← back
arXivZetian Ouyang, Linlin Wang, Gerard de Melo, Liang HeTue, Jun 2, 2026, 9:32 AM PDT
score 16.4

Benchmark and toolkit improve math reasoning in language models

Original: PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

Source: arxiv.org

Writing ELI5 summary…