arXivZetian Ouyang, Linlin Wang, Gerard de Melo, Liang HeTue, Jun 2, 2026, 9:32 AM PDT
score 16.4
Benchmark and toolkit improve math reasoning in language models
Original: PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models
Source: arxiv.org ↗
Writing ELI5 summary…