arXivHongyu Guo, Hao Li, He Cao, Gongbo Zhang, Li YuanTue, Jun 2, 2026, 6:47 AM PDT

score 17.1

New benchmark catches AI reasoning errors in chemistry tasks

Original: From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

Writing ELI5 summary…