arXivHongyu Guo, Hao Li, He Cao, Gongbo Zhang, Li YuanTue, Jun 2, 2026, 6:47 AM PDT
score 17.1
New benchmark catches AI reasoning errors in chemistry tasks
Original: From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models
Source: arxiv.org ↗
Writing ELI5 summary…