arXivYuchen Huang, Xiang Li, Zhenqing Ling, Sijia Li, Qianli Shen, Daoyuan Chen, Yi R. Fung, Yaliang LiTue, Jun 30, 2026, 3:08 AM PDT
score 16.5
New benchmark reveals LLMs fail at multi-step data tasks
Original: CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes
Source: arxiv.org ↗
Writing ELI5 summary…