arXivRohit Patel, Alexandre Rezende, Steven McClainMon, May 18, 2026, 10:09 AM PDT
score 16.5
New benchmark tests AI reasoning in realistic problem-solving
Original: GIM: Evaluating models via tasks that integrate multiple cognitive domains
Source: arxiv.org ↗
Writing ELI5 summary…