← back
x.comOmar KhattabSun, May 31, 2026, 5:22 PM PDT
score 15.9
118likes6RT8reply

New benchmark tests search and language models beyond saturated metrics

Original: if you're testing a new retrieval model or long-context LLM, it's a waste of your time (and ours...) to report 0.2% gains on the many saturated and expired benchmarks

Source: x.com

Writing ELI5 summary…