New benchmark tests search and language models beyond saturated metrics

Original: if you're testing a new retrieval model or long-context LLM, it's a waste of your time (and ours...) to report 0.2% gains on the many saturated and expired benchmarks

Source: x.com ↗

Writing ELI5 summary…