AI benchmarks should evolve, not stay frozen over time

Original: Another great way to evolve a benchmark is by mining failure cases from production usage of an agent / LLM. A good example of this idea is CursorBench, which continually pulls new eval tasks from actu

Source: arxiv.org ↗

Writing ELI5 summary…