New coding benchmark grades AI like human reviewers, not just tests

Original: SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue and then run its code on a pre-constructed unit test.

Source: x.com ↗

Writing ELI5 summary…