Researcher questions whether new AI agent tests measure real weaknesses

Original: @cwolferesearch the failure case approach makes intuitive sense, but im curious how they avoid the new evals just matching cursor's current blindspots instead of general agent weaknesses

Source: x.com ↗

Writing ELI5 summary…