x.comVivekFri, Jun 5, 2026, 2:55 AM PDT
score 16.1
8likes1RT1reply
AI evaluation scores depend on testing setup, not just the model
Original: excellent talk on why evals are so hard to judge. core point: an eval score isn't one number, it's the output of a whole stack: harness, sandbox, hardware, prompt, llm, grader, engine/api; and perturb
Source: x.com ↗
Writing ELI5 summary…