← back
x.comVivekFri, Jun 5, 2026, 2:55 AM PDT
score 16.1
8likes1RT1reply

AI evaluation scores depend on testing setup, not just the model

Original: excellent talk on why evals are so hard to judge. core point: an eval score isn't one number, it's the output of a whole stack: harness, sandbox, hardware, prompt, llm, grader, engine/api; and perturb

Source: x.com

Writing ELI5 summary…

AI evaluation scores depend on testing setup, not just the model · TinyNews · TinyNews