AI evaluation scores depend on testing setup, not just the model

Original: excellent talk on why evals are so hard to judge. core point: an eval score isn't one number, it's the output of a whole stack: harness, sandbox, hardware, prompt, llm, grader, engine/api; and perturb

Source: x.com ↗

Writing ELI5 summary…

AI evaluation scores depend on testing setup, not just the model · TinyNews · TinyNews