Legal AI benchmark shows different models win different tasks
Original: awesome artifact from the Harvey team on model perf, cost, latency, and trace analysis across legal domains
Source: x.com ↗
Who: Posted by @Vtrivedy10, sharing a report from the Harvey team. Harvey is an AI company that builds legal research and drafting tools for law firms, and the report was authored or shared publicly by Gabe Pereyra, who works there.
What's new: Harvey ran a head-to-head comparison of three leading AI models across real legal work, and found that no single model wins everywhere. Different models beat the others in different corners of legal practice, which is a clean empirical argument against picking one model and using it for everything.
The numbers: leads in regulated industries and emerging-company work, where lawyers spend most of their time digging through documents and statutes. leads in corporate transactions and investment funds work, where the job is more about analyzing and synthesizing information than searching for it. leads in privacy, tax, and private-client work, which tend to involve careful reading of dense, rule-heavy material.
Why it matters: This kind of domain-specific measurement is rare and useful. Most AI comparisons test models on general trivia or coding puzzles. Harvey's data comes from actual legal workflows, which means the rankings reflect something closer to real-world usefulness than a typical would. The result validates what practitioners have suspected: routing different tasks to different models, rather than picking a single winner, is likely to produce better outcomes.
Caveats: The tweet is a summary of a longer artifact, and the underlying methodology, how tasks were scored, who scored them, and how much data was involved, is not visible here. It is also worth noting that Harvey has commercial relationships with the model providers it is evaluating, which is worth keeping in mind when reading the rankings.