Open benchmark compares full AI agent systems, not just models

What's new: IBM Research has launched the Open Agent Leaderboard, an evaluation framework that benchmarks full agent systems rather than isolated models, paired with the Exgentic framework for reproducible runs and an accompanying methodology paper. The central argument is that model choice alone is insufficient to predict agent performance — the surrounding architecture (tool selection, planning strategy, memory, error recovery) produces materially different results even when the underlying model is held constant.

How it works: The leaderboard aggregates six established benchmarks under a unified evaluation protocol that maps each task to a shared triple of task, context, and allowed actions. The benchmarks are SWE-Bench Verified (real repository bug fixes), BrowseComp+ (open-ended web research), AppWorld (multi-app personal assistant tasks), and two variants of tau2-Bench covering airline/retail customer service and telecom technical support. Each benchmark retains its original design; the shared protocol standardizes how agents connect to them rather than reshaping the tasks themselves. Agents are evaluated as general-purpose systems without benchmark-specific tuning or prompt optimization, which is why scores may diverge from results reported on individual benchmark leaderboards.

The numbers: The top five configurations all use the same underlying model yet differ in both success rate and cost, demonstrating that agent architecture drives variance independent of the model. Failed runs cost 20 to 54% more than successful ones across the tested configurations, a finding with direct implications for production cost modeling. Tool shortlisting — a technique that narrows the set of tools presented to the agent at each step rather than exposing the full catalog — improved performance across every model tested and converted otherwise failing configurations into viable ones.

Why it matters: The leaderboard surfaces two dimensions that single-model evaluations obscure: quality-cost tradeoff and failure-mode cost. An agent that fails expensively is worse than one that fails cheaply, and this distinction is invisible from success-rate tables alone. The finding that general-purpose agents already match or outperform specialized systems on several benchmarks challenges the assumption that benchmark-specific tuning is necessary for competitive performance. in particular has historically favored heavily tuned systems, making competitive general-agent results notable.

Caveats: The leaderboard currently covers five task domains, which is broader than most prior work but still excludes multimodal tasks, long-horizon planning, and embodied settings. The unified protocol introduces a normalization layer that may inadvertently disadvantage agents whose native tool interfaces diverge significantly from the shared schema. The claim that model choice remains the dominant factor is stated qualitatively without a reported variance decomposition, so the relative magnitude of model vs. architecture contributions remains imprecise. All submissions are currently researcher-run rather than independently verified, and the degree to which Exgentic enforces reproducibility under different compute environments is not yet established.

Open benchmark compares full AI agent systems, not just models

Deep summary