โ† back
x.comDr Milan Milanoviฤ‡Mon, May 18, 2026, 1:30 AM PDTscore 16.6

ProgramBench: frontier models solve 0% of full-stack software rebuild tasks

Original: ๐—”๐—œ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐˜€๐˜๐—ถ๐—น๐—น ๐—ฐ๐—ฎ๐—ป'๐˜ ๐—ฏ๐˜‚๐—ถ๐—น๐—ฑ ๐˜€๐—ผ๐—ณ๐˜๐˜„๐—ฎ๐—ฟ๐—ฒ ๐—ณ๐—ฟ๐—ผ๐—บ ๐˜€๐—ฐ๐—ฟ๐—ฎ๐˜๐—ฐ๐—ต

31โค11RT5reply
https://x.com/milan_milanovic/status/2056291198957240391 โ†—

Deep summary

What's new: A collaboration between Meta, Stanford, and Harvard has released , a benchmark designed to evaluate whether AI agents can reconstruct real software from scratch. The setup gives an agent a compiled binary plus its documentation and asks it to reproduce a functionally equivalent implementation. Across 9 frontier models, 200 tasks, and 1,800 total runs evaluated against 248,853 behavioral tests, not a single model solved a single task end-to-end.

How it works: Each task in ProgramBench spans a range of complexity, from small CLI utilities like jq and gron up to FFmpeg, SQLite, and the PHP interpreter. An agent receives the binary and docs, then has unconstrained agentic turns to produce code that is evaluated behaviorally โ€” meaning the reconstructed program must pass the same test suite that the original binary would pass. This distinguishes ProgramBench from code-completion benchmarks like , where models patch existing code rather than build from nothing.

The numbers: Claude Opus 4.7 led all models by passing 95% of tests on 3% of tasks โ€” meaning it came close to a correct reconstruction on only 6 out of 200 programs, and never fully solved any. Opus 4.6 reached 2.5% and Sonnet 4.6 reached 1.6% on that same metric. GPT-5.4 and Gemini 3.1 Pro scored zero. Structural analysis of model outputs reveals a consistent failure pattern: 60% of model solutions use only 1 to 3 files, median directory depth is 1 versus 2 for human-written originals, and models retain only 10 to 29% of the original function count while making each function 1.08x to 1.62x longer. GPT-5.4 writes 96% of its final code in a single turn; Sonnet 4.6 issues an average of 868 commands and 18.3 file edits per task. Neither strategy produces a working program. Models also defect from the reference language roughly half the time, with GPT-5.4 choosing Python 79% of the time even when the original is Rust or C. Pass rates by language sit at 27.7% for C/C++ tasks, 38.5% for Rust, and 38.4% for Go โ€” but these are per-test partial scores, not full-task completions.

Why it matters: The benchmark makes explicit a gap that production experience already suggests: agents are capable assistants for localized edits within existing codebases, but the cognitive task of decomposing a system into modular, language-appropriate, build-ready source files remains well outside current capability. The output pattern โ€” long functions, shallow directory trees, Python defaults regardless of original language โ€” points to models optimizing for token-level plausibility rather than software architecture. Closing this gap likely requires improvements in long-horizon planning, build-system awareness, and structured reasoning about system decomposition that current -tuned models do not exhibit.

Caveats: The benchmark's pass/fail criterion is behavioral equivalence to the original binary, which is a strict and arguably unusual definition of "building from scratch" โ€” a functionally equivalent but structurally different program still fails if it misses edge-case behaviors baked into the original. The task framing also excludes interactive development workflows where a human collaborates with the agent during construction, which is closer to real practice. Still, the complete absence of end-to-end successes across all 1,800 runs, including models with very large context windows and strong tool-use capabilities, is a clear signal rather than a measurement artifact.

ProgramBench: frontier models solve 0% of full-stack software rebuild tasks ยท AI News Radar for SWE