ProgramBench

Academic

Updated: 6/2/2026

Can language models rebuild programs from scratch?

Key Takeaways

  • The original metric (Fully Resolved) is intentionally unforgiving: Claude Opus 4.8 leads with only 2 fully resolved tasks, while GPT 5.5, GPT 5.4 (high), and Claude Sonnet 4.6 each solve only one task entirely.
  • Despite the strict headline metric, top models pass more than 70% of hidden behavioral tests on average (Raw Pass Rate).
  • Claude Opus 4.8 also leads by Raw Pass Rate at 71.88%, with 31 Almost Resolved tasks.
  • GPT 5.5 is also close on Raw Pass Rate at 70.77%, with 19 Almost Resolved tasks and 1 Fully Resolved task.

Background

ProgramBench evaluates whether models can reconstruct command-line programs from an executable binary and a behavioral specification. Each task asks the model to produce an implementation from scratch that compiles and passes the benchmark’s hidden tests.

The benchmark’s headline metric is intentionally strict: a task is counted as Fully Resolved only when the submitted implementation passes all tests. Like the original benchmark, we also report Almost Resolved, the percentage of tasks where at least 95% of hidden tests pass. Finally, Raw Pass Rate reports the average percent of hidden behavioral tests passed per task.

This benchmark was developed by the ProgramBench team; we’d like to thank them for their efforts in building this benchmark. If you’re interested in learning more about ProgramBench, visit programbench.com.

Results

ProgramBench Results
Raw Pass Rate vs. Cost / Test

The efficient region is narrow. GPT 5.5 and GPT 5.4 xhigh sit near the best raw pass rates under $8 per task, while Claude Opus 4.8 reaches the highest raw pass rate at a much higher cost. GPT 5.4 high is the strongest low-cost point that also fully resolves a task.

Task Difficulty

The hard/easy split is steep: across all public models, average Raw Pass Rate rises from 21.4% on the hardest 50 tasks to 66.7% on the easiest 50. Even Claude Opus 4.8 and GPT 5.5 only reach 39.5% and 37.2% on the hardest quartile, despite clearing 91% on the easiest quartile.

Behavioral Test Pass-Rate Distribution

This graph shows the relative performance of models at various unit-test pass rate thresholds. For example, a threshold of 80% means that a model must pass at least 80% of the unit tests to resolve a task. A threshold of 100% is equivalent to the original benchmark metric (the "Fully Resolved" rate).

The leaders stay close through the middle: Claude Opus 4.8 and GPT 5.5 pass at least half the tests on 166 and 161 tasks. The separation shows up near the finish line: Opus 4.8 reaches the 95% threshold on 31 tasks and fully resolves 2, while GPT 5.5 reaches it on 19 tasks and fully resolves 1.

Score by Model x Task

All models x 200 tasks
Claude Opus 4.8
GPT 5.5
GPT 5.4 (high)
Claude Sonnet 4.6
GPT 5.4 (xhigh)
Claude Opus 4.7
Gemini 3.5 Flash
GPT 5.4 Mini
GLM 5.1
Kimi K2.6
DeepSeek V4
Gemini 3.1 Pro Preview (02/26)
Gemini 3 Flash (12/25)
Qwen 3.6 Plus
Claude Haiku 4.5 (Thinking)
Grok 4.3
Gemini 3.1 Flash Lite Preview
MiniMax-M2.7
0%
100%

The long tail is task-driven, not just model-driven. Only 43 of the 200 tasks reach 95% pass rate for any model, and only 3 are fully solved by at least one model. At the other end, 8 tasks stay below 25% for every model, which points to shared blind spots across providers.


Methodology

We evaluate models on the 200 public ProgramBench tasks. Each task gives the model a compiled command-line program and a behavioral specification, then asks the model to produce a cleanroom source-code implementation that matches the original program’s behavior.

All models use the same mini-SWE-agent harness, orchestrated through Valkyrie with model calls routed through model-library. The agent has a bash tool, an offline sandbox, and the ProgramBench cleanroom prompt. It may inspect the provided files and run the executable, but it may not use the internet, look up source code or package registries, wrap the provided binary, reuse object files, or use decompilers, disassemblers, tracing, or instrumentation. The provided binary is execute-only.

We follow the public ProgramBench evaluation setup. Models receive 1,000 steps, a 6 hour wall-clock limit, a 180 second action timeout, and 10,000 character head/tail tool-output truncation. A submission contains source files and a compile script. We compile the submission once, then evaluate each hidden test branch in a fresh container created from the compiled image. First-pass branch evaluations use 10 xdist workers.