ProgramBench

Key Takeaways

The original metric (Fully Resolved) is intentionally unforgiving: Claude Fable 5 leads with only 4 fully resolved tasks, while GPT-5.6 Sol fully resolves 3 and Claude Opus 4.8 solves 2 tasks entirely.
Despite the strict headline metric, top models pass more than 70% of hidden behavioral tests on average (Raw Pass Rate).
GPT-5.6 Sol leads by Raw Pass Rate at 77.64%, with Claude Fable 5 close behind at 76.80% and the most Almost Resolved tasks (66).
Claude Sonnet 5 is next on Raw Pass Rate at 72.07%, just ahead of Claude Opus 4.8 at 71.88%.

Background

ProgramBench evaluates whether models can reconstruct command-line programs from an executable binary and a behavioral specification. Each task asks the model to produce an implementation from scratch that compiles and passes the benchmark’s hidden tests.

The benchmark’s headline metric is intentionally strict: a task is counted as Fully Resolved only when the submitted implementation passes all tests. Like the original benchmark, we also report Almost Resolved, the percentage of tasks where at least 95% of hidden tests pass. Finally, Raw Pass Rate reports the average percent of hidden behavioral tests passed per task.

This benchmark was developed by the ProgramBench team; we’d like to thank them for their efforts in building this benchmark. If you’re interested in learning more about ProgramBench, visit programbench.com.

Results

ProgramBench Results

Raw Pass Rate vs. Cost / Test

Among models with reported cost data, the efficient region is narrow: GPT 5.4 reaches 68.6% Raw Pass Rate under $8 per task, while Claude Opus 4.8 reaches the highest cost-reported Raw Pass Rate at a much higher cost. Models without cost data are omitted from this x-axis view.

Task Difficulty

The hard/easy split is steep: across all public models, average Raw Pass Rate rises from 21.9% on the hardest 50 tasks to 65.5% on the easiest 50. Claude Fable 5 reaches 45.0% on the hardest quartile and 95.5% on the easiest quartile, ahead of Claude Opus 4.8 and GPT 5.5.

Behavioral Test Pass-Rate Distribution

This graph shows the relative performance of models at various unit-test pass rate thresholds. For example, a threshold of 80% means that a model must pass at least 80% of the unit tests to resolve a task. A threshold of 100% is equivalent to the original benchmark metric (the "Fully Resolved" rate).

The leaders stay close through the middle: Claude Fable 5 and Claude Opus 4.8 each pass at least half the tests on 166 tasks. The separation shows up near the finish line: Fable 5 reaches the 95% threshold on 66 tasks and fully resolves 4, while Opus 4.8 reaches it on 31 tasks and fully resolves 2.

Score by Model x Task

All models x 200 tasks

Claude Fable 5

GPT-5.6 Sol

Claude Opus 4.8

GPT 5.5

GLM 5.2

GPT-5.6 Terra

GPT 5.4 (high)

Claude Sonnet 4.6

Claude Sonnet 5

Claude Opus 4.7

GPT-5.6 Luna

GPT 5.4 (xhigh)

Grok 4.5

GPT 5.4 Mini

Gemini 3.5 Flash

GLM 5.1

Gemini 3.1 Pro Preview (02/26)

Qwen 3.6 Plus

Muse Spark 1.1

Kimi K2.7 Code

Kimi K2.6

Nemotron 3 Ultra

DeepSeek V4

Gemini 3 Flash (12/25)

Claude Haiku 4.5 (Thinking)

Inkling

Grok 4.3

Laguna M.1

Laguna XS.2

Gemini 3.1 Flash Lite Preview

MiniMax-M2.7

100%

The long tail is task-driven, not just model-driven. Only 77 of the 200 tasks reach 95% pass rate for any model, and only 6 are fully solved by at least one model. At the other end, 8 tasks stay below 25% for every model, which points to shared blind spots across providers.

Methodology

We evaluate models on the 200 public ProgramBench tasks. Each task gives the model a compiled command-line program and a behavioral specification, then asks the model to produce a cleanroom source-code implementation that matches the original program’s behavior.

All models use the same mini-SWE-agent harness, orchestrated through Valkyrie with model calls routed through model-library. The agent has a bash tool, an offline sandbox, and the ProgramBench cleanroom prompt. It may inspect the provided files and run the executable, but it may not use the internet, look up source code or package registries, wrap the provided binary, reuse object files, or use decompilers, disassemblers, tracing, or instrumentation. The provided binary is execute-only.

We follow the public ProgramBench evaluation setup. Models receive 1,000 steps, a 6 hour wall-clock limit, a 180 second action timeout, and 10,000 character head/tail tool-output truncation. A submission contains source files and a compile script. We compile the submission once, then evaluate each hidden test branch in a fresh container created from the compiled image. First-pass branch evaluations use 10 xdist workers.