Key Takeaways
- Claude Fable 5 leads Overall at 55.1%, ahead of Claude Opus 4.8 (47.2%), GPT 5.5 (45.2%), and Claude Opus 4.7 (43.9%).
- Code Migration grades behavior, not code similarity: a migration scores by the share of hidden behavior tests the rebuilt program passes. A separate anti-cheat check zeroes wrappers, copied reference artifacts, and wrong-language submissions.
- The two task types reward different skills. Claude Fable 5 dominates CLI migration at 60.1%, but the COBOL to Java split is much tighter, with GPT 5.5 and Claude Opus 4.7 on top at 70.0%.
- Correctness and code quality diverge: Claude Opus 4.8 writes the highest-rated code (averaging 4.6/5) while Claude Fable 5 passes the most CLI tests.
Background
Translating a working program from one language to another is a costly, high-stakes engineering task, whether porting a tool to a new ecosystem or modernizing legacy systems written in languages that are increasingly hard to staff.
COBOL-to-Java migration carries real economic weight: decades-old COBOL still runs core transaction systems at many of the world’s largest banks, insurers, and government agencies. These institutions are spending heavily on multi-year programs to modernize that code onto maintainable languages like Java. The work is slow, risky, and exactly the kind of task reliable automation could transform.
Migration between modern languages matters too, driven by the pursuit of better runtime performance, richer tooling and ecosystems, or the memory-safety guarantees of a language like Rust.
Code Migration measures how well today’s frontier models can automate both kinds of work (modernizing legacy COBOL to Java, and translating between modern languages) by taking an existing program and faithfully reproducing its behavior in the target language.
Results
Accuracy vs. Cost
Cost and accuracy trade off sharply across the leaderboard. Claude Fable 5 leads at 55% but is by far the most expensive at about $115 per test, roughly 3× the next-priciest run. GPT 5.5 is the clearest efficiency winner: 45% at about $6 per test, nearly matching Claude Opus 4.8 (47%) for roughly a fifth of the cost, and ahead of the pricier Claude Opus 4.7 and Claude Sonnet 4.6. Further down, DeepSeek V4 reaches about 26% at roughly $2 per test, and MiMo V2.5 Pro about 22% for under $0.25.
Correctness vs. Code Quality
Passing hidden tests and writing clean code are related but distinct: the strongest test-passers are not always the highest-rated implementations.
Difficulty by Target Language
The chart groups hidden-test pass rate by target language, with one bar per model in each group. Overall difficulty is fairly even across the five languages: model averages cluster in the high teens to low 20s, with Python marginally easiest (about 21%) and Rust the hardest (about 19%). Among the top cluster of models, Python is consistently the strongest target, though each spans only a handful of points across languages (Fable 5, for instance, runs from about 56% on Java to about 65% on Python); the weakest target varies from model to model.
Behavioral Test Pass-Rate Distribution
This graph shows how models perform at different hidden-test pass-rate thresholds. A threshold of 80%, for example, counts a task as resolved only if the rebuilt program passes at least 80% of its hidden behavior tests; a threshold of 100% requires reproducing the tested behavior exactly. The steeper a model’s drop-off near the high thresholds, the more all-or-nothing its migrations.
On the CLI split, Claude Fable 5 separates from the field: it clears at least half of a task’s hidden tests on 73 of the 120 migrations, while the chasing pack stays close to each other but far behind, with Claude Opus 4.8 passing the 50% mark on 33 tasks and GPT 5.5 on 31. The gap is widest through the middle of the curve, around the 55 to 65% range, where Fable 5 still clears roughly 65 to 70 tasks against only about 20 to 25 for the next-best model. Fable 5 is also the only model to fully reproduce any CLI task, resolving 6 of 120.
Methodology
Every task runs the same way. The model works in an offline sandbox with the source program and task instructions, and submits a target-language implementation plus a Dockerfile. The grader builds that image and runs a hidden behavior-test suite the model never sees. Before scoring, a separate anti-cheat checker inspects each submission and zeroes any that wrap the original executable, copy reference artifacts, or hardcode expected outputs. It also verifies that the submission is actually written in the requested target language, so a model cannot pass by leaving the program in its source language.
The CLI split is 30 open-source repositories spanning Python, Java, Kotlin, Rust, and C++. Each is migrated into its four non-source target languages (120 scored repository-language migrations in all), and every migration is graded against a hidden suite of roughly 100 to 400 behavior tests, exercised through the program’s command-line interface. The COBOL split is a 10-repository COBOL-to-Java test set (a mix of open-source COBOL programs and synthetic operational systems built for the benchmark) that stresses file formats, batch workflows, terminal interfaces, and persistent state.
The headline score is hidden-test pass rate. A CLI repository’s score is the average across its four target languages, so all 30 repositories count equally despite the language fan-out. The Overall score keeps every source repository equally weighted, which gives the CLI split (30 repositories) three times the weight of COBOL (10). Code quality is reported as a separate diagnostic (a reviewer rating of readability, documentation, structure, and idiomatic style) and never affects the pass-rate score.