Code Migration

Key Takeaways

Claude Opus 5 leads Overall at 57.5%, ahead of Claude Fable 5 (55.1%), GPT-5.6 Sol (52.9%), and Claude Opus 4.8 (47.2%).
Code Migration grades behavior, not code similarity: a migration scores by the share of hidden behavior tests the rebuilt program passes. A separate anti-cheat check zeroes wrappers, copied reference artifacts, and wrong-language submissions.
The two task types reward different skills. Claude Fable 5 leads CLI migration at 60.1%, followed by Claude Opus 5 at 53.3%, but the COBOL to Java split is much tighter, with several models — including Opus 5, GPT-5.6 Sol, GPT 5.5, and GLM 5.2 — tied on top at 70.0%.
Correctness and code quality diverge: Claude Sonnet 5 writes the highest-rated code (averaging 4.9/5) while Claude Fable 5 passes the most CLI tests.

Background

Translating a working program from one language to another is a costly, high-stakes engineering task, whether porting a tool to a new ecosystem or modernizing legacy systems written in languages that are increasingly hard to staff.

COBOL-to-Java migration carries real economic weight: decades-old COBOL still runs core transaction systems at many of the world’s largest banks, insurers, and government agencies. These institutions are spending heavily on multi-year programs to modernize that code onto maintainable languages like Java. The work is slow, risky, and exactly the kind of task reliable automation could transform.

Migration between modern languages matters too, driven by the pursuit of better runtime performance, richer tooling and ecosystems, or the memory-safety guarantees of a language like Rust.

Code Migration measures how well today’s frontier models can automate both kinds of work (modernizing legacy COBOL to Java, and translating between modern languages) by taking an existing program and faithfully reproducing its behavior in the target language.

Results

Accuracy vs. Cost

Code Migration

Correctness vs. Code Quality

Code Quality vs Accuracy

Code quality is the mean of four reviewer dimensions (readability, documentation, file structure, language idioms) over CLI tasks, scored independently of hidden-test pass rate. Hover a point for the model.

Passing hidden tests and writing clean code are related but distinct: the strongest test-passers are not always the highest-rated implementations.

Difficulty by Target Language

CLI Accuracy by Target Language

Hidden-test pass rate per target language. Each of the 30 source repositories is migrated into the four non-source target languages. Click a legend entry to toggle a model; double-click to isolate it.

The chart groups hidden-test pass rate by target language, with one bar per model in each group. Overall difficulty is tightly clustered: mean pass rates range from 21.90% on Rust to 23.13% on Java, with Python (22.94%), Kotlin (22.56%), and C++ (22.15%) between them. The strongest target varies by model; Opus 5, for example, performs best on Java at 55.58%.

Behavioral Test Pass-Rate Distribution

Each line is one model: the share of its tasks that pass at least the given fraction of hidden tests. Click a legend entry to toggle a model; double-click to isolate it.

This graph shows how models perform at different hidden-test pass-rate thresholds. A threshold of 80%, for example, counts a task as resolved only if the rebuilt program passes at least 80% of its hidden behavior tests; a threshold of 100% requires reproducing the tested behavior exactly. The steeper a model’s drop-off near the high thresholds, the more all-or-nothing its migrations.

On the CLI split, Claude Fable 5 clears at least half of a task’s hidden tests on 73 of the 120 migrations, followed by Claude Opus 5 on 64 and GPT-5.6 Sol on 62. Fable 5 retains its clearest lead at higher thresholds: 64 tasks pass at least 65% of their hidden tests, compared with 48 for Opus 5 and 44 for GPT-5.6 Sol. Fable 5 is also the only model to fully reproduce any CLI task, resolving 6 of 120.

Methodology

Every task runs the same way. The model works in an offline sandbox with the source program and task instructions, and submits a target-language implementation plus a Dockerfile. The grader builds that image and runs a hidden behavior-test suite the model never sees. Before scoring, a separate anti-cheat checker inspects each submission and zeroes any that wrap the original executable, copy reference artifacts, or hardcode expected outputs. It also verifies that the submission is actually written in the requested target language, so a model cannot pass by leaving the program in its source language.

The CLI split is 30 open-source repositories spanning Python, Java, Kotlin, Rust, and C++. Each is migrated into its four non-source target languages (120 scored repository-language migrations in all), and every migration is graded against a hidden suite of roughly 100 to 400 behavior tests, exercised through the program’s command-line interface. The COBOL split is a 10-repository COBOL-to-Java test set (a mix of open-source COBOL programs and synthetic operational systems built for the benchmark) that stresses file formats, batch workflows, terminal interfaces, and persistent state.

The headline score is hidden-test pass rate. A CLI repository’s score is the average across its four target languages, so all 30 repositories count equally despite the language fan-out. The Overall score keeps every source repository equally weighted, which gives the CLI split (30 repositories) three times the weight of COBOL (10). Code quality is reported as a separate diagnostic (a reviewer rating of readability, documentation, structure, and idiomatic style) and never affects the pass-rate score.