Terminal-Bench 2.0

Takeaways

Gemini 3.1 Pro Preview (02/26) leads the way on performance, achieving 67.4% accuracy—a significant improvement over Gemini 3 Pro (11/25) (55.1%) and demonstrating strong capabilities in agentic coding tasks.
Claude Sonnet 4.6 takes second place with 59.55% accuracy, followed by Claude Opus 4.5 (Nonthinking) (58.43%).
GPT 5.2 and Grok 4 also demonstrate strong performance, as well as open-source offerings like MiniMax-M2.1 and GLM 4.7.

Background

Terminal-Bench 2.0 is an open-source benchmark that is designed to test a model’s ability to navigate and complete tasks in a sandboxed terminal environment. The official version of the benchmark highlights 89 tasks with unique categories ranging from model training to system administration. These tasks scale in difficulty from easy to hard.

We chose to include Terminal-Bench because a) it is increasingly common for it to be reported by model providers, b) it reflects the real-world terminal tasks expected of software engineers, and c) it is quite challenging, with no model scoring above 50% on the hard tasks upon its initial release. Furthermore, agentic systems like Claude Code, Codex, and Cursor now rely heavily on executing terminal commands correctly.

This benchmark was developed by the Terminal Bench community as an open-source effort; we’d like to thank the community for their efforts in building this benchmark and for helping us integrate it into our evaluation suite. If you’re interested in learning more about Terminal Bench or want to contribute to the project, visit tbench.ai

Below is an example task (you can find the full details for this task in the open-source task registry).

Please train a fasttext model on the yelp data in the data/ folder.

The final model size needs to be less than 150MB but get at least 0.62 accuracy on a private test set that comes from the same yelp review distribution.

The model should be saved as /app/model.bin

Results

Terminal-Bench 2.0

Methodology

All models were benchmarked using the Terminus 2 harness. Unless otherwise specified, we use identical configuration and methodology to Terminus 2. All results reported are pass@1.

On submission, we run the model against the provided pytests—a model must pass all pytests to get any credit for a task.

Unlike Terminus 1, Terminus 2 does not use structured outputs to enforce a response schema. Model queries returning an invalid or missing JSON are retried with a warning.

Comparison to Original Terminal-Bench

This benchmark is similar in structure to the original Terminal-Bench - it features 80 terminal-based tasks, many of which (like our example!) also appear in Terminal-Bench 2.0. The most substantial differences are in evaluation methododology:

We ran the original Terminal-Bench on an EC2 instance, using docker containers. Following the Laude implementation, we run Terminal-Bench 2.0 remotely using daytona.
We ran the original Terminal-Bench using a turn limit. Again, following the Laude implementation, we run Terminal-Bench 2.0 using a time limit instead.

Terminal-Bench 2.0

Takeaways

Background

Results

Methodology

Comparison to Original Terminal-Bench

Join our mailing list to receive benchmark updates