Benchmark

Terminal-Bench 2.0

Vals Logo

Updated: 1/23/2026

State-of-the-art set of difficult terminal-based tasks

Takeaways

Background

Terminal-Bench 2.0 is an open-source benchmark that is designed to test a model’s ability to navigate and complete tasks in a sandboxed terminal environment. The official version of the benchmark highlights 89 tasks with unique categories ranging from model training to system administration. These tasks scale in difficulty from easy to hard.

We chose to include Terminal-Bench because a) it is increasingly common for it to be reported by model providers, b) it reflects the real-world terminal tasks expected of software engineers, and c) it is quite challenging, with no model scoring above 50% on the hard tasks upon its initial release. Furthermore, agentic systems like Claude Code, Codex, and Cursor now rely heavily on executing terminal commands correctly.

This benchmark was developed by the Terminal Bench community as an open-source effort; we’d like to thank the community for their efforts in building this benchmark and for helping us integrate it into our evaluation suite. If you’re interested in learning more about Terminal Bench or want to contribute to the project, visit tbench.ai

Below is an example task (you can find the full details for this task in the open-source task registry).

Please train a fasttext model on the yelp data in the data/ folder.

The final model size needs to be less than 150MB but get at least 0.62 accuracy on a private test set that comes from the same yelp review distribution.

The model should be saved as /app/model.bin

Results

Methodology

All models were benchmarked using the Terminus 2 harness. Unless otherwise specified, we use identical configuration and methodology to Terminus 2. All results reported are pass@1.

On submission, we run the model against the provided pytests—a model must pass all pytests to get any credit for a task.

Unlike Terminus 1, Terminus 2 does not use structured outputs to enforce a response schema. Model queries returning an invalid or missing JSON are retried with a warning.

Comparison to Original Terminal-Bench

This benchmark is similar in structure to the original Terminal-Bench - it features 80 terminal-based tasks, many of which (like our example!) also appear in Terminal-Bench 2.0. The most substantial differences are in evaluation methododology:

  • We ran the original Terminal-Bench on an EC2 instance, using docker containers. Following the Laude implementation, we run Terminal-Bench 2.0 remotely using daytona.
  • We ran the original Terminal-Bench using a turn limit. Again, following the Laude implementation, we run Terminal-Bench 2.0 using a time limit instead.

Join our mailing list to receive benchmark updates

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

By subscribing, I agree to Vals' Privacy Policy.