Benchmark

SWE-bench

Vals Logo

Updated: 3/20/2026

Solving production software engineering tasks

Takeaways


Instance Resolution by Model

Background

SWE-bench, introduced by Jimenez et al. in their seminal paper “Can Language Models Resolve Real-World GitHub Issues?”, has emerged as a prominent benchmark for evaluating Large Language Models (LLMs) in software engineering contexts.

The benchmark comprises 500 tasks, each executed within an isolated Docker container. These tasks represent real-world GitHub issues from various repositories. Models must generate a “patch” to resolve each issue. The success of a model’s solution is determined by running unit tests against the generated patch.

A notable complexity of SWE-bench lies in its dual evaluation of both the agentic harness and the underlying foundation model. This leads to different methodologies adopted by foundation model labs when they report their results. Additionally, the benchmark’s computational requirements make it resource-intensive to reproduce results.

To enable fair and consistent comparisons across foundation models, we use a minimal bash-tool-only agent harness. Models are given a single tool — bash — and must navigate, search, edit, and solve tasks using standard command-line tools. This puts the evaluation burden squarely on the model rather than the harness.

Results



Models that perform well on SWE-bench tend to be proficient with bash and standard command-line tools for code navigation and editing.

There is also a clear trend that closed-source models perform better on SWE-bench than open-source models. The clearest performance differences appear among tasks that take between 15 minutes and 1 hour to complete. Models that perform well on these tasks tend to score higher overall on the benchmark.



Tool Use

All models are given a single tool: bash. Models must use standard command-line tools (grep, find, sed, etc.) to navigate codebases, search for relevant files, and apply edits. This means tool usage differences between models reflect their command-line fluency and problem-solving strategy rather than how they interact with specialized harness tooling.

Methodology

We use a minimal bash-tool-only agent harness, mini-swe-agent, for all evaluations. Models are given a single tool — bash — and a system prompt describing the task. They must use standard command-line tools to navigate the codebase, identify the relevant code, and produce a patch.

We use the SWE-bench Verified subset of the dataset. SWE-bench Verified is a human-validated section of the SWE-bench dataset released by OpenAI in August 2024. Each task in the split has been carefully reviewed and validated by human experts, resulting in a curated set of 500 high-quality test cases from the original benchmark. You can find more information about the Verified split of the dataset here.

All models have access to the same tool to ensure a fair, apples-to-apples comparison. Models are run with the default configuration given by the provider, except for the max token limit, which is always set to the highest value possible.

All experiments are run on isolated cloud sandboxes. Latency is calculated starting from the first step the model takes within each task.

It may be possible to build better harnesses for a given model — for example, Anthropic has claimed their custom harness leads to a ten percentage point improvement in accuracy. However, our aim is to adopt a fair framework with which to evaluate all models.

Join our mailing list to receive benchmark updates

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

By subscribing, I agree to Vals' Privacy Policy.