Task Type:

SWE-bench Benchmark

Last updated

1

Claude Sonnet 4 (Nonthinking)

★

65.0%

$1.24

426.52 s

2

o3

49.8%

$1.42

620.33 s

3

GPT 4.1

⚡︎

47.4%

$0.45

173.98 s

4

Gemini 2.5 Pro Preview

46.8%

$0.88

540.96 s

5

Gemini 2.5 Flash Preview (Nonthinking)

$

35.6%

$0.11

251.91 s

6

GPT 4.1 Mini

34.8%

$0.13

233.12 s

7

o4 Mini

33.4%

$1.54

976.81 s

8

GPT 4o (2024-08-06)

27.2%

$1.53

197.58 s

9

Llama 4 Maverick

18.4%

$0.12

62.48 s

10

Command A

0.2%

$0.01

5.26 s

Task type :

★Best Performing

$Best Budget

⚡︎Best Speed

Reasoning Model

Takeaways

Foundation models still fail to solve real-world coding problems despite notable progress, highlighting remaining room for improvement.
The models’ performance drops significantly on “harder” problems that take >1 hour to complete. Only Claude Sonnet 4 (Nonthinking) , o3 and GPT 4.1 pass any of the >4 hour tasks (33% each).
Claude Sonnet 4 (Nonthinking) leads by a wide margin with 65.0% accuracy, and maintains both excellent cost efficiency at $1.24 per test and fast completion times (426.52s).
Tool usage patterns reveal models employ distinct strategies. o4 Mini brute-forces problems (~25k searches per task), while Claude Sonnet 4 (Nonthinking) employs a leaner, balanced mix (~9-10k default tool calls with far fewer searches).

Instance Resolution by Model

Background

SWE-bench, introduced by Jimenez et al. in their seminal paper “Can Language Models Resolve Real-World GitHub Issues?”, has emerged as a prominent benchmark for evaluating Large Language Models (LLMs) in software engineering contexts.

The benchmark comprises 500 tasks, each executed within an isolated Docker container. These tasks represent real-world GitHub issues from various repositories. Models are provided with a suite of agentic tools and must generate a “patch” to resolve each issue. The success of a model’s solution is determined by running unit tests against the generated patch.

A notable complexity of SWE-bench lies in its dual evaluation of both the agentic harness and the underlying foundation model. This leads to different methodologies adopted by foundation model labs when they report their results. Additionally, the benchmark’s computational requirements make it resource-intensive to reproduce results.

To enable fair and consistent comparisons across foundation models, we implemented a standardized evaluation framework and evaluated 10 popular models.

Results

SWE-bench

/

Claude 4 Sonnet leads with 65.0% accuracy, offering excellent cost efficiency with reasonable pricing (average $1.24 per test) and latency (426.52s).

o3 and GPT 4.1 follow with ~49% and 47% accuracy respectively. While o3 and Claude Sonnet 4 (Nonthinking) are comparable in price, GPT 4.1 offers strong performance with even lower costs and the fastest latency (173.98s).

Gemini 2.5 Flash provides the best budget option at $0.15/$0.60 with respectable 35.6% accuracy, while GPT-4o (2024-08-06) underperformed at 27.2% accuracy despite higher costs.

When analyzing performance by task complexity, all models struggle as estimated resolution time increases, but Claude 4 Sonnet maintains consistently better performance across difficulty levels. Notably, Gemini 2.5 Flash performance falls precipitously from 51.5% on simple tasks to just 4.8% on complex ones. On the hardest task category (>4 hours), only Claude Sonnet 4 (Nonthinking) , o3 and GPT 4.1 mini achieved non-zero scores (33.3% each), while all other models failed completely.

SWE-bench

/

Tool Use Analysis

Within the SWE-agent harness, tools are organized into groups called bundles. The following table details which tools are contained within each bundle.

Bundle	Tools
Default	create, goto, open, scroll_down, scroll_up
Search	find_file, search_file, search_dir
Edit/Replace	edit, insert
Submit	submit
Bash	bash

The following visualization shows the distribution of tool usage across different models for each task, represented as percentages.

SWEBench Tool Usage

The tool usage patterns reveal distinct strategic approaches across models. o4 mini demonstrates the most intensive search behavior, using search tools over 25,000 times per task on average—significantly more than any other model—suggesting a thorough exploration strategy before making changes. In contrast, Claude 4 Sonnet shows a more balanced approach with moderate usage across all tool categories, using approximately 9,000-10,000 default tools and fewer search operations, indicating a more targeted problem-solving methodology.

Most models show relatively low usage of Edit/Replace, Submit, and Bash tools compared to Default and Search categories. These usage patterns align with the performance results, where o3’s exhaustive search approach and Claude 4 Sonnet’s balanced strategy both contribute to their strong accuracy scores, though at different computational costs.

Methodology

For the standardized harness, we used the one created by the SWE-agent team. The agentic tools provided included the ability to edit and search files, use bash commands, diff files, and more. You can find more information here: SWE Agent.

We used the SWE-bench Verified subset of the dataset. SWE-bench Verified is a human validated section of the SWE-bench dataset released by Open AI in August 2024. Each task in the split has been carefully reviewed and validated by human experts, resulting in a curated set of 500 high-quality test cases from the original benchmark. You can find more information about the Verified split of the dataset here.

All models had access to the same tools to ensure a fair, apples-to-apples comparison. Models were run with the default configuration given by the provider, except for the max token limit, which was always set to the highest value possible. Due to the limiting amount of cache break points per Anthropic API key, we used the rotating key option provided by SWE-agent.

All experiments were ran ran on a 8 vCPU, 32GB RAM EC2 instance. Latency was calculated starting with the first step the model took within each task.

Evaluated models were constrained to a maximum of 150 steps per task. This limit was determined by analyzing the highest step count needed to resolve instances in the “1-4 hours” difficulty category, with a 150% buffer added to ensure fair comparison across all models. We selected the “1-4 hours” difficulty category (comprising 42 instances) as it provided the optimal balance of complexity and variance, allowing us to capture a comprehensive range of steps between resolved instances. Both large and smaller models were evaluated across this test set. A “step” is a single interaction turn within the SWE-agent framework.

It may be possible to build a better harnesses than SWE-Agent for a given model - for example, Anthropic has claimed their custom harness leads to a ten percentage point improvement in accuracy. However, our aim was to adopt a fair framework with which to evaluate all models.