Takeaways
-
Foundation models still fail to consistently solve real-world coding problems despite impressive progress, highlighting remaining room for improvement.
-
There is a significant performance improvement between the most recent and previous OpenAI models, with an overall increase of 10% in accuracy between GPT 5 Mini and o3 .
-
GPT 5 leads by 3.8%, scoring the highest across all models. This is a slight improvement of Claude Sonnet 4 (Nonthinking) , which was the previous best performer.
-
Grok Code Fast delivers impressive results while having much lower latency than other top-performing models. Optimized for coding tasks, it offers competitive performance, but is also snappier relative to other top models.
Background
SWE-bench, introduced by Jimenez et al. in their seminal paper “Can Language Models Resolve Real-World GitHub Issues?”, has emerged as a prominent benchmark for evaluating Large Language Models (LLMs) in software engineering contexts.
The benchmark comprises 500 tasks, each executed within an isolated Docker container. These tasks represent real-world GitHub issues from various repositories. Models are provided with a suite of agentic tools and must generate a “patch” to resolve each issue. The success of a model’s solution is determined by running unit tests against the generated patch.
A notable complexity of SWE-bench lies in its dual evaluation of both the agentic harness and the underlying foundation model. This leads to different methodologies adopted by foundation model labs when they report their results. Additionally, the benchmark’s computational requirements make it resource-intensive to reproduce results.
To enable fair and consistent comparisons across foundation models, we implemented a standardized evaluation framework and evaluated 13 popular models.
Results
/
GPT 5 leads with 68.8% accuracy, though it has the highest latency at 2328.15s.
Claude Sonnet 4 (Nonthinking) follows with 65.0% accuracy and moderate latency (426.52s), while Grok 4 achieves 58.6% accuracy with reasonable latency (704.78s).
Grok Code Fast performs almost as well as Grok 4 , with an accuracy of 57.6%, but with a much lower latency of 264.68%.
Additionally, GPT 4.1 offers strong performance at 47.4% accuracy with the fastest latency (173.98s), while o3 achieves 49.8% accuracy with moderate latency (620.33s).
Gemini 2.5 Flash Preview (Nonthinking) provides respectable 35.6% accuracy with good latency (251.91s), while GPT 4o (2024-08-06) underperformed at 27.2% accuracy with moderate latency (197.58s).
When analyzing performance by task complexity, all models show declining performance as estimated resolution time increases. On simple tasks (<15 min), GPT 5 leads with 83.5% accuracy, followed by Claude Sonnet 4 (Nonthinking) at 80.9%. On the hardest task category (>4 hours), only Claude Sonnet 4 (Nonthinking) , Grok 4 , Grok Code Fast , GPT 4.1 Mini , and GPT 5 achieved non-zero scores (33.3% each), while all other models failed completely.
/
Tool Use Analysis
Within the SWE-agent harness, tools are organized into groups called bundles. The following table details which tools are contained within each bundle.
Bundle | Tools |
---|---|
Default | create, goto, open, scroll_down, scroll_up |
Search | find_file, search_file, search_dir |
Edit/Replace | edit, insert |
Submit | submit |
Bash | bash |
The following visualization shows the distribution of tool usage across different models for each task, represented as percentages.
The tool usage patterns reveal distinct strategic approaches across models. o4 Mini tends to use default and search tools more. In contrast, Claude Sonnet 4 (Nonthinking) shows a more balanced approach with moderate usage across all tool categories, using approximately 9,000-10,000 default tools and fewer search operations, indicating a more targeted problem-solving methodology.
Most models show relatively low usage of Edit/Replace, Submit, and Bash tools compared to Default and Search categories. These usage patterns align with the performance results, where o3’s exhaustive search approach and Claude 4 Sonnet’s balanced strategy both contribute to their strong accuracy scores, though at different computational costs.
Methodology
For the standardized harness, we used the one created by the SWE-agent team. The agentic tools provided included the ability to edit and search files, use bash commands, diff files, and more. You can find more information here: SWE Agent.
We used the SWE-bench Verified subset of the dataset. SWE-bench Verified is a human validated section of the SWE-bench dataset released by Open AI in August 2024. Each task in the split has been carefully reviewed and validated by human experts, resulting in a curated set of 500 high-quality test cases from the original benchmark. You can find more information about the Verified split of the dataset here.
All models had access to the same tools to ensure a fair, apples-to-apples comparison. Models were run with the default configuration given by the provider, except for the max token limit, which was always set to the highest value possible. Due to the limiting amount of cache break points per Anthropic API key, we used the rotating key option provided by SWE-agent.
All experiments were ran on a 8 vCPU, 32GB RAM EC2 instance. Latency was calculated starting with the first step the model took within each task.
Evaluated models were constrained to a maximum of 150 steps per task. This limit was determined by analyzing the highest step count needed to resolve instances in the “1-4 hours” difficulty category, with a 150% buffer added to ensure fair comparison across all models. We selected the “1-4 hours” difficulty category (comprising 42 instances) as it provided the optimal balance of complexity and variance. This allowed us to capture a comprehensive range of steps between resolved instances, where a “step” is a single interaction turn within the SWE-agent framework. Both larger and smaller models were evaluated across this test set.
It may be possible to build a better harnesses than SWE-Agent for a given model - for example, Anthropic has claimed their custom harness leads to a ten percentage point improvement in accuracy. However, our aim was to adopt a fair framework with which to evaluate all models.