SWE-bench Verified

Key Takeaways

Claude Fable 5 leads with a performance of 95.00%, achieving the best accuracy on SWE-bench Verified. Claude Opus 4.8 follows at 88.60%.
GPT 5.5 comes in third with 82.60%, followed by Claude Opus 4.7 at 82.00% and Gemini 3.5 Flash at 78.80%.

Resolution Rate by Task Difficulty

Model

<15 min

194 tasks

15m–1h

261 tasks

1–4 hr

42 tasks

>4 hr

3 tasks

GPT 5.5

92%

81%

50%

67%

Claude Opus 4.7

90%

79%

64%

67%

Composer 2.5

87%

79%

52%

67%

Gemini 3.1 Pro Preview (02/26)

89%

78%

43%

33%

Claude Opus 4.6 (Thinking)

90%

75%

43%

33%

GPT 5.4 (xhigh)

88%

76%

50%

GPT 5.3 Codex

90%

73%

55%

33%

Claude Sonnet 4.6

87%

75%

50%

33%

DeepSeek V4

88%

76%

45%

Claude Opus 4.5 (Thinking)

88%

75%

40%

Gemini 3 Pro (11/25)

88%

74%

43%

33%

GLM 5.1

89%

73%

45%

33%

Kimi K2.6

88%

74%

40%

GPT 5.2

89%

72%

38%

33%

Gemini 3 Flash (12/25)

88%

72%

38%

33%

MiniMax-M2.1

86%

73%

40%

33%

Muse Spark

85%

73%

40%

MiniMax-M2.5

87%

71%

38%

33%

MiniMax-M2.7

86%

72%

31%

33%

GPT 5.4 Mini

86%

71%

33%

Background

SWE-bench, introduced by Jimenez et al. in their seminal paper “Can Language Models Resolve Real-World GitHub Issues?”, has emerged as a prominent benchmark for evaluating Large Language Models (LLMs) in software engineering contexts.

The benchmark comprises 500 tasks, each executed within an isolated Docker container. These tasks represent real-world GitHub issues from various repositories. Models must generate a “patch” to resolve each issue. The success of a model’s solution is determined by running unit tests against the generated patch.

A notable complexity of SWE-bench lies in its dual evaluation of both the agentic harness and the underlying foundation model. This leads to different methodologies adopted by foundation model labs when they report their results. Additionally, the benchmark’s computational requirements make it resource-intensive to reproduce results.

To enable fair and consistent comparisons across foundation models, we use a minimal bash-tool-only agent harness. Models are given a single tool — bash — and must navigate, search, edit, and solve tasks using standard command-line tools. This puts the evaluation burden squarely on the model rather than the harness.

Results

SWE-bench

Tool Use

All models are given a single tool: bash. Models must use standard command-line tools (grep, find, sed, etc.) to navigate codebases, search for relevant files, and apply edits. This means tool usage differences between models reflect their command-line fluency and problem-solving strategy rather than how they interact with specialized harness tooling.

Methodology

We use a minimal bash-tool-only agent harness, mini-swe-agent, for all evaluations. Models are given a single tool — bash — and a system prompt describing the task. They must use standard command-line tools to navigate the codebase, identify the relevant code, and produce a patch.

We use the SWE-bench Verified subset of the dataset. SWE-bench Verified is a human-validated section of the SWE-bench dataset released by OpenAI in August 2024. Each task in the split has been carefully reviewed and validated by human experts, resulting in a curated set of 500 high-quality test cases from the original benchmark. You can find more information about the Verified split of the dataset here.

All models have access to the same tool to ensure a fair, apples-to-apples comparison. Models are run with the default configuration given by the provider, except for the max token limit, which is always set to the highest value possible.

All experiments are run on isolated cloud sandboxes. Latency is calculated starting from the first step the model takes within each task.

It may be possible to build better harnesses for a given model. However, our aim is to adopt a fair framework with which to evaluate all models.