Task Type:

LiveCodeBench Benchmark

Last updated

1

o4 Mini

★$⚡︎

66.5%

$1.10 / $4.40

32.84 s

2

o3

63.2%

$2.00 / $8.00

63.95 s

3

Claude Opus 4 (Thinking)

63.1%

$15.00 / $75.00

93.54 s

4

Gemini 2.5 Pro Preview

61.9%

$1.25 / $10.00

164.66 s

5

o3 Mini

59.8%

$1.10 / $4.40

53.80 s

6

DeepSeek R1

58.7%

$3.00 / $8.00

86.07 s

7

Grok 3 Mini Fast High Reasoning

57.9%

$0.60 / $4.00

213.66 s

8

Grok 3 Mini Fast Low Reasoning

56.9%

$0.60 / $4.00

30.71 s

9

Claude Opus 4 (Nonthinking)

56.4%

$15.00 / $75.00

14.78 s

10

DeepSeek V3 (03/24/2025)

56.0%

$1.20 / $1.20

26.93 s

Task type :

★Best Performing

$Best Budget

⚡︎Best Speed

Reasoning Model

Key Takeaways

OpenAI’s o4 Mini takes first place, getting about two-thirds of questions correct. However, it only gets a third of hard problems, suggesting room for improvement on more difficult coding tasks.
The other top performers on LiveCodeBench are flagship reasoning models like OpenAI’s o3 , Claude Opus 4 (Thinking) , Google’s Gemini 2.5 Pro Preview , DeepSeek R1 , and xAI’s Grok 3 Mini Fast High Reasoning .
Latency tends to be high, primarily due to long responses. This is especially true for Grok 3 Mini Fast High Reasoning (which reports tokens used for reasoning) and Gemini 2.5 Pro Preview (which doesn’t). We had to explicitly cap response length to manage latency for both models.

Dataset and Context

LiveCodeBench is a programming benchmark designed to assess the capabilities of LLMs on competitive programming problems. LiveCodeBench continuously collects new coding problems from competitive programming platforms, providing a dynamic evaluation framework that evolves over time.

Problem Sources and Scale

This benchmark evaluates models on coding problems collected from three major competitive programming platforms:

LeetCode: Industry-standard coding interview problems
AtCoder: Japanese competitive programming platform known for algorithmic challenges
Codeforces: International platform hosting regular programming contests

The newest version (v6) of the benchmark includes over 1000 high-quality coding problems collected between May 2023 and 2025, with problems categorized into three difficulty levels: easy, medium, and hard. Each problem consists of a natural language problem statement, example input-output pairs, and hidden test cases for evaluation.

Evaluation Approach

The “Code Generation” task in LiveCodeBench evaluates models on code generation tasks where:

Models receive a problem statement with natural language description and example tests
They must generate a syntactically correct Python solution
Solutions are evaluated against hidden test cases for functional correctness

This approach tests not just syntax generation but also:

Problem comprehension and decomposition
Algorithm design and implementation
Edge case handling
Code efficiency considerations

Why LiveCodeBench Matters

While benchmarks like HumanEval have become saturated (with many models achieving near-perfect scores), LiveCodeBench remains challenging due to:

Higher problem complexity from real competitive programming
Continuous addition of new problems mitigating overfitting and data contamination
Diverse problem types requiring various algorithmic approaches
Strict functional correctness evaluation with hidden test cases

Results

After accounting for price and latency, OpenAI’s o4 Mini stands out as the clear front-runner, boasting a unique combination of state-of-the-art performance and low latency. Especially price-sensitive customers might instead opt for xAI’s Grok 3 Mini Fast High Reasoning or Google’s Gemini 2.5 Flash Preview (Nonthinking) .

LiveCodeBench

/

We see significant differences in model performance as a function of split. Existing models are near-perfect on the easy split, perform well on the medium split, and struggle on the hard split.

We also find that performance on the medium and hard splits is correlated. In other words, models that are good at the medium problems are also good at the hard problems. This may be unsurprising, but is also less obvious for easy problems, since some models are good at the easy problems but still struggle on the medium and hard problems.