Key Takeaways
- OpenAI’s o4 Mini takes first place, getting about two-thirds of questions correct. However, it only gets a third of hard problems, suggesting room for improvement on more difficult coding tasks.
- The other top performers on LiveCodeBench are flagship reasoning models like OpenAI’s o3 , Claude Opus 4 (Thinking) , Google’s Gemini 2.5 Pro Preview , DeepSeek R1 , and xAI’s Grok 3 Mini Fast High Reasoning .
- Latency tends to be high, primarily due to long responses. This is especially true for Grok 3 Mini Fast High Reasoning (which reports tokens used for reasoning) and Gemini 2.5 Pro Preview (which doesn’t). We had to explicitly cap response length to manage latency for both models.
Dataset and Context
LiveCodeBench is a programming benchmark designed to assess the capabilities of LLMs on competitive programming problems. LiveCodeBench continuously collects new coding problems from competitive programming platforms, providing a dynamic evaluation framework that evolves over time.
Problem Sources and Scale
This benchmark evaluates models on coding problems collected from three major competitive programming platforms:
- LeetCode: Industry-standard coding interview problems
- AtCoder: Japanese competitive programming platform known for algorithmic challenges
- Codeforces: International platform hosting regular programming contests
The newest version (v6) of the benchmark includes over 1000 high-quality coding problems collected between May 2023 and 2025, with problems categorized into three difficulty levels: easy, medium, and hard. Each problem consists of a natural language problem statement, example input-output pairs, and hidden test cases for evaluation.
Evaluation Approach
The “Code Generation” task in LiveCodeBench evaluates models on code generation tasks where:
- Models receive a problem statement with natural language description and example tests
- They must generate a syntactically correct Python solution
- Solutions are evaluated against hidden test cases for functional correctness
This approach tests not just syntax generation but also:
- Problem comprehension and decomposition
- Algorithm design and implementation
- Edge case handling
- Code efficiency considerations
Why LiveCodeBench Matters
While benchmarks like HumanEval have become saturated (with many models achieving near-perfect scores), LiveCodeBench remains challenging due to:
- Higher problem complexity from real competitive programming
- Continuous addition of new problems mitigating overfitting and data contamination
- Diverse problem types requiring various algorithmic approaches
- Strict functional correctness evaluation with hidden test cases
Results
After accounting for price and latency, OpenAI’s o4 Mini stands out as the clear front-runner, boasting a unique combination of state-of-the-art performance and low latency. Especially price-sensitive customers might instead opt for xAI’s Grok 3 Mini Fast High Reasoning or Google’s Gemini 2.5 Flash Preview (Nonthinking) .
/
We see significant differences in model performance as a function of split. Existing models are near-perfect on the easy split, perform well on the medium split, and struggle on the hard split.
We also find that performance on the medium and hard splits is correlated. In other words, models that are good at the medium problems are also good at the hard problems. This may be unsurprising, but is also less obvious for easy problems, since some models are good at the easy problems but still struggle on the medium and hard problems.
/