Benchmark

LiveCodeBench

10/16/2025

Our Implementation of the LiveCodeBench benchmark

Key Takeaways


Dataset and Context

LiveCodeBench is a programming benchmark designed to assess the capabilities of LLMs on competitive programming problems. LiveCodeBench continuously collects new coding problems from competitive programming platforms, providing a dynamic evaluation framework that evolves over time.

Problem Sources and Scale

This benchmark evaluates models on coding problems collected from three major competitive programming platforms:

  • LeetCode: Industry-standard coding interview problems
  • AtCoder: Japanese competitive programming platform known for algorithmic challenges
  • Codeforces: International platform hosting regular programming contests

The newest version (v6) of the benchmark includes over 1000 high-quality coding problems collected between May 2023 and 2025, with problems categorized into three difficulty levels: easy, medium, and hard. Each problem consists of a natural language problem statement, example input-output pairs, and hidden test cases for evaluation.

Evaluation Approach

The “Code Generation” task in LiveCodeBench evaluates models on code generation tasks where:

  1. Models receive a problem statement with natural language description and example tests
  2. They must generate a syntactically correct Python solution
  3. Solutions are evaluated against hidden test cases for functional correctness

This approach tests not just syntax generation but also:

  • Problem comprehension and decomposition
  • Algorithm design and implementation
  • Edge case handling
  • Code efficiency considerations

Why LiveCodeBench Matters

While benchmarks like HumanEval have become saturated (with many models achieving near-perfect scores), LiveCodeBench remains challenging due to:

  • Higher problem complexity from real competitive programming
  • Continuous addition of new problems mitigating overfitting and data contamination
  • Diverse problem types requiring various algorithmic approaches
  • Strict functional correctness evaluation with hidden test cases

Results

After accounting for price and latency, OpenAI’s o4 Mini stands out as the clear front-runner, boasting a unique combination of state-of-the-art performance and low latency. Especially price-sensitive customers might instead opt for Alibaba’s Qwen 3 (235B) or Google’s Gemini 2.5 Flash Preview 4/17 (Nonthinking) .

LiveCodeBench
AI21 Labs
Alibaba
Anthropic
Cohere
DeepSeek
Google
Kimi
Meta
Mistral
NVIDIA
OpenAI
xAI
zAI

We see significant differences in model performance as a function of split. Existing models are near-perfect on the easy split, perform well on the medium split, and struggle on the hard split.

We also find that performance on the medium and hard splits is correlated. In other words, models that are good at the medium problems are also good at the hard problems. This may be unsurprising, but is also less obvious for easy problems, since some models are good at the easy problems but still struggle on the medium and hard problems.

LiveCodeBench

Join our mailing list to receive benchmark updates

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

By subscribing, I agree to Vals' Privacy Policy.