Task Type:

IOI Benchmark

Last updated

1

Grok 4

★$⚡︎

26.2%

$3.00 / $15.00

4265.88 s

2

GPT 5

20.0%

$1.25 / $10.00

2608.25 s

3

Gemini 2.5 Pro

17.1%

$1.25 / $10.00

2964.80 s

4

Claude Opus 4.1 (Nonthinking)

15.2%

$15.00 / $75.00

6124.90 s

5

Claude Sonnet 4 (Nonthinking)

6.5%

$3.00 / $15.00

3253.12 s

6

o4 Mini

5.3%

$1.10 / $4.40

480.19 s

7

Claude Sonnet 4 (Thinking)

4.6%

$3.00 / $15.00

2216.29 s

8

Gemini 2.5 Flash (Nonthinking)

3.9%

$0.30 / $2.50

350.19 s

9

DeepSeek V3 (03/24/2025)

1.7%

$1.20 / $1.20

171.48 s

10

Kimi K2 Instruct

1.3%

$1.00 / $3.00

1551.10 s

Task type :

★Best Performing

$Best Budget

⚡︎Best Speed

Reasoning Model

Key Takeaways

Grok 4 wins convincingly, placing first on both the 2024 and 2025 exams.
Models struggle to write C++ at the level of the best high-school students – no models qualify for medals on either exam.
Only the largest and most expensive models even come close to placing. The only models to achieve >10% performance all cost at least $2 per question. Claude Opus 4.1 (Nonthinking) costs over $10 per question!
Consistency between performance on the 2024 and 2025 tests suggests that LLM labs aren’t currently training on the IOI, suggesting that this benchmark is relatively free from data contamination.

Benchmark Design

We designed our benchmark to imitate competition conditions as closely as possible.

Agent Harness

We adapted our open-source agent harness from our Finance Agent Benchmark by providing it with access to the following tools:

a c++ (v20) execution environment, which executes arbitrary code
a submission tool used for grading, which executes submitted code and provides a

These tools, particularly the submission tool, were based on the testing environment available to human contestants.

Scoring

We further designed our submission tool to match the grading process during the olympiad. Agents get up to 50 submissions, each of which are graded on a variety of subtasks. A competitor receives credit for a given subtask if any submission passes all tests on a given subtask. In particular, this means the final score can be higher than any individual submission, since credit for separate subtasks are combined by the grading system.

Results

We consider Grok 4 the unambiguous winner of our IOI, narrowly outcompeting GPT 5 on the 2025 exam and scoring 20+ points above all competitors in 2024. However, all models struggled on the task - none qualified for a medal or even surpassed the median student score!

We chose to evaluate on both 2024 and 2025 to check for data contamination, which we suspect to explain the performance decrease between those years on LiveCodeBench. By contrast, we find increased performance in 2025, which we attribute to an easier test - student scores increased commensurately between the two years.

Why IOI?

Recently, top LLM labs like OpenAI and Google reported that their models achieved gold medals on the International Mathematical Olympiad (IMO). However, advanced models are starting to saturate IMO, meaning it may no longer effectively differentiate between the capabilities of top-performing models. Reports also suggest the evaluation process faced coordination challenges, with AI companies seeking expedited validation mid-competition that may not reflect standard IMO assessment procedures.

The International Olympiad in Informatics (IOI) offers several advantages as an LLM benchmark. Unlike the IMO, the IOI is not yet saturated, providing clear differentiation between model capabilities. The competition features standardized and automated grading, ensuring objective evaluation without subjective scoring. Additionally, the IOI has real-world relevance as it tests C++ programming skills that are directly applicable to software development.

IOI Benchmark

Key Takeaways

Benchmark Design

Agent Harness

Scoring

Results

Why IOI?

Join our mailing list to receive benchmark updates on