The Public Standard for Real World AI Performance

Generic benchmarks only go so far.
Vals AI evaluates models on the real tasks each industry relies on.

Vals Index
Vals Index
Finance
Healthcare
Math
AIME

Challenging national math exam given to top high-school students

View Details
MATH 500

Academic math benchmark on probability, algebra, and trigonometry

View Details
MGSM

A multilingual benchmark for mathematical questions.

View Details
Academic
Education
Coding
Proprietary

Updated 6/17/2026

Code Migration

21

models tested

Can language models reimplement working programs in another language?

Top Models

1
Claude Fable 5

Claude Fable 5

55.1%
2
Claude Opus 4.8

Claude Opus 4.8

47.3%
3
GPT 5.5

GPT 5.5

45.2%
View Details
Academic

Updated 6/9/2026

IOI

55

models tested

International Olympiad in Informatics

Top Models

1
Claude Fable 5

Claude Fable 5

72.3%
2
GPT 5.4 (xhigh)

GPT 5.4 (xhigh)

67.8%
3
GPT 5.2

GPT 5.2

54.8%
View Details
Academic

Updated 6/17/2026

LiveCodeBench

122

models tested

Our Implementation of the LiveCodeBench benchmark

Top Models

1
Claude Fable 5

Claude Fable 5

89.8%
2
Gemini 3.1 Pro Preview (02/26)

Gemini 3.1 Pro Preview (02/26)

88.5%
3
GPT 5.2 Codex

GPT 5.2 Codex

88.0%
View Details
Academic

Updated 6/17/2026

New

ProgramBench

24

models tested

Can language models rebuild programs from scratch?

Top Models

1
Claude Fable 5

Claude Fable 5

2.0%
2
Claude Opus 4.8

Claude Opus 4.8

1.0%
3
GPT 5.5

GPT 5.5

0.5%
View Details
Academic

Updated 6/17/2026

SWE-bench Verified

64

models tested

Solving production software engineering tasks

Top Models

1
Claude Fable 5

Claude Fable 5

95.0%
2
Claude Opus 4.8

Claude Opus 4.8

88.6%
3
Claude Opus 4.8

Claude Opus 4.8

85.8%
View Details
Academic

Updated 6/17/2026

New

Terminal-Bench 2.1

35

models tested

State-of-the-art set of difficult terminal-based tasks

Top Models

1
Claude Fable 5

Claude Fable 5

80.5%
2
GPT 5.5

GPT 5.5

76.4%
3
Gemini 3.5 Flash

Gemini 3.5 Flash

74.2%
View Details
Proprietary

Updated 6/17/2026

New

Vibe Code Bench v1.1

66

models tested

Can models build web applications from scratch?

Top Models

1
Claude Fable 5

Claude Fable 5

90.4%
2
Claude Opus 4.8

Claude Opus 4.8

82.7%
3
Claude Opus 4.8

Claude Opus 4.8

77.5%
View Details
Terminal-Bench 2.0

State-of-the-art set of difficult terminal-based tasks

View Details
Beta
Social Mobility