The Public Standard for Real World AI Performance
Generic benchmarks only go so far.
Vals AI evaluates models on
the real tasks each industry relies on.
Generic benchmarks only go so far.
Vals AI evaluates models on
the real tasks each industry relies on.
Updated 6/17/2026
30
models tested
Benchmark consisting of a weighted performance across finance and coding tasks. Showing the potential impact that LLMs can have on the economy.
Top Models
Claude Fable 5
Claude Opus 4.8
GPT 5.5
Updated 6/10/2026
21
models tested
Benchmark consisting of a weighted performance across finance, coding, and education tasks. Showing the potential impact that LLMs can have on the economy.
Top Models
Claude Fable 5
Claude Opus 4.8
GPT 5.5
Updated 6/17/2026
14
models tested
Tests an agent's ability to complete legal work using documents, spreadsheets, presentations, and file-system tools.
Top Models
Claude Fable 5
Claude Opus 4.8
GLM 5.2
Updated 6/17/2026
119
models tested
Evaluating language models on a wide range of open source legal reasoning tasks.
Top Models
Claude Fable 5
Gemini 3.1 Pro Preview (02/26)
Gemini 3 Pro (11/25)
Private question-answer benchmark over Canadian court-cases.
Updated 6/17/2026
116
models tested
A private benchmark evaluating understanding of long-context credit agreements
Top Models
Claude Fable 5
Grok 4.3
GPT 5.5
Updated 6/17/2026
New28
models tested
Evaluating agents on core financial analyst tasks
Top Models
Gemini 3.5 Flash
Claude Fable 5
Claude Opus 4.8
Updated 6/10/2026
80
models tested
Evaluating reading and understanding tax certificates as images
Top Models
Claude Opus 4.7
Claude Opus 4.8
Gemini 3.1 Pro Preview (02/26)
Updated 6/17/2026
122
models tested
A Vals-created set of questions and responses to tax questions
Top Models
Muse Spark
Claude Sonnet 4.6
Claude Fable 5
Updated 6/17/2026
68
models tested
Can models support the medical billing process?
Top Models
Gemini 3.1 Pro Preview (02/26)
Claude Fable 5
Gemini 3 Flash (12/25)
Updated 6/17/2026
65
models tested
Can models support doctors with their administrative work?
Top Models
Claude Fable 5
GPT 5.1
MiniMax-M3
Evaluating language model bias in medical questions.
Challenging national math exam given to top high-school students
Academic math benchmark on probability, algebra, and trigonometry
A multilingual benchmark for mathematical questions.
Updated 6/17/2026
116
models tested
Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.
Top Models
Gemini 3.1 Pro Preview (02/26)
Claude Fable 5
GPT 5.5
Updated 6/17/2026
115
models tested
Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.
Top Models
Claude Fable 5
Gemini 3.1 Pro Preview (02/26)
Gemini 3 Pro (11/25)
Updated 6/10/2026
76
models tested
Multimodal Multi-task Benchmark
Top Models
Claude Fable 5
Gemini 3.5 Flash
GPT 5.5
Updated 6/17/2026
21
models tested
Can language models reimplement working programs in another language?
Top Models
Claude Fable 5
Claude Opus 4.8
GPT 5.5
Updated 6/9/2026
55
models tested
International Olympiad in Informatics
Top Models
Claude Fable 5
GPT 5.4 (xhigh)
GPT 5.2
Updated 6/17/2026
122
models tested
Our Implementation of the LiveCodeBench benchmark
Top Models
Claude Fable 5
Gemini 3.1 Pro Preview (02/26)
GPT 5.2 Codex
Updated 6/17/2026
New24
models tested
Can language models rebuild programs from scratch?
Top Models
Claude Fable 5
Claude Opus 4.8
GPT 5.5
Updated 6/17/2026
64
models tested
Solving production software engineering tasks
Top Models
Claude Fable 5
Claude Opus 4.8
Claude Opus 4.8
Updated 6/17/2026
New35
models tested
State-of-the-art set of difficult terminal-based tasks
Top Models
Claude Fable 5
GPT 5.5
Gemini 3.5 Flash
Updated 6/17/2026
New66
models tested
Can models build web applications from scratch?
Top Models
Claude Fable 5
Claude Opus 4.8
Claude Opus 4.8
State-of-the-art set of difficult terminal-based tasks
Social Mobility
Updated 6/9/2026
Public Benefits Bench
13
models tested
Can AI help people navigate SNAP benefits?
Top Models
Claude Fable 5
Claude Opus 4.8
MiniMax-M3