The Public Standard for Real World AI Performance
Generic benchmarks only go so far.
Vals AI evaluates models on
the real tasks each industry relies on.
Generic benchmarks only go so far.
Vals AI evaluates models on
the real tasks each industry relies on.
Updated 4/21/2026
41
models tested
Benchmark consisting of a weighted performance across finance, law and coding tasks. Showing the potential impact that LLM's can have on the economy.
Top Models
Claude Opus 4.7
Claude Sonnet 4.6
Claude Opus 4.6 (Thinking)
Updated 4/16/2026
28
models tested
Benchmark consisting of a weighted performance across finance, law, coding, and education tasks. Showing the potential impact that LLM's can have on the economy.
Top Models
Claude Opus 4.7
Claude Sonnet 4.6
Gemini 3.1 Pro Preview (02/26)
Updated 4/21/2026
48
models tested
Private question-answer benchmark over Canadian court-cases.
Top Models
GPT 5.1
GPT 4.1
GPT 5 Mini
Updated 4/16/2026
116
models tested
Evaluating language models on a wide range of open source legal reasoning tasks.
Top Models
Gemini 3.1 Pro Preview (02/26)
Gemini 3 Pro (11/25)
Gemini 3 Flash (12/25)
Updated 4/16/2026
97
models tested
A private benchmark evaluating understanding of long-context credit agreements
Top Models
Kimi K2.5
Qwen 3 Max Thinking
Claude Opus 4.6 (Thinking)
Updated 4/20/2026
45
models tested
Evaluating agents on core financial analyst tasks
Top Models
Claude Opus 4.7
Claude Sonnet 4.6
Muse Spark
Updated 4/16/2026
69
models tested
Evaluating reading and understanding tax certificates as images
Top Models
Claude Opus 4.7
Gemini 3.1 Pro Preview (02/26)
Gemini 3 Pro (11/25)
Updated 4/16/2026
104
models tested
A Vals-created set of questions and responses to tax questions
Top Models
Muse Spark
Claude Sonnet 4.6
Claude Opus 4.6 (Thinking)
Updated 4/16/2026
51
models tested
Can models support the medical billing process?
Top Models
Gemini 3.1 Pro Preview (02/26)
Gemini 3 Flash (12/25)
Claude Opus 4.7
Updated 4/16/2026
51
models tested
Can models support doctors with their administrative work?
Top Models
GPT 5.1
Claude Opus 4.6 (Nonthinking)
Claude Opus 4.6 (Thinking)
Evaluating language model bias in medical questions.
Updated 4/16/2026
96
models tested
Challenging national math exam given to top high-school students
Top Models
Gemini 3.1 Pro Preview (02/26)
GPT 5.2
Muse Spark
Updated 4/21/2026
25
models tested
Can models write math proofs that are formally verified?
Top Models
GPT 5.4
Claude Opus 4.7
Claude Opus 4.6 (Thinking)
Academic math benchmark on probability, algebra, and trigonometry
A multilingual benchmark for mathematical questions.
Updated 4/16/2026
99
models tested
Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.
Top Models
Gemini 3.1 Pro Preview (02/26)
Gemini 3 Pro (11/25)
GPT 5.2
Updated 4/16/2026
97
models tested
Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.
Top Models
Gemini 3.1 Pro Preview (02/26)
Gemini 3 Pro (11/25)
Claude Opus 4.7
Updated 4/16/2026
66
models tested
Multimodal Multi-task Benchmark
Top Models
Gemini 3.1 Pro Preview (02/26)
Gemini 3 Flash (12/25)
Gemini 3 Pro (11/25)
Updated 4/16/2026
50
models tested
International Olympiad in Informatics
Top Models
GPT 5.4
GPT 5.2
GPT 5.3 Codex
Updated 4/16/2026
103
models tested
Our Implementation of the LiveCodeBench benchmark
Top Models
Gemini 3.1 Pro Preview (02/26)
GPT 5.2 Codex
GPT 5.3 Codex
Updated 4/16/2026
41
models tested
Solving production software engineering tasks
Top Models
Claude Opus 4.7
Gemini 3.1 Pro Preview (02/26)
GPT 5.4
Updated 4/21/2026
New53
models tested
State-of-the-art set of difficult terminal-based tasks
Top Models
Claude Opus 4.7
Gemini 3.1 Pro Preview (02/26)
GPT 5.3 Codex
Updated 4/21/2026
New27
models tested
Can models build web applications from scratch?
Top Models
Claude Opus 4.7
GPT 5.4
GPT 5.3 Codex
State-of-the-art set of difficult terminal-based tasks