Independent Evaluation, Unbiased Benchmarks

Testing AI on Real-World Tasks

We benchmark the world's leading AI models on rigorous, domain-specific tasks in finance, law, software, healthcare, and more. We run all of our own evaluations and create many of our benchmarks in house

Top performing models from the Vals Index. Includes a range of tasks across finance, coding, and law.

View Full List

Vals Index

4/7/2026
Vals Logo
0.00%
Anthropic
Anthropic
Claude Sonnet 4.6
Vals Index Score: 67.74%
Anthropic
Anthropic
Claude Opus 4.6 (Thinking)
Vals Index Score: 65.88%
Google
Google
Gemini 3.1 Pro Preview (02/26)
Vals Index Score: 65.42%
OpenAI
OpenAI
GPT 5.4
Vals Index Score: 64.77%
OpenAI
OpenAI
GPT 5.2
Vals Index Score: 63.55%
zAI
zAI
GLM 5.1
Vals Index Score: 63.17%
Anthropic
Anthropic
Claude Opus 4.5 (Thinking)
Vals Index Score: 62.93%
Google
Google
Gemini 3 Pro (11/25)
Vals Index Score: 61.81%
zAI
zAI
GLM 5
Vals Index Score: 61.41%
Google
Google
Gemini 3 Flash (12/25)
Vals Index Score: 60.61%
OpenAI
OpenAI
GPT 5.1
Vals Index Score: 60.38%
Anthropic
Anthropic
Claude Sonnet 4.5 (Thinking)
Vals Index Score: 59.88%
Moonshot AI
Moonshot AI
Kimi K2.5
Vals Index Score: 59.61%
MiniMax
MiniMax
MiniMax-M2.7
Vals Index Score: 59.58%
Alibaba
Alibaba
Qwen 3.5 Plus
Vals Index Score: 59.08%
OpenAI
OpenAI
GPT 5.4 Mini
Vals Index Score: 57.14%
1Claude Sonnet 4.6
67.74%
2Claude Opus 4.6 (Thinking)
65.88%
3Gemini 3.1 Pro Preview (02/26)
65.42%

Vals AI Updates

Fresh updates from our testing queue

model
04/07/26

GLM 5.1 takes the top open-weight spot on the Vals Index

GLM 5.1 takes the top open-weight spot on the Vals Index

View Details

Benchmarks

Accuracy

Rankings

63.17%

± 1.95
6/ 36

51.55%

± 1.13
40/ 43

64.45%

± 0.94
14/ 94

57.66%

± 2.80
6/ 41

41.60%

± 2.12
17/ 48

72.27%

± 2.06
36/ 48

22.00%

± 4.16
6/ 23

71.19%

± 0.90
46/ 101

91.88%

± 0.44
19/ 93

84.52%

± 1.82
17/ 96
Contact us
Or send us an email at contact@vals.ai
Proprietary Benchmarks (contact us to get access)
Academic Benchmarks

Read about our methodology.

Industry Leaderboard

Independent benchmarks for industry-specific AI performance.

Industry
Benchmark

Model Performance Over Time

Tracking how foundation models improve with each release

75%62%49%36%23%10%
Vals Logo
Qwen 3.5 Plus
AlibabaAlibaba
Score: 59.08%
Released: Feb '26
Claude Sonnet 4.5 (Thinking)
AnthropicAnthropic
Score: 59.88%
Released: Sep '25
Claude Opus 4.5 (Thinking)
AnthropicAnthropic
Score: 62.93%
Released: Nov '25
Claude Opus 4.6 (Thinking)
AnthropicAnthropic
Score: 65.88%
Released: Feb '26
Claude Sonnet 4.6
AnthropicAnthropic
Score: 67.74%
Released: Feb '26
Trinity Large Thinking
Arcee AIArcee AI
Score: 42.60%
Released: Apr '26
Command A
CohereCohere
Score: 19.20%
Released: Mar '25
DeepSeek V3.2 (Thinking)
DeepSeekDeepSeek
Score: 37.58%
Released: Dec '25
Gemini 2.5 Pro
GoogleGoogle
Score: 48.82%
Released: Jul '25
Gemini 3 Pro (11/25)
GoogleGoogle
Score: 61.81%
Released: Nov '25
Gemini 3.1 Pro Preview (02/26)
GoogleGoogle
Score: 65.42%
Released: Feb '26
MiniMax-M2.1
MiniMaxMiniMax
Score: 51.80%
Released: Dec '25
MiniMax-M2.5
MiniMaxMiniMax
Score: 53.77%
Released: Feb '26
MiniMax-M2.7
MiniMaxMiniMax
Score: 59.58%
Released: Mar '26
Mistral Large 3
MistralMistral
Score: 34.86%
Released: Dec '25
Kimi K2 Thinking
Moonshot AIMoonshot AI
Score: 50.97%
Released: Nov '25
Kimi K2.5
Moonshot AIMoonshot AI
Score: 59.61%
Released: Jan '26
GPT OSS 120B
OpenAIOpenAI
Score: 36.07%
Released: Aug '25
GPT 5
OpenAIOpenAI
Score: 56.10%
Released: Aug '25
GPT 5.1
OpenAIOpenAI
Score: 60.38%
Released: Nov '25
GPT 5.2
OpenAIOpenAI
Score: 63.55%
Released: Dec '25
GPT 5.4
OpenAIOpenAI
Score: 64.77%
Released: Mar '26
Grok 4
xAIxAI
Score: 54.65%
Released: Jul '25
Grok 4.20 (Reasoning)
xAIxAI
Score: 56.59%
Released: Mar '26
GLM 4.7
zAIzAI
Score: 54.83%
Released: Dec '25
GLM 5
zAIzAI
Score: 61.41%
Released: Feb '26
GLM 5.1
zAIzAI
Score: 63.17%
Released: Apr '26
Feb '25May '25Jul '25Sep '25Nov '25Feb '26Apr '26