Public Enterprise LLM Benchmarks

Vals Index
Vals Index
Vals Index

Updated 2/19/2026

Vals Index

Benchmark consisting of a weighted performance across finance, law and coding tasks. Showing the potential impact that LLM's can have on the economy.

Top Model:

Claude Sonnet 4.6

Claude Sonnet 4.6

Number of models tested

31

View Details
Vals Index

Updated 2/18/2026

Vals Multimodal Index

Benchmark consisting of a weighted performance across finance, law, coding, and education tasks. Showing the potential impact that LLM's can have on the economy.

Top Model:

Claude Sonnet 4.6

Claude Sonnet 4.6

Number of models tested

19

View Details
Finance Benchmarks

Updated 2/18/2026

CorpFin (v2)

A private benchmark evaluating understanding of long-context credit agreements

Top Model:

Kimi K2.5

Kimi K2.5

Number of models tested

84

View Details

Updated 2/18/2026

Finance Agent v1.1

Evaluating agents on core financial analyst tasks

Top Model:

Claude Sonnet 4.6

Claude Sonnet 4.6

Number of models tested

32

View Details

Updated 2/18/2026

MortgageTax

Evaluating reading and understanding tax certificates as images

Top Model:

Gemini 3 Pro (11/25)

Gemini 3 Pro (11/25)

Number of models tested

60

View Details

Updated 2/18/2026

TaxEval (v2)

A Vals-created set of questions and responses to tax questions

Top Model:

Claude Sonnet 4.6

Claude Sonnet 4.6

Number of models tested

92

View Details
Healthcare Benchmarks

Updated 2/19/2026

New

MedCode

Can models support the medical billing process?

Top Model:

Gemini 3 Flash (12/25)

Gemini 3 Flash (12/25)

Number of models tested

40

View Details

Updated 2/19/2026

New

MedScribe

Can models support doctors with their administrative work?

Top Model:

GPT 5.1

GPT 5.1

Number of models tested

40

View Details

Updated 2/18/2026

MedQA

Evaluating language model bias in medical questions.

Top Model:

o1

o1

Number of models tested

92

View Details
Math Benchmarks

Updated 2/18/2026

AIME

Challenging national math exam given to top high-school students

Top Model:

GPT 5.2

GPT 5.2

Number of models tested

84

View Details

Updated 2/18/2026

ProofBench

Can models write math proofs that are formally verified?

Top Model:

Claude Opus 4.6 (Thinking)

Claude Opus 4.6 (Thinking)

Number of models tested

16

View Details
Academic Benchmarks

Updated 2/18/2026

GPQA

Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.

Top Model:

Gemini 3 Pro (11/25)

Gemini 3 Pro (11/25)

Number of models tested

87

View Details

Updated 2/18/2026

MMLU Pro

Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.

Top Model:

Gemini 3 Pro (11/25)

Gemini 3 Pro (11/25)

Number of models tested

84

View Details

Updated 2/18/2026

MMMU

Multimodal Multi-task Benchmark

Top Model:

Gemini 3 Flash (12/25)

Gemini 3 Flash (12/25)

Number of models tested

57

View Details
Education Benchmarks

Updated 2/18/2026

SAGE

Student Assessment with Generative Evaluation

Top Model:

Claude Opus 4.5 (Thinking)

Claude Opus 4.5 (Thinking)

Number of models tested

39

View Details
Coding Benchmarks

Updated 2/17/2026

IOI

International Olympiad in Informatics

Top Model:

GPT 5.2

GPT 5.2

Number of models tested

44

View Details

Updated 2/17/2026

LiveCodeBench

Our Implementation of the LiveCodeBench benchmark

Top Model:

GPT 5.2 Codex

GPT 5.2 Codex

Number of models tested

93

View Details

Updated 2/18/2026

SWE-bench

Solving production software engineering tasks

Top Model:

Claude Opus 4.6 (Thinking)

Claude Opus 4.6 (Thinking)

Number of models tested

53

View Details

Updated 2/19/2026

New

Terminal-Bench 2.0

State-of-the-art set of difficult terminal-based tasks

Top Model:

Gemini 3.1 Pro Preview (02/26)

Gemini 3.1 Pro Preview (02/26)

Number of models tested

38

View Details

Updated 2/18/2026

New

Vibe Code Bench

Can models build web applications from scratch?

Top Model:

GPT 5.2

GPT 5.2

Number of models tested

17

View Details
Beta Benchmarks

Updated 12/23/2025

New

Poker Agent

Which model can make the most money playing poker?

Top Model:

GPT 5.2

GPT 5.2

Number of models tested

17

View Details

Join our mailing list to receive benchmark updates

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

By subscribing, I agree to Vals' Privacy Policy.