Public Enterprise LLM Benchmarks

Vals Index
Vals Index
Vals Index

Updated 11/18/2025

Vals Index

Benchmark consisting of a weighted performance across finance, law and coding tasks. Showing the potential impact that LLM's can have on the economy.

Top Model:

Claude Sonnet 4.5 (Thinking)

Claude Sonnet 4.5 (Thinking)

Number of models tested

24

View Details
Vals Index

Updated 11/18/2025

New

Vals Multimodal Index

Benchmark consisting of a weighted performance across finance, law, coding, and education tasks. Showing the potential impact that LLM's can have on the economy.

Top Model:

Claude Sonnet 4.5 (Thinking)

Claude Sonnet 4.5 (Thinking)

Number of models tested

12

View Details
Finance Benchmarks

Updated 11/17/2025

CorpFin (v2)

A private benchmark evaluating understanding of long-context credit agreements

Top Model:

Grok 4 Fast (Reasoning)

Grok 4 Fast (Reasoning)

Number of models tested

62

View Details

Updated 11/18/2025

Finance Agent

Evaluating agents on core financial analyst tasks

Top Model:

GPT 5.1

GPT 5.1

Number of models tested

48

View Details

Updated 11/18/2025

MortgageTax

Evaluating reading and understanding tax certificates as images

Top Model:

Gemini 3 Pro (11/25)

Gemini 3 Pro (11/25)

Number of models tested

49

View Details

Updated 11/13/2025

TaxEval (v2)

A Vals-created set of questions and responses to tax questions

Top Model:

Grok 3

Grok 3

Number of models tested

78

View Details
Healthcare Benchmarks

Updated 11/13/2025

MedQA

Evaluating language model bias in medical questions.

Top Model:

o1

o1

Number of models tested

74

View Details
Math Benchmarks

Updated 11/13/2025

AIME

Challenging national math exam given to top high-school students

Top Model:

Gemini 3 Pro (11/25)

Gemini 3 Pro (11/25)

Number of models tested

68

View Details

Updated 11/13/2025

MGSM

A multilingual benchmark for mathematical questions.

Top Model:

Claude Opus 4.1 (Thinking)

Claude Opus 4.1 (Thinking)

Number of models tested

70

View Details

Updated 08/26/2025

MATH 500

Academic math benchmark on probability, algebra, and trigonometry

Top Model:

Gemini 3 Pro (11/25)

Gemini 3 Pro (11/25)

Number of models tested

54

View Details
Academic Benchmarks

Updated 11/18/2025

GPQA

Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.

Top Model:

Gemini 3 Pro (11/25)

Gemini 3 Pro (11/25)

Number of models tested

70

View Details

Updated 11/13/2025

MMLU Pro

Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.

Top Model:

Gemini 3 Pro (11/25)

Gemini 3 Pro (11/25)

Number of models tested

68

View Details

Updated 11/13/2025

MMMU

Multimodal Multi-task Benchmark

Top Model:

GPT 5.1

GPT 5.1

Number of models tested

45

View Details
Education Benchmarks

Updated 11/18/2025

SAGE

Student Assessment with Generative Evaluation

Top Model:

Gemini 3 Pro (11/25)

Gemini 3 Pro (11/25)

Number of models tested

26

View Details
Coding Benchmarks

Updated 11/13/2025

IOI

Based on the International Olympiad in Informatics

Top Model:

Gemini 3 Pro (11/25)

Gemini 3 Pro (11/25)

Number of models tested

25

View Details

Updated 11/14/2025

LiveCodeBench

Our Implementation of the LiveCodeBench benchmark

Top Model:

GPT 5 Mini

GPT 5 Mini

Number of models tested

68

View Details

Updated 11/19/2025

SWE-bench

Solving production software engineering tasks

Top Model:

Gemini 3 Pro (11/25)

Gemini 3 Pro (11/25)

Number of models tested

31

View Details

Updated 11/19/2025

Terminal-Bench

State-of-the-art set of difficult terminal-based tasks

Top Model:

Claude Sonnet 4.5 (Thinking)

Claude Sonnet 4.5 (Thinking)

Number of models tested

35

View Details

Join our mailing list to receive benchmark updates

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

By subscribing, I agree to Vals' Privacy Policy.