Benchmarks

Vals Index

Updated 2/19/2026

Vals Index

Benchmark consisting of a weighted performance across finance, law and coding tasks. Showing the potential impact that LLM's can have on the economy.

Top Model:

Claude Sonnet 4.6

Number of models tested

31

View Details

Updated 2/18/2026

Vals Multimodal Index

Benchmark consisting of a weighted performance across finance, law, coding, and education tasks. Showing the potential impact that LLM's can have on the economy.

Top Model:

Claude Sonnet 4.6

Number of models tested

19

View Details

Legal Benchmarks

Updated 2/19/2026

CaseLaw (v2)

Private question-answer benchmark over Canadian court-cases.

Top Model:

GPT 5 Mini

Number of models tested

66

View Details

Updated 2/18/2026

LegalBench

Evaluating language models on a wide range of open source legal reasoning tasks.

Top Model:

Gemini 3 Pro (11/25)

Number of models tested

106

View Details

Finance Benchmarks

Updated 2/18/2026

CorpFin (v2)

A private benchmark evaluating understanding of long-context credit agreements

Top Model:

Kimi K2.5

Number of models tested

84

View Details

Updated 2/18/2026

Finance Agent v1.1

Evaluating agents on core financial analyst tasks

Top Model:

Claude Sonnet 4.6

Number of models tested

32

View Details

Updated 2/18/2026

MortgageTax

Evaluating reading and understanding tax certificates as images

Top Model:

Gemini 3 Pro (11/25)

Number of models tested

60

View Details

Updated 2/18/2026

TaxEval (v2)

A Vals-created set of questions and responses to tax questions

Top Model:

Claude Sonnet 4.6

Number of models tested

92

View Details

Healthcare Benchmarks

Updated 2/19/2026

New

MedCode

Can models support the medical billing process?

Top Model:

Gemini 3 Flash (12/25)

Number of models tested

40

View Details

Updated 2/19/2026

New

MedScribe

Can models support doctors with their administrative work?

Top Model:

GPT 5.1

Number of models tested

40

View Details

Updated 2/18/2026

MedQA

Evaluating language model bias in medical questions.

Top Model:

o1

Number of models tested

92

View Details

Math Benchmarks

Updated 2/18/2026

AIME

Challenging national math exam given to top high-school students

Top Model:

GPT 5.2

Number of models tested

84

View Details

Updated 2/18/2026

ProofBench

Can models write math proofs that are formally verified?

Top Model:

Claude Opus 4.6 (Thinking)

Number of models tested

16

View Details

Academic Benchmarks

Updated 2/18/2026

GPQA

Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.

Top Model:

Gemini 3 Pro (11/25)

Number of models tested

87

View Details

Updated 2/18/2026

MMLU Pro

Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.

Top Model:

Gemini 3 Pro (11/25)

Number of models tested

84

View Details

Updated 2/18/2026

MMMU

Multimodal Multi-task Benchmark

Top Model:

Gemini 3 Flash (12/25)

Number of models tested

57

View Details

Education Benchmarks

Updated 2/18/2026

SAGE

Student Assessment with Generative Evaluation

Top Model:

Claude Opus 4.5 (Thinking)

Number of models tested

39

View Details

Coding Benchmarks

Updated 2/17/2026

IOI

International Olympiad in Informatics

Top Model:

GPT 5.2

Number of models tested

44

View Details

Updated 2/17/2026

LiveCodeBench

Our Implementation of the LiveCodeBench benchmark

Top Model:

GPT 5.2 Codex

Number of models tested

93

View Details

Updated 2/18/2026

SWE-bench

Solving production software engineering tasks

Top Model:

Claude Opus 4.6 (Thinking)

Number of models tested

53

View Details

Updated 2/19/2026

New

Terminal-Bench 2.0

State-of-the-art set of difficult terminal-based tasks

Top Model:

Gemini 3.1 Pro Preview (02/26)

Number of models tested

38

View Details

Updated 2/18/2026

New

Vibe Code Bench

Can models build web applications from scratch?

Top Model:

GPT 5.2

Number of models tested

17

View Details

Beta Benchmarks

Updated 12/23/2025

New

Poker Agent

Which model can make the most money playing poker?

Top Model:

GPT 5.2

Number of models tested

17

View Details

Public Enterprise LLM Benchmarks

Vals Index

Vals Index

Vals Multimodal Index

Legal Benchmarks

CaseLaw (v2)

LegalBench

Finance Benchmarks

CorpFin (v2)

Finance Agent v1.1

MortgageTax

TaxEval (v2)

Healthcare Benchmarks

MedCode

MedScribe

MedQA

Math Benchmarks

AIME

ProofBench

MATH 500

MGSM

Academic Benchmarks

GPQA

MMLU Pro

MMMU

Education Benchmarks

SAGE

Coding Benchmarks

IOI

LiveCodeBench

SWE-bench

Terminal-Bench 2.0

Vibe Code Bench

Terminal-Bench

Beta Benchmarks

Poker Agent

Join our mailing list to receive benchmark updates