The Public Standard for Real World AI Performance

Claude Fable 5

Claude Opus 4.8

GPT 5.5

Legal

Industry Partner

Updated 6/17/2026

Harvey's Legal Agent Benchmark

14

models tested

Tests an agent's ability to complete legal work using documents, spreadsheets, presentations, and file-system tools.

Top Models

Claude Fable 5

Claude Opus 4.8

GLM 5.2

Updated 6/17/2026

LegalBench

119

models tested

Evaluating language models on a wide range of open source legal reasoning tasks.

Top Models

Claude Fable 5

88.6%

2

Gemini 3.1 Pro Preview (02/26)

Gemini 3 Pro (11/25)

CaseLaw v2

Private question-answer benchmark over Canadian court-cases.

Finance

Updated 6/17/2026

CorpFin v2

116

models tested

A private benchmark evaluating understanding of long-context credit agreements

Top Models

Claude Fable 5

Grok 4.3

GPT 5.5

Updated 6/17/2026

Finance Agent v2

28

models tested

Evaluating agents on core financial analyst tasks

Top Models

Gemini 3.5 Flash

Claude Fable 5

Claude Opus 4.8

Updated 6/10/2026

MortgageTax

80

models tested

Evaluating reading and understanding tax certificates as images

Top Models

Claude Opus 4.7

Claude Opus 4.8

Gemini 3.1 Pro Preview (02/26)

69.4%

Updated 6/17/2026

TaxEval v2

122

models tested

A Vals-created set of questions and responses to tax questions

Top Models

Muse Spark

Claude Sonnet 4.6

Claude Fable 5

Healthcare

Updated 6/17/2026

MedCode

68

models tested

Can models support the medical billing process?

Top Models

Gemini 3.1 Pro Preview (02/26)

Claude Fable 5

Gemini 3 Flash (12/25)

55.9%

Updated 6/17/2026

MedScribe

65

models tested

Can models support doctors with their administrative work?

Top Models

Claude Fable 5

GPT 5.1

MiniMax-M3

MedQA

Evaluating language model bias in medical questions.

Math

Updated 6/17/2026

ProofBench

43

models tested

Can models write math proofs that are formally verified?

Top Models

Claude Fable 5

Claude Opus 4.8

GPT 5.4 (xhigh)

AIME

Challenging national math exam given to top high-school students

MATH 500

Academic math benchmark on probability, algebra, and trigonometry

MGSM

A multilingual benchmark for mathematical questions.

Academic

Updated 6/17/2026

GPQA Diamond

116

models tested

Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.

Top Models

Gemini 3.1 Pro Preview (02/26)

Claude Fable 5

GPT 5.5

Updated 6/17/2026

MMLU Pro

115

models tested

Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.

Top Models

Claude Fable 5

91.5%

2

Gemini 3.1 Pro Preview (02/26)

Gemini 3 Pro (11/25)

Updated 6/10/2026

MMMU Pro

76

models tested

Multimodal Multi-task Benchmark

Top Models

Claude Fable 5

Gemini 3.5 Flash

GPT 5.5

Education

Updated 6/10/2026

SAGE

61

models tested

Student Assessment with Generative Evaluation

Top Models

Claude Opus 4.7

Gemma 4 31B IT

Claude Opus 4.8

Coding

Updated 6/17/2026

Code Migration

21

models tested

Can language models reimplement working programs in another language?

Top Models

Claude Fable 5

Claude Opus 4.8

GPT 5.5

Updated 6/9/2026

IOI

55

models tested

International Olympiad in Informatics

Top Models

Claude Fable 5

GPT 5.4 (xhigh)

GPT 5.2

Updated 6/17/2026

LiveCodeBench

122

models tested

Our Implementation of the LiveCodeBench benchmark

Top Models

Claude Fable 5

89.8%

2

Gemini 3.1 Pro Preview (02/26)

GPT 5.2 Codex

Updated 6/17/2026

ProgramBench

24

models tested

Can language models rebuild programs from scratch?

Top Models

Claude Fable 5

Claude Opus 4.8

GPT 5.5

Updated 6/17/2026

SWE-bench Verified

64

models tested

Solving production software engineering tasks

Top Models

Claude Fable 5

Claude Opus 4.8

Claude Opus 4.8

Updated 6/17/2026

Terminal-Bench 2.1

35

models tested

State-of-the-art set of difficult terminal-based tasks

Top Models

Claude Fable 5

GPT 5.5

Gemini 3.5 Flash

Updated 6/17/2026

Vibe Code Bench v1.1

66

models tested

Can models build web applications from scratch?

Top Models

Claude Fable 5

Claude Opus 4.8

Claude Opus 4.8

Terminal-Bench 2.0

State-of-the-art set of difficult terminal-based tasks

Beta

Updated 12/23/2025

New

Poker Agent

17

models tested

Which model can make the most money playing poker?

Top Models

GPT 5.2

GPT 5

Gemini 3 Flash (12/25)

1100.2

Social Mobility