Vals Index
Updated 10/16/2025
NewVals Multimodal Index
Benchmark consisting of a weighted performance across finance, law, coding, and education tasks. Showing the potential impact that LLM's can have on the economy.
Top Model:
Claude Sonnet 4.5 (Thinking)
Number of models tested
8
Updated 10/29/2025
Vals Index
Benchmark consisting of a weighted performance across finance, law and coding tasks. Showing the potential impact that LLM's can have on the economy.
Top Model:
Claude Sonnet 4.5 (Thinking)
Number of models tested
20
Legal Benchmarks
Updated 11/07/2025
CaseLaw (v2)
Private question-answer benchmark over Canadian court-cases.
Top Model:
GPT 4.1
Number of models tested
41
Updated 10/27/2025
LegalBench
Evaluating language models on a wide range of open source legal reasoning tasks.
Top Model:
GPT 5
Number of models tested
89
Finance Benchmarks
Updated 11/07/2025
CorpFin (v2)
A private benchmark evaluating understanding of long-context credit agreements
Top Model:
GPT 5
Number of models tested
59
Updated 11/07/2025
Finance Agent
Evaluating agents on core financial analyst tasks
Top Model:
Claude Sonnet 4.5 (Thinking)
Number of models tested
45
Updated 10/16/2025
MortgageTax
Evaluating reading and understanding tax certificates as images
Top Model:
Claude 3.7 Sonnet (Nonthinking)
Number of models tested
45
Updated 11/07/2025
TaxEval (v2)
A Vals-created set of questions and responses to tax questions
Top Model:
GPT 5 Mini
Number of models tested
75
Healthcare Benchmarks
Updated 10/27/2025
MedQA
Evaluating language model bias in medical questions.
Top Model:
o1
Number of models tested
70
Math Benchmarks
Updated 10/27/2025
AIME
Challenging national math exam given to top high-school students
Top Model:
GPT 5
Number of models tested
64
Updated 10/27/2025
MGSM
A multilingual benchmark for mathematical questions.
Top Model:
Claude Opus 4.1 (Thinking)
Number of models tested
66
Updated 08/26/2025
MATH 500
Academic math benchmark on probability, algebra, and trigonometry
Top Model:
Grok 4
Number of models tested
54
Academic Benchmarks
Updated 11/07/2025
GPQA
Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.
Top Model:
Grok 4
Number of models tested
67
Updated 10/27/2025
MMLU Pro
Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.
Top Model:
Claude Opus 4.1 (Nonthinking)
Number of models tested
64
Updated 10/16/2025
MMMU
Multimodal Multi-task Benchmark
Top Model:
GPT 5
Number of models tested
42
Education Benchmarks
Updated 10/16/2025
SAGE
Student Assessment with Generative Evaluation
Top Model:
Gemini 2.5 Flash (7/17) (Nonthinking)
Number of models tested
22
Coding Benchmarks
Updated 10/27/2025
IOI
Based on the International Olympiad in Informatics
Top Model:
Grok 4
Number of models tested
21
Updated 10/27/2025
LiveCodeBench
Our Implementation of the LiveCodeBench benchmark
Top Model:
GPT 5 Mini
Number of models tested
63
Updated 10/30/2025
SWE-bench
Solving production software engineering tasks
Top Model:
Claude Sonnet 4.5 (Thinking)
Number of models tested
25
Updated 11/07/2025
Terminal-Bench
State-of-the-art set of difficult terminal-based tasks
Top Model:
Claude Sonnet 4.5 (Thinking)
Number of models tested
30