Benchmarks
Models
Comparison
Methodology
VLAIR
Platform
About
Private question-answer benchmark over Canadian court-cases.
Updated 09/29/2025
Evaluating language models on a wide range of open source legal reasoning tasks.
A private benchmark evaluating understanding of long-context credit agreements
Evaluating agents on core financial analyst tasks
Evaluating reading and understanding tax certificates as images
A Vals-created set of questions and responses to tax questions
Evaluating language model bias in medical questions.
Challenging national math exam given to top high-school students
A multilingual benchmark for mathematical questions.
Academic math benchmark on probability, algebra, and trigonometry
Updated 08/26/2025
Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.
Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.
Multimodal Multi-task Benchmark
Based on the International Olympiad in Informatics
Updated 09/30/2025
Our Implementation of the LiveCodeBench benchmark
Solving production software engineering tasks
State-of-the-art set of difficult terminal-based tasks
Updated 09/27/2025