Benchmarks
Models
Methodology
VLAIR
Platform
About
Private question-answer benchmark over Canadian court cases.
Updated 05/30/2025
Benchmarking model performance on Contract Law Tasks
Evaluating language models on a wide range of open source legal reasoning tasks.
Our completely new version of CorpFin benchmark
Evaluating agents on core financial analyst tasks
Evaluating Language Models on Mortgage Tax Certificates
Our completely new version of TaxEval benchmark
Evaluating language model bias in medical questions.
Extremely challenging math exam given to students
A multilingual benchmark for mathematical questions.
Academic math benchmark on probability, algebra, and trigonometry
Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.
Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.
Multimodal Multi-task Benchmark
Our Implementation of the LiveCodeBench benchmark
Updated 06/16/2025
Solving production software engineering tasks
Updated 06/13/2025