Benchmarks
Models
Methodology
Updates
VLAIR
Careers
About
Changelog
Private question-answer benchmark over Canadian court cases.
Updated 03/26/2025
Benchmarking model performance on Contract Law Tasks
Evaluating language models on a wide range of open source legal reasoning tasks.
Our completely new version of CorpFin benchmark
Evaluating Language Models on Mortgage Tax Certificates
Updated 03/05/2025
Our completely new version of TaxEval benchmark
Evaluating language model bias in medical questions.
Extremely challenging math exam given to students
A multilingual benchmark for mathematical questions.
Academic math benchmark on probability, algebra, and trigonometry
Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.
Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.
Multimodal Multi-task Benchmark