Benchmarks
Models
Methodology
VLAIR
Platform
About
Private question-answer benchmark over Canadian court cases.
Updated 04/18/2025
Benchmarking model performance on Contract Law Tasks
Evaluating language models on a wide range of open source legal reasoning tasks.
Our completely new version of CorpFin benchmark
Finance Agent
Updated 04/22/2025
Evaluating Language Models on Mortgage Tax Certificates
Our completely new version of TaxEval benchmark
Evaluating language model bias in medical questions.
Extremely challenging math exam given to students
A multilingual benchmark for mathematical questions.
Academic math benchmark on probability, algebra, and trigonometry
Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.
Academic multiple-choice benchmark covering 14 subjects including STEM, humanities, and social sciences.
Multimodal Multi-task Benchmark