Updates
View All Updates
Benchmark
08/18/2025
Our CaseLaw v2 Benchmark is live!
Our CaseLaw benchmark studies how well language models are able to perform case law reasoning and legal document analysis. We refreshed the benchmark to include harder and up-to-date questions, since the first version of our benchmark was getting saturated.
From our evaluations, we found:
- GPT 4.1 maintained the top performance with 78.1% accuracy.
- GPT 5 Mini emerged as a strong second-place performer, and had faster processing times; Grok 4 ranked third on the benchmark.
- A common failure mode was identifying only parts of relevant document sections, relying more on their general knowledge despite being instructed otherwise.
While top models performed well, many still struggled with the nuanced interpretation required for legal analysis. CaseLaw v2 highlights both current strengths and the work ahead for applying AI in legal workflows.
View Our CaseLaw v2 Benchmark
Benchmark
08/11/2025
Is your model smarter than a High-Schooler? Introducing our IOI Benchmark
Recently, top LLM labs like OpenAI and Google reported that their models achieved gold medals on the International Mathematical Olympiad (IMO). This suggests that advanced models are saturating IMO, so we decided to test models on the International Olympiad in Informatics (IOI)!
From our evaluations, we found:
- Grok 4 wins convincingly, placing first on both the 2024 and 2025 exams.
- Models struggle to write C++ at the level of the best high-school students – no models qualify for medals on either exam.
- Only the largest and most expensive models even come close to placing. The only models to achieve >10% performance all cost at least $2 per question. Claude Opus 4.1 (Nonthinking) costs over $10 per question!
- Consistency between performance on the 2024 and 2025 tests suggests that LLM labs aren’t currently training on the IOI, suggesting that this benchmark is relatively free from data contamination.
View Our IOI Benchmark
Model
08/09/2025
Opus 4.1 (Thinking) Evaluated!
We just evaluated Claude Opus 4.1 (Thinking) on our non-agentic benchmarks. While it placed in the top 10 on 6 of our public benchmarks, its performance on our private benchmarks was fairly mediocre.
-
On our private benchmarks, Claude Opus 4.1 (Thinking) lands squarely in the middle of the pack — barely making the top 10 on our TaxEval benchmark.
-
On public benchmarks, however, Claude Opus 4.1 (Thinking) ranks in the top 10 on 6 of the benchmarks we evaluated. Notably, it takes 2nd place on MMLU Pro behind only Claude Opus 4.1 (Nonthinking) and claims 1st place on MGSM.
View Opus 4.1 (Thinking) Evaluated!
Latest Benchmarks
View All Benchmarks
Latest Model Releases
View All Models
GPT 5
Release date : 8/7/2025
GPT 5 Mini
Release date : 8/7/2025
GPT 5 Nano
Release date : 8/7/2025
Claude Opus 4.1 (Nonthinking)
Release date : 8/5/2025