Updates
View All Updates
Model
05/24/2025
Claude Sonnet 4 (non-thinking) evaluated on (almost) all benchmarks!
We just evaluated the recently-released Claude Sonnet 4 (non-thinking) on all benchmarks except for CorpFin, MortgageTax, and Finance Agent, which are still running. We’ll post updates as soon as we have them!
- Claude Sonnet 4 (non-thinking) achieves 76.9% accuracy on average, a 7.1% improvement on Anthropic’s previous flagship model, Claude 3.7 Sonnet. The newer claude is also nearly twice as fast for the same price.
- Claude Sonnet 4 (non-thinking) excels on the MGSM benchmark, edging out Claude 3.7 Sonnet (Thinking) by a tenth of a percentage point.
- Claude Sonnet 4 (non-thinking) also achieves strong performance on our proprietary CaseLaw benchmark, outperforming all previous Anthropic models.
Stay tuned for evaluations of Sonnet 4’s thinking variant, as well as Opus 4!
View Models Page
Model
05/09/2025
Mistral Medium 3 evaluated on all benchmarks!
We just evaluated Mistral Medium 3 on all benchmarks!
-
Mistral Medium 3 demonstrates consistent performance across both public and proprietary benchmarks, scoring 68.7% overall accuracy with strong results on CaseLaw (84.9%, #6/59) and Math500 (87.0%, #17/42) given its size and price.
-
The model outperforms Llama 4 Maverick (63.3% accuracy) in most benchmarks, particularly excelling in MGSM (91.6% vs 92.5%) and MMLU Pro (74.4% vs 79.4%).
-
While impressive, Mistral Medium 3 still trails behind Qwen 3 235B (81.0% accuracy) on several academic benchmarks, particularly Math500 (87.0% vs 94.6%) and AIME (42.3% vs 84.0%).
-
For users seeking speed-performance balance, Mistral Medium 3 offers good latency (14.37s) compared to Qwen 3 235B (94.31s), making it suitable for applications requiring faster response times while maintaining strong reasoning capabilities.
View Models Page
Model
05/05/2025
Google's Gemini 2.5 Flash evaluated on most benchmarks
We just evaluated Gemini 2.5 Flash Preview on most benchmarks.
- Gemini 2.5 Flash Preview is a lightweight alternative to Google’s flagship model, Gemini 2.5 Pro Exp. Gemini 2.5 Flash Preview runs at a fraction of the cost and latency, rendering it a more accessible option.
- Like Claude 3.7 Sonnet, Gemini 2.5 Flash Preview is a hybrid reasoning model, meaning it can adaptively choose how much to think before responding.
- Gemini 2.5 Flash Preview excels on LegalBench, coming second only to the flagship Gemini 2.5 Pro Exp (and outperforming its own thinking variant, Gemini 2.5 Flash Preview (Thinking), by 1%).
- We consistently had difficulty with Google’s API during evaluation, which prevented us from reporting full results. We’re working with a representative from the Gemini team to resolve those issues.
View Models Page
Latest Benchmarks
View All Benchmarks
Latest Model Releases
View All Models
Claude Sonnet 4
Release date : 5/22/2025
Mistral Medium 3.1 (05/2025)
Release date : 5/7/2025
Qwen 3 (235B)
Release date : 4/28/2025
Gemini 2.5 Flash Preview
Release date : 4/17/2025