Updates
View All Updates
Model
05/09/2025
Mistral Medium 3 evaluated on all benchmarks!
We just evaluated Mistral Medium 3 on all benchmarks!
-
Mistral Medium 3 demonstrates consistent performance across both public and proprietary benchmarks, scoring 68.7% overall accuracy with strong results on CaseLaw (84.9%, #6/59) and Math500 (87.0%, #17/42) given its size and price.
-
The model outperforms Llama 4 Maverick (63.3% accuracy) in most benchmarks, particularly excelling in MGSM (91.6% vs 92.5%) and MMLU Pro (74.4% vs 79.4%).
-
While impressive, Mistral Medium 3 still trails behind Qwen 3 235B (81.0% accuracy) on several academic benchmarks, particularly Math500 (87.0% vs 94.6%) and AIME (42.3% vs 84.0%).
-
For users seeking speed-performance balance, Mistral Medium 3 offers good latency (14.37s) compared to Qwen 3 235B (94.31s), making it suitable for applications requiring faster response times while maintaining strong reasoning capabilities.
View Models Page
Model
05/05/2025
Google's Gemini 2.5 Flash evaluated on most benchmarks
We just evaluated Gemini 2.5 Flash Preview on most benchmarks.
- Gemini 2.5 Flash Preview is a lightweight alternative to Google’s flagship model, Gemini 2.5 Pro Exp. Gemini 2.5 Flash Preview runs at a fraction of the cost and latency, rendering it a more accessible option.
- Like Claude 3.7 Sonnet, Gemini 2.5 Flash Preview is a hybrid reasoning model, meaning it can adaptively choose how much to think before responding.
- Gemini 2.5 Flash Preview excels on LegalBench, coming second only to the flagship Gemini 2.5 Pro Exp (and outperforming its own thinking variant, Gemini 2.5 Flash Preview (Thinking), by 1%).
- We consistently had difficulty with Google’s API during evaluation, which prevented us from reporting full results. We’re working with a representative from the Gemini team to resolve those issues.
View Models Page
Model
05/05/2025
Qwen 3 235B evaluations released!
We just evaluated Qwen 3 235B on all benchmarks!
-
Qwen 3 235B demonstrates exceptional math reasoning capabilities, ranking #3 on Math500, #5 on AIME, and #3 on MGSM.
-
With its “thinking allowed” approach, Qwen 3 outperforms several prominent closed-source reasoning models including Claude 3.7 Sonnet and o4-mini in mathematical reasoning tasks.
-
Private benchmark challenges: Qwen 3 shows limitations on proprietary benchmarks, particularly struggling on TaxEval where it ranks #29 out of 43 evaluated models.
-
This evaluation showcases Qwen 3’s strong specialized reasoning capabilities while highlighting areas where further improvements could enhance its performance on domain-specific tasks.
View Models Page
Latest Benchmarks
View All Benchmarks
Latest Model Releases
View All Models
Mistral Medium 3 (05/2025)
Release date : 5/7/2025
Qwen 3 (235B)
Release date : 4/28/2025
Gemini 2.5 Flash Preview
Release date : 4/17/2025
Gemini 2.5 Flash Preview (Thinking)
Release date : 4/17/2025