Updates
View All Updates
Model
09/30/2025
Magistral 1.2 (Small and Medium) Evaluated
We evaluated Magistral Medium 1.2 (09/2025) and Magistral Small 1.2 (09/2025) - and found that both models perform decently for their size, especially on coding tasks. However, the models also struggled on many benchmarks.
- Magistral Medium performs well on academic and coding benchmarks, placing in the top 20 on LiveCodeBench and AIME. However, the model struggles on our proprietary benchmarks, particularly MortgageTax and CaseLaw.
- Surprisingly, Magistral Small tends to do better on finance and academic benchmarks, most notably outperforming Medium on MortgageTax (+8.8%). The model also does well on LiveCodeBench and AIME. However, Small struggled on our proprietary CorpFin and CaseLaw benchmarks, along with GPQA and MMLU Pro.
- A large chunk of the performance loss was the result of models not outputting results in the format that was required.
The Medium model is priced at $2 / $5, and the Small at $0.5 / $1.5. The Small model has open weights, whereas the Medium model is only available via API.
View Model Results
Model
09/29/2025
Sonnet 4.5 sets new SOTAs
We ran the recently-released Claude Sonnet 4.5 (Thinking) our benchmarks, and found very strong performance:
- On Finance Agent, it beats the previous state-of-the-art by five percentage points.
- It also takes the #1 spot on SWE Bench and Terminal Bench, beating out GPT 5 Codex .
- It is in the top 10 models on the majority of our benchmarks, and also showed better performance than Claude Sonnet 4 (Thinking) on almost all benchmarks.
- It has a 1-million token context window, when the flag “context-1m-2025-08-07” is enabled
Overall, this model is extremely capable, at the same mid-range price point as its predecessor.
View Model Results
Model
09/26/2025
Gemini 2.5 Flash (09/25) Models Evaluated
We evaluated Gemini 2.5 Flash (Thinking) (and also the Flash Lite model) and found the following:
-
Compared to the previous version, the update shows improvements on GPQA, with a ~17% increase over the previous version.
-
Flash improved on Terminal Bench (+5%), GPQA (+17.2%), and our private Corp Fin Benchmark (+4.4%). It also ranks #3/38 on MMMU and #6/20 on SWE-Bench (for the thinking model), while delivering performance at half the cost of similar models.
-
Flash Lite matches Flash on several public benchmarks, making it a very cost-effective option. However, Flash outperforms Lite by ~10% on our private benchmarks (CaseLaw v2, TaxEval, Mortgage Tax).
-
Flash delivers competitive performance at a fraction of the cost of other foundation models.
Overall, the latest update to Gemini 2.5 Flash is a highly efficient model that balances strong performance with low cost.
View Gemini 2.5 Flash (09/25) Model Results
Latest Benchmarks
View All Benchmarks
Latest Model Releases
View All Models
Claude Sonnet 4.5 (Thinking)
Release date : 9/29/2025
Claude Sonnet 4.5 (Nonthinking)
Release date : 9/29/2025
Gemini 2.5 Flash (Thinking)
Release date : 9/25/2025
Gemini 2.5 Flash (Nonthinking)
Release date : 9/25/2025