Home

Updates

View All Updates

Model

09/30/2025

Magistral 1.2 (Small and Medium) Evaluated

We evaluated Magistral Medium 1.2 (09/2025) and Magistral Small 1.2 (09/2025) - and found that both models perform decently for their size, especially on coding tasks. However, the models also struggled on many benchmarks.

Magistral Medium performs well on academic and coding benchmarks, placing in the top 20 on LiveCodeBench and AIME. However, the model struggles on our proprietary benchmarks, particularly MortgageTax and CaseLaw.
Surprisingly, Magistral Small tends to do better on finance and academic benchmarks, most notably outperforming Medium on MortgageTax (+8.8%). The model also does well on LiveCodeBench and AIME. However, Small struggled on our proprietary CorpFin and CaseLaw benchmarks, along with GPQA and MMLU Pro.
A large chunk of the performance loss was the result of models not outputting results in the format that was required.

The Medium model is priced at $2 / $5, and the Small at $0.5 / $1.5. The Small model has open weights, whereas the Medium model is only available via API.

View Model Results

Model

09/29/2025

Sonnet 4.5 sets new SOTAs

We ran the recently-released Claude Sonnet 4.5 (Thinking) our benchmarks, and found very strong performance:

On Finance Agent, it beats the previous state-of-the-art by five percentage points.
It also takes the #1 spot on SWE Bench and Terminal Bench, beating out GPT 5 Codex .
It is in the top 10 models on the majority of our benchmarks, and also showed better performance than Claude Sonnet 4 (Thinking) on almost all benchmarks.
It has a 1-million token context window, when the flag “context-1m-2025-08-07” is enabled

Overall, this model is extremely capable, at the same mid-range price point as its predecessor.

View Model Results

Model

09/26/2025

Gemini 2.5 Flash (09/25) Models Evaluated

We evaluated Gemini 2.5 Flash (Thinking) (and also the Flash Lite model) and found the following:

Compared to the previous version, the update shows improvements on GPQA, with a ~17% increase over the previous version.
Flash improved on Terminal Bench (+5%), GPQA (+17.2%), and our private Corp Fin Benchmark (+4.4%). It also ranks #3/38 on MMMU and #6/20 on SWE-Bench (for the thinking model), while delivering performance at half the cost of similar models.
Flash Lite matches Flash on several public benchmarks, making it a very cost-effective option. However, Flash outperforms Lite by ~10% on our private benchmarks (CaseLaw v2, TaxEval, Mortgage Tax).
Flash delivers competitive performance at a fraction of the cost of other foundation models.