Home

Updates

View All Updates

Model

08/07/2025

GPT-5 Evaluated on Non-Agentic Benchmarks!

We evaluated OpenAI’s newly-released GPT 5 family of models and found that GPT 5 achieves SOTA performance for a fraction of the cost compared to similarly performing models.

Of the three, GPT 5 is the strongest model in the family with SOTA performance on public LegalBench and AIME benchmarks.

On private benchmarks, GPT 5 and GPT 5 Mini achieve top 10 performance on all but CaseLaw. Most notably, GPT 5 Mini is the new SOTA model on TaxEval at a substantially lower cost.

On public benchmarks, GPT 5 places top 5 and GPT 5 Mini places top 10 on nearly everything. Further, we found the two models have complementary strengths - GPT 5 is SOTA on LegalBench and AIME, while GPT 5 Mini is SOTA on LiveCodeBench.

Lastly, GPT 5 Nano achieves middle of the pack performance across the board. It narrowly places in the top 10 on AIME, compared to GPT 5 and GPT 5 Mini which top the charts.

View GPT-5 Results

Model

07/30/2025

Kimi K2 Instruct Evaluated on SWE-Bench

Our SWE-bench evaluation of Kimi K2 Instruct achieved 34% accuracy, barely more than half of Kimi’s published figures!

After investigating the model responses, we identified the two following sources of error:

The model struggles to use tools - it often includes tool calls in the response itself! We replicated the issue on multiple popular inference providers. However, even discarding such errors only increases accuracy by around 2%.
The model often gets stuck repeating itself, leading to unnecessarily long and incorrect responses. This is a common failure mode of models at zero temperature, though it’s most prevalent among thinking models.

View Kimi K2 Results

Model

07/29/2025

NVIDIA Nemotron Super Evaluated

We evaluated Llama 3.3 Nemotron Super (Nonthinking) and Llama 3.3 Nemotron Super (Thinking) and found the Thinking variant substantially outperforms the Nonthinking variant, with the significant exception of our proprietary Contract Law benchmark.