Updates
View All Updates
Model
08/07/2025
GPT-5 Evaluated on Non-Agentic Benchmarks!
We evaluated OpenAI’s newly-released GPT 5 family of models and found that GPT 5 achieves SOTA performance for a fraction of the cost compared to similarly performing models.
Of the three, GPT 5 is the strongest model in the family with SOTA performance on public LegalBench and AIME benchmarks.
On private benchmarks, GPT 5 and GPT 5 Mini achieve top 10 performance on all but CaseLaw. Most notably, GPT 5 Mini is the new SOTA model on TaxEval at a substantially lower cost.
On public benchmarks, GPT 5 places top 5 and GPT 5 Mini places top 10 on nearly everything. Further, we found the two models have complementary strengths - GPT 5 is SOTA on LegalBench and AIME, while GPT 5 Mini is SOTA on LiveCodeBench.
Lastly, GPT 5 Nano achieves middle of the pack performance across the board. It narrowly places in the top 10 on AIME, compared to GPT 5 and GPT 5 Mini which top the charts.
View GPT-5 Results
Model
07/30/2025
Kimi K2 Instruct Evaluated on SWE-Bench
Our SWE-bench evaluation of Kimi K2 Instruct achieved 34% accuracy, barely more than half of Kimi’s published figures!
After investigating the model responses, we identified the two following sources of error:
- The model struggles to use tools - it often includes tool calls in the response itself! We replicated the issue on multiple popular inference providers. However, even discarding such errors only increases accuracy by around 2%.
- The model often gets stuck repeating itself, leading to unnecessarily long and incorrect responses. This is a common failure mode of models at zero temperature, though it’s most prevalent among thinking models.
View Kimi K2 Results
Model
07/29/2025
NVIDIA Nemotron Super Evaluated
We evaluated Llama 3.3 Nemotron Super (Nonthinking) and Llama 3.3 Nemotron Super (Thinking) and found the Thinking variant substantially outperforms the Nonthinking variant, with the significant exception of our proprietary Contract Law benchmark.
-
On Contract Law, Llama 3.3 Nemotron Super (Thinking) struggles, ranking in the bottom 10 models. Meanwhile, Llama 3.3 Nemotron Super (Nonthinking) lands in the top 3!
-
On TaxEval and CaseLaw, Llama 3.3 Nemotron Super (Nonthinking) (Non-thinking) struggles significantly, while Llama 3.3 Nemotron Super (Thinking) sits solidly middle-of-the-pack.
-
On public benchmarks, Llama 3.3 Nemotron Super (Nonthinking) performs abysmally across the board. Llama 3.3 Nemotron Super (Thinking) improves on all public benchmarks but still struggles, particularly on MGSM (35/46) and MMLU Pro (31/43).
-
Llama 3.3 Nemotron Super (Thinking) shows substantial gains over Llama 3.3 Nemotron Super (Nonthinking) : on AIME, performance improves from 37/44 to 14/44, and on Case Law, accuracy increases by 12%. These results highlight the benefits of the reasoning model.
View Nemotron Super Results
Latest Benchmarks
View All Benchmarks
Latest Model Releases
View All Models
GPT 5
Release date : 8/7/2025
GPT 5 Mini
Release date : 8/7/2025
GPT 5 Nano
Release date : 8/7/2025
Kimi K2 Instruct
Release date : 7/11/2025