Home

Updates

View All Updates

Model

07/22/2025

Kimi K2 Instruct Evaluated On Non-Agentic Benchmarks!

We found that Kimi K2 Instruct is the new state-of-the-art open-source model according to our evaluations.

The model cracks the top 10 on Math500 and LiveCodeBench, narrowly beating out DeepSeek R1 on both. On other public benchmarks, however, Kimi K2 Instruct delivers middle-of-the-pack performance.

However, Kimi K2 Instruct struggles on our proprietary benchmarks, failing to break the top 10 on any of them. We noticed it particularly struggles with legal tasks such as Case Law and Contract Law but performs comparatively better on finance tasks such as Corp Fin and Tax Eval.

The model offers solid value at $1.00 input/$3.00 output per million tokens, which is cheaper than DeepSeek R1 ($3.00/$7.00, both as hosted on Together A) but more expensive than Mistral Medium 3.1 (05/2025) ($0.40/$2.00).

We’re currently evaluating the model on SWE-bench, on which Kimi’s reported accuracy would top our leaderboard. Looking forward to seeing whether the model can live up to the hype!

View Kimi K2 Instruct Results

Model

07/17/2025

Grok 4 on Tough Benchmarks

We evaluated Grok 4 on the Finance Agent, CorpFin, SWE-bench, and LegalBench benchmarks and found strong results, especially on our private benchmarks.

On CorpFin, the model achieves state-of-the-art performance by a significant margin, placing first by the largest margin of any other model in the top 10!
Grok 4 ranks in the top 10 for model performance on the Finance Agent benchmark.
On LegalBench, Grok 4 places second behind Gemini 2.5 Pro Preview , illustrating potential saturation on this public, legal benchmark.
On SWE-bench, Grok 4 scores place second only to Claude Sonnet 4 (Nonthinking) and show a 15% improvement over previous Grok 4 results. Though Grok 4 uses tools 50% less often than Claude Sonnet 4 (Nonthinking) , this does not yield an improvement in overall price.

View Grok 4 Results

Model

07/13/2025

Grok 4 Results (Continued)

In the livestream, Elon Musk called Grok 4 “partially blind”. We tested this claim on our two multimodal benchmarks (Mortgage Tax and MMMU) and found a bigger gap between public and private benchmarks. We found that Grok 4 struggles to recognize unseen images, highlighting the importance of high-quality private datasets to evaluate image recognition capabilities.

As we continue to evaluate Grok 4 on our benchmarks, the model continues to struggle on our private ones. The middling performance on Tax Eval (67.6%) and Mortgage Tax (57.5%) is consistent with previous findings on our private legal tasks like Case Law and Contract Law.

On public benchmarks, Grok 4 achieves top-10 performance on both MMLU Pro (85.3%) and MMMU (76.5%).

View Grok 4 Results