Updates
View All Updates
Model
07/13/2025
Grok 4 Results (Continued)
In the livestream, Elon Musk called Grok 4 “partially blind”. We tested this claim on our two multimodal benchmarks (Mortgage Tax and MMMU) and found a bigger gap between public and private benchmarks. We found that Grok 4 struggles to recognize unseen images, highlighting the importance of high-quality private datasets to evaluate image recognition capabilities.
As we continue to evaluate Grok 4 on our benchmarks, the model continues to struggle on our private ones. The middling performance on Tax Eval (67.6%) and Mortgage Tax (57.5%) is consistent with previous findings on our private legal tasks like Case Law and Contract Law.
On public benchmarks, Grok 4 achieves top-10 performance on both MMLU Pro (85.3%) and MMMU (76.5%).
View Grok 4 Results
Model
07/11/2025
Grok 4 Results (Continued)
We found that Grok 4 struggles on our private benchmarks, in contrast to SOTA performance on AIME, Math 500, and GPQA.
Grok 4 delivers middle-of-the-pack performance on our private legal benchmarks. The model scores 80.6% on Case Law and 66.0% on Contract Law, underperforming Grok 3 Mini Fast Low Reasoning on both and Grok 3 Beta on Case Law. Notably, Grok 3 Beta remains our top performer on the Case Law benchmark.
On public benchmarks, Grok 4 barely cracks the top 10 on MedQA at 92.5%, narrowly outperforming Grok 2 . On MGSM, it fails to break the top 10 with 90.9%. This contrasts its SOTA performance on Math 500, suggesting Grok 4 struggles more with language than mathematical reasoning.
View Grok 4 Results
Model
07/09/2025
Grok 4 Early Results
We received early access to xAI’s latest Grok 4 and an initial set of smaller benchmarks. These early results show incredible performance — the model sets the new state-of-the-art on AIME, GPQA, and Math 500 benchmarks! Grok 4 is extremely capable in its ability to answer challenging math and science questions.
We are continuing to run our evaluations on our private benchmarks and will release results shortly.
View Grok 4 Results
Latest Benchmarks
View All Benchmarks
Latest Model Releases
View All Models
Grok 4
Release date : 7/9/2025
Magistral Medium 3.1 (06/2025)
Release date : 6/10/2025
Claude Sonnet 4 (Nonthinking)
Release date : 5/22/2025
Claude Opus 4 (Nonthinking)
Release date : 5/22/2025