New Finance Agent Benchmark Released

Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Model

07/13/2025

Grok 4 Results (Continued)

In the livestream, Elon Musk called Grok 4 “partially blind”. We tested this claim on our two multimodal benchmarks (Mortgage Tax and MMMU) and found a bigger gap between public and private benchmarks. We found that Grok 4 struggles to recognize unseen images, highlighting the importance of high-quality private datasets to evaluate image recognition capabilities.

As we continue to evaluate Grok 4 on our benchmarks, the model continues to struggle on our private ones. The middling performance on Tax Eval (67.6%) and Mortgage Tax (57.5%) is consistent with previous findings on our private legal tasks like Case Law and Contract Law.

On public benchmarks, Grok 4 achieves top-10 performance on both MMLU Pro (85.3%) and MMMU (76.5%).

View Grok 4 Results

Model

07/11/2025

Grok 4 Results (Continued)

We found that Grok 4 struggles on our private benchmarks, in contrast to SOTA performance on AIME, Math 500, and GPQA.

Grok 4 delivers middle-of-the-pack performance on our private legal benchmarks. The model scores 80.6% on Case Law and 66.0% on Contract Law, underperforming Grok 3 Mini Fast Low Reasoning on both and Grok 3 Beta on Case Law. Notably, Grok 3 Beta remains our top performer on the Case Law benchmark.

On public benchmarks, Grok 4 barely cracks the top 10 on MedQA at 92.5%, narrowly outperforming Grok 2 . On MGSM, it fails to break the top 10 with 90.9%. This contrasts its SOTA performance on Math 500, suggesting Grok 4 struggles more with language than mathematical reasoning.

View Grok 4 Results

Model

07/09/2025

Grok 4 Early Results

We received early access to xAI’s latest Grok 4 and an initial set of smaller benchmarks. These early results show incredible performance — the model sets the new state-of-the-art on AIME, GPQA, and Math 500 benchmarks! Grok 4 is extremely capable in its ability to answer challenging math and science questions.

We are continuing to run our evaluations on our private benchmarks and will release results shortly.

View Grok 4 Results

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

Grok 4

Grok 4

Release date : 7/9/2025

View Model
Magistral Medium 3.1 (06/2025)

Magistral Medium 3.1 (06/2025)

Release date : 6/10/2025

View Model
Claude Sonnet 4 (Nonthinking)

Claude Sonnet 4 (Nonthinking)

Release date : 5/22/2025

View Model
Claude Opus 4 (Nonthinking)

Claude Opus 4 (Nonthinking)

Release date : 5/22/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.