New Finance Agent Benchmark Released

Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Model

07/22/2025

Kimi K2 Instruct Evaluated On Non-Agentic Benchmarks!

We found that Kimi K2 Instruct is the new state-of-the-art open-source model according to our evaluations.

The model cracks the top 10 on Math500 and LiveCodeBench, narrowly beating out DeepSeek R1 on both. On other public benchmarks, however, Kimi K2 Instruct delivers middle-of-the-pack performance.

However, Kimi K2 Instruct struggles on our proprietary benchmarks, failing to break the top 10 on any of them. We noticed it particularly struggles with legal tasks such as Case Law and Contract Law but performs comparatively better on finance tasks such as Corp Fin and Tax Eval.

The model offers solid value at $1.00 input/$3.00 output per million tokens, which is cheaper than DeepSeek R1 ($3.00/$7.00, both as hosted on Together A) but more expensive than Mistral Medium 3.1 (05/2025) ($0.40/$2.00).

We’re currently evaluating the model on SWE-bench, on which Kimi’s reported accuracy would top our leaderboard. Looking forward to seeing whether the model can live up to the hype!

View Kimi K2 Instruct Results

Model

07/17/2025

Grok 4 on Tough Benchmarks

We evaluated Grok 4 on the Finance Agent, CorpFin, SWE-bench, and LegalBench benchmarks and found strong results, especially on our private benchmarks.

View Grok 4 Results

Model

07/13/2025

Grok 4 Results (Continued)

In the livestream, Elon Musk called Grok 4 “partially blind”. We tested this claim on our two multimodal benchmarks (Mortgage Tax and MMMU) and found a bigger gap between public and private benchmarks. We found that Grok 4 struggles to recognize unseen images, highlighting the importance of high-quality private datasets to evaluate image recognition capabilities.

As we continue to evaluate Grok 4 on our benchmarks, the model continues to struggle on our private ones. The middling performance on Tax Eval (67.6%) and Mortgage Tax (57.5%) is consistent with previous findings on our private legal tasks like Case Law and Contract Law.

On public benchmarks, Grok 4 achieves top-10 performance on both MMLU Pro (85.3%) and MMMU (76.5%).

View Grok 4 Results

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

Kimi K2 Instruct

Kimi K2 Instruct

Release date : 7/11/2025

View Model
Grok 4

Grok 4

Release date : 7/9/2025

View Model
Magistral Medium 3.1 (06/2025)

Magistral Medium 3.1 (06/2025)

Release date : 6/10/2025

View Model
Claude Sonnet 4 (Nonthinking)

Claude Sonnet 4 (Nonthinking)

Release date : 5/22/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.