Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Model

04/15/2025

GPT 4.1, 4.1 Mini, and 4.1 Nano evaluated on all benchmarks!

We just evaluated GPT 4.1, GPT 4.1 Mini, and GPT 4.1 Nano on all benchmarks!

  • GPT 4.1 delivers impressive results with a 75.5% average accuracy across benchmarks.

  • Impressive performance on proprietary benchmarks! GPT 4.1 is now the leader on CorpFin (71.2%), and shows strong performance on CaseLaw (85.8%, 4/53), and MMLU Pro (80.5%, 6/33).

  • GPT 4.1 Nano and GPT 4.1 Mini bring AI to time-sensitive applications with an outstanding latency of only 3.62s and 6.60s respectively while still achieving 59.1% and 75.1% average accuracy.

  • Compact but capable! Despite its size, GPT 4.1 Mini performs admirably on Math500 (88.8%, 10/36) and MGSM (87.9%, 20/34).

  • Size versus performance tradeoff: The smaller models do show lower performance on some complex tasks, with GPT 4.1 Nano ranking near the bottom on MMLU Pro (62.3%, 30/33) and MGSM (69.8%, 32/34).

View Models Page

Model

04/11/2025

Grok 3 Beta and Mini Beta (High and Low Reasoning) evaluated on all benchmarks!

We just evaluated Grok 3 Beta, Grok 3 Mini Fast Beta (High Reasoning), and Grok 3 Mini Fast Beta (Low Reasoning) on all benchmarks!

View Models Page

Model

04/07/2025

Llama 4 Maverick and Llama 4 Scout evaluated on all benchmarks!

We just evaluated Llama 4 Maverick and Llama 4 Scout on all benchmarks!

View Models Page

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

Anthropic Claude 3.7 Sonnet (Thinking)

Anthropic Claude 3.7 Sonnet (Thinking)

Release date : Invalid Date

View Model
Anthropic Claude 3.7 Sonnet

Anthropic Claude 3.7 Sonnet

Release date : Invalid Date

View Model
OpenAI O3 Mini

OpenAI O3 Mini

Release date : Invalid Date

View Model
DeepSeek R1

DeepSeek R1

Release date : Invalid Date

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.