Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Updates

Model

03/26/2025

DeepSeek V3 evaluated on all benchmarks!

We just evaluated DeepSeek V3 on all benchmarks!

  • DeepSeek V3 is DeepSeek’s latest model, boasting speeds of 60 tokens/second and claiming to be 3x faster than V2, with an average accuracy of 73.9% (4.2% better than previous versions).
  • DeepSeek V3 performs comparably (slightly better) to Claude 3.7 Sonnet (71.7%).
  • The model demonstrates strong legal capabilities, scoring particularly well on CaseLaw and LegalBench, though it scores lower on ContractLaw.
  • It shows impressive academic versatility with top-tier performance on MGSM, Math500, and MedQA.

View Models Page

Benchmark

03/26/2025

New Multimodal Benchmark: MMMU Evaluates Visual Reasoning Across 30 Subjects

Today, we’re releasing results from the Multimodal Multi-task Benchmark (MMMU), a comprehensive evaluation of AI models’ ability to reason across multiple modalities spanning 30 subjects in 6 major disciplines.

  • o1 achieved the highest overall accuracy at 77.7%, surpassing the performance of the worst human experts (76.2%).
  • Claude 3.7 Sonnet (Thinking) delivers performance nearly identical to o1 at a more favorable price point
  • Even the best models remain well below the performance of the best human experts (88.6%), highlighting opportunities for further advancement

View Benchmark

Model

03/24/2025

Command A evaluated on all benchmarks!

We just evaluated Command A on all benchmarks!

  • Command A is Cohere’s most efficient and performant model to date, specializing in agentic AI, multilingual, and human evaluations for real-life use cases.
  • On our proprietary benchmarks, Command A shows mixed performance, ranking 23rd out of 28 models on TaxEval but a good 10th out of 22 models on CorpFin.
  • The model performs better on some academic benchmarks, scoring 78.7% on LegalBench (9th place) and 86.8% on MGSM (13th place).
  • However, it struggles with AIME (13.3%, 12th place) and GPQA (29.3%, 18th place).

View Models Page

Latest Benchmarks

View All Benchmarks

Latest Model Releases

Anthropic Claude 3.7 Sonnet (Thinking)

Anthropic Claude 3.7 Sonnet (Thinking)

Release date : 2/19/2025

View Model
Anthropic Claude 3.7 Sonnet

Anthropic Claude 3.7 Sonnet

Release date : 2/19/2025

View Model
OpenAI O3 Mini

OpenAI O3 Mini

Release date : 1/31/2025

View Model
DeepSeek R1

DeepSeek R1

Release date : 1/20/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.