Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Updates

Model

03/13/2025

Jamba 1.6 Large and Mini Evaluated on All Benchmarks.

We just evaluated Jamba 1.6 Large and Jamba 1.6 Mini models!

View Models Page

Benchmark

03/11/2025

Academic Benchmarks Released: GPQA, MMLU, AIME (2024 and 2025), Math 500, and MGSM

Today, we’ve released five new academic benchmarks on our site: three evaluating mathematical reasoning, and two on general question-answering.

Unlike results released by model providers on these benchmarks, we applied a consistent methodology and prompt-template across models, ensuring an apples-to-apples comparison. You can find detailed information about our evaluation approach on each benchmark’s page:

View Benchmarks

Benchmark

03/05/2025

New Multimodal Mortgage Tax Benchmark Released

We just released a new benchmark in partnership with Vontive!

  • The MortgageTax benchmark evaluates language models on extracting information from tax certificates.
  • It tests multimodal capabilities with 1258 document images, including both computer-written and handwritten content.
  • The benchmark includes two key tasks: semantic extraction (identifying year, parcel number, county) and numerical extraction (calculating annualized amounts).

Claude 3.7 Sonnet leads the pack with 80.6% accuracy, and the other top 3 models are all from Anthropic.

View Benchmark

Latest Benchmarks

View All Benchmarks

Latest Model Releases

Anthropic Claude 3.7 Sonnet (Thinking)

Anthropic Claude 3.7 Sonnet (Thinking)

Release date : 2/19/2025

View Model
Anthropic Claude 3.7 Sonnet

Anthropic Claude 3.7 Sonnet

Release date : 2/19/2025

View Model
OpenAI O3 Mini

OpenAI O3 Mini

Release date : 1/31/2025

View Model
DeepSeek R1

DeepSeek R1

Release date : 1/20/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.