Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Updates

News

12/11/2024

Refresh to Vals AI

We’ve just implemented a re-design of this benchmarking website!

Apart from being easier on the eyes, this new version of the site is much more useful.

  1. Models cards are displayed on their own dedicated pages, showing results across all benchmarks.
  2. Every Benchmark page is time-stamped and updated with changelogs.
  3. Our Methodology page now shares more details around our approach and plan.

Read about our Methodology

Model

11/10/2024

Results for the new 3.5 Sonnet (Upgraded) model

  • On Legalbench, it’s now exactly tied with GPT 4o, and beats 4o on CorpFin and CaseLaw
  • It usually, but not always, performs a few percentage points better than the previous version - for example, on Legalbench (+1.3%), ContractLaw Overall (+0.5%), and CorpFin (+0.8%).
  • There are some instances where it experienced a performance regression - including TaxEval Free Response (-3.2%) and CaseLaw Overall (-0.1%).
  • Although it’s competitive with 4o, it’s still not at the level of GPT o1, which still claims the top spots on almost all of our leaderboards.

View Model

News

10/31/2024

Vals AI Legal Report Announced

Vals AI and Legaltech Hub are partnering with leading law firms and top legal AI vendors to conduct a first-of-its-kind benchmark.

The study will evaluate the platforms across eight legal tasks including Document Q&A, Legal Research, EDGAR Research. All data will be collected from the law firms, to ensure it’s representative of real legal work.

The report will be published in early 2025.

View Announcement

Benchmarks

View All

Latest Model Releases

Claude 3.5 Sonnet

Claude 3.5 Sonnet

Release date : 10/22/2024

View Model
o1 Preview

o1 Preview

Release date : 9/12/2024

View Model
GPT-4o

GPT-4o

Release date : 8/6/2024

View Model
Llama 3.1 Instruct Turbo (405B)

Llama 3.1 Instruct Turbo (405B)

Release date : 7/23/2024

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.