Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Updates

Model

02/03/2025

OpenAI's o3-mini Evaluated on All Benchmarks.

We just evaluated OpenAI’s o3-mini model!

  • The model shows a good price-performance trade-off, reaching close to top places on our most recent and proprietary benchmarks like Tax Eval.
  • However, o3-mini seems to struggle with large context windows, performing poorly on the Max Fitting Context task of CorpFin. It tends to lose the question if it is provided at the beginning of a large context window (around 150k tokens and more).

We have also run DeepSeek R1 on our CorpFin benchmark, on which it reaches the top place, beating all other models we have tested.

View Model Page

Model

01/28/2025

DeepSeek R1 Evaluated on TaxEval, CaseLaw, ContractLaw

🐳 We just evaluated DeepSeek’s R1 model on three of our private datasets! 🐳

  • The model demonstrates its strong reasoning ability, rivaling Open AI’s o1 model on our Tax dataset.
  • However, R1 performs extremely poorly on ContractLaw and with middling performance on CaseLaw. The model’s performance is not uniform, suggest task-specific evaluation must be done before adoption
  • Overall, this large Chinese model shows impressive ability and further closes the gap between closed and open-source models.

View Model Page

Benchmark

01/27/2025

Two New Proprietary Benchmarks Released

We just released two new benchmarks!

  • We have released a completely new version of our CorpFin benchmark - with 1200 expert generated financial questions on very long context docs (200-300 pages).
  • We have also released a completely new TaxEval benchmark, with more than 1500 expert reviewed tax questions.

We also are releasing several new models such as Grok 2 and Gemini 2.0 Flash Exp.

View Benchmarks

Benchmarks

View All

Latest Model Releases

OpenAI O3 Mini

OpenAI O3 Mini

Release date : 1/31/2025

View Model
DeepSeek R1

DeepSeek R1

Release date : 1/20/2025

View Model
DeepSeek V3

DeepSeek V3

Release date : 12/26/2024

View Model
o1

o1

Release date : 12/17/2024

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.