New Finance Agent Benchmark Released

Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Benchmark

08/18/2025

Our CaseLaw v2 Benchmark is live!

Our CaseLaw benchmark studies how well language models are able to perform case law reasoning and legal document analysis. We refreshed the benchmark to include harder and up-to-date questions, since the first version of our benchmark was getting saturated.

From our evaluations, we found:

  • GPT 4.1 maintained the top performance with 78.1% accuracy.
  • GPT 5 Mini emerged as a strong second-place performer, and had faster processing times; Grok 4 ranked third on the benchmark.
  • A common failure mode was identifying only parts of relevant document sections, relying more on their general knowledge despite being instructed otherwise.

While top models performed well, many still struggled with the nuanced interpretation required for legal analysis. CaseLaw v2 highlights both current strengths and the work ahead for applying AI in legal workflows.

View Our CaseLaw v2 Benchmark

Benchmark

08/11/2025

Is your model smarter than a High-Schooler? Introducing our IOI Benchmark

Recently, top LLM labs like OpenAI and Google reported that their models achieved gold medals on the International Mathematical Olympiad (IMO). This suggests that advanced models are saturating IMO, so we decided to test models on the International Olympiad in Informatics (IOI)!

From our evaluations, we found:

  • Grok 4 wins convincingly, placing first on both the 2024 and 2025 exams.
  • Models struggle to write C++ at the level of the best high-school students – no models qualify for medals on either exam.
  • Only the largest and most expensive models even come close to placing. The only models to achieve >10% performance all cost at least $2 per question. Claude Opus 4.1 (Nonthinking) costs over $10 per question!
  • Consistency between performance on the 2024 and 2025 tests suggests that LLM labs aren’t currently training on the IOI, suggesting that this benchmark is relatively free from data contamination.

View Our IOI Benchmark

Model

08/09/2025

Opus 4.1 (Thinking) Evaluated!

We just evaluated Claude Opus 4.1 (Thinking) on our non-agentic benchmarks. While it placed in the top 10 on 6 of our public benchmarks, its performance on our private benchmarks was fairly mediocre.

View Opus 4.1 (Thinking) Evaluated!

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

GPT 5

GPT 5

Release date : 8/7/2025

View Model
GPT 5 Mini

GPT 5 Mini

Release date : 8/7/2025

View Model
GPT 5 Nano

GPT 5 Nano

Release date : 8/7/2025

View Model
Claude Opus 4.1 (Nonthinking)

Claude Opus 4.1 (Nonthinking)

Release date : 8/5/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.