New Finance Agent Benchmark Released

Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Benchmark

06/13/2025

SWE-bench results released

  • Foundation models still fail to solve real-world coding problems despite notable progress, highlighting remaining room for improvement.

  • The models’ performance drops significantly on “harder” problems that take >1 hour to complete. Only Claude Sonnet 4 (Nonthinking) , o3 and GPT 4.1 pass any of the >4 hour tasks (33% each).

  • Claude Sonnet 4 (Nonthinking) leads by a wide margin with 65.0% accuracy, and maintains both excellent cost efficiency at $1.24 per test and fast completion times (426.52s).

  • Tool usage patterns reveal models employ distinct strategies. o4 Mini brute-forces problems (~25k searches per task), while Claude Sonnet 4 (Nonthinking) employs a leaner, balanced mix (~9-10k default tool calls with far fewer searches).

Note that we run every model through the same evaluation harness to make direct comparisons between models, so the scores show relative performance, not each model’s best possible accuracy.

View Benchmarks Page

Benchmark

06/09/2025

LiveCodeBench: Models Struggle with Hard Competitive Programming Problems

Our results for LiveCodeBench are now live!

Key findings:

View Benchmark

Model

05/30/2025

Opus (Nonthinking) evaluated on our benchmarks; Sonnet 4 evaluated on FAB.

We’ve released our evaluation of Claude Opus 4 (Nonthinking) across our benchmarks!

We found:

  • Opus 4 ranks #1 on both MMLU Pro and MGSM, narrowly setting new state-of-the-art scores. However, it achieves middle of the road performance across most other benchmarks.
  • Compared to its predecessor (Opus 3), Opus 4 ranked higher on CaseLaw (#22 vs. #24/62) and LegalBench (#8 vs #32/67) but scored notably lower on ContractLaw (#16 vs. #2/69)
  • Opus 4 is expensive, with an output cost of $75.00 /M tokens, 5x as much as Sonnet 4, and about 1.5x more expensive than o3 ($15 / $75 vs $10 / $40).

We also benchmarked Claude Sonnet 4 (Thinking) and Claude Sonnet 4 (Nonthinking) on our Finance Agent benchmark (the last remaining benchmark for this model). They performed nearly identically to Claude 3.7 Sonnet (Nonthinking) .

View Model Page

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

Claude Sonnet 4 (Nonthinking)

Claude Sonnet 4 (Nonthinking)

Release date : 5/22/2025

View Model
Claude Opus 4 (Nonthinking)

Claude Opus 4 (Nonthinking)

Release date : 5/22/2025

View Model
Claude Sonnet 4 (Thinking)

Claude Sonnet 4 (Thinking)

Release date : 5/22/2025

View Model
Claude Opus 4 (Thinking)

Claude Opus 4 (Thinking)

Release date : 5/22/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.