New Finance Agent Benchmark Released

Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Model

09/30/2025

Magistral 1.2 (Small and Medium) Evaluated

We evaluated Magistral Medium 1.2 (09/2025) and Magistral Small 1.2 (09/2025) - and found that both models perform decently for their size, especially on coding tasks. However, the models also struggled on many benchmarks.

  • Magistral Medium performs well on academic and coding benchmarks, placing in the top 20 on LiveCodeBench and AIME. However, the model struggles on our proprietary benchmarks, particularly MortgageTax and CaseLaw.
  • Surprisingly, Magistral Small tends to do better on finance and academic benchmarks, most notably outperforming Medium on MortgageTax (+8.8%). The model also does well on LiveCodeBench and AIME. However, Small struggled on our proprietary CorpFin and CaseLaw benchmarks, along with GPQA and MMLU Pro.
  • A large chunk of the performance loss was the result of models not outputting results in the format that was required.

The Medium model is priced at $2 / $5, and the Small at $0.5 / $1.5. The Small model has open weights, whereas the Medium model is only available via API.

View Model Results

Model

09/29/2025

Sonnet 4.5 sets new SOTAs

We ran the recently-released Claude Sonnet 4.5 (Thinking) our benchmarks, and found very strong performance:

  • On Finance Agent, it beats the previous state-of-the-art by five percentage points.
  • It also takes the #1 spot on SWE Bench and Terminal Bench, beating out GPT 5 Codex .
  • It is in the top 10 models on the majority of our benchmarks, and also showed better performance than Claude Sonnet 4 (Thinking) on almost all benchmarks.
  • It has a 1-million token context window, when the flag “context-1m-2025-08-07” is enabled

Overall, this model is extremely capable, at the same mid-range price point as its predecessor.

View Model Results

Model

09/26/2025

Gemini 2.5 Flash (09/25) Models Evaluated

We evaluated Gemini 2.5 Flash (Thinking) (and also the Flash Lite model) and found the following:

  • Compared to the previous version, the update shows improvements on GPQA, with a ~17% increase over the previous version.

  • Flash improved on Terminal Bench (+5%), GPQA (+17.2%), and our private Corp Fin Benchmark (+4.4%). It also ranks #3/38 on MMMU and #6/20 on SWE-Bench (for the thinking model), while delivering performance at half the cost of similar models.

  • Flash Lite matches Flash on several public benchmarks, making it a very cost-effective option. However, Flash outperforms Lite by ~10% on our private benchmarks (CaseLaw v2, TaxEval, Mortgage Tax).

  • Flash delivers competitive performance at a fraction of the cost of other foundation models.

Overall, the latest update to Gemini 2.5 Flash is a highly efficient model that balances strong performance with low cost.

View Gemini 2.5 Flash (09/25) Model Results

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

Claude Sonnet 4.5 (Thinking)

Claude Sonnet 4.5 (Thinking)

Release date : 9/29/2025

View Model
Claude Sonnet 4.5 (Nonthinking)

Claude Sonnet 4.5 (Nonthinking)

Release date : 9/29/2025

View Model
Gemini 2.5 Flash (Thinking)

Gemini 2.5 Flash (Thinking)

Release date : 9/25/2025

View Model
Gemini 2.5 Flash (Nonthinking)

Gemini 2.5 Flash (Nonthinking)

Release date : 9/25/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.