New Finance Agent Benchmark Released

Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Model

05/24/2025

Claude Sonnet 4 (non-thinking) evaluated on (almost) all benchmarks!

We just evaluated the recently-released Claude Sonnet 4 (non-thinking) on all benchmarks except for CorpFin, MortgageTax, and Finance Agent, which are still running. We’ll post updates as soon as we have them!

Stay tuned for evaluations of Sonnet 4’s thinking variant, as well as Opus 4!

View Models Page

Model

05/09/2025

Mistral Medium 3 evaluated on all benchmarks!

We just evaluated Mistral Medium 3 on all benchmarks!

  • Mistral Medium 3 demonstrates consistent performance across both public and proprietary benchmarks, scoring 68.7% overall accuracy with strong results on CaseLaw (84.9%, #6/59) and Math500 (87.0%, #17/42) given its size and price.

  • The model outperforms Llama 4 Maverick (63.3% accuracy) in most benchmarks, particularly excelling in MGSM (91.6% vs 92.5%) and MMLU Pro (74.4% vs 79.4%).

  • While impressive, Mistral Medium 3 still trails behind Qwen 3 235B (81.0% accuracy) on several academic benchmarks, particularly Math500 (87.0% vs 94.6%) and AIME (42.3% vs 84.0%).

  • For users seeking speed-performance balance, Mistral Medium 3 offers good latency (14.37s) compared to Qwen 3 235B (94.31s), making it suitable for applications requiring faster response times while maintaining strong reasoning capabilities.

View Models Page

Model

05/05/2025

Google's Gemini 2.5 Flash evaluated on most benchmarks

We just evaluated Gemini 2.5 Flash Preview on most benchmarks.

View Models Page

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

Claude Sonnet 4

Claude Sonnet 4

Release date : 5/22/2025

View Model
Mistral Medium 3.1 (05/2025)

Mistral Medium 3.1 (05/2025)

Release date : 5/7/2025

View Model
Qwen 3 (235B)

Qwen 3 (235B)

Release date : 4/28/2025

View Model
Gemini 2.5 Flash Preview

Gemini 2.5 Flash Preview

Release date : 4/17/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.