New Finance Agent Benchmark Released

Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Model

05/09/2025

Mistral Medium 3 evaluated on all benchmarks!

We just evaluated Mistral Medium 3 on all benchmarks!

  • Mistral Medium 3 demonstrates consistent performance across both public and proprietary benchmarks, scoring 68.7% overall accuracy with strong results on CaseLaw (84.9%, #6/59) and Math500 (87.0%, #17/42) given its size and price.

  • The model outperforms Llama 4 Maverick (63.3% accuracy) in most benchmarks, particularly excelling in MGSM (91.6% vs 92.5%) and MMLU Pro (74.4% vs 79.4%).

  • While impressive, Mistral Medium 3 still trails behind Qwen 3 235B (81.0% accuracy) on several academic benchmarks, particularly Math500 (87.0% vs 94.6%) and AIME (42.3% vs 84.0%).

  • For users seeking speed-performance balance, Mistral Medium 3 offers good latency (14.37s) compared to Qwen 3 235B (94.31s), making it suitable for applications requiring faster response times while maintaining strong reasoning capabilities.

View Models Page

Model

05/05/2025

Google's Gemini 2.5 Flash evaluated on most benchmarks

We just evaluated Gemini 2.5 Flash Preview on most benchmarks.

View Models Page

Model

05/05/2025

Qwen 3 235B evaluations released!

We just evaluated Qwen 3 235B on all benchmarks!

  • Qwen 3 235B demonstrates exceptional math reasoning capabilities, ranking #3 on Math500, #5 on AIME, and #3 on MGSM.

  • With its “thinking allowed” approach, Qwen 3 outperforms several prominent closed-source reasoning models including Claude 3.7 Sonnet and o4-mini in mathematical reasoning tasks.

  • Private benchmark challenges: Qwen 3 shows limitations on proprietary benchmarks, particularly struggling on TaxEval where it ranks #29 out of 43 evaluated models.

  • This evaluation showcases Qwen 3’s strong specialized reasoning capabilities while highlighting areas where further improvements could enhance its performance on domain-specific tasks.

View Models Page

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

Mistral Medium 3 (05/2025)

Mistral Medium 3 (05/2025)

Release date : 5/7/2025

View Model
Qwen 3 (235B)

Qwen 3 (235B)

Release date : 4/28/2025

View Model
Gemini 2.5 Flash Preview

Gemini 2.5 Flash Preview

Release date : 4/17/2025

View Model
Gemini 2.5 Flash Preview (Thinking)

Gemini 2.5 Flash Preview (Thinking)

Release date : 4/17/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.