New Finance Agent Benchmark Released

Updates

4/22/2025

Benchmark

Our new Finance Agent Benchmark is live!

  • Our new Finance Agent Benchmark evaluates AI agents’ ability to perform tasks expected of entry-level financial analysts.
  • Developed in collaboration with industry experts, it includes 537 questions covering skills like simple retrieval, market research, and projections.
  • The models are expected to use a set of 4 tools to search through the web or the EDGAR database and parse the results to answer the questions.
  • Current AI models do not exceed 50% accuracy, highlighting the need for further development before reliable deployment in the finance industry.
  • At the time of this benchmark’s release, o3 is the best performing model reaching 48.3%, but at the cost of an average of $3.69 per question.
  • It is followed closely by Claude Sonnet 3.7 Thinking which gets 44.1% accuracy, at the much lower price per question of $1.05.

4/18/2025

Model

o3 and o4 Mini evaluated on all benchmarks!

We just evaluated o3 and o4 Mini on all benchmarks!

  • o3 achieved the #1 overall accuracy ranking on our benchmarks, with exceptional performance on complex reasoning tests like MMMU (#1/22), MMLU Pro (#1/35), GPQA (#1/35) and proprietary benchmarks like TaxEval (#1/42) and CorpFin (#2/35).

  • o4 Mini achieved the second-highest accuracy across our benchmarks (82.8%), driven by strong performance on public math tests like MGSM (#1/36), MMMU (#2/22), and Math500 (#4/38).

  • Legal benchmark weaknesses: Both models demonstrated significant weaknesses on our proprietary legal benchmarks, with lower ranks on ContractLaw (o3: #34/62, o4 Mini: #14/62) and CaseLaw (o3: #15/55, o4 Mini: #18/55).

  • Cost-effectiveness comparison: With similar performance levels, cost becomes a key differentiator. o4 Mini costs $4.40 for output, compared to $40.00 for o3 — a tenfold price difference that makes o4 Mini the more economical choice for many use cases.

4/15/2025

Model

GPT 4.1, 4.1 Mini, and 4.1 Nano evaluated on all benchmarks!

We just evaluated GPT 4.1, GPT 4.1 Mini, and GPT 4.1 Nano on all benchmarks!

  • GPT 4.1 delivers impressive results with a 75.5% average accuracy across benchmarks.

  • Impressive performance on proprietary benchmarks! GPT 4.1 is now the leader on CorpFin (71.2%), and shows strong performance on CaseLaw (85.8%, 4/53), and MMLU Pro (80.5%, 6/33).

  • GPT 4.1 Nano and GPT 4.1 Mini bring AI to time-sensitive applications with an outstanding latency of only 3.62s and 6.60s respectively while still achieving 59.1% and 75.1% average accuracy.

  • Compact but capable! Despite its size, GPT 4.1 Mini performs admirably on Math500 (88.8%, 10/36) and MGSM (87.9%, 20/34).

  • Size versus performance tradeoff: The smaller models do show lower performance on some complex tasks, with GPT 4.1 Nano ranking near the bottom on MMLU Pro (62.3%, 30/33) and MGSM (69.8%, 32/34).

4/11/2025

Model

Grok 3 Beta and Mini Beta (High and Low Reasoning) evaluated on all benchmarks!

We just evaluated Grok 3 Beta, Grok 3 Mini Fast Beta (High Reasoning), and Grok 3 Mini Fast Beta (Low Reasoning) on all benchmarks!

4/7/2025

Model

Llama 4 Maverick and Llama 4 Scout evaluated on all benchmarks!

We just evaluated Llama 4 Maverick and Llama 4 Scout on all benchmarks!

4/4/2025

Model

Mistral Small 3.1 (2503) evaluated on all benchmarks!

We just evaluated Mistral Small 2503 on all benchmarks!

  • Mistral Small 3.1 is Mistral AI’s latest small model, achieving an average accuracy of 61.4% across all benchmarks with a latency of 6.52s - faster than GPT-4o Mini (9.89s) and Llama 3.3 70B (7.67s).
  • Despite its compact size, Mistral Small outperforms Claude 3.5 Haiku (60.2%) in overall accuracy while offering competitive performance to GPT-4o Mini (62.8%).
  • The model excels on MGSM with 85.4% accuracy, comparable to Claude Haiku (85.9%) but behind Llama 3.3 70B’s impressive 91.3%.
  • Like Claude Haiku, the model struggles with AIME (both 3.5%), well behind GPT-4o Mini (11.5%) and Llama 3.3 70B (16.6%).

3/28/2025

Model

Gemini 2.5 Pro Exp evaluated on all benchmarks!

We just evaluated Gemini 2.5 Pro Exp on all benchmarks!

  • Gemini 2.5 Pro Exp is Google’s latest experimental model and the new State-of-the-Art, achieving an impressive average accuracy of 82.3% across all benchmarks with a latency of 24.68s.
  • The model ranks #1 on many of our benchmarks including CorpFin, Math500, LegalBench, GPQA, MMLU Pro, and MMMU.
  • It excels in academic benchmarks, with standout performances on Math500 (95.2%), MedQA (93.0%), and MGSM (92.2%).
  • Gemini 2.5 Pro Exp demonstrates strong legal reasoning capabilities with 86.1% accuracy on CaseLaw and 83.6% on LegalBench, though it scores lower on ContractLaw (64.7%).

3/26/2025

Model

DeepSeek V3 evaluated on all benchmarks!

We just evaluated DeepSeek V3 on all benchmarks!

  • DeepSeek V3 is DeepSeek’s latest model, boasting speeds of 60 tokens/second and claiming to be 3x faster than V2, with an average accuracy of 73.9% (4.2% better than previous versions).
  • DeepSeek V3 performs comparably (slightly better) to Claude 3.7 Sonnet (71.7%).
  • The model demonstrates strong legal capabilities, scoring particularly well on CaseLaw and LegalBench, though it scores lower on ContractLaw.
  • It shows impressive academic versatility with top-tier performance on MGSM, Math500, and MedQA.

3/26/2025

Benchmark

New Multimodal Benchmark: MMMU Evaluates Visual Reasoning Across 30 Subjects

Today, we’re releasing results from the Multimodal Multi-task Benchmark (MMMU), a comprehensive evaluation of AI models’ ability to reason across multiple modalities spanning 30 subjects in 6 major disciplines.

  • o1 achieved the highest overall accuracy at 77.7%, surpassing the performance of the worst human experts (76.2%).
  • Claude 3.7 Sonnet (Thinking) delivers performance nearly identical to o1 at a more favorable price point
  • Even the best models remain well below the performance of the best human experts (88.6%), highlighting opportunities for further advancement

3/24/2025

Model

Command A evaluated on all benchmarks!

We just evaluated Command A on all benchmarks!

  • Command A is Cohere’s most efficient and performant model to date, specializing in agentic AI, multilingual, and human evaluations for real-life use cases.
  • On our proprietary benchmarks, Command A shows mixed performance, ranking 23rd out of 28 models on TaxEval but a good 10th out of 22 models on CorpFin.
  • The model performs better on some academic benchmarks, scoring 78.7% on LegalBench (9th place) and 86.8% on MGSM (13th place).
  • However, it struggles with AIME (13.3%, 12th place) and GPQA (29.3%, 18th place).

3/13/2025

Model

Jamba 1.6 Large and Mini Evaluated on All Benchmarks.

We just evaluated Jamba 1.6 Large and Jamba 1.6 Mini models!

3/11/2025

Benchmark

Academic Benchmarks Released: GPQA, MMLU, AIME (2024 and 2025), Math 500, and MGSM

Today, we’ve released five new academic benchmarks on our site: three evaluating mathematical reasoning, and two on general question-answering.

Unlike results released by model providers on these benchmarks, we applied a consistent methodology and prompt-template across models, ensuring an apples-to-apples comparison. You can find detailed information about our evaluation approach on each benchmark’s page:

3/5/2025

Benchmark

New Multimodal Mortgage Tax Benchmark Released

We just released a new benchmark in partnership with Vontive!

  • The MortgageTax benchmark evaluates language models on extracting information from tax certificates.
  • It tests multimodal capabilities with 1258 document images, including both computer-written and handwritten content.
  • The benchmark includes two key tasks: semantic extraction (identifying year, parcel number, county) and numerical extraction (calculating annualized amounts).

Claude 3.7 Sonnet leads the pack with 80.6% accuracy, and the other top 3 models are all from Anthropic.

2/27/2025

News

Vals Legal AI Report Released

We just released the VLAIR! Whereas our previous benchmarks study foundation model performance, here we investigate the ability of the most popular legal AI products to perform real world legal tasks.

To build a large, high quality dataset, we worked with some of the top global law firms, including Reed Smith, Fisher Phillips, McDermott Will & Emery, Ogletree Deakins, Paul Hastings among others. This is the first benchmark in which we also collected a human baseline against which we measure performance.

In sum, this enabled us to study how these legal AI systems perform on practical tasks and especially how the work of generative AI tools compared to that of a human lawyer.

Read the report for full results.

2/25/2025

Model

Anthropic's Claude 3.7 Sonnet Evaluated on All Benchmarks.

We just evaluated Anthropic’s Claude 3.7 Sonnet model!

  • We evaluted the model with Thinking Disabled on all benchmarks. It shows great performance and reaches second place just behind its Thinking Enabled counterpart on Corp Fin.
  • We also evaluated the model with Thinking Enabled. Unlike most models that excel in specific areas, Anthropic’s Claude 3.7 Sonnet (Thinking) demonstrates remarkable consistency, achieving top-tier performance across all evaluated benchmarks. The remaining two benchmarks are currently in progress due to their higher token requirements.

We have also run Google 2.0 Flash Thinking Exp and Google 2.0 Pro Exp on most benchmarks.

2/3/2025

Model

OpenAI's o3-mini Evaluated on All Benchmarks.

We just evaluated OpenAI’s o3-mini model!

  • The model shows a good price-performance trade-off, reaching close to top places on our most recent and proprietary benchmarks like Tax Eval.
  • However, o3-mini seems to struggle with large context windows, performing poorly on the Max Fitting Context task of CorpFin. It tends to lose the question if it is provided at the beginning of a large context window (around 150k tokens and more).

We have also run DeepSeek R1 on our CorpFin benchmark, on which it reaches the top place, beating all other models we have tested.

1/28/2025

Model

DeepSeek R1 Evaluated on TaxEval, CaseLaw, ContractLaw

🐳 We just evaluated DeepSeek’s R1 model on three of our private datasets! 🐳

  • The model demonstrates its strong reasoning ability, rivaling Open AI’s o1 model on our Tax dataset.
  • However, R1 performs extremely poorly on ContractLaw and with middling performance on CaseLaw. The model’s performance is not uniform, suggest task-specific evaluation must be done before adoption
  • Overall, this large Chinese model shows impressive ability and further closes the gap between closed and open-source models.

1/27/2025

Benchmark

Two New Proprietary Benchmarks Released

We just released two new benchmarks!

  • We have released a completely new version of our CorpFin benchmark - with 1200 expert generated financial questions on very long context docs (200-300 pages).
  • We have also released a completely new TaxEval benchmark, with more than 1500 expert reviewed tax questions.

We also are releasing several new models such as Grok 2 and Gemini 2.0 Flash Exp.

1/27/2025

Benchmark

New Medical Benchmark Released

Vals AI and Graphite Digital partnered to release the first medical benchmark on Vals AI.

This report offers the first third-party, highly-exhaustive evaluation of over 15 of the most popular LLMs on graduate-level medical questions.

We assessed models under two conditions: unbiased and bias-injected questions, measuring the models’ general accuracy and the ability to handle racial bias in medical contexts.

Our top-performing model was OpenAI’s o1 Preview and best value was Meta’s Llama 3.1 70b.

Read the full report to find out more!

12/11/2024

News

Refresh to Vals AI

We’ve just implemented a re-design of this benchmarking website!

Apart from being easier on the eyes, this new version of the site is much more useful.

  1. Models cards are displayed on their own dedicated pages, showing results across all benchmarks.
  2. Every Benchmark page is time-stamped and updated with changelogs.
  3. Our Methodology page now shares more details around our approach and plan.

11/10/2024

Model

Results for the new 3.5 Sonnet (Latest) model

  • On Legalbench, it’s now exactly tied with GPT 4o, and beats 4o on CorpFin and CaseLaw
  • It usually, but not always, performs a few percentage points better than the previous version - for example, on Legalbench (+1.3%), ContractLaw Overall (+0.5%), and CorpFin (+0.8%).
  • There are some instances where it experienced a performance regression - including TaxEval Free Response (-3.2%) and CaseLaw Overall (-0.1%).
  • Although it’s competitive with 4o, it’s still not at the level of GPT o1, which still claims the top spots on almost all of our leaderboards.

10/31/2024

News

Vals AI Legal Report Announced

Vals AI and Legaltech Hub are partnering with leading law firms and top legal AI vendors to conduct a first-of-its-kind benchmark.

The study will evaluate the platforms across eight legal tasks including Document Q&A, Legal Research, EDGAR Research. All data will be collected from the law firms, to ensure it’s representative of real legal work.

The report will be published in early 2025.

Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.