4/22/2025
Benchmark
Our new Finance Agent Benchmark is live!
- Our new Finance Agent Benchmark evaluates AI agents’ ability to perform tasks expected of entry-level financial analysts.
- Developed in collaboration with industry experts, it includes 537 questions covering skills like simple retrieval, market research, and projections.
- The models are expected to use a set of 4 tools to search through the web or the EDGAR database and parse the results to answer the questions.
- Current AI models do not exceed 50% accuracy, highlighting the need for further development before reliable deployment in the finance industry.
- At the time of this benchmark’s release, o3 is the best performing model reaching 48.3%, but at the cost of an average of $3.69 per question.
- It is followed closely by Claude Sonnet 3.7 Thinking which gets 44.1% accuracy, at the much lower price per question of $1.05.
4/18/2025
Model
o3 and o4 Mini evaluated on all benchmarks!
We just evaluated o3 and o4 Mini on all benchmarks!
-
o3 achieved the #1 overall accuracy ranking on our benchmarks, with exceptional performance on complex reasoning tests like MMMU (#1/22), MMLU Pro (#1/35), GPQA (#1/35) and proprietary benchmarks like TaxEval (#1/42) and CorpFin (#2/35).
-
o4 Mini achieved the second-highest accuracy across our benchmarks (82.8%), driven by strong performance on public math tests like MGSM (#1/36), MMMU (#2/22), and Math500 (#4/38).
-
Legal benchmark weaknesses: Both models demonstrated significant weaknesses on our proprietary legal benchmarks, with lower ranks on ContractLaw (o3: #34/62, o4 Mini: #14/62) and CaseLaw (o3: #15/55, o4 Mini: #18/55).
-
Cost-effectiveness comparison: With similar performance levels, cost becomes a key differentiator. o4 Mini costs $4.40 for output, compared to $40.00 for o3 — a tenfold price difference that makes o4 Mini the more economical choice for many use cases.
4/15/2025
Model
GPT 4.1, 4.1 Mini, and 4.1 Nano evaluated on all benchmarks!
We just evaluated GPT 4.1, GPT 4.1 Mini, and GPT 4.1 Nano on all benchmarks!
-
GPT 4.1 delivers impressive results with a 75.5% average accuracy across benchmarks.
-
Impressive performance on proprietary benchmarks! GPT 4.1 is now the leader on CorpFin (71.2%), and shows strong performance on CaseLaw (85.8%, 4/53), and MMLU Pro (80.5%, 6/33).
-
GPT 4.1 Nano and GPT 4.1 Mini bring AI to time-sensitive applications with an outstanding latency of only 3.62s and 6.60s respectively while still achieving 59.1% and 75.1% average accuracy.
-
Compact but capable! Despite its size, GPT 4.1 Mini performs admirably on Math500 (88.8%, 10/36) and MGSM (87.9%, 20/34).
-
Size versus performance tradeoff: The smaller models do show lower performance on some complex tasks, with GPT 4.1 Nano ranking near the bottom on MMLU Pro (62.3%, 30/33) and MGSM (69.8%, 32/34).
4/11/2025
Model
Grok 3 Beta and Mini Beta (High and Low Reasoning) evaluated on all benchmarks!
We just evaluated Grok 3 Beta, Grok 3 Mini Fast Beta (High Reasoning), and Grok 3 Mini Fast Beta (Low Reasoning) on all benchmarks!
-
Grok 3 Beta delivers impressive results with a 78.1% average accuracy across benchmarks and a snappy 15.52s latency.
-
Dominates proprietary benchmarks! Grok 3 Beta ranks #1 on three key benchmarks: CorpFin (69.1%), CaseLaw (88.1%), and TaxEval (78.8%).
-
Grok 3 Mini Fast Beta (High Reasoning) surprises with an even higher average accuracy of 81.6% despite being a smaller model.
-
Mathematical prowess! Grok 3 Mini Fast Beta (High Reasoning) takes the #2 place (94.2%) on Math500 and the #3 place (85.00%) on AIME.
4/7/2025
Model
Llama 4 Maverick and Llama 4 Scout evaluated on all benchmarks!
We just evaluated Llama 4 Maverick and Llama 4 Scout on all benchmarks!
- Llama 4 Scout achieves an average accuracy of 61.5% with a latency of 7.13 seconds, placing the model at a tie with Mistral Small 3.1 (03/2025) (61.5%) and just behind Cohere’s Command A (63.5%).
- Llama 4 Maverick sits at 67.0% accuracy with a latency of 7.72 seconds ranking just behind Anthropic’s Claude 3.5 Sonnet (69.9%) and DeepSeek V3 (03/24/2025) (74.7%).
- Both models excel on public benchmarks, with Maverick achieving top rankings in MMMU (4/17), MGSM (4/28), GPQA (5/27), and MMLU Pro (5/27), while Scout delivers strong results in MMMU (10/17), MortgageTax (10/18), and AIME (11/26).
- However, these models show a significant gap between their impressive public benchmark performance and mediocre results on private benchmarks, particularly struggling with TaxEval (Maverick: 28/34, Scout: 32/34), Contract Law (Maverick: 37/54, Scout: 43/54), and MedQA (Maverick: 32/32, Scout: 30/32).
4/4/2025
Model
Mistral Small 3.1 (2503) evaluated on all benchmarks!
We just evaluated Mistral Small 2503 on all benchmarks!
- Mistral Small 3.1 is Mistral AI’s latest small model, achieving an average accuracy of 61.4% across all benchmarks with a latency of 6.52s - faster than GPT-4o Mini (9.89s) and Llama 3.3 70B (7.67s).
- Despite its compact size, Mistral Small outperforms Claude 3.5 Haiku (60.2%) in overall accuracy while offering competitive performance to GPT-4o Mini (62.8%).
- The model excels on MGSM with 85.4% accuracy, comparable to Claude Haiku (85.9%) but behind Llama 3.3 70B’s impressive 91.3%.
- Like Claude Haiku, the model struggles with AIME (both 3.5%), well behind GPT-4o Mini (11.5%) and Llama 3.3 70B (16.6%).
3/28/2025
Model
Gemini 2.5 Pro Exp evaluated on all benchmarks!
We just evaluated Gemini 2.5 Pro Exp on all benchmarks!
- Gemini 2.5 Pro Exp is Google’s latest experimental model and the new State-of-the-Art, achieving an impressive average accuracy of 82.3% across all benchmarks with a latency of 24.68s.
- The model ranks #1 on many of our benchmarks including CorpFin, Math500, LegalBench, GPQA, MMLU Pro, and MMMU.
- It excels in academic benchmarks, with standout performances on Math500 (95.2%), MedQA (93.0%), and MGSM (92.2%).
- Gemini 2.5 Pro Exp demonstrates strong legal reasoning capabilities with 86.1% accuracy on CaseLaw and 83.6% on LegalBench, though it scores lower on ContractLaw (64.7%).
3/26/2025
Model
DeepSeek V3 evaluated on all benchmarks!
We just evaluated DeepSeek V3 on all benchmarks!
- DeepSeek V3 is DeepSeek’s latest model, boasting speeds of 60 tokens/second and claiming to be 3x faster than V2, with an average accuracy of 73.9% (4.2% better than previous versions).
- DeepSeek V3 performs comparably (slightly better) to Claude 3.7 Sonnet (71.7%).
- The model demonstrates strong legal capabilities, scoring particularly well on CaseLaw and LegalBench, though it scores lower on ContractLaw.
- It shows impressive academic versatility with top-tier performance on MGSM, Math500, and MedQA.
3/26/2025
Benchmark
New Multimodal Benchmark: MMMU Evaluates Visual Reasoning Across 30 Subjects
Today, we’re releasing results from the Multimodal Multi-task Benchmark (MMMU), a comprehensive evaluation of AI models’ ability to reason across multiple modalities spanning 30 subjects in 6 major disciplines.
- o1 achieved the highest overall accuracy at 77.7%, surpassing the performance of the worst human experts (76.2%).
- Claude 3.7 Sonnet (Thinking) delivers performance nearly identical to o1 at a more favorable price point
- Even the best models remain well below the performance of the best human experts (88.6%), highlighting opportunities for further advancement
3/24/2025
Model
Command A evaluated on all benchmarks!
We just evaluated Command A on all benchmarks!
- Command A is Cohere’s most efficient and performant model to date, specializing in agentic AI, multilingual, and human evaluations for real-life use cases.
- On our proprietary benchmarks, Command A shows mixed performance, ranking 23rd out of 28 models on TaxEval but a good 10th out of 22 models on CorpFin.
- The model performs better on some academic benchmarks, scoring 78.7% on LegalBench (9th place) and 86.8% on MGSM (13th place).
- However, it struggles with AIME (13.3%, 12th place) and GPQA (29.3%, 18th place).
3/13/2025
Model
Jamba 1.6 Large and Mini Evaluated on All Benchmarks.
We just evaluated Jamba 1.6 Large and Jamba 1.6 Mini models!
- Jamba 1.6 Large and Jamba 1.6 Mini are the latest versions of the open source Jamba models, developed by AI21 Labs.
- On our private benchmarks, Jamba 1.6 Large shows reasonable performance, getting the 16th place out of 27 models on TaxEval. with 65.3% accuracy, beating GPT-4o Mini and Claude 3.5 Haiku.
- However both models are not competitive on public benchmarks, they get the last two places on AIME and GPQA.
3/11/2025
Benchmark
Academic Benchmarks Released: GPQA, MMLU, AIME (2024 and 2025), Math 500, and MGSM
Today, we’ve released five new academic benchmarks on our site: three evaluating mathematical reasoning, and two on general question-answering.
Unlike results released by model providers on these benchmarks, we applied a consistent methodology and prompt-template across models, ensuring an apples-to-apples comparison. You can find detailed information about our evaluation approach on each benchmark’s page:
3/5/2025
Benchmark
New Multimodal Mortgage Tax Benchmark Released
We just released a new benchmark in partnership with Vontive!
- The MortgageTax benchmark evaluates language models on extracting information from tax certificates.
- It tests multimodal capabilities with 1258 document images, including both computer-written and handwritten content.
- The benchmark includes two key tasks: semantic extraction (identifying year, parcel number, county) and numerical extraction (calculating annualized amounts).
Claude 3.7 Sonnet leads the pack with 80.6% accuracy, and the other top 3 models are all from Anthropic.
2/27/2025
News
Vals Legal AI Report Released
We just released the VLAIR! Whereas our previous benchmarks study foundation model performance, here we investigate the ability of the most popular legal AI products to perform real world legal tasks.
To build a large, high quality dataset, we worked with some of the top global law firms, including Reed Smith, Fisher Phillips, McDermott Will & Emery, Ogletree Deakins, Paul Hastings among others. This is the first benchmark in which we also collected a human baseline against which we measure performance.
In sum, this enabled us to study how these legal AI systems perform on practical tasks and especially how the work of generative AI tools compared to that of a human lawyer.
Read the report for full results.
2/25/2025
Model
Anthropic's Claude 3.7 Sonnet Evaluated on All Benchmarks.
We just evaluated Anthropic’s Claude 3.7 Sonnet model!
- We evaluted the model with Thinking Disabled on all benchmarks. It shows great performance and reaches second place just behind its Thinking Enabled counterpart on Corp Fin.
- We also evaluated the model with Thinking Enabled. Unlike most models that excel in specific areas, Anthropic’s Claude 3.7 Sonnet (Thinking) demonstrates remarkable consistency, achieving top-tier performance across all evaluated benchmarks. The remaining two benchmarks are currently in progress due to their higher token requirements.
We have also run Google 2.0 Flash Thinking Exp and Google 2.0 Pro Exp on most benchmarks.
2/3/2025
Model
OpenAI's o3-mini Evaluated on All Benchmarks.
We just evaluated OpenAI’s o3-mini model!
- The model shows a good price-performance trade-off, reaching close to top places on our most recent and proprietary benchmarks like Tax Eval.
- However, o3-mini seems to struggle with large context windows, performing poorly on the Max Fitting Context task of CorpFin. It tends to lose the question if it is provided at the beginning of a large context window (around 150k tokens and more).
We have also run DeepSeek R1 on our CorpFin benchmark, on which it reaches the top place, beating all other models we have tested.
1/28/2025
Model
DeepSeek R1 Evaluated on TaxEval, CaseLaw, ContractLaw
🐳 We just evaluated DeepSeek’s R1 model on three of our private datasets! 🐳
- The model demonstrates its strong reasoning ability, rivaling Open AI’s o1 model on our Tax dataset.
- However, R1 performs extremely poorly on ContractLaw and with middling performance on CaseLaw. The model’s performance is not uniform, suggest task-specific evaluation must be done before adoption
- Overall, this large Chinese model shows impressive ability and further closes the gap between closed and open-source models.
1/27/2025
Benchmark
Two New Proprietary Benchmarks Released
We just released two new benchmarks!
- We have released a completely new version of our CorpFin benchmark - with 1200 expert generated financial questions on very long context docs (200-300 pages).
- We have also released a completely new TaxEval benchmark, with more than 1500 expert reviewed tax questions.
We also are releasing several new models such as Grok 2 and Gemini 2.0 Flash Exp.
1/27/2025
Benchmark
New Medical Benchmark Released
Vals AI and Graphite Digital partnered to release the first medical benchmark on Vals AI.
This report offers the first third-party, highly-exhaustive evaluation of over 15 of the most popular LLMs on graduate-level medical questions.
We assessed models under two conditions: unbiased and bias-injected questions, measuring the models’ general accuracy and the ability to handle racial bias in medical contexts.
Our top-performing model was OpenAI’s o1 Preview and best value was Meta’s Llama 3.1 70b.
Read the full report to find out more!
12/11/2024
News
Refresh to Vals AI
We’ve just implemented a re-design of this benchmarking website!
Apart from being easier on the eyes, this new version of the site is much more useful.
- Models cards are displayed on their own dedicated pages, showing results across all benchmarks.
- Every Benchmark page is time-stamped and updated with changelogs.
- Our Methodology page now shares more details around our approach and plan.
11/10/2024
Model
Results for the new 3.5 Sonnet (Latest) model
- On Legalbench, it’s now exactly tied with GPT 4o, and beats 4o on CorpFin and CaseLaw
- It usually, but not always, performs a few percentage points better than the previous version - for example, on Legalbench (+1.3%), ContractLaw Overall (+0.5%), and CorpFin (+0.8%).
- There are some instances where it experienced a performance regression - including TaxEval Free Response (-3.2%) and CaseLaw Overall (-0.1%).
- Although it’s competitive with 4o, it’s still not at the level of GPT o1, which still claims the top spots on almost all of our leaderboards.
10/31/2024
News
Vals AI Legal Report Announced
Vals AI and Legaltech Hub are partnering with leading law firms and top legal AI vendors to conduct a first-of-its-kind benchmark.
The study will evaluate the platforms across eight legal tasks including Document Q&A, Legal Research, EDGAR Research. All data will be collected from the law firms, to ensure it’s representative of real legal work.
The report will be published in early 2025.