10/16/2025
We evaluated Claude Haiku 4.5 (Thinking) and found strong performance:
Overall, Haiku 4.5 sits firmly on the Pareto frontier.
10/8/2025
Our new Student Assessment with Generative Evaluation (SAGE) benchmark evaluates the ability of Language Models to grade handwritten student work at the undergraduate level.
We find significant room for improvement in models’ capabilities here, and are excited for the SAGE benchmark to enable continued model development for education!
9/30/2025
We evaluated Magistral Medium 1.2 (09/2025) and Magistral Small 1.2 (09/2025) - and found that both models perform decently for their size, especially on coding tasks. However, the models also struggled on many benchmarks.
The Medium model is priced at $2 / $5, and the Small at $0.5 / $1.5. The Small model has open weights, whereas the Medium model is only available via API.
9/29/2025
We ran the recently-released Claude Sonnet 4.5 (Thinking) our benchmarks, and found very strong performance:
Overall, this model is extremely capable, at the same mid-range price point as its predecessor.
9/26/2025
We evaluated Gemini 2.5 Flash Preview (9/25) (Thinking) (and also the Flash Lite model) and found the following:
Compared to the previous version, the update shows improvements on GPQA, with a ~17% increase over the previous version.
Flash improved on Terminal Bench (+5%), GPQA (+17.2%), and our private Corp Fin Benchmark (+4.4%). It also ranks #3/38 on MMMU and #6/20 on SWE-Bench (for the thinking model), while delivering performance at half the cost of similar models.
Flash Lite matches Flash on several public benchmarks, making it a very cost-effective option. However, Flash outperforms Lite by ~10% on our private benchmarks (CaseLaw v2, TaxEval, Mortgage Tax).
Flash delivers competitive performance at a fraction of the cost of other foundation models.
Overall, the latest update to Gemini 2.5 Flash is a highly efficient model that balances strong performance with low cost.
9/25/2025
We evaluated Qwen 3 Max and found the following:
Qwen 3 Max breaks the top 20 on 5 benchmarks, with leading open source scores on LiveCodeBench, MMLU Pro, and GPQA.
Overall scores are similar to Qwen 3 Max Preview , with gains on just three benchmarks: FAB (+15%), IOI (+8%), and AIME (+17.2%). However, Qwen 3 Max Preview far outperformed Max on LegalBench (78.9% vs. 38.1%).
Our results were lower than Alibaba’s reported AIME25 score by 11.6%, though LiveCodeBench results aligned closely.
Overall, Qwen 3 Max is a solid, middle-of-the-pack model. It shows incremental gains over Qwen 3 Max Preview on FAB, IOI, and AIME, while underperforming on LegalBench.
9/24/2025
We evaluated GPT 5 Codex across Terminal Bench, SWE-bench, IOI, and LCB, finding the following:
GPT 5 Codex is optimized for agentic coding, particularly within OpenAI’s Codex offering. For standardization, we used the same prompts and templates as we used with other models when running our evaluations.
9/20/2025
We evaluated Grok 4 Fast (Reasoning) and Grok 4 Fast (Non-Reasoning) , and found the following:
Overall, this is a very performant model from the xAI team for its price and latency.
9/18/2025
We evaluated the top foundation AI models on Terminal Bench, and found GPT 5 claimed the #1 spot, scoring 48.8% across all 80 tasks.
Terminal-Bench tests AI agents’ ability to perform real-world tasks using only the terminal.
Overall, models struggle on this benchmark. Even the latest flagship models fail to break 50% average accuracy overall.
Additionally, we found that performance drops sharply as tasks get harder: average accuracy falls from 63% on easy tasks to 16% on hard tasks.
Common failure modes we observed included models not waiting for a process to finish before sending the next command, missing edge cases, and crashing the terminal entirely.
Latency across the board was also high - with models taking up to a few minutes at the low end, and up to three minutes at the highest end.
9/08/2025
We evaluated Qwen 3 Max Preview on our benchmarks. Despite the model’s large size, we found the performance did not live up to the hype.
On our benchmarks, it was generally in the middle of the pack - but not in the top 5 on any benchmarks, and on most, it was outside the top 20. On Finance Agent, it only managed to get 17% accuracy.
Qwen 3 Max Preview did have comparatively strong performance on MGSM and GPQA, but these benchmarks are saturated, and incremental gains here do not signify meaningful differences in model intelligence.
This model is not open source, which is one of the main benefits of the Qwen series. It is also more expensive than its open source counterpart, Qwen 3 (235B) , but often performs worse.
Alibaba has currently only released the non-reasoning version of max preview. We’re excited to benchmark the reasoning version when it’s available, which may improve responses on benchmarks like AIME.
8/29/2025
We evaluated Z.ai’s GLM 4.5 model and found the following:
GLM 4.5 definitely still has room for improvement. We’re looking forward to seeing how open-source models continue to progress, but for now there is still a long way to go.
8/27/2025
We evaluated xAI’s Grok Code Fast on three of our coding benchmarks and found it to be much faster (and cheaper) for practical coding tasks, but significantly worse than xAI’s flagship model Grok 4 in general. Our findings are below:
Grok Code Fast is a snappier (and cheaper) model optimized for coding, and our results show that while there is significant room for improvement relative to other frontier models including xAI’s Grok 4 , it performs competitively on practical coding tasks while offering benefits in terms of latency and cost.
8/26/2025
GPT 5 achieved the highest overall accuracy on SWE-Bench, attaining an impressive 68.8%!
Results released come from running the model with the following settings:
Evaluated across all four task categories based off difficulty and 500 benchmark instances, GPT 5 ranked first in every category except for the “>4 hours” group, where it was among four models tied with a 33% completion rate on the most challenging tasks.
These results demonstrate that GPT 5 represents a significant advancement over previous OpenAI models.
8/18/2025
Our CaseLaw benchmark studies how well language models are able to perform case law reasoning and legal document analysis. We refreshed the benchmark to include harder and up-to-date questions, since the first version of our benchmark was getting saturated.
From our evaluations, we found:
While top models performed well, many still struggled with the nuanced interpretation required for legal analysis. CaseLaw v2 highlights both current strengths and the work ahead for applying AI in legal workflows.
8/11/2025
Recently, top LLM labs like OpenAI and Google reported that their models achieved gold medals on the International Mathematical Olympiad (IMO). This suggests that advanced models are saturating IMO, so we decided to test models on the International Olympiad in Informatics (IOI)!
From our evaluations, we found:
8/9/2025
We just evaluated Claude Opus 4.1 (Thinking) on our non-agentic benchmarks. While it placed in the top 10 on 6 of our public benchmarks, its performance on our private benchmarks was fairly mediocre.
On our private benchmarks, Claude Opus 4.1 (Thinking) lands squarely in the middle of the pack — barely making the top 10 on our TaxEval benchmark.
On public benchmarks, however, Claude Opus 4.1 (Thinking) ranks in the top 10 on 6 of the benchmarks we evaluated. Notably, it takes 2nd place on MMLU Pro behind only Claude Opus 4.1 (Nonthinking) and claims 1st place on MGSM.
8/8/2025
We just released results on Claude Opus 4.1 (Nonthinking) and found that, despite achieving top spots on MMLU Pro and MGSM, the model performs only marginally better across almost all of our benchmarks (<2% performance gain) compared to Claude Opus 4 (Nonthinking) .
On our private benchmarks, Opus 4.1 fails to place among the top 10 models. On public benchmarks, however, the model breaks the top 10 on 5 of the 9 public benchmarks we evaluated. This signals the need for more private benchmarks to evaluate meaningful differences between models and gauge true performance.
8/7/2025
We evaluated OpenAI’s newly-released GPT 5 family of models and found that GPT 5 achieves SOTA performance for a fraction of the cost compared to similarly performing models.
Of the three, GPT 5 is the strongest model in the family with SOTA performance on public LegalBench and AIME benchmarks.
On private benchmarks, GPT 5 and GPT 5 Mini achieve top 10 performance on all but CaseLaw. Most notably, GPT 5 Mini is the new SOTA model on TaxEval at a substantially lower cost.
On public benchmarks, GPT 5 places top 5 and GPT 5 Mini places top 10 on nearly everything. Further, we found the two models have complementary strengths - GPT 5 is SOTA on LegalBench and AIME, while GPT 5 Mini is SOTA on LiveCodeBench.
Lastly, GPT 5 Nano achieves middle of the pack performance across the board. It narrowly places in the top 10 on AIME, compared to GPT 5 and GPT 5 Mini which top the charts.
7/30/2025
Our SWE-bench evaluation of Kimi K2 Instruct achieved 34% accuracy, barely more than half of Kimi’s published figures!
After investigating the model responses, we identified the two following sources of error:
7/29/2025
We evaluated Llama 3.3 Nemotron Super (Nonthinking) and Llama 3.3 Nemotron Super (Thinking) and found the Thinking variant substantially outperforms the Nonthinking variant, with the significant exception of our proprietary Contract Law benchmark.
On Contract Law, Llama 3.3 Nemotron Super (Thinking) struggles, ranking in the bottom 10 models. Meanwhile, Llama 3.3 Nemotron Super (Nonthinking) lands in the top 3!
On TaxEval and CaseLaw, Llama 3.3 Nemotron Super (Nonthinking) (Non-thinking) struggles significantly, while Llama 3.3 Nemotron Super (Thinking) sits solidly middle-of-the-pack.
On public benchmarks, Llama 3.3 Nemotron Super (Nonthinking) performs abysmally across the board. Llama 3.3 Nemotron Super (Thinking) improves on all public benchmarks but still struggles, particularly on MGSM (35/46) and MMLU Pro (31/43).
Llama 3.3 Nemotron Super (Thinking) shows substantial gains over Llama 3.3 Nemotron Super (Nonthinking) : on AIME, performance improves from 37/44 to 14/44, and on Case Law, accuracy increases by 12%. These results highlight the benefits of the reasoning model.
7/22/2025
We found that Kimi K2 Instruct is the new state-of-the-art open-source model according to our evaluations.
The model cracks the top 10 on Math500 and LiveCodeBench, narrowly beating out DeepSeek R1 on both. On other public benchmarks, however, Kimi K2 Instruct delivers middle-of-the-pack performance.
However, Kimi K2 Instruct struggles on our proprietary benchmarks, failing to break the top 10 on any of them. We noticed it particularly struggles with legal tasks such as Case Law and Contract Law but performs comparatively better on finance tasks such as Corp Fin and Tax Eval.
The model offers solid value at $1.00 input/$3.00 output per million tokens, which is cheaper than DeepSeek R1 ($3.00/$7.00, both as hosted on Together AI) but more expensive than Mistral Medium 3.1 (05/2025) ($0.40/$2.00).
We’re currently evaluating the model on SWE-bench, on which Kimi’s reported accuracy would top our leaderboard. Looking forward to seeing whether the model can live up to the hype!
7/17/2025
We evaluated Grok 4 on the Finance Agent, CorpFin, SWE-bench, and LegalBench benchmarks and found strong results, especially on our private benchmarks.
On CorpFin, the model achieves state-of-the-art performance by a significant margin, placing first by the largest margin of any other model in the top 10!
Grok 4 ranks in the top 10 for model performance on the Finance Agent benchmark.
On LegalBench, Grok 4 places second behind Gemini 2.5 Pro Preview , illustrating potential saturation on this public, legal benchmark.
On SWE-bench, Grok 4 scores place second only to Claude Sonnet 4 (Nonthinking) and show a 15% improvement over previous Grok 4 results. Though Grok 4 uses tools 50% less often than Claude Sonnet 4 (Nonthinking) , this does not yield an improvement in overall price.
7/13/2025
In the livestream, Elon Musk called Grok 4 “partially blind”. We tested this claim on our two multimodal benchmarks (Mortgage Tax and MMMU) and found a bigger gap between public and private benchmarks. We found that Grok 4 struggles to recognize unseen images, highlighting the importance of high-quality private datasets to evaluate image recognition capabilities.
As we continue to evaluate Grok 4 on our benchmarks, the model continues to struggle on our private ones. The middling performance on Tax Eval (67.6%) and Mortgage Tax (57.5%) is consistent with previous findings on our private legal tasks like Case Law and Contract Law.
On public benchmarks, Grok 4 achieves top-10 performance on both MMLU Pro (85.3%) and MMMU (76.5%).
7/11/2025
We found that Grok 4 struggles on our private benchmarks, in contrast to SOTA performance on AIME, Math 500, and GPQA.
Grok 4 delivers middle-of-the-pack performance on our private legal benchmarks. The model scores 80.6% on Case Law and 66.0% on Contract Law, underperforming Grok 3 Mini Fast Low Reasoning on both and Grok 3 on Case Law. Notably, Grok 3 remains our top performer on the Case Law benchmark.
On public benchmarks, Grok 4 barely cracks the top 10 on MedQA at 92.5%, narrowly outperforming Grok 2 . On MGSM, it fails to break the top 10 with 90.9%. This contrasts its SOTA performance on Math 500, suggesting Grok 4 struggles more with language than mathematical reasoning.
7/9/2025
We received early access to xAI’s latest Grok 4 and an initial set of smaller benchmarks. These early results show incredible performance — the model sets the new state-of-the-art on AIME, GPQA, and Math 500 benchmarks! Grok 4 is extremely capable in its ability to answer challenging math and science questions.
We are continuing to run our evaluations on our private benchmarks and will release results shortly.
6/13/2025
Foundation models still fail to solve real-world coding problems despite notable progress, highlighting remaining room for improvement.
The models’ performance drops significantly on “harder” problems that take >1 hour to complete. Only Claude Sonnet 4 (Nonthinking) , o3 and GPT 4.1 pass any of the >4 hour tasks (33% each).
Claude Sonnet 4 (Nonthinking) leads by a wide margin with 65.0% accuracy, and maintains both excellent cost efficiency at $1.24 per test and fast completion times (426.52s).
Tool usage patterns reveal models employ distinct strategies. o4 Mini brute-forces problems (~25k searches per task), while Claude Sonnet 4 (Nonthinking) employs a leaner, balanced mix (~9-10k default tool calls with far fewer searches).
Note that we run every model through the same evaluation harness to make direct comparisons between models, so the scores show relative performance, not each model’s best possible accuracy.
5/30/2025
We’ve released our evaluation of Claude Opus 4 (Nonthinking) across our benchmarks!
We found:
We also benchmarked Claude Sonnet 4 (Thinking) and Claude Sonnet 4 (Nonthinking) on our Finance Agent benchmark (the last remaining benchmark for this model). They performed nearly identically to Claude 3.7 Sonnet (Nonthinking) .
5/27/2025
We’ve released our evaluation of Claude Sonnet 4 (Thinking) across all of our benchmarks!
The full writeups are linked in the comments. The final determinant of the Claude 4 family strengths will come from Opus 4, so stay tuned for the results!
5/25/2025
We just evaluated Claude Sonnet 4 (Nonthinking) on all benchmarks!
Stay tuned for evaluations of Sonnet 4’s thinking variant, as well as Opus 4!
5/9/2025
We just evaluated Mistral Medium 3 on all benchmarks!
Mistral Medium 3 demonstrates consistent performance across both public and proprietary benchmarks, scoring 68.7% overall accuracy with strong results on CaseLaw (84.9%, #6/59) and Math500 (87.0%, #17/42) given its size and price.
The model outperforms Llama 4 Maverick (63.3% accuracy) in most benchmarks, particularly excelling in MGSM (91.6% vs 92.5%) and MMLU Pro (74.4% vs 79.4%).
While impressive, Mistral Medium 3 still trails behind Qwen 3 235B (81.0% accuracy) on several academic benchmarks, particularly Math500 (87.0% vs 94.6%) and AIME (42.3% vs 84.0%).
For users seeking speed-performance balance, Mistral Medium 3 offers good latency (14.37s) compared to Qwen 3 235B (94.31s), making it suitable for applications requiring faster response times while maintaining strong reasoning capabilities.
5/5/2025
We just evaluated Gemini 2.5 Flash Preview (Nonthinking) on most benchmarks.
5/5/2025
We just evaluated Qwen 3 235B on all benchmarks!
Qwen 3 235B demonstrates exceptional math reasoning capabilities, ranking #3 on Math500, #5 on AIME, and #3 on MGSM.
With its “thinking allowed” approach, Qwen 3 outperforms several prominent closed-source reasoning models including Claude 3.7 Sonnet and o4-mini in mathematical reasoning tasks.
Private benchmark challenges: Qwen 3 shows limitations on proprietary benchmarks, particularly struggling on TaxEval where it ranks #29 out of 43 evaluated models.
This evaluation showcases Qwen 3’s strong specialized reasoning capabilities while highlighting areas where further improvements could enhance its performance on domain-specific tasks.
4/22/2025
4/18/2025
We just evaluated o3 and o4 Mini on all benchmarks!
o3 achieved the #1 overall accuracy ranking on our benchmarks, with exceptional performance on complex reasoning tests like MMMU (#1/22), MMLU Pro (#1/35), GPQA (#1/35) and proprietary benchmarks like TaxEval (#1/42) and CorpFin (#2/35).
o4 Mini achieved the second-highest accuracy across our benchmarks (82.8%), driven by strong performance on public math tests like MGSM (#1/36), MMMU (#2/22), and Math500 (#4/38).
Legal benchmark weaknesses: Both models demonstrated significant weaknesses on our proprietary legal benchmarks, with lower ranks on ContractLaw (o3: #34/62, o4 Mini: #14/62) and CaseLaw (o3: #15/55, o4 Mini: #18/55).
Cost-effectiveness comparison: With similar performance levels, cost becomes a key differentiator. o4 Mini costs $4.40 for output, compared to $40.00 for o3 — a tenfold price difference that makes o4 Mini the more economical choice for many use cases.
4/15/2025
We just evaluated GPT 4.1, GPT 4.1 Mini, and GPT 4.1 Nano on all benchmarks!
GPT 4.1 delivers impressive results with a 75.5% average accuracy across benchmarks.
Impressive performance on proprietary benchmarks! GPT 4.1 is now the leader on CorpFin (71.2%), and shows strong performance on CaseLaw (85.8%, 4/53), and MMLU Pro (80.5%, 6/33).
GPT 4.1 Nano and GPT 4.1 Mini bring AI to time-sensitive applications with an outstanding latency of only 3.62s and 6.60s respectively while still achieving 59.1% and 75.1% average accuracy.
Compact but capable! Despite its size, GPT 4.1 Mini performs admirably on Math500 (88.8%, 10/36) and MGSM (87.9%, 20/34).
Size versus performance tradeoff: The smaller models do show lower performance on some complex tasks, with GPT 4.1 Nano ranking near the bottom on MMLU Pro (62.3%, 30/33) and MGSM (69.8%, 32/34).
4/11/2025
We just evaluated Grok 3 Beta, Grok 3 Mini Fast Beta (High Reasoning), and Grok 3 Mini Fast Beta (Low Reasoning) on all benchmarks!
Grok 3 Beta delivers impressive results with a 78.1% average accuracy across benchmarks and a snappy 15.52s latency.
Dominates proprietary benchmarks! Grok 3 Beta ranks #1 on three key benchmarks: CorpFin (69.1%), CaseLaw (88.1%), and TaxEval (78.8%).
Grok 3 Mini Fast Beta (High Reasoning) surprises with an even higher average accuracy of 81.6% despite being a smaller model.
Mathematical prowess! Grok 3 Mini Fast Beta (High Reasoning) takes the #2 place (94.2%) on Math500 and the #3 place (85.00%) on AIME.
4/7/2025
We just evaluated Llama 4 Maverick and Llama 4 Scout on all benchmarks!
4/4/2025
We just evaluated Mistral Small 2503 on all benchmarks!
3/28/2025
We just evaluated Gemini 2.5 Pro Exp on all benchmarks!
3/26/2025
We just evaluated DeepSeek V3 on all benchmarks!
3/26/2025
Today, we’re releasing results from the Multimodal Multi-task Benchmark (MMMU), a comprehensive evaluation of AI models’ ability to reason across multiple modalities spanning 30 subjects in 6 major disciplines.
3/24/2025
We just evaluated Command A on all benchmarks!
3/13/2025
We just evaluated Jamba 1.6 Large and Jamba 1.6 Mini models!
3/11/2025
Today, we’ve released five new academic benchmarks on our site: three evaluating mathematical reasoning, and two on general question-answering.
Unlike results released by model providers on these benchmarks, we applied a consistent methodology and prompt-template across models, ensuring an apples-to-apples comparison. You can find detailed information about our evaluation approach on each benchmark’s page:
3/5/2025
We just released a new benchmark in partnership with Vontive!
Claude 3.7 Sonnet leads the pack with 80.6% accuracy, and the other top 3 models are all from Anthropic.
2/27/2025
We just released the VLAIR! Whereas our previous benchmarks study foundation model performance, here we investigate the ability of the most popular legal AI products to perform real world legal tasks.
To build a large, high quality dataset, we worked with some of the top global law firms, including Reed Smith, Fisher Phillips, McDermott Will & Emery, Ogletree Deakins, Paul Hastings among others. This is the first benchmark in which we also collected a human baseline against which we measure performance.
In sum, this enabled us to study how these legal AI systems perform on practical tasks and especially how the work of generative AI tools compared to that of a human lawyer.
Read the report for full results.
2/25/2025
We just evaluated Anthropic’s Claude 3.7 Sonnet (Nonthinking) model!
We have also run Google 2.0 Flash Thinking Exp and Google 2.0 Pro Exp on most benchmarks.
2/3/2025
We just evaluated OpenAI’s o3-mini model!
We have also run DeepSeek R1 on our CorpFin benchmark, on which it reaches the top place, beating all other models we have tested.
1/28/2025
🐳 We just evaluated DeepSeek’s R1 model on three of our private datasets! 🐳
1/27/2025
We just released two new benchmarks!
We also are releasing several new models such as Grok 2 and Gemini 2.0 Flash Exp.
1/27/2025
Vals AI and Graphite Digital partnered to release the first medical benchmark on Vals AI.
This report offers the first third-party, highly-exhaustive evaluation of over 15 of the most popular LLMs on graduate-level medical questions.
We assessed models under two conditions: unbiased and bias-injected questions, measuring the models’ general accuracy and the ability to handle racial bias in medical contexts.
Our top-performing model was OpenAI’s o1 Preview and best value was Meta’s Llama 3.1 70b.
Read the full report to find out more!
12/11/2024
We’ve just implemented a re-design of this benchmarking website!
Apart from being easier on the eyes, this new version of the site is much more useful.
11/10/2024
10/31/2024
Vals AI and Legaltech Hub are partnering with leading law firms and top legal AI vendors to conduct a first-of-its-kind benchmark.
The study will evaluate the platforms across eight legal tasks including Document Q&A, Legal Research, EDGAR Research. All data will be collected from the law firms, to ensure it’s representative of real legal work.
The report will be published in early 2025.
Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.
By subscribing, I agree to Vals' Privacy Policy.