Updates

10/16/2025

Model

Claude Haiku 4.5 Evaluated on All Benchmarks!

We evaluated Claude Haiku 4.5 (Thinking) and found strong performance:

The model places 3rd on our Vals Index, demonstrating well-rounded capabilities across diverse tasks.
On Terminal Bench, Haiku 4.5 achieves 3rd place, showing particular strength on coding tasks.
While it performs well on certain coding benchmarks, the model achieves middle-of-the-pack performance on most other benchmarks.
The model struggles significantly on our proprietary CaseLaw benchmark and the public MedQA, GPQA, MMLU Pro, MMMU, and LiveCodeBench benchmarks.
Compared to Claude Sonnet 4.5 (Thinking) , Haiku trades some performance for significantly faster speed and lower cost - see our model comparison for details.

Overall, Haiku 4.5 sits firmly on the Pareto frontier.

10/8/2025

Benchmark

Our new SAGE Benchmark is live!

Our new Student Assessment with Generative Evaluation (SAGE) benchmark evaluates the ability of Language Models to grade handwritten student work at the undergraduate level.

No models score higher than 50%; by contrast, top models consistently achieve >90% accuracy on mathematical benchmarks like AIME.
Gemini 2.5 Flash (7/17) (Nonthinking) takes first place with 44.8% - outperforming Gemini 2.5 Pro by 3%!
Current results show significant variation in performance by model provider, with less variation in particular model. With the exception of the two smallest models we tested ( Gemini 2.5 Flash Lite (Nonthinking) and GPT 5 Nano ), all OpenAI and Google models do better than all Anthropic models, which in turn do better than Grok 4 .

We find significant room for improvement in models’ capabilities here, and are excited for the SAGE benchmark to enable continued model development for education!

9/30/2025

Model

Magistral 1.2 (Small and Medium) Evaluated

We evaluated Magistral Medium 1.2 (09/2025) and Magistral Small 1.2 (09/2025) - and found that both models perform decently for their size, especially on coding tasks. However, the models also struggled on many benchmarks.

Magistral Medium performs well on academic and coding benchmarks, placing in the top 20 on LiveCodeBench and AIME. However, the model struggles on our proprietary benchmarks, particularly MortgageTax and CaseLaw.
Surprisingly, Magistral Small tends to do better on finance and academic benchmarks, most notably outperforming Medium on MortgageTax (+8.8%). The model also does well on LiveCodeBench and AIME. However, Small struggled on our proprietary CorpFin and CaseLaw benchmarks, along with GPQA and MMLU Pro.
A large chunk of the performance loss was the result of models not outputting results in the format that was required.

The Medium model is priced at $2 / $5, and the Small at $0.5 / $1.5. The Small model has open weights, whereas the Medium model is only available via API.

9/29/2025

Model

Sonnet 4.5 sets new SOTAs

We ran the recently-released Claude Sonnet 4.5 (Thinking) our benchmarks, and found very strong performance:

On Finance Agent, it beats the previous state-of-the-art by five percentage points.
It also takes the #1 spot on SWE Bench and Terminal Bench, beating out GPT 5 Codex .
It is in the top 10 models on the majority of our benchmarks, and also showed better performance than Claude Sonnet 4 (Thinking) on almost all benchmarks.
It has a 1-million token context window, when the flag “context-1m-2025-08-07” is enabled

Overall, this model is extremely capable, at the same mid-range price point as its predecessor.

9/26/2025

Model

Gemini 2.5 Flash (09/25) Models Evaluated

We evaluated Gemini 2.5 Flash Preview (9/25) (Thinking) (and also the Flash Lite model) and found the following:

Compared to the previous version, the update shows improvements on GPQA, with a ~17% increase over the previous version.
Flash improved on Terminal Bench (+5%), GPQA (+17.2%), and our private Corp Fin Benchmark (+4.4%). It also ranks #3/38 on MMMU and #6/20 on SWE-Bench (for the thinking model), while delivering performance at half the cost of similar models.
Flash Lite matches Flash on several public benchmarks, making it a very cost-effective option. However, Flash outperforms Lite by ~10% on our private benchmarks (CaseLaw v2, TaxEval, Mortgage Tax).
Flash delivers competitive performance at a fraction of the cost of other foundation models.

Overall, the latest update to Gemini 2.5 Flash is a highly efficient model that balances strong performance with low cost.

9/25/2025

Model

Qwen 3 Max Evaluated

We evaluated Qwen 3 Max and found the following:

Qwen 3 Max breaks the top 20 on 5 benchmarks, with leading open source scores on LiveCodeBench, MMLU Pro, and GPQA.
Overall scores are similar to Qwen 3 Max Preview , with gains on just three benchmarks: FAB (+15%), IOI (+8%), and AIME (+17.2%). However, Qwen 3 Max Preview far outperformed Max on LegalBench (78.9% vs. 38.1%).
Our results were lower than Alibaba’s reported AIME25 score by 11.6%, though LiveCodeBench results aligned closely.

Overall, Qwen 3 Max is a solid, middle-of-the-pack model. It shows incremental gains over Qwen 3 Max Preview on FAB, IOI, and AIME, while underperforming on LegalBench.

9/24/2025

Model

GPT 5 Codex Evaluated on Coding Benchmarks

We evaluated GPT 5 Codex across Terminal Bench, SWE-bench, IOI, and LCB, finding the following:

On Terminal Bench, GPT-5 Codex takes 1st place with 58.8% accuracy, a 10% improvement over the previous #1, GPT-5. It also delivers lower cost and latency compared to the other top three models.
GPT 5 Codex also gets first place on SWE-Bench, narrowly outperforming GPT 5 , which ranks 2nd by less than a percentage point.
GPT 5 Codex is the first model we’ve seen receive full credit on a single question on IOI, though overall accuracy remains low (9.8%).
GPT 5 Codex places second on LCB, behind GPT 5 Mini .

GPT 5 Codex is optimized for agentic coding, particularly within OpenAI’s Codex offering. For standardization, we used the same prompts and templates as we used with other models when running our evaluations.

9/20/2025

Model

Grok 4 Fast Evaluated!

We evaluated Grok 4 Fast (Reasoning) and Grok 4 Fast (Non-Reasoning) , and found the following:

Grok 4 Fast (Reasoning) delivers impressive latency and pricing, comparable to GPT-5 Nano. For example, on Corp Fin, it was half the latency and 10% of the cost compared to Grok 4 .
The Reasoning model significantly outperforms the Non-Reasoning variant. While Grok 4 Fast (Non-Reasoning) did not place in the top 10 on any benchmark, Grok Fast (Reasoning) achieved #3 on both AIME and CorpFin.
The results highlight the growing performance gap between reasoning and non-reasoning models, particularly on complex benchmarks.

Overall, this is a very performant model from the xAI team for its price and latency.

9/18/2025

Benchmark

Terminal Bench Released

We evaluated the top foundation AI models on Terminal Bench, and found GPT 5 claimed the #1 spot, scoring 48.8% across all 80 tasks.

Terminal-Bench tests AI agents’ ability to perform real-world tasks using only the terminal.

Overall, models struggle on this benchmark. Even the latest flagship models fail to break 50% average accuracy overall.
Additionally, we found that performance drops sharply as tasks get harder: average accuracy falls from 63% on easy tasks to 16% on hard tasks.
Common failure modes we observed included models not waiting for a process to finish before sending the next command, missing edge cases, and crashing the terminal entirely.

Latency across the board was also high - with models taking up to a few minutes at the low end, and up to three minutes at the highest end.

9/08/2025

Model

Qwen3 Max Preview Benchmarked

We evaluated Qwen 3 Max Preview on our benchmarks. Despite the model’s large size, we found the performance did not live up to the hype.

On our benchmarks, it was generally in the middle of the pack - but not in the top 5 on any benchmarks, and on most, it was outside the top 20. On Finance Agent, it only managed to get 17% accuracy.
Qwen 3 Max Preview did have comparatively strong performance on MGSM and GPQA, but these benchmarks are saturated, and incremental gains here do not signify meaningful differences in model intelligence.
This model is not open source, which is one of the main benefits of the Qwen series. It is also more expensive than its open source counterpart, Qwen 3 (235B) , but often performs worse.

Alibaba has currently only released the non-reasoning version of max preview. We’re excited to benchmark the reasoning version when it’s available, which may improve responses on benchmarks like AIME.

8/29/2025

Model

GLM 4.5 Evaluated!

We evaluated Z.ai’s GLM 4.5 model and found the following:

GLM 4.5 delivers solid top-twenty results on AIME (#5/51), GPQA (#16/53), MMLU Pro (#15/51), LiveCodeBench (#15/53), and our own CaseLaw benchmark (#20/27).
When compared directly to U.S. open-source peers, GLM 4.5 performs better than models such as Llama 4 Maverick , but is still outperformed by GPT OSS 120B across nearly every benchmark.

GLM 4.5 definitely still has room for improvement. We’re looking forward to seeing how open-source models continue to progress, but for now there is still a long way to go.

8/27/2025

Model

Grok Code Evaluated on Coding Benchmarks!

We evaluated xAI’s Grok Code Fast on three of our coding benchmarks and found it to be much faster (and cheaper) for practical coding tasks, but significantly worse than xAI’s flagship model Grok 4 in general. Our findings are below:

Grok Code Fast scores 62% on LCB, placing the model in the middle of the pack, comparable to other reasoning models like Claude Sonnet 4 (Nonthinking) , but for a tenth of the price.
On IOI, Grok Code Fast gets 4.3% placing the model at 8/12. By contrast, Grok 4 gets 26.2% and places first overall!
On SWE-bench, Grok Code Fast gets an impressive 57.6% percent, placing 4th right behind Grok 4 ‘s 58.6%, but with a latency of 264.68s compared to Grok 4 ‘s 704.8s.

Grok Code Fast is a snappier (and cheaper) model optimized for coding, and our results show that while there is significant room for improvement relative to other frontier models including xAI’s Grok 4 , it performs competitively on practical coding tasks while offering benefits in terms of latency and cost.

8/26/2025

Model

GPT-5 has been Evaluated on SWE-Bench!

GPT 5 achieved the highest overall accuracy on SWE-Bench, attaining an impressive 68.8%!

Results released come from running the model with the following settings:

High reasoning
Default verbosity
New response endpoint

Evaluated across all four task categories based off difficulty and 500 benchmark instances, GPT 5 ranked first in every category except for the “>4 hours” group, where it was among four models tied with a 33% completion rate on the most challenging tasks.

These results demonstrate that GPT 5 represents a significant advancement over previous OpenAI models.

8/18/2025

Benchmark

Our CaseLaw v2 Benchmark is live!

Our CaseLaw benchmark studies how well language models are able to perform case law reasoning and legal document analysis. We refreshed the benchmark to include harder and up-to-date questions, since the first version of our benchmark was getting saturated.

From our evaluations, we found:

GPT 4.1 maintained the top performance with 78.1% accuracy.
GPT 5 Mini emerged as a strong second-place performer, and had faster processing times; Grok 4 ranked third on the benchmark.
A common failure mode was identifying only parts of relevant document sections, relying more on their general knowledge despite being instructed otherwise.

While top models performed well, many still struggled with the nuanced interpretation required for legal analysis. CaseLaw v2 highlights both current strengths and the work ahead for applying AI in legal workflows.

8/11/2025

Benchmark

Is your model smarter than a High-Schooler? Introducing our IOI Benchmark

Recently, top LLM labs like OpenAI and Google reported that their models achieved gold medals on the International Mathematical Olympiad (IMO). This suggests that advanced models are saturating IMO, so we decided to test models on the International Olympiad in Informatics (IOI)!

From our evaluations, we found:

Grok 4 wins convincingly, placing first on both the 2024 and 2025 exams.
Models struggle to write C++ at the level of the best high-school students – no models qualify for medals on either exam.
Only the largest and most expensive models even come close to placing. The only models to achieve >10% performance all cost at least $2 per question. Claude Opus 4.1 (Nonthinking) costs over $10 per question!
Consistency between performance on the 2024 and 2025 tests suggests that LLM labs aren’t currently training on the IOI, suggesting that this benchmark is relatively free from data contamination.

8/9/2025

Model

Opus 4.1 (Thinking) Evaluated!

We just evaluated Claude Opus 4.1 (Thinking) on our non-agentic benchmarks. While it placed in the top 10 on 6 of our public benchmarks, its performance on our private benchmarks was fairly mediocre.

On our private benchmarks, Claude Opus 4.1 (Thinking) lands squarely in the middle of the pack — barely making the top 10 on our TaxEval benchmark.
On public benchmarks, however, Claude Opus 4.1 (Thinking) ranks in the top 10 on 6 of the benchmarks we evaluated. Notably, it takes 2nd place on MMLU Pro behind only Claude Opus 4.1 (Nonthinking) and claims 1st place on MGSM.

8/8/2025

Model

Opus 4.1 (Nonthinking) Evaluated!

We just released results on Claude Opus 4.1 (Nonthinking) and found that, despite achieving top spots on MMLU Pro and MGSM, the model performs only marginally better across almost all of our benchmarks (<2% performance gain) compared to Claude Opus 4 (Nonthinking) .

On our private benchmarks, Opus 4.1 fails to place among the top 10 models. On public benchmarks, however, the model breaks the top 10 on 5 of the 9 public benchmarks we evaluated. This signals the need for more private benchmarks to evaluate meaningful differences between models and gauge true performance.

8/7/2025

Model

GPT-5 Evaluated on Non-Agentic Benchmarks!

We evaluated OpenAI’s newly-released GPT 5 family of models and found that GPT 5 achieves SOTA performance for a fraction of the cost compared to similarly performing models.

Of the three, GPT 5 is the strongest model in the family with SOTA performance on public LegalBench and AIME benchmarks.

On private benchmarks, GPT 5 and GPT 5 Mini achieve top 10 performance on all but CaseLaw. Most notably, GPT 5 Mini is the new SOTA model on TaxEval at a substantially lower cost.

On public benchmarks, GPT 5 places top 5 and GPT 5 Mini places top 10 on nearly everything. Further, we found the two models have complementary strengths - GPT 5 is SOTA on LegalBench and AIME, while GPT 5 Mini is SOTA on LiveCodeBench.

Lastly, GPT 5 Nano achieves middle of the pack performance across the board. It narrowly places in the top 10 on AIME, compared to GPT 5 and GPT 5 Mini which top the charts.

7/30/2025

Model

Kimi K2 Instruct Evaluated on SWE-Bench

Our SWE-bench evaluation of Kimi K2 Instruct achieved 34% accuracy, barely more than half of Kimi’s published figures!

After investigating the model responses, we identified the two following sources of error:

The model struggles to use tools - it often includes tool calls in the response itself! We replicated the issue on multiple popular inference providers. However, even discarding such errors only increases accuracy by around 2%.
The model often gets stuck repeating itself, leading to unnecessarily long and incorrect responses. This is a common failure mode of models at zero temperature, though it’s most prevalent among thinking models.

7/29/2025

Model

NVIDIA Nemotron Super Evaluated

We evaluated Llama 3.3 Nemotron Super (Nonthinking) and Llama 3.3 Nemotron Super (Thinking) and found the Thinking variant substantially outperforms the Nonthinking variant, with the significant exception of our proprietary Contract Law benchmark.

On Contract Law, Llama 3.3 Nemotron Super (Thinking) struggles, ranking in the bottom 10 models. Meanwhile, Llama 3.3 Nemotron Super (Nonthinking) lands in the top 3!
On TaxEval and CaseLaw, Llama 3.3 Nemotron Super (Nonthinking) (Non-thinking) struggles significantly, while Llama 3.3 Nemotron Super (Thinking) sits solidly middle-of-the-pack.
On public benchmarks, Llama 3.3 Nemotron Super (Nonthinking) performs abysmally across the board. Llama 3.3 Nemotron Super (Thinking) improves on all public benchmarks but still struggles, particularly on MGSM (35/46) and MMLU Pro (31/43).
Llama 3.3 Nemotron Super (Thinking) shows substantial gains over Llama 3.3 Nemotron Super (Nonthinking) : on AIME, performance improves from 37/44 to 14/44, and on Case Law, accuracy increases by 12%. These results highlight the benefits of the reasoning model.

7/22/2025

Model

Kimi K2 Instruct Evaluated On Non-Agentic Benchmarks!

We found that Kimi K2 Instruct is the new state-of-the-art open-source model according to our evaluations.

The model cracks the top 10 on Math500 and LiveCodeBench, narrowly beating out DeepSeek R1 on both. On other public benchmarks, however, Kimi K2 Instruct delivers middle-of-the-pack performance.

However, Kimi K2 Instruct struggles on our proprietary benchmarks, failing to break the top 10 on any of them. We noticed it particularly struggles with legal tasks such as Case Law and Contract Law but performs comparatively better on finance tasks such as Corp Fin and Tax Eval.

The model offers solid value at $1.00 input/$3.00 output per million tokens, which is cheaper than DeepSeek R1 ($3.00/$7.00, both as hosted on Together AI) but more expensive than Mistral Medium 3.1 (05/2025) ($0.40/$2.00).

We’re currently evaluating the model on SWE-bench, on which Kimi’s reported accuracy would top our leaderboard. Looking forward to seeing whether the model can live up to the hype!

7/17/2025

Model

Grok 4 on Tough Benchmarks

We evaluated Grok 4 on the Finance Agent, CorpFin, SWE-bench, and LegalBench benchmarks and found strong results, especially on our private benchmarks.

On CorpFin, the model achieves state-of-the-art performance by a significant margin, placing first by the largest margin of any other model in the top 10!
Grok 4 ranks in the top 10 for model performance on the Finance Agent benchmark.
On LegalBench, Grok 4 places second behind Gemini 2.5 Pro Preview , illustrating potential saturation on this public, legal benchmark.
On SWE-bench, Grok 4 scores place second only to Claude Sonnet 4 (Nonthinking) and show a 15% improvement over previous Grok 4 results. Though Grok 4 uses tools 50% less often than Claude Sonnet 4 (Nonthinking) , this does not yield an improvement in overall price.

7/13/2025

Model

Grok 4 Results (Continued)

In the livestream, Elon Musk called Grok 4 “partially blind”. We tested this claim on our two multimodal benchmarks (Mortgage Tax and MMMU) and found a bigger gap between public and private benchmarks. We found that Grok 4 struggles to recognize unseen images, highlighting the importance of high-quality private datasets to evaluate image recognition capabilities.

As we continue to evaluate Grok 4 on our benchmarks, the model continues to struggle on our private ones. The middling performance on Tax Eval (67.6%) and Mortgage Tax (57.5%) is consistent with previous findings on our private legal tasks like Case Law and Contract Law.

On public benchmarks, Grok 4 achieves top-10 performance on both MMLU Pro (85.3%) and MMMU (76.5%).

7/11/2025

Model

Grok 4 Results (Continued)

We found that Grok 4 struggles on our private benchmarks, in contrast to SOTA performance on AIME, Math 500, and GPQA.

Grok 4 delivers middle-of-the-pack performance on our private legal benchmarks. The model scores 80.6% on Case Law and 66.0% on Contract Law, underperforming Grok 3 Mini Fast Low Reasoning on both and Grok 3 on Case Law. Notably, Grok 3 remains our top performer on the Case Law benchmark.

On public benchmarks, Grok 4 barely cracks the top 10 on MedQA at 92.5%, narrowly outperforming Grok 2 . On MGSM, it fails to break the top 10 with 90.9%. This contrasts its SOTA performance on Math 500, suggesting Grok 4 struggles more with language than mathematical reasoning.

7/9/2025

Model

Grok 4 Early Results

We received early access to xAI’s latest Grok 4 and an initial set of smaller benchmarks. These early results show incredible performance — the model sets the new state-of-the-art on AIME, GPQA, and Math 500 benchmarks! Grok 4 is extremely capable in its ability to answer challenging math and science questions.

We are continuing to run our evaluations on our private benchmarks and will release results shortly.

6/13/2025

Benchmark

SWE-bench results released

Foundation models still fail to solve real-world coding problems despite notable progress, highlighting remaining room for improvement.
The models’ performance drops significantly on “harder” problems that take >1 hour to complete. Only Claude Sonnet 4 (Nonthinking) , o3 and GPT 4.1 pass any of the >4 hour tasks (33% each).
Claude Sonnet 4 (Nonthinking) leads by a wide margin with 65.0% accuracy, and maintains both excellent cost efficiency at $1.24 per test and fast completion times (426.52s).
Tool usage patterns reveal models employ distinct strategies. o4 Mini brute-forces problems (~25k searches per task), while Claude Sonnet 4 (Nonthinking) employs a leaner, balanced mix (~9-10k default tool calls with far fewer searches).

Note that we run every model through the same evaluation harness to make direct comparisons between models, so the scores show relative performance, not each model’s best possible accuracy.

5/30/2025

Model

Opus (Nonthinking) evaluated on our benchmarks; Sonnet 4 evaluated on FAB.

We’ve released our evaluation of Claude Opus 4 (Nonthinking) across our benchmarks!

We found:

Opus 4 ranks #1 on both MMLU Pro and MGSM, narrowly setting new state-of-the-art scores. However, it achieves middle of the road performance across most other benchmarks.
Compared to its predecessor (Opus 3), Opus 4 ranked higher on CaseLaw (#22 vs. #24/62) and LegalBench (#8 vs #32/67) but scored notably lower on ContractLaw (#16 vs. #2/69)
Opus 4 is expensive, with an output cost of $75.00 /M tokens, 5x as much as Sonnet 4, and about 1.5x more expensive than o3 ($15 / $75 vs $10 / $40).

We also benchmarked Claude Sonnet 4 (Thinking) and Claude Sonnet 4 (Nonthinking) on our Finance Agent benchmark (the last remaining benchmark for this model). They performed nearly identically to Claude 3.7 Sonnet (Nonthinking) .

5/27/2025

Model

Claude Sonnet 4 (Thinking) evaluated on all benchmarks!

We’ve released our evaluation of Claude Sonnet 4 (Thinking) across all of our benchmarks!

Claude Sonnet 4 (Thinking) seriously underperforms when compared to its predecessor Claude 3.7 Sonnet (Thinking) on our proprietary TaxEval and ContractLaw benchmarks.
Claude Sonnet 4 (Thinking) significantly outperformed Claude Sonnet 4 (Nonthinking) on our reasoning benchmarks. For example, Claude Sonnet 4 (Thinking) scored 76.3% and Claude Sonnet 4 (Nonthinking) scored 38.5% on our AIME benchmark.
Claude Sonnet 4 (Thinking) is consistently in the top 10 across most of our benchmarks, though it is never the SOTA model.
The model latency is high when reasoning is enabled with a high token budget. On AIME, the model responded in four minutes, on average, with some questions taking over ten minutes.

The full writeups are linked in the comments. The final determinant of the Claude 4 family strengths will come from Opus 4, so stay tuned for the results!

5/25/2025

Model

Claude Sonnet 4 (Nonthinking) evaluated on all benchmarks!

We just evaluated Claude Sonnet 4 (Nonthinking) on all benchmarks!

Claude Sonnet 4 (Nonthinking) achieves 76.9% accuracy on average, a 7.1% improvement on Anthropic’s previous flagship model, Claude 3.7 Sonnet. The newer Claude is also nearly twice as fast for the same price.
Claude Sonnet 4 (Nonthinking) excels on the MGSM benchmark, edging out Claude 3.7 Sonnet (Thinking) by a tenth of a percentage point.
Claude Sonnet 4 (Nonthinking) also achieves strong performance on our proprietary CaseLaw benchmark, outperforming all previous Anthropic models.
Interestingly, Claude Sonnet 4 (Nonthinking) performs worse than its predecessor Claude 3.7 Sonnet by six percentage points on the MortgageTax benchmark. It even performs worse than its predecessor, Claude 3.5 Sonnet, on both the MortgageTax and CorpFin benchmarks!

Stay tuned for evaluations of Sonnet 4’s thinking variant, as well as Opus 4!

5/9/2025

Model

Mistral Medium 3 evaluated on all benchmarks!

We just evaluated Mistral Medium 3 on all benchmarks!

Mistral Medium 3 demonstrates consistent performance across both public and proprietary benchmarks, scoring 68.7% overall accuracy with strong results on CaseLaw (84.9%, #6/59) and Math500 (87.0%, #17/42) given its size and price.
The model outperforms Llama 4 Maverick (63.3% accuracy) in most benchmarks, particularly excelling in MGSM (91.6% vs 92.5%) and MMLU Pro (74.4% vs 79.4%).
While impressive, Mistral Medium 3 still trails behind Qwen 3 235B (81.0% accuracy) on several academic benchmarks, particularly Math500 (87.0% vs 94.6%) and AIME (42.3% vs 84.0%).
For users seeking speed-performance balance, Mistral Medium 3 offers good latency (14.37s) compared to Qwen 3 235B (94.31s), making it suitable for applications requiring faster response times while maintaining strong reasoning capabilities.

5/5/2025

Model

Google's Gemini 2.5 Flash (Nonthinking) evaluated on most benchmarks

We just evaluated Gemini 2.5 Flash Preview (Nonthinking) on most benchmarks.

Gemini 2.5 Flash Preview (Nonthinking) is a lightweight alternative to Google’s flagship model, Gemini 2.5 Pro Exp. Gemini 2.5 Flash Preview (Nonthinking) runs at a fraction of the cost and latency, rendering it a more accessible option.
Like Claude 3.7 Sonnet (Nonthinking), Gemini 2.5 Flash Preview (Nonthinking) is a hybrid reasoning model, meaning it can adaptively choose how much to think before responding.
Gemini 2.5 Flash Preview (Nonthinking) excels on LegalBench, coming second only to the flagship Gemini 2.5 Pro Exp (and outperforming its own thinking variant, Gemini 2.5 Flash Preview (Thinking), by 1%).
We consistently had difficulty with Google’s API during evaluation, which prevented us from reporting full results. We’re working with a representative from the Gemini team to resolve those issues.

5/5/2025

Model

Qwen 3 235B evaluations released!

We just evaluated Qwen 3 235B on all benchmarks!

Qwen 3 235B demonstrates exceptional math reasoning capabilities, ranking #3 on Math500, #5 on AIME, and #3 on MGSM.
With its “thinking allowed” approach, Qwen 3 outperforms several prominent closed-source reasoning models including Claude 3.7 Sonnet and o4-mini in mathematical reasoning tasks.
Private benchmark challenges: Qwen 3 shows limitations on proprietary benchmarks, particularly struggling on TaxEval where it ranks #29 out of 43 evaluated models.
This evaluation showcases Qwen 3’s strong specialized reasoning capabilities while highlighting areas where further improvements could enhance its performance on domain-specific tasks.

4/22/2025

Benchmark

Our new Finance Agent Benchmark is live!

Our new Finance Agent Benchmark evaluates AI agents’ ability to perform tasks expected of entry-level financial analysts.
Developed in collaboration with industry experts, it includes 537 questions covering skills like simple retrieval, market research, and projections.
The models are expected to use a set of 4 tools to search through the web or the EDGAR database and parse the results to answer the questions.
Current AI models do not exceed 50% accuracy, highlighting the need for further development before reliable deployment in the finance industry.
At the time of this benchmark’s release, o3 is the best performing model reaching 48.3%, but at the cost of an average of $3.69 per question.
It is followed closely by Claude Sonnet 3.7 Thinking which gets 44.1% accuracy, at the much lower price per question of $1.05.

4/18/2025

Model

o3 and o4 Mini evaluated on all benchmarks!

We just evaluated o3 and o4 Mini on all benchmarks!

o3 achieved the #1 overall accuracy ranking on our benchmarks, with exceptional performance on complex reasoning tests like MMMU (#1/22), MMLU Pro (#1/35), GPQA (#1/35) and proprietary benchmarks like TaxEval (#1/42) and CorpFin (#2/35).
o4 Mini achieved the second-highest accuracy across our benchmarks (82.8%), driven by strong performance on public math tests like MGSM (#1/36), MMMU (#2/22), and Math500 (#4/38).
Legal benchmark weaknesses: Both models demonstrated significant weaknesses on our proprietary legal benchmarks, with lower ranks on ContractLaw (o3: #34/62, o4 Mini: #14/62) and CaseLaw (o3: #15/55, o4 Mini: #18/55).
Cost-effectiveness comparison: With similar performance levels, cost becomes a key differentiator. o4 Mini costs $4.40 for output, compared to $40.00 for o3 — a tenfold price difference that makes o4 Mini the more economical choice for many use cases.

4/15/2025

Model

GPT 4.1, 4.1 Mini, and 4.1 Nano evaluated on all benchmarks!

We just evaluated GPT 4.1, GPT 4.1 Mini, and GPT 4.1 Nano on all benchmarks!

GPT 4.1 delivers impressive results with a 75.5% average accuracy across benchmarks.
Impressive performance on proprietary benchmarks! GPT 4.1 is now the leader on CorpFin (71.2%), and shows strong performance on CaseLaw (85.8%, 4/53), and MMLU Pro (80.5%, 6/33).
GPT 4.1 Nano and GPT 4.1 Mini bring AI to time-sensitive applications with an outstanding latency of only 3.62s and 6.60s respectively while still achieving 59.1% and 75.1% average accuracy.
Compact but capable! Despite its size, GPT 4.1 Mini performs admirably on Math500 (88.8%, 10/36) and MGSM (87.9%, 20/34).
Size versus performance tradeoff: The smaller models do show lower performance on some complex tasks, with GPT 4.1 Nano ranking near the bottom on MMLU Pro (62.3%, 30/33) and MGSM (69.8%, 32/34).

4/11/2025

Model

Grok 3 Beta and Mini Beta (High and Low Reasoning) evaluated on all benchmarks!

We just evaluated Grok 3 Beta, Grok 3 Mini Fast Beta (High Reasoning), and Grok 3 Mini Fast Beta (Low Reasoning) on all benchmarks!

Grok 3 Beta delivers impressive results with a 78.1% average accuracy across benchmarks and a snappy 15.52s latency.
Dominates proprietary benchmarks! Grok 3 Beta ranks #1 on three key benchmarks: CorpFin (69.1%), CaseLaw (88.1%), and TaxEval (78.8%).
Grok 3 Mini Fast Beta (High Reasoning) surprises with an even higher average accuracy of 81.6% despite being a smaller model.
Mathematical prowess! Grok 3 Mini Fast Beta (High Reasoning) takes the #2 place (94.2%) on Math500 and the #3 place (85.00%) on AIME.

4/7/2025

Model

Llama 4 Maverick and Llama 4 Scout evaluated on all benchmarks!

We just evaluated Llama 4 Maverick and Llama 4 Scout on all benchmarks!

Llama 4 Scout achieves an average accuracy of 61.5% with a latency of 7.13 seconds, placing the model at a tie with Mistral Small 3.1 (03/2025) (61.5%) and just behind Cohere’s Command A (63.5%).
Llama 4 Maverick sits at 67.0% accuracy with a latency of 7.72 seconds ranking just behind Anthropic’s Claude 3.5 Sonnet (69.9%) and DeepSeek V3 (03/24/2025) (74.7%).
Both models excel on public benchmarks, with Maverick achieving top rankings in MMMU (4/17), MGSM (4/28), GPQA (5/27), and MMLU Pro (5/27), while Scout delivers strong results in MMMU (10/17), MortgageTax (10/18), and AIME (11/26).
However, these models show a significant gap between their impressive public benchmark performance and mediocre results on private benchmarks, particularly struggling with TaxEval (Maverick: 28/34, Scout: 32/34), Contract Law (Maverick: 37/54, Scout: 43/54), and MedQA (Maverick: 32/32, Scout: 30/32).

4/4/2025

Model

Mistral Small 3.1 (2503) evaluated on all benchmarks!

We just evaluated Mistral Small 2503 on all benchmarks!

Mistral Small 3.1 is Mistral AI’s latest small model, achieving an average accuracy of 61.4% across all benchmarks with a latency of 6.52s - faster than GPT-4o Mini (9.89s) and Llama 3.3 70B (7.67s).
Despite its compact size, Mistral Small outperforms Claude 3.5 Haiku (60.2%) in overall accuracy while offering competitive performance to GPT-4o Mini (62.8%).
The model excels on MGSM with 85.4% accuracy, comparable to Claude Haiku (85.9%) but behind Llama 3.3 70B’s impressive 91.3%.
Like Claude Haiku, the model struggles with AIME (both 3.5%), well behind GPT-4o Mini (11.5%) and Llama 3.3 70B (16.6%).

3/28/2025

Model

Gemini 2.5 Pro Exp evaluated on all benchmarks!

We just evaluated Gemini 2.5 Pro Exp on all benchmarks!

Gemini 2.5 Pro Exp is Google’s latest experimental model and the new State-of-the-Art, achieving an impressive average accuracy of 82.3% across all benchmarks with a latency of 24.68s.
The model ranks #1 on many of our benchmarks including CorpFin, Math500, LegalBench, GPQA, MMLU Pro, and MMMU.
It excels in academic benchmarks, with standout performances on Math500 (95.2%), MedQA (93.0%), and MGSM (92.2%).
Gemini 2.5 Pro Exp demonstrates strong legal reasoning capabilities with 86.1% accuracy on CaseLaw and 83.6% on LegalBench, though it scores lower on ContractLaw (64.7%).

3/26/2025

Model

DeepSeek V3 evaluated on all benchmarks!

We just evaluated DeepSeek V3 on all benchmarks!

DeepSeek V3 is DeepSeek’s latest model, boasting speeds of 60 tokens/second and claiming to be 3x faster than V2, with an average accuracy of 73.9% (4.2% better than previous versions).
DeepSeek V3 performs comparably (slightly better) to Claude 3.7 Sonnet (71.7%).
The model demonstrates strong legal capabilities, scoring particularly well on CaseLaw and LegalBench, though it scores lower on ContractLaw.
It shows impressive academic versatility with top-tier performance on MGSM, Math500, and MedQA.

3/26/2025

Benchmark

New Multimodal Benchmark: MMMU Evaluates Visual Reasoning Across 30 Subjects

Today, we’re releasing results from the Multimodal Multi-task Benchmark (MMMU), a comprehensive evaluation of AI models’ ability to reason across multiple modalities spanning 30 subjects in 6 major disciplines.

o1 achieved the highest overall accuracy at 77.7%, surpassing the performance of the worst human experts (76.2%).
Claude 3.7 Sonnet (Thinking) delivers performance nearly identical to o1 at a more favorable price point
Even the best models remain well below the performance of the best human experts (88.6%), highlighting opportunities for further advancement

3/24/2025

Model

Command A evaluated on all benchmarks!

We just evaluated Command A on all benchmarks!

Command A is Cohere’s most efficient and performant model to date, specializing in agentic AI, multilingual, and human evaluations for real-life use cases.
On our proprietary benchmarks, Command A shows mixed performance, ranking 23rd out of 28 models on TaxEval but a good 10th out of 22 models on CorpFin.
The model performs better on some academic benchmarks, scoring 78.7% on LegalBench (9th place) and 86.8% on MGSM (13th place).
However, it struggles with AIME (13.3%, 12th place) and GPQA (29.3%, 18th place).

3/13/2025

Model

Jamba 1.6 Large and Mini Evaluated on All Benchmarks.

We just evaluated Jamba 1.6 Large and Jamba 1.6 Mini models!

Jamba 1.6 Large and Jamba 1.6 Mini are the latest versions of the open source Jamba models, developed by AI21 Labs.
On our private benchmarks, Jamba 1.6 Large shows reasonable performance, getting the 16th place out of 27 models on TaxEval. with 65.3% accuracy, beating GPT-4o Mini and Claude 3.5 Haiku.
However both models are not competitive on public benchmarks, they get the last two places on AIME and GPQA.

3/11/2025

Benchmark

Academic Benchmarks Released: GPQA, MMLU, AIME (2024 and 2025), Math 500, and MGSM

Today, we’ve released five new academic benchmarks on our site: three evaluating mathematical reasoning, and two on general question-answering.

Unlike results released by model providers on these benchmarks, we applied a consistent methodology and prompt-template across models, ensuring an apples-to-apples comparison. You can find detailed information about our evaluation approach on each benchmark’s page:

3/5/2025

Benchmark

New Multimodal Mortgage Tax Benchmark Released

We just released a new benchmark in partnership with Vontive!

The MortgageTax benchmark evaluates language models on extracting information from tax certificates.
It tests multimodal capabilities with 1258 document images, including both computer-written and handwritten content.
The benchmark includes two key tasks: semantic extraction (identifying year, parcel number, county) and numerical extraction (calculating annualized amounts).

Claude 3.7 Sonnet leads the pack with 80.6% accuracy, and the other top 3 models are all from Anthropic.

2/27/2025

News

Vals Legal AI Report Released

We just released the VLAIR! Whereas our previous benchmarks study foundation model performance, here we investigate the ability of the most popular legal AI products to perform real world legal tasks.

To build a large, high quality dataset, we worked with some of the top global law firms, including Reed Smith, Fisher Phillips, McDermott Will & Emery, Ogletree Deakins, Paul Hastings among others. This is the first benchmark in which we also collected a human baseline against which we measure performance.

In sum, this enabled us to study how these legal AI systems perform on practical tasks and especially how the work of generative AI tools compared to that of a human lawyer.

Read the report for full results.

2/25/2025

Model

Anthropic's Claude 3.7 Sonnet (Nonthinking) Evaluated on All Benchmarks.

We just evaluated Anthropic’s Claude 3.7 Sonnet (Nonthinking) model!

We evaluted the model with Thinking Disabled on all benchmarks. It shows great performance and reaches second place just behind its Thinking Enabled counterpart on Corp Fin.
We also evaluated the model with Thinking Enabled. Unlike most models that excel in specific areas, Anthropic’s Claude 3.7 Sonnet (Thinking) demonstrates remarkable consistency, achieving top-tier performance across all evaluated benchmarks. The remaining two benchmarks are currently in progress due to their higher token requirements.

We have also run Google 2.0 Flash Thinking Exp and Google 2.0 Pro Exp on most benchmarks.

2/3/2025

Model

OpenAI's o3-mini Evaluated on All Benchmarks.

We just evaluated OpenAI’s o3-mini model!

The model shows a good price-performance trade-off, reaching close to top places on our most recent and proprietary benchmarks like Tax Eval.
However, o3-mini seems to struggle with large context windows, performing poorly on the Max Fitting Context task of CorpFin. It tends to lose the question if it is provided at the beginning of a large context window (around 150k tokens and more).

We have also run DeepSeek R1 on our CorpFin benchmark, on which it reaches the top place, beating all other models we have tested.

1/28/2025

Model

DeepSeek R1 Evaluated on TaxEval, CaseLaw, ContractLaw

🐳 We just evaluated DeepSeek’s R1 model on three of our private datasets! 🐳

The model demonstrates its strong reasoning ability, rivaling Open AI’s o1 model on our Tax dataset.
However, R1 performs extremely poorly on ContractLaw and with middling performance on CaseLaw. The model’s performance is not uniform, suggest task-specific evaluation must be done before adoption
Overall, this large Chinese model shows impressive ability and further closes the gap between closed and open-source models.

1/27/2025

Benchmark

Two New Proprietary Benchmarks Released

We just released two new benchmarks!

We have released a completely new version of our CorpFin benchmark - with 1200 expert generated financial questions on very long context docs (200-300 pages).
We have also released a completely new TaxEval benchmark, with more than 1500 expert reviewed tax questions.

We also are releasing several new models such as Grok 2 and Gemini 2.0 Flash Exp.

1/27/2025

Benchmark

New Medical Benchmark Released

Vals AI and Graphite Digital partnered to release the first medical benchmark on Vals AI.

This report offers the first third-party, highly-exhaustive evaluation of over 15 of the most popular LLMs on graduate-level medical questions.

We assessed models under two conditions: unbiased and bias-injected questions, measuring the models’ general accuracy and the ability to handle racial bias in medical contexts.

Our top-performing model was OpenAI’s o1 Preview and best value was Meta’s Llama 3.1 70b.

Read the full report to find out more!

12/11/2024

News

Refresh to Vals AI

We’ve just implemented a re-design of this benchmarking website!

Apart from being easier on the eyes, this new version of the site is much more useful.

Models cards are displayed on their own dedicated pages, showing results across all benchmarks.
Every Benchmark page is time-stamped and updated with changelogs.
Our Methodology page now shares more details around our approach and plan.

11/10/2024

Model

Results for the new 3.5 Sonnet (Latest) model

On Legalbench, it’s now exactly tied with GPT 4o, and beats 4o on CorpFin and CaseLaw
It usually, but not always, performs a few percentage points better than the previous version - for example, on Legalbench (+1.3%), ContractLaw Overall (+0.5%), and CorpFin (+0.8%).
There are some instances where it experienced a performance regression - including TaxEval Free Response (-3.2%) and CaseLaw Overall (-0.1%).
Although it’s competitive with 4o, it’s still not at the level of GPT o1, which still claims the top spots on almost all of our leaderboards.

10/31/2024

News

Vals AI Legal Report Announced

Vals AI and Legaltech Hub are partnering with leading law firms and top legal AI vendors to conduct a first-of-its-kind benchmark.

The study will evaluate the platforms across eight legal tasks including Document Q&A, Legal Research, EDGAR Research. All data will be collected from the law firms, to ensure it’s representative of real legal work.

The report will be published in early 2025.

Updates

Claude Haiku 4.5 Evaluated on All Benchmarks!

Our new SAGE Benchmark is live!

Magistral 1.2 (Small and Medium) Evaluated

Sonnet 4.5 sets new SOTAs

Gemini 2.5 Flash (09/25) Models Evaluated

Qwen 3 Max Evaluated

GPT 5 Codex Evaluated on Coding Benchmarks

Grok 4 Fast Evaluated!

Terminal Bench Released

Qwen3 Max Preview Benchmarked

GLM 4.5 Evaluated!

Grok Code Evaluated on Coding Benchmarks!

GPT-5 has been Evaluated on SWE-Bench!

Our CaseLaw v2 Benchmark is live!

Is your model smarter than a High-Schooler? Introducing our IOI Benchmark

Opus 4.1 (Thinking) Evaluated!

Opus 4.1 (Nonthinking) Evaluated!

GPT-5 Evaluated on Non-Agentic Benchmarks!

Kimi K2 Instruct Evaluated on SWE-Bench

NVIDIA Nemotron Super Evaluated

Kimi K2 Instruct Evaluated On Non-Agentic Benchmarks!

Grok 4 on Tough Benchmarks

Grok 4 Results (Continued)

Grok 4 Results (Continued)

Grok 4 Early Results

SWE-bench results released

Opus (Nonthinking) evaluated on our benchmarks; Sonnet 4 evaluated on FAB.

Claude Sonnet 4 (Thinking) evaluated on all benchmarks!

Claude Sonnet 4 (Nonthinking) evaluated on all benchmarks!

Mistral Medium 3 evaluated on all benchmarks!

Google's Gemini 2.5 Flash (Nonthinking) evaluated on most benchmarks

Qwen 3 235B evaluations released!

Our new Finance Agent Benchmark is live!

o3 and o4 Mini evaluated on all benchmarks!

GPT 4.1, 4.1 Mini, and 4.1 Nano evaluated on all benchmarks!

Grok 3 Beta and Mini Beta (High and Low Reasoning) evaluated on all benchmarks!

Llama 4 Maverick and Llama 4 Scout evaluated on all benchmarks!

Mistral Small 3.1 (2503) evaluated on all benchmarks!

Gemini 2.5 Pro Exp evaluated on all benchmarks!

DeepSeek V3 evaluated on all benchmarks!

New Multimodal Benchmark: MMMU Evaluates Visual Reasoning Across 30 Subjects

Command A evaluated on all benchmarks!

Jamba 1.6 Large and Mini Evaluated on All Benchmarks.

Academic Benchmarks Released: GPQA, MMLU, AIME (2024 and 2025), Math 500, and MGSM

New Multimodal Mortgage Tax Benchmark Released

Vals Legal AI Report Released

Anthropic's Claude 3.7 Sonnet (Nonthinking) Evaluated on All Benchmarks.

OpenAI's o3-mini Evaluated on All Benchmarks.

DeepSeek R1 Evaluated on TaxEval, CaseLaw, ContractLaw

Two New Proprietary Benchmarks Released

New Medical Benchmark Released

Refresh to Vals AI

Results for the new 3.5 Sonnet (Latest) model

Vals AI Legal Report Announced

Join our mailing list to receive benchmark updates