New Finance Agent Benchmark Released

Task Type:

AIME Benchmark

Last updated

Task type :

Best Performing
$Best Budget
⚡︎Best Speed
Reasoning ModelReasoning Model

Key Takeaways

  • Grok 4 is the new top-performing model on AIME, achieving a remarkable 90.6% accuracy, significantly ahead of all other models.
  • o3 Mini and Gemini 2.5 Pro Exp follow in second and third place, with 86.5% and 85.8% accuracy, respectively.
  • o3 and Grok 3 Mini Fast High Reasoning also demonstrate strong performance, both scoring above 85%.
  • Despite these advances, AIME remains an unsaturated and highly challenging benchmark, with no model achieving perfect accuracy.
  • As the AIME questions and answers are publicly available, there is a risk that models may have been exposed to them during pretraining. Notably, models tend to perform better on older (2024) questions compared to the newer 2025 set, raising questions about data contamination and true generalization.

Results

To visualize the performance differences among models, we provide the following scatter plots illustrating accuracy versus latency and price.

AIME

/

The analysis of the results indicates that the correctly answered questions were distributed among different models, suggesting that no single model has developed a comprehensive approach to solving these problems. This reinforces the notion that current AI models remain limited in their ability to solve advanced mathematical problems consistently.


Background

The American Invitational Mathematics Examination (AIME) is a prestigious, invite-only mathematics competition for high-school students who perform in the top 5% of the AMC 12 mathematics exam. It involves 15 questions of increasing difficulty, with the answer to every question being a single integer from 0 to 999. The median score is historically between 4 and 6 questions correct (out of the 15 possible). Two versions of the test are given every year (thirty questions total). You can view the questions from previous years on the AIME website

This examination serves as a crucial gateway for students aiming to qualify for the USA Mathematical Olympiad (USAMO). In general, the test is extremely challenging, and covers a wide range of mathematical topics, including algebra, geometry, and number theory.

The results clearly illustrate that no current model has yet mastered this benchmark, although several achieve strong performance.


Methodology

For this benchmark, we used the thirty questions from the 2024 and 2025 versions of the test (sixty questions total), modelling our approach after the repository from the GAIR NLP Lab.

To minimize parsing errors, we instructed the models with the following prompt template.

Please reason step by step, and put your final answer within \boxed{}

{Question}

The answer was then extracted from the boxed section and compared to the ground truth.

Although a few questions included an image or diagram, all of the information needed to solve the problem was present in the question text, so we did not include these images.

Reducing variance

Given the very low size of this benchmark, we ran each model 8 times on both AIME 2024 and AIME 2025 to reduce variance. We averaged the pass@1 performance across all runs for each model.


Highest-Performing Models

The top-performing model on this benchmark is Grok 4 , which achieved a remarkable 90.6% accuracy. o3 Mini and Gemini 2.5 Pro Exp follow in second and third place, respectively.

Grok 4

Grok 4

Released date : 7/9/2025

Accuracy :

90.6%

Latency :

133.2s

Cost :

$3.00 / $15.00

  • achieved the highest accuracy at 90.6% on the AIME benchmark.
  • This model set a new State-of-the-Art for AIME, outperforming all other models by a significant margin.
  • The cost per run is relatively high ($3.00 in / $15.00 out), and latency is also substantial at 133.2s.

View Model


o3 Mini

o3 Mini

Released date : 1/31/2025

Accuracy :

86.5%

Latency :

154.6s

Cost :

$1.10 / $4.40

  • Ranked second with an 86.5% accuracy.
  • Despite being a Mini model, it still had a noticeably high latency (154.7s) and a moderate cost ($1.10 in / $4.40 out).

View Model


Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.