Key Takeaways
- GPT 5 is the new top-performing model on AIME at 93.4% accuracy
- Nine out of ten top models are reasoning models, underscoring the efficacy of reasoning for difficult math and coding problems.
- Many of the questions are extremely resource intensive for top models - taking more than 30,000 reasoning tokens for a single question is very common, and generation can often takes upwards of 15 minutes for the hardest questions.
- As the AIME questions and answers are publicly available, there is a risk that models may have been exposed to them during pretraining. Notably, models tend to perform better on older (2024) questions compared to the newer 2025 set, raising questions about data contamination and true generalization.
Results
To visualize the performance differences among models, we provide the following scatter plots illustrating accuracy versus latency and price.
/
Background
The American Invitational Mathematics Examination (AIME) is a prestigious, invite-only mathematics competition for high-school students who perform in the top 5% of the AMC 12 mathematics exam. It involves 15 questions of increasing difficulty, with the answer to every question being a single integer from 0 to 999. The median score is historically between 4 and 6 questions correct (out of the 15 possible). Two versions of the test are given every year (thirty questions total). You can view the questions from previous years on the AIME website
This examination serves as a crucial gateway for students aiming to qualify for the USA Mathematical Olympiad (USAMO). In general, the test is extremely challenging, and covers a wide range of mathematical topics, including algebra, geometry, and number theory.
The results clearly illustrate that no current model has yet mastered this benchmark, although several achieve strong performance.
Methodology
For this benchmark, we used the thirty questions from the 2024 and 2025 versions of the test (sixty questions total), modelling our approach after the repository from the GAIR NLP Lab.
To minimize parsing errors, we instructed the models with the following prompt template.
Please reason step by step, and put your final answer within \boxed{}
{Question}
The answer was then extracted from the boxed section and compared to the ground truth.
Although a few questions included an image or diagram, all of the information needed to solve the problem was present in the question text, so we did not include these images.
Reducing variance
Given the very low size of this benchmark, we ran each model 8 times on both AIME 2024 and AIME 2025 to reduce variance. We averaged the pass@1 performance across all runs for each model.