Key Takeaways
- Gemini 2.5 Pro Exp is the best performing model by a significant margin on this benchmark, achieving a 95.2% accuracy.
- o3 reaches the second place with a 94.6% accuracy, narrowly beating out Grok 3 Mini Fast Beta High Reasoning
- Given the fact that the best models performed are performing at over 90% accuracy, it seems like we may soon be at the limit of the differences in model capabilities this benchmark can test. Given the questions are public, it may be time for a new, private math benchmark to be created, not part of any pre-training corpus.
Methodology
This benchmark is an adaptation of the MATH benchmark, first published in Measuring Mathematical Problem Solving With the MATH Dataset. The MATH benchmark is commonly reported on new model releases.
We sample 500 diverse problems from this benchmark - spanning topics like probability, algebra, trigonometry, and geometry. The questions are designed to test a model’s ability to apply mathematical principles, execute complex calculations, and communicate solutions clearly.
Unlike in original paper, which fine-tuned models to produce LaTeX output, we used the following prompt template to ensure the models produce outputs in the correct format.
Answer the following math question, given in LaTeX format, clearly and concisely, and present the final answer as \(\boxed{x}\), where X is the fully simplified solution.
Example:
**Question:** \(\int_0^1 (3x^2 + 2x) \,dx\)
**Solution:** \(\int (3x^2 + 2x) \,dx = x^3 + x^2 + C\) Evaluating from 0 to 1: \((1^3 + 1^2) - (0^3 + 0^2) = 1 + 1 - 0 = 2 \boxed{2}\)
Now, solve the following question: {question}
We also used the parsing logic from the the PRM800K dataset grader. We found that this was much more reliable in extracting and evaluating the model’s output - it was robust towards differing formats and mathematical formulations, compared to the parsing logic from the original MATH paper.
All models were evaluated with temperature set to 0, except for the reasoning models that force a certain temperature (like 3.7 at 1, or o1 not accepting temperature).
Highest Performing Models
Gemini 2.5 Pro Exp
Released date : 3/25/2025
Accuracy :
95.2%
Latency :
25.8s
Cost :
$1.25 / $10.00
- Gemini 2.5 Pro Exp was the best performing model on this benchmark, achieving a 95.2% accuracy.
- That being said, this is only a marginal improvement over other, similar models such as o3 and Grok 3.
View Model
Results
We display the accuracy versus the price of each model.
/
In general, the models performed well on the MATH benchmark - the top models were able to solve over 90% of questions correctly. As this benchmark is public, and been out for several years, it does raise questions if the models have been pre-trained on the test data.
Some models performed similarly to others, but were much more verbose. For example, DeepSeek R1 used over 26,242 characters for one problem. In contrast, o3 mini solved it in just just around 10,000 characters (2700 tokens).
Here’s a glimpse of o3 mini’s concise approach (post-reasoning):
We are given three vertices of a parallelogram:
A = (5, 3), B = (6, 8), C = (7, 4),
and we want to find the fourth vertex D = (x, y) with the extra condition that x > 7.
A key property of a parallelogram is that its diagonals bisect each other. This means the midpoints of the two diagonals are the same.
There are several ways to assign the vertices as opposite pairs, but only one choice will yield x > 7. Let's determine D by pairing B and C as opposite vertices and, consequently, A and D as the other pair.
1. Compute the midpoint of diagonal joining B and C:
Midpoint of BC = ((6 + 7)/2, (8 + 4)/2) = (13/2, 12/2) = (6.5, 6).
2. Since A and D are the opposite vertices of the other diagonal, their midpoint must equal (6.5, 6):
Midpoint of AD = ((5 + x)/2, (3 + y)/2) = (6.5, 6).
3. Set up the equations:
(5 + x)/2 = 6.5 ⟹ 5 + x = 13 ⟹ x = 8,
(3 + y)/2 = 6 ⟹ 3 + y = 12 ⟹ y = 9.
Thus, D = (8, 9) and indeed x = 8 is greater than 7.
Finally, we find x + y:
x + y = 8 + 9 = 17.
The final answer is \boxed{17}.
Despite having fewer parameters (200 billion vs. 671 billion), o3 mini’s efficiency highlights that more parameters don’t always equate to high-quality answers.
Example
We provide examples of a question from the dataset, as well as the outputs from each model. You will see that each model chose different paths to solve the same problem. You will also see the logical deductions made by the models between each step in their attempts.
Q
Compute [1/cos^2(10°) + 1/sin^2(20°) + 1/sin^2(40°)]
A
We wish to simplify
S = 1⁄cos²10° + 1⁄sin²20° + 1⁄sin²40°.
Notice that cos 10° = sin 80° (since sin (90°–θ) = cos θ). Thus, we can rewrite the first term as
1⁄cos²10° = 1⁄sin²80°.
So the sum becomes
S = 1⁄sin²20° + 1⁄sin²40° + 1⁄sin²80°.
A useful identity is
csc²θ = 1 + cot²θ,
so we may write
S = [
1 + cot²20°
] + [
1 + cot²40°
] + [
1 + cot²80°
] = 3 + (cot²20° + cot²40° + cot²80°).
A numerical check (or a derivation using product–sum identities) shows that
cot²20° + cot²40° + cot²80° = 8.
(For example, one may verify using a calculator that cot 20° ≈ 2.7475 so its square is ≈7.547, cot 40° ≈ 1.1918 (square ≈1.420), and cot 80° ≈ 0.1763 (square ≈0.031) whose sum is approximately 8.)
Thus, we have
S = 3 + 8 = 11.
Therefore, the fully simplified solution is
\boxed{11}.
INCORRECT