New Finance Agent Benchmark Released

Task Type:

MATH 500 Benchmark

Last updated

1

Google

Gemini 2.5 Pro Exp

95.2%

$1.25 / $10.00

25.83 s

2

OpenAI

o3

94.6%

$10.00 / $40.00

16.59 s

3

xAI

Grok 3 Mini Fast Beta High Reasoning

$

94.2%

$0.60 / $4.00

22.77 s

4

OpenAI

o4 Mini

⚡︎

94.2%

$1.10 / $4.40

12.54 s

5

DeepSeek

DeepSeek R1

92.2%

$8.00 / $8.00

156.47 s

6

OpenAI

o3 Mini

91.8%

$1.10 / $4.40

14.36 s

7

Anthropic

Claude 3.7 Sonnet (Thinking)

91.6%

$3.00 / $15.00

94.24 s

8

OpenAI

o1

90.4%

$15.00 / $60.00

23.54 s

9

xAI

Grok 3 Beta

89.8%

$3.00 / $15.00

9.09 s

10

DeepSeek

DeepSeek V3 (03/24/2025)

88.6%

$1.20 / $1.20

12.54 s

11

Google

Gemini 2.0 Flash (001)

88.0%

$0.10 / $0.40

3.37 s

12

OpenAI

GPT 4.1 mini

88.0%

$0.40 / $1.60

5.76 s

13

OpenAI

GPT 4.1

87.2%

$2.00 / $8.00

12.53 s

14

Meta

Llama 4 Maverick

85.2%

$0.27 / $0.85

7.47 s

15

Google

Gemini 2.0 Flash Thinking Exp

84.6%

$0.10 / $0.70

11.80 s

16

Google

Gemini 1.5 Pro (002)

82.8%

$1.25 / $5.00

5.02 s

17

DeepSeek

DeepSeek V3

80.4%

$0.90 / $0.90

7.94 s

18

OpenAI

GPT 4.1 nano

80.2%

$0.10 / $0.40

3.37 s

19

Meta

Llama 4 Scout

79.2%

$0.18 / $0.59

10.52 s

20

Google

Gemini 1.5 Flash (002)

78.8%

$0.07 / $0.30

2.65 s

21

xAI

Grok 2

78.4%

$2.00 / $10.00

20.44 s

22

Anthropic

Claude 3.7 Sonnet

76.8%

$3.00 / $15.00

5.53 s

23

Cohere

Command A

76.2%

$2.50 / $10.00

8.66 s

24

OpenAI

GPT 4o (2024-08-06)

75.2%

$2.50 / $10.00

12.29 s

25

Mistral

Mistral Large (11/2024)

74.4%

$2.00 / $6.00

9.93 s

26

OpenAI

GPT 4o (2024-11-20)

74.0%

$2.50 / $10.00

12.80 s

27

Meta

Llama 3.3 Instruct Turbo (70B)

73.4%

$0.88 / $0.88

5.41 s

28

OpenAI

GPT 4o Mini

72.6%

$0.15 / $0.60

6.30 s

29

Anthropic

Claude 3.5 Sonnet Latest

72.4%

$3.00 / $15.00

4.63 s

30

Meta

Llama 3.1 Instruct Turbo (405B)

71.4%

$3.50 / $3.50

45.30 s

31

Mistral

Mistral Small (02/2024)

70.6%

$0.20 / $0.60

4.89 s

32

xAI

Grok 3 Mini Fast Beta Low Reasoning

70.2%

$0.60 / $4.00

9.89 s

33

Mistral

Mistral Small 3.1 (03/2025)

68.4%

$0.07 / $0.30

6.41 s

34

Meta

Llama 3.1 Instruct Turbo (70B)

65.0%

$0.88 / $0.88

9.20 s

35

Anthropic

Claude 3.5 Haiku Latest

64.2%

$1.00 / $5.00

5.13 s

36

AI21 Labs

Jamba 1.6 Large

54.8%

$2.00 / $8.00

13.01 s

37

Meta

Llama 3.1 Instruct Turbo (8B)

44.4%

$0.18 / $0.18

5.84 s

38

AI21 Labs

Jamba 1.6 Mini

25.4%

$0.20 / $0.40

4.86 s

Task type :

Key Takeaways

  • Gemini 2.5 Pro Exp is the best performing model by a significant margin on this benchmark, achieving a 95.2% accuracy.
  • o3 reaches the second place with a 94.6% accuracy, narrowly beating out Grok 3 Mini Fast Beta High Reasoning
  • Given the fact that the best models performed are performing at over 90% accuracy, it seems like we may soon be at the limit of the differences in model capabilities this benchmark can test. Given the questions are public, it may be time for a new, private math benchmark to be created, not part of any pre-training corpus.

Methodology

This benchmark is an adaptation of the MATH benchmark, first published in Measuring Mathematical Problem Solving With the MATH Dataset. The MATH benchmark is commonly reported on new model releases.

We sample 500 diverse problems from this benchmark - spanning topics like probability, algebra, trigonometry, and geometry. The questions are designed to test a model’s ability to apply mathematical principles, execute complex calculations, and communicate solutions clearly.

Unlike in original paper, which fine-tuned models to produce LaTeX output, we used the following prompt template to ensure the models produce outputs in the correct format.

Answer the following math question, given in LaTeX format, clearly and concisely, and present the final answer as \(\boxed{x}\), where X is the fully simplified solution.

Example:
**Question:** \(\int_0^1 (3x^2 + 2x) \,dx\)
**Solution:** \(\int (3x^2 + 2x) \,dx = x^3 + x^2 + C\) Evaluating from 0 to 1: \((1^3 + 1^2) - (0^3 + 0^2) = 1 + 1 - 0 = 2 \boxed{2}\)

Now, solve the following question: {question}

We also used the parsing logic from the the PRM800K dataset grader. We found that this was much more reliable in extracting and evaluating the model’s output - it was robust towards differing formats and mathematical formulations, compared to the parsing logic from the original MATH paper.

All models were evaluated with temperature set to 0, except for the reasoning models that force a certain temperature (like 3.7 at 1, or o1 not accepting temperature).


Highest Performing Models

Gemini 2.5 Pro Exp

Gemini 2.5 Pro Exp

Released date : 3/25/2025

Accuracy :

95.2%

Latency :

25.8s

Cost :

$1.25 / $10.00

  • Gemini 2.5 Pro Exp was the best performing model on this benchmark, achieving a 95.2% accuracy.
  • That being said, this is only a marginal improvement over other, similar models such as o3 and Grok 3.

View Model


Results

We display the accuracy versus the price of each model.

Math500

/

In general, the models performed well on the MATH benchmark - the top models were able to solve over 90% of questions correctly. As this benchmark is public, and been out for several years, it does raise questions if the models have been pre-trained on the test data.

Some models performed similarly to others, but were much more verbose. For example, DeepSeek R1 used over 26,242 characters for one problem. In contrast, o3 mini solved it in just just around 10,000 characters (2700 tokens).

Here’s a glimpse of o3 mini’s concise approach (post-reasoning):

We are given three vertices of a parallelogram:
 A = (5, 3), B = (6, 8), C = (7, 4),
and we want to find the fourth vertex D = (x, y) with the extra condition that x > 7.

A key property of a parallelogram is that its diagonals bisect each other. This means the midpoints of the two diagonals are the same.

There are several ways to assign the vertices as opposite pairs, but only one choice will yield x > 7. Let's determine D by pairing B and C as opposite vertices and, consequently, A and D as the other pair.

1. Compute the midpoint of diagonal joining B and C:
 Midpoint of BC = ((6 + 7)/2, (8 + 4)/2) = (13/2, 12/2) = (6.5, 6).

2. Since A and D are the opposite vertices of the other diagonal, their midpoint must equal (6.5, 6):
 Midpoint of AD = ((5 + x)/2, (3 + y)/2) = (6.5, 6).

3. Set up the equations:
 (5 + x)/2 = 6.5  ⟹  5 + x = 13  ⟹  x = 8,
 (3 + y)/2 = 6   ⟹  3 + y = 12  ⟹  y = 9.

Thus, D = (8, 9) and indeed x = 8 is greater than 7.

Finally, we find x + y:
 x + y = 8 + 9 = 17.

The final answer is \boxed{17}.

Despite having fewer parameters (200 billion vs. 671 billion), o3 mini’s efficiency highlights that more parameters don’t always equate to high-quality answers.

Example

We provide examples of a question from the dataset, as well as the outputs from each model. You will see that each model chose different paths to solve the same problem. You will also see the logical deductions made by the models between each step in their attempts.

Q

Compute [1/cos^2(10°) + 1/sin^2(20°) + 1/sin^2(40°)]

A

Response:

We wish to simplify
S = 1⁄cos²10° + 1⁄sin²20° + 1⁄sin²40°.
Notice that cos 10° = sin 80° (since sin (90°–θ) = cos θ). Thus, we can rewrite the first term as

1⁄cos²10° = 1⁄sin²80°.

So the sum becomes

S = 1⁄sin²20° + 1⁄sin²40° + 1⁄sin²80°.

A useful identity is
csc²θ = 1 + cot²θ,
so we may write
S = [
1 + cot²20°
] + [
1 + cot²40°
] + [
1 + cot²80°
] = 3 + (cot²20° + cot²40° + cot²80°).

A numerical check (or a derivation using product–sum identities) shows that
cot²20° + cot²40° + cot²80° = 8.
(For example, one may verify using a calculator that cot 20° ≈ 2.7475 so its square is ≈7.547, cot 40° ≈ 1.1918 (square ≈1.420), and cot 80° ≈ 0.1763 (square ≈0.031) whose sum is approximately 8.)

Thus, we have
S = 3 + 8 = 11.

Therefore, the fully simplified solution is
\boxed{11}.

INCORRECT

Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.