Task Type:

MGSM Benchmark

Last updated

1

o4 Mini

★

93.4%

$1.10 / $4.40

7.45 s

2

Claude 3.7 Sonnet (Thinking)

92.8%

$3.00 / $15.00

21.83 s

3

DeepSeek V3

92.5%

$0.90 / $0.90

16.62 s

4

Claude 3.5 Sonnet Latest

92.5%

$3.00 / $15.00

3.86 s

5

Llama 4 Maverick

$⚡︎

92.5%

$0.27 / $0.85

2.70 s

6

DeepSeek R1

92.4%

$8.00 / $8.00

10.13 s

7

Claude 3.7 Sonnet

92.4%

$3.00 / $15.00

4.68 s

8

Gemini 2.5 Pro Exp

92.2%

$1.25 / $10.00

9.39 s

9

DeepSeek V3 (03/24/2025)

92.0%

$1.20 / $1.20

19.37 s

10

o3 Mini

91.6%

$1.10 / $4.40

13.17 s

11

o3

91.4%

$10.00 / $40.00

6.50 s

12

Grok 3 Beta

91.3%

$3.00 / $15.00

5.91 s

13

Llama 3.3 Instruct Turbo (70B)

91.3%

$0.88 / $0.88

2.96 s

14

GPT 4o (2024-08-06)

90.6%

$2.50 / $10.00

7.08 s

15

Grok 3 Mini Fast Beta Low Reasoning

90.3%

$0.60 / $4.00

6.68 s

16

Grok 3 Mini Fast Beta High Reasoning

90.3%

$0.60 / $4.00

9.43 s

17

GPT 4o (2024-11-20)

90.2%

$2.50 / $10.00

4.06 s

18

Gemini 1.5 Pro (002)

89.3%

$1.25 / $5.00

2.83 s

19

o1

88.9%

$15.00 / $60.00

11.39 s

20

Gemini 2.0 Flash (001)

88.9%

$0.10 / $0.40

1.54 s

21

Mistral Large (11/2024)

88.2%

$2.00 / $6.00

8.31 s

22

GPT 4.1 mini

87.9%

$0.40 / $1.60

1.95 s

23

Llama 4 Scout

87.8%

$0.18 / $0.59

3.58 s

24

GPT 4.1

87.2%

$2.00 / $8.00

2.12 s

25

Command A

86.8%

$2.50 / $10.00

8.80 s

26

Gemini 1.5 Flash (002)

86.7%

$0.07 / $0.30

1.43 s

27

Grok 2

86.3%

$2.00 / $10.00

6.70 s

28

GPT 4o Mini

86.2%

$0.15 / $0.60

4.06 s

29

Claude 3.5 Haiku Latest

85.9%

$1.00 / $5.00

3.75 s

30

Mistral Small 3.1 (03/2025)

85.4%

$0.07 / $0.30

4.01 s

31

Mistral Small (02/2024)

85.0%

$0.20 / $0.60

3.08 s

32

Jamba 1.5 Large

77.4%

$2.00 / $8.00

10.65 s

33

Jamba 1.6 Large

75.0%

$2.00 / $8.00

10.31 s

34

GPT 4.1 nano

69.8%

$0.10 / $0.40

1.41 s

35

Jamba 1.6 Mini

44.7%

$0.20 / $0.40

4.08 s

36

Jamba 1.5 Mini

29.6%

$0.20 / $0.40

2.99 s

Task type :

Key Takeaways

o4 Mini lead the leaderboard on MGSM, demonstrating superior multilingual mathematical reasoning.
The majority of evaluated models achieve high accuracy, suggesting that this benchmark is reaching saturation. The model performances are very tightly clustered, with only marginal differences between them.
Also performing well are the Claude 3.7 Sonnet Thinking and DeepSeek V3 models.
Across the board, all models exhibit better performance on the English version of the benchmark, underscoring the impact of pre-training data.

Background

The Multilingual Grade School Math Benchmark (MGSM) is an academic evaluation benchmark designed to assess the ability of language models to solve grade-school math problems in multiple languages. Derived from the well-known GSM8K dataset—which consists of 8.5K high-quality, diverse math word problems—the MGSM benchmark features a subset of 250 problems that have been carefully translated by human annotators into 10 typologically diverse languages (including underrepresented languages such as Bengali, Telugu, and Swahili). MGSM is commonly reported by model providers on new model releases.

Introduced in the paper Language Models are Multilingual Chain-of-Thought Reasoners by Shi et al. (2022) and supported by subsequent research on multilingual evaluation, the dataset not only measures a model’s numerical and reasoning capabilities but also its proficiency in processing linguistic variations across different scripts and cultural contexts.

Methodology

To ensure reproducible and fair comparisons, the evaluation of models on the MGSM benchmark was conducted using the same grading scripts provided in the OpenAI’s SimpleEvals GitHub repository. This approach guarantees consistency in the prompt format and testing environment across all languages.

We used the following prompt template to query each model:

Solve this math problem. Give the reasoning steps before giving the final answer on the last line by itself in the format of "Answer:". Do not add anything other than the integer answer after "Answer:".

{Question}

For non-English evaluations, the prompt was accurately translated into the target language while preserving the original structure. All models were tested with a temperature setting of 0.

Highest-Performing Models

o4 Mini

Released date : 4/16/2025

Accuracy :

93.4%

Latency :

7.5s

Cost :

$1.10 / $4.40

Exhibited robust performance across all languages, including less-resourced ones such as Telugu.
Had a comparatively low cost, especially compared to models like 3.7 Sonnet.

View Model

Results

MGSM

/

The evaluation of MGSM reveals the following:

Among the top models, performance is very tightly clustered, with only marginal differences between them.
Cheap models perform very comparatively to more expensive models, suggesting that premium models are not necessarily better for questions of this difficult.
Despite high overall performance, there is a noticeable performance drop in non-English languages. Specifically, models showed the lowest performance in Bengali, where the best result was 90.4% accuracy.

These results indicate that while state-of-the-art models excel in mathematical reasoning, language-specific performance discrepancies persist, likely due to the imbalance in training data across languages. For users seeking a model with strong multilingual mathematical capabilities, the benchmark provides a range of well-rounded options.

The high results also suggest that the models are potentially reaching saturation on this benchmark - and the benchmark may soon be unable to distinguish between models’ multilingual and mathematical reasoning capabilities.

MGSM Benchmark

Key Takeaways

Background

Methodology

Highest-Performing Models

o4 Mini

Results

Join our mailing list to receive benchmark updates on