Key Takeaways
- o4 Mini lead the leaderboard on MGSM, demonstrating superior multilingual mathematical reasoning.
- The majority of evaluated models achieve high accuracy, suggesting that this benchmark is reaching saturation. The model performances are very tightly clustered, with only marginal differences between them.
- Also performing well are the Claude 3.7 Sonnet Thinking and DeepSeek V3 models.
- Across the board, all models exhibit better performance on the English version of the benchmark, underscoring the impact of pre-training data.
Background
The Multilingual Grade School Math Benchmark (MGSM) is an academic evaluation benchmark designed to assess the ability of language models to solve grade-school math problems in multiple languages. Derived from the well-known GSM8K dataset—which consists of 8.5K high-quality, diverse math word problems—the MGSM benchmark features a subset of 250 problems that have been carefully translated by human annotators into 10 typologically diverse languages (including underrepresented languages such as Bengali, Telugu, and Swahili). MGSM is commonly reported by model providers on new model releases.
Introduced in the paper Language Models are Multilingual Chain-of-Thought Reasoners by Shi et al. (2022) and supported by subsequent research on multilingual evaluation, the dataset not only measures a model’s numerical and reasoning capabilities but also its proficiency in processing linguistic variations across different scripts and cultural contexts.
Methodology
To ensure reproducible and fair comparisons, the evaluation of models on the MGSM benchmark was conducted using the same grading scripts provided in the OpenAI’s SimpleEvals GitHub repository. This approach guarantees consistency in the prompt format and testing environment across all languages.
We used the following prompt template to query each model:
Solve this math problem. Give the reasoning steps before giving the final answer on the last line by itself in the format of "Answer:". Do not add anything other than the integer answer after "Answer:".
{Question}
For non-English evaluations, the prompt was accurately translated into the target language while preserving the original structure. All models were tested with a temperature setting of 0.
Highest-Performing Models
o4 Mini
Released date : 4/16/2025
Accuracy :
93.4%
Latency :
7.5s
Cost :
$1.10 / $4.40
- Exhibited robust performance across all languages, including less-resourced ones such as Telugu.
- Had a comparatively low cost, especially compared to models like 3.7 Sonnet.
View Model
Results
/
The evaluation of MGSM reveals the following:
- Among the top models, performance is very tightly clustered, with only marginal differences between them.
- Cheap models perform very comparatively to more expensive models, suggesting that premium models are not necessarily better for questions of this difficult.
- Despite high overall performance, there is a noticeable performance drop in non-English languages. Specifically, models showed the lowest performance in Bengali, where the best result was 90.4% accuracy.
These results indicate that while state-of-the-art models excel in mathematical reasoning, language-specific performance discrepancies persist, likely due to the imbalance in training data across languages. For users seeking a model with strong multilingual mathematical capabilities, the benchmark provides a range of well-rounded options.
The high results also suggest that the models are potentially reaching saturation on this benchmark - and the benchmark may soon be unable to distinguish between models’ multilingual and mathematical reasoning capabilities.