We evaluated Gemini 3.1 Pro Preview (02/26) across our full benchmark suite. Here are the key takeaways:
- It is the 3rd best model on both the Vals Index and the Vals Multimodal Index.
- It is first on several of our benchmarks, including: AIME, GPQA, Live Code Bench, Terminal Bench 2, LegalBench, MMMU, MMLU Pro and MedCode.
- The model shows dramatic improvement compared with Gemini 3 Pro (11/25) on Case Law (v2), jumping from rank #50 (53.4% accuracy) to rank #11 (65.6% accuracy)—a 12 percentage point improvement.
- It performs slightly worse than Gemini 3 Pro (11/25) on SWE-bench, achieving 69.6% accuracy.
- It also gets 59.72 on our Finance Agent, which puts it at 3rd place. It beats Gemini 3 Pro (11/25) and GPT 5.2.
One notable metric to call out here is the model achieves this performance at a lower cost than models like Claude Opus 4.6 (Thinking), Claude Sonnet 4.6, GPT 5.2 and o3.
Evaluations were run with a temperature of 1.0 and a “high” thinking level, via the official Google API.
Congratulations to the Google team on another outstanding model!