Key Takeaways
- Harmonic’s Aristotle achieved the best performance by a significant margin, obtaining 71%. This is ~2x the score of the best foundation model with tools.
- Foundation models struggle to produce formally verified proofs of advanced undergraduate or graduate level mathematical problems
- The best foundation model, Claude Opus 4.5 (Thinking), scores 36%, with models dropping in accuracy rapidly after that. The second-best model is Gemini 3 Pro (11/25), scoring 20%, followed by Claude Sonnet 4.5 (Thinking) at an accuracy of 19%.
- Results highlight that while tools like Aristotle are quite capable already of assisting humans with advanced mathematics and formalization, there remain substantial gaps in reliable formal reasoning across foundation models, pointing to significant room for improvement.
Background
Large language models now routinely achieve impressive scores on mathematical contests such as the AIME, IMO, and Putnam. These results suggest substantial progress in mathematical reasoning— however, they come with an important caveat. Natural language solutions can mask subtle logical errors, unstated assumptions, or invalid steps that are difficult and time-consuming for humans to audit.
To address this gap, ProofBench evaluates whether models can translate mathematical reasoning into machine-checkable proofs. We use Lean 4, a proof assistant and programming language designed for formal mathematics, where every logical inference must be formally verified (i.e., be machine-checkable). Since a solution here either compiles or it does not— there is no partial credit for plausible reasoning- this makes for a more objective evaluation of mathematical reasoning skills.
In addition, mathematicians are increasingly using Lean 4 to formalize their work. Therefore, models that perform well on this benchmark will be valuable in assisting with mathematical research in the future.
Success on this benchmark therefore tests whether models can maintain logical correctness across chains of reasoning, and whether they can be effective with modern formal mathematics. Strong performance would indicate genuine progress toward AI systems that can assist mathematicians with formalization, verification, and large-scale proof development.
Results
To visualize the performance differences among models, we provide scatter plots illustrating accuracy versus latency and cost.
Benchmark Overview
Each benchmark task pairs a natural-language mathematical problem with its corresponding Lean 4 formal statement. Given both, a model must construct a formal proof that is accepted by the Lean proof checker. The benchmark measures proof success directly, providing a clear and unforgiving signal of correctness. Problems are drawn from advanced undergraduate and graduate-level sources, including qualifying exams and standard textbooks. They span a broad range of modern mathematics:
- Probability theory and stochastic analysis
- Measure theory
- Real and functional analysis
- Algebra and commutative algebra
- Algebraic geometry
- Number theory
- Set theory, logic, and model theory
By focusing on higher levels of abstraction rather than high-school contest problems (e.g. AIME), ProofBench provides a complementary signal to existing math benchmarks and better reflects the challenges faced in real mathematical research.
Tool Use Analysis
Within the ProofBench harness, models have access to tools organized into the following bundles:
| Bundle | Tools |
|---|---|
| Search | lean_loogle (Mathlib search) |
| Code Execution | lean_run_code (Lean code execution + feedback) |
| Submission | submit_proof (final proof submission) |
Models may iteratively search the Mathlib library and test partial proofs or code in a Lean 4 environment before submitting a final solution. The submission tool is invoked exactly once per problem. The visualization below summarizes how different models allocate effort between search and code execution during problem solving.
One observation in tool-use patterns is that models that tend to focus more on executing code generally seem to perform better. For instance, we observed that Claude Opus 4.5 (Thinking) quickly starts executing code and getting feedback, using the search tool more as a method to find lemmas it recognizes it needs in the middle of constructing its solution. A common failure mode we observed is models spiraling into a long series of search tool calls, attempting to find a theorem or notion that doesn’t exist, but continuing to search for variations of it in an attempt to find it.
Methodology
The benchmark consists of two splits: a public split and a private test split, each containing 100 problems. For every problem, we provide:
- A natural-language statement of the theorem (no informal proof attached; models must generate the proof)
- A carefully vetted Lean 4 formalization of the statement
Models are allowed up to 40 interaction turns per problem, enabling iterative development with tool feedback. No access to the informal proof is provided.
Note on evaluating Aristotle: To evaluate Aristotle, a formal-math system from Harmonic AI, we fed in the full Lean 4 statement file, with the natural language statement included as a comment at the top of the file. We emphasize that Aristotle uses its own internal system/harness, and not the evaluation harness we provide to other models. Additionally, the time reported indicates only solve time, and does not include queue time.
Data Collection Process and Examples
Prior benchmarks have faced numerous challenges with ensuring that the formalizations of the statements were correct. Statements may be misformalized in a number of ways, particularly if autoformalized by models. To circumvent these issues, problems were selected and formalized by a team of Lean 4 experts, including PhD students and experts in their respective domains. Statements and formalizations underwent a secondary review process before being included in the benchmark. We also collected natural language proofs for the problems, and while the main scores reported on this page do not give the model(s) access to the proofs, we plan to conduct ablations including the proofs.
For examples of the questions in the benchmark, see the panel below.
Model Output Examples
Q
Prove that a finitely additive non-negative set function μ on a measurable space (Ω, F) with the "continuity" property Bₙ ∈ F, Bₙ ↓ ∅, μ(Bₙ) < ∞ → μ(Bₙ) → 0 must be countably additive if μ(Ω) < ∞.
L
Lean Formalization:1import Mathlib23open TopologicalSpace Filter MeasureTheory ProbabilityTheory Function45open scoped NNReal ENNReal MeasureTheory ProbabilityTheory Topology67namespace MeasureTheory89variable {Ω : Type*} [MeasurableSpace Ω]1011theorem dembo_1_1_15 {μ : Set Ω → ℝ≥0}12 (hAdd :