Benchmark

ProofBench

Vals Logo

Updated: 1/30/2026

Can models write math proofs that are formally verified?

Key Takeaways

  • Harmonic’s Aristotle achieved the best performance by a significant margin, obtaining 71%. This is ~2x the score of the best foundation model with tools.
  • Foundation models struggle to produce formally verified proofs of advanced undergraduate or graduate level mathematical problems
  • The best foundation model, Claude Opus 4.5 (Thinking), scores 36%, with models dropping in accuracy rapidly after that. The second-best model is Gemini 3 Pro (11/25), scoring 20%, followed by Claude Sonnet 4.5 (Thinking) at an accuracy of 19%.
  • Results highlight that while tools like Aristotle are quite capable already of assisting humans with advanced mathematics and formalization, there remain substantial gaps in reliable formal reasoning across foundation models, pointing to significant room for improvement.

Background

Large language models now routinely achieve impressive scores on mathematical contests such as the AIME, IMO, and Putnam. These results suggest substantial progress in mathematical reasoning— however, they come with an important caveat. Natural language solutions can mask subtle logical errors, unstated assumptions, or invalid steps that are difficult and time-consuming for humans to audit.

To address this gap, ProofBench evaluates whether models can translate mathematical reasoning into machine-checkable proofs. We use Lean 4, a proof assistant and programming language designed for formal mathematics, where every logical inference must be formally verified (i.e., be machine-checkable). Since a solution here either compiles or it does not— there is no partial credit for plausible reasoning- this makes for a more objective evaluation of mathematical reasoning skills.

In addition, mathematicians are increasingly using Lean 4 to formalize their work. Therefore, models that perform well on this benchmark will be valuable in assisting with mathematical research in the future.

Success on this benchmark therefore tests whether models can maintain logical correctness across chains of reasoning, and whether they can be effective with modern formal mathematics. Strong performance would indicate genuine progress toward AI systems that can assist mathematicians with formalization, verification, and large-scale proof development.

Results

To visualize the performance differences among models, we provide scatter plots illustrating accuracy versus latency and cost.

ProofBench

Benchmark Overview

Each benchmark task pairs a natural-language mathematical problem with its corresponding Lean 4 formal statement. Given both, a model must construct a formal proof that is accepted by the Lean proof checker. The benchmark measures proof success directly, providing a clear and unforgiving signal of correctness. Problems are drawn from advanced undergraduate and graduate-level sources, including qualifying exams and standard textbooks. They span a broad range of modern mathematics:

  • Probability theory and stochastic analysis
  • Measure theory
  • Real and functional analysis
  • Algebra and commutative algebra
  • Algebraic geometry
  • Number theory
  • Set theory, logic, and model theory

By focusing on higher levels of abstraction rather than high-school contest problems (e.g. AIME), ProofBench provides a complementary signal to existing math benchmarks and better reflects the challenges faced in real mathematical research.

Tool Use Analysis

Within the ProofBench harness, models have access to tools organized into the following bundles:

BundleTools
Searchlean_loogle (Mathlib search)
Code Executionlean_run_code (Lean code execution + feedback)
Submissionsubmit_proof (final proof submission)

Models may iteratively search the Mathlib library and test partial proofs or code in a Lean 4 environment before submitting a final solution. The submission tool is invoked exactly once per problem. The visualization below summarizes how different models allocate effort between search and code execution during problem solving.

ProofBench Tool Usage

One observation in tool-use patterns is that models that tend to focus more on executing code generally seem to perform better. For instance, we observed that Claude Opus 4.5 (Thinking) quickly starts executing code and getting feedback, using the search tool more as a method to find lemmas it recognizes it needs in the middle of constructing its solution. A common failure mode we observed is models spiraling into a long series of search tool calls, attempting to find a theorem or notion that doesn’t exist, but continuing to search for variations of it in an attempt to find it.

Methodology

The benchmark consists of two splits: a public split and a private test split, each containing 100 problems. For every problem, we provide:

  • A natural-language statement of the theorem (no informal proof attached; models must generate the proof)
  • A carefully vetted Lean 4 formalization of the statement

Models are allowed up to 40 interaction turns per problem, enabling iterative development with tool feedback. No access to the informal proof is provided.

Note on evaluating Aristotle: To evaluate Aristotle, a formal-math system from Harmonic AI, we fed in the full Lean 4 statement file, with the natural language statement included as a comment at the top of the file. We emphasize that Aristotle uses its own internal system/harness, and not the evaluation harness we provide to other models. Additionally, the time reported indicates only solve time, and does not include queue time.

Data Collection Process and Examples

Prior benchmarks have faced numerous challenges with ensuring that the formalizations of the statements were correct. Statements may be misformalized in a number of ways, particularly if autoformalized by models. To circumvent these issues, problems were selected and formalized by a team of Lean 4 experts, including PhD students and experts in their respective domains. Statements and formalizations underwent a secondary review process before being included in the benchmark. We also collected natural language proofs for the problems, and while the main scores reported on this page do not give the model(s) access to the proofs, we plan to conduct ablations including the proofs.

For examples of the questions in the benchmark, see the panel below.

Model Output Examples

Q

Prove that a finitely additive non-negative set function μ on a measurable space (Ω, F) with the "continuity" property Bₙ ∈ F, Bₙ ↓ ∅, μ(Bₙ) < ∞ → μ(Bₙ) → 0 must be countably additive if μ(Ω) < ∞.

L

Lean Formalization:
Lean 4
1import Mathlib
2
3open TopologicalSpace Filter MeasureTheory ProbabilityTheory Function
4
5open scoped NNReal ENNReal MeasureTheory ProbabilityTheory Topology
6
7namespace MeasureTheory
8
9variable {Ω : Type*} [MeasurableSpace Ω]
10
11theorem dembo_1_1_15 {μ : Set Ω ℝ≥0}
12 (hAdd : A B, MeasurableSet A MeasurableSet B Disjoint A B μ (A B) = μ A + μ B)
13 (hCont : B : Set Ω,
14 ( n, MeasurableSet (B n)) Antitone B ( n, B n) =
15 Tendsto (fun n => μ (B n)) atTop ( 0))
16 {A : Set Ω} (hA : n, MeasurableSet (A n)) (hDisj : Pairwise (Disjoint on A)) :
17 μ ( n, A n) = ' n, μ (A n) := by
18 sorry
19
20end MeasureTheory
Lean 4
1by
2 have hEmpty : μ = 0 := by
3 have h := hAdd MeasurableSet.empty MeasurableSet.empty disjoint_bot_left
4 simp only [Set.union_empty] at h
5 have : μ + μ = μ := h.symm
6 have h3 : μ + μ - μ = μ - μ := by rw [this]
7 simp at h3; exact h3
8 set B := fun n => k n, A k with hBdef
9 have hB_meas : n, MeasurableSet (B n) := fun n =>
10 MeasurableSet.biUnion (Set.to_countable _) (fun k _ => hA k)
11 have hB_anti : Antitone B := by
12 intro m n hmn
13 apply Set.biUnion_subset_biUnion_left
14 intro k hk; exact le_trans hmn hk
15 have hB_inter : n, B n = := by
16 rw [Set.eq_empty_iff_forall_not_mem]
17 intro x hx
18 simp only [Set.mem_iInter] at hx
19 have h0 := hx 0; rw [hBdef] at h0
20 simp only [Set.mem_iUnion, ge_iff_le] at h0
21 obtain k, -, hxk := h0
22 have hk1 := hx (k + 1); rw [hBdef] at hk1
23 simp only [Set.mem_iUnion, ge_iff_le] at hk1
24 obtain j, hjk, hxj := hk1
25 have hkj : k j := by omega
26 exact Set.disjoint_iff.mp (hDisj hkj) hxk, hxj
27 have hBtends : Tendsto (fun n => μ (B n)) atTop ( 0) := hCont B hB_meas hB_anti hB_inter
28 have hUnion : n, ( i, A i) = ( k < n, A k) B n := by
29 intro n; ext x
30 simp only [Set.mem_union, Set.mem_iUnion, exists_prop, hBdef]
31 constructor
32 · rintro k, hxk
33 by_cases hkn : k < n
34 · left; exact k, hkn, hxk
35 · right; exact k, not_lt.mp hkn, hxk
36 · rintro (k, -, hxk | k, -, hxk) <;> exact k, hxk
37 have hDisjParts : n, Disjoint ( k < n, A k) (B n) := by
38 intro n; rw [Set.disjoint_iff]
39 intro x hx1, hx2
40 simp only [Set.mem_iUnion, exists_prop, hBdef] at hx1 hx2
41 obtain k1, hk1n, hxk1 := hx1
42 obtain k2, hk2n, hxk2 := hx2
43 have hne : k1 k2 := by omega
44 exact Set.disjoint_iff.mp (hDisj hne) hxk1, hxk2
45 have hFinMeas : n, MeasurableSet ( k < n, A k) := fun n =>
46 MeasurableSet.biUnion (Set.to_countable _) (fun k _ => hA k)
47 have hFinAdd : n, μ ( k < n, A k) = k Finset.range n, μ (A k) := by
48 intro n
49 induction n with
50 | zero => simp [hEmpty]
51 | succ n ih =>
52 rw [Finset.sum_range_succ, ih]
53 have hunion : ( k < n + 1, A k) = ( k < n, A k) A n := by
54 ext x; simp only [Set.mem_iUnion, Set.mem_union, exists_prop]
55 constructor
56 · rintro k, hkn, hxk
57 by_cases hkn' : k < n
58 · left; exact k, hkn', hxk
59 · have heq : k = n := by omega
60 right; rw [heq] at hxk; exact hxk
61 · rintro (k, hkn, hxk | hxn)
62 · exact k, Nat.lt_succ_of_lt hkn, hxk
63 · exact n, Nat.lt_succ_self n, hxn
64 have hdisj : Disjoint ( k < n, A k) (A n) := by
65 rw [Set.disjoint_iff]; intro x hx1, hx2
66 simp only [Set.mem_iUnion, exists_prop] at hx1
67 obtain k, hkn, hxk := hx1
68 have hne : k n := by omega
69 exact Set.disjoint_iff.mp (hDisj hne) hxk, hx2
70 rw [hunion]; exact hAdd _ _ (hFinMeas n) (hA n) hdisj
71 have hKey : n, μ ( i, A i) = ( k Finset.range n, μ (A k)) + μ (B n) := by
72 intro n
73 calc μ ( i, A i) = μ (( k < n, A k) B n) := by rw [hUnion n]
74 _ = μ ( k < n, A k) + μ (B n) := hAdd _ _ (hFinMeas n) (hB_meas n) (hDisjParts n)
75 _ = ( k Finset.range n, μ (A k)) + μ (B n) := by rw [hFinAdd n]
76 have hPartialSums : Tendsto (fun n => k Finset.range n, μ (A k)) atTop ( (μ ( i, A i))) := by
77 have h1 : n, ( k Finset.range n, μ (A k)) = μ ( i, A i) - μ (B n) := by
78 intro n
79 have := hKey n
80 have h2 : μ ( i, A i) - μ (B n) = k Finset.range n, μ (A k) + μ (B n) - μ (B n) := by
81 rw [ this]
82 simp at h2; exact h2.symm
83 simp_rw [h1]
84 have : Tendsto (fun n => μ ( i, A i) - μ (B n)) atTop ( (μ ( i, A i) - 0)) :=
85 Tendsto.sub tendsto_const_nhds hBtends
86 simp at this; exact this
87 have hHasSum : HasSum (fun n => μ (A n)) (μ ( i, A i)) := by
88 rw [NNReal.hasSum_iff_tendsto_nat]; exact hPartialSums
89 exact hHasSum.tsum_eq.symm

CORRECT

Correct proof.

Q

Show that an A-scheme is separated (over A) if and only if it is separated over ℤ.

L

Lean Formalization:
Lean 4
1import Mathlib
2
3open AlgebraicGeometry
4
5variable {A : Type*} {X : Scheme} [CommRing A]
6 (struct_X : X Spec (CommRingCat.of A))
7
8theorem algebraicGeometry_vakil_ch_11_2_E :
9 IsSeparated struct_X IsSeparated (instOverTerminalScheme X).hom := by
10 sorry
Lean 4
1by
2 simpa using
3 (show IsSeparated struct_X IsSeparated (CategoryTheory.Limits.terminal.from X) from by
4 constructor
5 · intro h
6 have hterm :
7 IsSeparated
8 (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A))) := by
9 infer_instance
10 have hcomp :
11 IsSeparated
12 (CategoryTheory.CategoryStruct.comp struct_X
13 (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A)))) :=
14 (CategoryTheory.MorphismProperty.IsStableUnderComposition.comp_mem
15 (P := @AlgebraicGeometry.IsSeparated)
16 struct_X (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A))) h hterm)
17 simpa [CategoryTheory.Limits.terminal.comp_from struct_X] using hcomp
18 · intro h
19 have hcomp :
20 IsSeparated
21 (CategoryTheory.CategoryStruct.comp struct_X
22 (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A)))) := by
23 simpa [CategoryTheory.Limits.terminal.comp_from struct_X] using h
24 haveI :
25 IsSeparated
26 (CategoryTheory.CategoryStruct.comp struct_X
27 (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A)))) :=
28 hcomp
29 exact
30 AlgebraicGeometry.IsSeparated.of_comp struct_X
31 (CategoryTheory.Limits.terminal.from (Spec (CommRingCat.of A))))

CORRECT

Acknowledgements

We thank Kexing Ying, Vasilii Nesterov, Janitha Aswedige, Bingyu Xia, Elif Uskuplu, and Yizheng Zhu for contributing problem formalizations. We also thank Alex Gu for valuable feedback that improved the benchmark design. Finally, we are grateful to the teams behind lean-lsp-mcp, as well as Loogle.

Citations

[1] Formal Mathematical Reasoning: A New Frontier in AI (arXiv:2412.16075)
[2] The Lean 4 language (lean-lang.org)

Citation (BibTeX)

@misc{ravi2026proofbench,
title        = {ProofBench: Can Models Write Math Proofs That Are Formally Verified?},
author       = {Ravi, Nikil and Nashold, Langston and Krishnan, Rayan},
year         = {2026},
month        = jan,
howpublished = {Vals AI},
url          = {https://vals.ai/benchmarks/proof-bench},
note         = {Accessed: 2026-01-30}
}

Join our mailing list to receive benchmark updates

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

By subscribing, I agree to Vals' Privacy Policy.