CaseLaw (v2)

Partners in Evaluation

Key Takeaways

GPT 5.1 emerged as a strong first-place performer with 73.42% accuracy, demonstrating excellent legal reasoning capabilities with faster processing times.
A common issue for the models was identifying only parts of relevant document sections, relying more on their general knowledge rather than the specific document context (despite being instructed otherwise).
While there is still significant room for improvement, the top-performing models demonstrated high accuracy, making them well-suited for these tasks.

Context

In our CaseLaw v2 benchmark report, we present our study of using LLMs for litigation, particularly how law relates to public court systems.

Case law in the US and Canada is challenging to use at scale. Strict licensing requirements prevent LLMs from being trained on them. Organizations that have amassed substantial case histories provide access only for legal use at significant cost.

Testing models on this data offers several benefits. First, practicing lawyers will query models using references to recent cases that foundational models or applications have not been exposed to during their training. Second, the majority of legal LLM evaluations have focused solely on US law, and our study expands to explore their application in international legal systems.

In collaboration with the legaltech startup Jurisage, we announce the creation of our latest dataset, “CaseLaw v2”. Building on our previous CaseLaw benchmark, we curated harder, more up-to-date questions, providing an enhanced evaluation of large language models’ capabilities in legal document analysis and case law reasoning. Models were performing quite well (scoring almost 90%) on our CaseLaw benchmark, nearing benchmark saturation; we found that with CaseLaw v2, no model obtains above 80% accuracy.

This dataset remains private with cases from recent court decisions beyond models’ training cutoff dates. The evaluation tasks test real-world legal reasoning capabilities based on feedback from legal practitioners. The benchmark tests models along seven dimensions, including:

retrieving the most important cases for a given query
answering questions over multiple documents
providing multi-point answers that have several components
performing calculations to arrive at an answer
reading over tables
working chronologically over some input data
understanding terms of art relevant to case law research

The benchmark contains tests that require answering based on a single relevant case, as well as more complex questions that require models to refer to multiple cases. In total, it contains 300 tests in our validation split, and 104 tests in our test split.

Legal document analysis presents unique challenges due to complex legal language, precise interpretation requirements, and the need to identify relevant precedents—making it ideal for testing advanced language models. Each evaluation presents models with a substantial amount of legal information in the form of documents, requiring sophisticated analysis.

Questions test both extractive capabilities (finding specific information) and reasoning abilities (understanding legal implications and relationships). For example, the questions in our benchmark include questions such as the ones below:

Example Question 1: From these X cases, which one(s) are most relevant as a precedent for Y situation?

Example Question 2: From the cases provided, what is meant by X? Answer the question only with the relevant excerpt or multiple excerpts from the documents, and do not include additional description or explanation*

and other types of questions along these lines.

CaseLaw (v2)

Changelog

This benchmark was updated 11/17/25 to upgrade the evaluator model from GPT 4 Turbo, which had fallen out of date, to GPT 5.1.

This benchmark was updated 02/26/26 to run multiple times in order to reduce error.