New Finance Agent Benchmark Released

Task Type:

CaseLaw (v2) Benchmark

Last updated

Task type :

Best Performing
$Best Budget
⚡︎Best Speed
Reasoning ModelReasoning Model

Partners in Evaluation


Key Takeaways

  • GPT 4.1 ranked #1, with 78.1% accuracy.
  • GPT 5 Mini emerged as a strong second-place performer with 77.5% accuracy, demonstrating excellent legal reasoning capabilities with faster processing times.
  • Grok 4 achieved 76.2% accuracy, showing robust performance on complex legal analysis tasks.
  • A common issue for the models was identifying only parts of relevant document sections, relying more on their general knowledge rather than the specific document context (despite being instructed otherwise).
  • While there is still significant room for improvement, the top-performing models demonstrated high accuracy, making them well-suited for these tasks.

Context

In our CaseLaw v2 benchmark report, we present our study of using LLMs for litigation, particularly how law relates to public court systems.

Case law in the US and Canada is challenging to use at scale. Strict licensing requirements prevent LLMs from being trained on them. Organizations that have amassed substantial case histories provide access only for legal use at significant cost.

Testing models on this data offers several benefits. First, practicing lawyers will query models using references to recent cases that foundational models or applications have not been exposed to during their training. Second, the majority of legal LLM evaluations have focused solely on US law, and our study expands to explore their application in international legal systems.

In collaboration with the legaltech startup Jurisage, we announce the creation of our latest dataset, “CaseLaw v2”. Building on our previous CaseLaw benchmark, we curated harder, more up-to-date questions, providing an enhanced evaluation of large language models’ capabilities in legal document analysis and case law reasoning. Models were performing quite well (scoring almost 90%) on our CaseLaw benchmark, nearing benchmark saturation; we found that with CaseLaw v2, no model obtains above 80% accuracy.

This dataset remains private with cases from recent court decisions beyond models’ training cutoff dates. The evaluation tasks test real-world legal reasoning capabilities based on feedback from legal practitioners. The benchmark tests models along seven dimensions, including:

  • retrieving the most important cases for a given query
  • answering questions over multiple documents
  • providing multi-point answers that have several components
  • performing calculations to arrive at an answer
  • reading over tables
  • working chronologically over some input data
  • understanding terms of art relevant to case law research

The benchmark contains tests that require answering based on a single relevant case, as well as more complex questions that require models to refer to multiple cases. In total, it contains 300 tests in our validation split, and 104 tests in our test split.

Legal document analysis presents unique challenges due to complex legal language, precise interpretation requirements, and the need to identify relevant precedents—making it ideal for testing advanced language models. Each evaluation presents models with a substantial amount of legal information in the form of documents, requiring sophisticated analysis.

Questions test both extractive capabilities (finding specific information) and reasoning abilities (understanding legal implications and relationships). For example, the questions in our benchmark include questions such as the ones below:

Example Question 1: From these X cases, which one(s) are most relevant as a precedent for Y situation?

Example Question 2: From the cases provided, what is meant by X? Answer the question only with the relevant excerpt or multiple excerpts from the documents, and do not include additional description or explanation*

and other types of questions along these lines.


Highest Quality Models

GPT 4.1

GPT 4.1

Released date : 4/14/2025

Accuracy :

78.1%

Latency :

35.0s

Cost :

$2.00 / $8.00

  • GPT-4.1 achieved the highest performance with 78.1% accuracy, though the competitive gap has narrowed with newer models.
  • Excelled at identifying relevant case citations and understanding complex legal relationships within lengthy documents.
  • Demonstrated strong performance across both extractive and analytical legal reasoning tasks.

View Model


GPT 5 Mini

GPT 5 Mini

Released date : 8/7/2025

Accuracy :

77.5%

Latency :

24.4s

Cost :

$0.25 / $2.00

  • GPT-5 Mini achieved strong second-place performance with 77.5% accuracy, offering excellent value with faster processing times.
  • Demonstrated sophisticated legal reasoning capabilities while maintaining efficiency in document analysis.
  • Provided consistent performance across different types of legal tasks with notably lower latency than top-tier models.

View Model


Grok 4

Grok 4

Released date : 7/9/2025

Accuracy :

76.2%

Latency :

40.1s

Cost :

$3.00 / $15.00

  • Grok-4 performed well with 76.2% accuracy, showing strong reasoning capabilities for complex legal analysis.
  • Demonstrated particular strength in understanding relationships between different legal concepts within lengthy documents.
  • Provides thorough legal analysis despite higher processing latency.

View Model


These results highlight the sophisticated reasoning capabilities required for legal document analysis, with leading models achieving similar accuracy scores.

Case Law v2

/

The performance spread highlights the complexity of legal reasoning tasks, with frontier models clearly outperforming alternatives by up to ~25%. The accuracy vs. efficiency analysis shows varied approaches to legal reasoning. GPT 4.1 provides the highest accuracy, while GPT 5 Mini offers an excellent balance of performance and speed, and Grok 3 delivers reliable results with good efficiency.

The results show that advanced reasoning models perform well at these complex legal tasks; however, there is significant room for improvement, and many models struggle with the nuanced interpretation required for accurate legal analysis. This benchmark thus provides valuable insights for organizations considering incorporating AI into legal workflows.

Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.