New Finance Agent Benchmark Released

Task Type:

LegalBench Benchmark

Last updated

Task type :

Best Performing
$Best Budget
⚡︎Best Speed
Reasoning ModelReasoning Model

Partners in Evaluation


Key Takeaways

  • The highest-performing model is Google’s Gemini 2.5 Pro Exp with a score of 83.6%.
  • Google’s Gemini 2.5 Flash models follow close behind, along with reasoning models from OpenAI, xAI, and Anthropic.
  • Aside from reaoning models, GPT 4.1 (as well as the mini and nano variants) performed well, as well as DeepSeek V3. However, DeepSeek R1, the reasoning variant, performed poorly.
  • A given model’s performance can vary dramatically across different legal tasks. There is still significant room for improvement for these models to perform well on legal tasks.

Context

There has been a considerable effort to measure language model performance in academic tasks and chatbot settings but these high-level benchmarks are not applicable to specific industry use cases. Here we start to remedy this by reporting our application-specific findings and live leaderboard results on LegalBench, a large crowd-sourced collection of legal reasoning tasks. The data set is quite comprehensive, covering six major categories.

  • Issue-spotting, where a model must determine if a fact has relevance to a particular law or legal entity.
  • Rule-recall, where a model must identify a relevant rule or state its characteristics.
  • Rule-conclusion, where a model must predict a legal outcome.
  • Rule-application, where a model must analyze how a rule was applied to reach a conclusion.
  • Interpretation, where a model must parse and understand legal text.
  • Rhetorical understanding, where a model must determine whether a legal argument performs a certain function.

Overall Results

Overall, Gemini 2.5 Pro is the winner by a few percentage points. It is able to outperform the other models on every question type.

Gemini 2.5 Pro Exp

Gemini 2.5 Pro Exp

Released date : 3/25/2025

Accuracy :

83.6%

Latency :

3.5s

Cost :

$1.25 / $10.00

  • Gemini 2.5 Pro is the new state-of-the-art, getting 83.6% accuracy.
  • It also gets a relatively low latency of 3.51s for a reasoning model.

View Model


Grok 3 Mini Fast High Reasoning

Grok 3 Mini Fast High Reasoning

Released date : 4/9/2025

Accuracy :

82.0%

Latency :

4.9s

Cost :

$0.60 / $4.00

  • The latest Grok 3 Mini High Reasoning model gets 82.0% accuracy.
  • Despite a high latency of 4.92, it comes at a cheaper price point than Gemini 2.5 Pro.

View Model

The results per question type are summarized in the graph below.

LegalBench

/

Across all tasks, performance of the best models was generally between 80% and 90% accuracy. Although Gemini 2.5 Pro took the top placement overall, several other models scored #1 on specific categories. For example, Grok 3 Mini Fast Beta scored #1 on “Issue” tasks, and GPT-4o Mini snagged the top spot for “Rhetoric”.

LegalBench

/


Model Output Example

We show some examples from one of the LegalBench free response tasks (Rule QA) here. The models were asked “Where in the Federal Rules of Civil Procedure are notice requirements described?”

This question is tricky because both Rules 4 and 5 talk about the service process and notice requirements. However, Rule 4 is the main rule that defines the first notice in a proceeding, whereas Rule 5 gives the procedures for filing pleadings, etc.

This question is also difficult because implicit in the question is that it’s “notice” of a suit. GPT correctly determines that the question is talking about the initial notice, whereas Opus does not. Interestingly, Gemini gets the answer wrong, but it is much more terse than the other two models. Despite performing best overall, Llama 3.1 makes the same mistake as Opus, citing Rule 5 instead of Rule 4. Interestingly, o1 preview is a lot more verbose than other models for this question.

It’s important this is one question from a very large evaluation data set and is not representative of overall performance.

Q

Where in the Federal Rules of Civil Procedure are notice requirements described?

A

Response:

The notice requirements in federal civil litigation are primarily described in Rule 5 of the Federal Rules of Civil Procedure (FRCP). Specifically: 1. Rule 5(a) outlines the service requirements for pleadings and other papers, stating that unless the rules provide otherwise, each paper must be served on every party. 2. Rule 5(b) describes how service can be made, including the methods of service and the permissible means of electronic service. 3. Rule 5(c) states that all papers after the complaint required to be served must be filed with the court within a reasonable time after service. 4. Rule 5(d) covers the filing requirements, including electronic filing and the filing of discovery materials. Additionally, other rules in the FRCP may contain specific notice requirements for certain situations, such as Rule 4 for serving a summons and complaint, Rule 6 for computing and extending time, and Rule 55 for default judgments.

INCORRECT


Additional Notes

Gemini

The Gemini documentation and error handling were extremely poor. Even after turning content moderation to the least restrictive setting, the API frequently returned an “unsafe content” error. In some cases, we could circumvent this because the error payload included the supposedly unsafe generation

Often, we’d get an index out-of-bounds error originating from within Gemini’s own Python SDK, instead of a more meaningful error message. We debugged further, now believing this to be an additional level of content moderation not exposed to the user in any capacity.

In general, Gemini has much stricter content moderation on its output than most other models. This is true even if Content Blocking was set to the lowest possible settings. Because of this, the model was not able to successfully produce outputs for many tests in the learned_hands tasks and others. These cases are treated as failures. The Gemini Pro model results may improve considerably if the content moderation is better calibrated.

Gemini Pro pricing is per character, not per token. We go by the pricing listed here, and assume an average of 4 characters per token.

Claude 2

Almost all LegalBench tasks are multiple-choice, and expect a single word or choice as an output. Claude-2 has extreme difficulty producing outputs in this format. Even if explicit instructions are provided, such as “Don’t include an explanation for your choice”, “Answer in a single word only”, etc., the model reliably produced a paragraph-length output with some explanation.

To give the model a chance, we wrote a custom parser for Claude 2. We asked it to produce outputs in a JSON format with ‘explanation’ and ‘answer’ keys, then extracted the ‘answer’ key. We did not perform this for any subsequent Claude models - Claude 3.0 and 3.5 were evaluated normally.

When asked for a single-word response, the Claude 3 models still struggled to consistently follow these directions. These problems were solved once we provided a system prompt with these expectations.

Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.