Partners in Evaluation
Key Takeaways
- The highest-performing model is GPT 5 , with an accuracy of 84.6%. Next are Gemini 2.5 Pro Exp and Grok 4 .
- A given model’s performance can vary dramatically across different legal tasks. There is still significant room for improvement for these models to perform well in law.
Results
/
The results per question type are summarized in the graph below.
/
Model Output Example
We show some examples from one of the LegalBench free response tasks (Rule QA) here. The models were asked “Where in the Federal Rules of Civil Procedure are notice requirements described?”
This question is tricky because both Rules 4 and 5 talk about the service process and notice requirements. However, Rule 4 is the main rule that defines the first notice in a proceeding, whereas Rule 5 gives the procedures for filing pleadings, etc.
This question is also difficult because implicit in the question is that it’s “notice” of a suit. GPT correctly determines that the question is talking about the initial notice, whereas Opus does not. Interestingly, Gemini gets the answer wrong, but it is much more terse than the other two models. Despite performing best overall, Llama 3.1 makes the same mistake as Opus, citing Rule 5 instead of Rule 4.
It’s important this is one question from a very large evaluation data set and is not representative of overall performance.
Q
Where in the Federal Rules of Civil Procedure are notice requirements described?
A
The notice requirements in federal civil litigation are primarily described in Rule 5 of the Federal Rules of Civil Procedure (FRCP). Specifically: 1. Rule 5(a) outlines the service requirements for pleadings and other papers, stating that unless the rules provide otherwise, each paper must be served on every party. 2. Rule 5(b) describes how service can be made, including the methods of service and the permissible means of electronic service. 3. Rule 5(c) states that all papers after the complaint required to be served must be filed with the court within a reasonable time after service. 4. Rule 5(d) covers the filing requirements, including electronic filing and the filing of discovery materials. Additionally, other rules in the FRCP may contain specific notice requirements for certain situations, such as Rule 4 for serving a summons and complaint, Rule 6 for computing and extending time, and Rule 55 for default judgments.
INCORRECT
Dataset and Context
There has been a considerable effort to measure language model performance in academic tasks and chatbot settings but these high-level benchmarks are not applicable to specific industry use cases. Here we start to remedy this by reporting our application-specific findings and live leaderboard results on LegalBench, a large crowd-sourced collection of legal reasoning tasks. The data set is quite comprehensive, covering six major categories.
- Issue-spotting, where a model must determine if a fact has relevance to a particular law or legal entity.
- Rule-recall, where a model must identify a relevant rule or state its characteristics.
- Rule-conclusion, where a model must predict a legal outcome.
- Rule-application, where a model must analyze how a rule was applied to reach a conclusion.
- Interpretation, where a model must parse and understand legal text.
- Rhetorical understanding, where a model must determine whether a legal argument performs a certain function.