Task Type:

ContractLaw Benchmark

Last updated

1

Llama 3.1 Instruct Turbo (405B)

★

75.2%

$3.50 / $3.50

2.19 s

2

Claude 3 Opus

74.0%

$15.00 / $75.00

5.97 s

3

Claude 3.7 Sonnet (Thinking)

73.0%

$3.00 / $15.00

8.87 s

4

o1 Mini

72.8%

$3.00 / $12.00

4.01 s

5

o1

72.7%

$15.00 / $60.00

11.26 s

6

GPT 4o Mini

$⚡︎

72.4%

$0.15 / $0.60

1.92 s

7

Claude Sonnet 4 (Nonthinking)

72.4%

$3.00 / $15.00

2.91 s

8

GPT 4 Turbo

71.8%

$10.00 / $30.00

3.26 s

9

GPT 4.1 Mini

71.5%

$0.40 / $1.60

1.03 s

10

Grok 3 Mini Fast Low Reasoning

70.8%

$0.60 / $4.00

4.92 s

Task type :

★Best Performing

$Best Budget

⚡︎Best Speed

Reasoning Model

Partners in Evaluation

Key Takeaways

Llama 3.1 Instruct Turbo (405B) was the top-performing model - it achieved 75.2% accuracy, setting a new SOTA on this task. All of the Llama 3.1 models perform particularly on the extraction task.
Claude 3 Opus model was second. It showed particular strength in determining whether contract language was in accordance with firm standards and suggesting corrections.
o1 Mini performed the best of the OpenAI models - although all of the GPT-4 models were clustered relatively closely (likely within random deviation). Surprisingly, o1 Preview performed worse than others - mainly due to its poor performance on matching tasks.
Overall, language models are reasonably capable of performing tasks on contract law-related questions for documents of this type. It is likely that we will continue to see improvement as new models are released.

Dataset and Context

There has been a considerable effort to measure language model performance in academic tasks and chatbot settings, but these high-level benchmarks are contrived and not applicable to specific industry use cases. Further, model performance results released by LLM providers are highly biased - they are often manufactured to show state-of-the-art results.

Here we start to remedy this by reporting our third-party, application-specific findings and live leaderboard results on the ContractLaw dataset, which was created in collaboration with SpeedLegal. This dataset consists of three task types which all pertain to various contract types. The different tasks are as follows.

Extraction: Asking the model to retrieve a part of the contract that relates to a relevant term. The model must understand the legal term within the contract that is being searched for and extract the relevant phrase or sentence that relates to it. Some extraction terms include “Non-Competition Covenant” or “Governing Law”.

Matching: Providing a model with an excerpt of a contract and a standard text to determine whether the contract upholds the standard expected. When lawyers review legal contracts, they determine whether the language is within the expectations of their client. Statements that are too risky or non-standard should be identified and corrected before contracts are signed. Here, the model was asked whether a given statement should be flagged.

Correction: Given an excerpt of a contract text and standard text, the model is asked to correct the contract text to meet the standard. This is the fix that a lawyer might write to send a new contract to the opposing party for review. These tasks were evaluated over five contract types. These contract types were Non-Disclosure Agreements (NDA), Data Processing Agreements (DPA), Master Service Agreements (MSA), Sales Agreements, and Employment Agreements.

Results

ContractLaw

/

ContractLaw

/

Model Output Examples

In the following example, we asked a model to take a contract and suggest a correction in keeping with the provided standard text. With each question, we also provided the model with a few in-context examples of ideal corrections.

The challenge with this task is to adapt the existing contract language in a way that is in keeping with the standard. Simply replacing the text with the standard text does not suffice. Models must understand the nuance of the clauses to form a good correction.

For the question asked, the answer we were looking for was “This Agreement shall continue for a period of three (3) years from the Effective Date or until such time as a definitive agreement(s) is entered into by the Parties with respect to the Purpose, whichever occurs first.”

In particular, it is important for the correction to describe that the agreement should continue for a period of three years OR until a definitive agreement is reached. Both parts of this logical statement must be included.

In the example, we see that Gemini Pro 1.0 and GPT4 are able to produce an answer like this while Opus simply reproduces the standard text. Llama 3.1 405B also produces an accurate suggested fix, and is also more verbose (this is still a pass under our grading system).

Q

You are a lawyer reviewing an NDA contract text. Please correct the contract text to match the criteria/information included in the standard text. Respond with a provision suggested fix.

A

Response:

Provision Suggested Fix: This Agreement, and all obligations thereof unless otherwise stated in the relevant provisions, shall continue for a period of three (3) years from the Effective Date.

INCORRECT