Overview

Our benchmarks measure the capability and reliability of AI models and agents in realistic tasks. In contrast with contrived exam-style benchmarks, we focus on economically valuable and scientifically important domains—finance, healthcare, math, coding, and more. Developed in collaboration with domain experts, our datasets are carefully curated to be of extremely high quality and push models to their limits.

Task Design

Our benchmarks reflect the complexity of real-world tasks, which necessitates evaluating multiple types of capabilities:

Tool-Use: How well can models call the right tools to solve problems?
Multiple Modalities: How well can models handle images, tabular data, files, and other modalities beyond text?
Reasoning: Models are increasingly trained to output reasoning before answering; do these capabilities actually improve real-world utility?
Long-Context Capabilities: Can models reason over long contexts, such as extensive legal documents or large codebases?
Long-Horizon Tasks: Can models autonomously work on tasks that take minutes, hours, or longer?

Public and Private Sets

A major problem with evaluations of AI models is test-set leakage [1]. Benchmark data can contaminate training sets either directly or through synthetic data [2], undermining the validity of reported results. Thus, we offer private benchmarking; for transparency and fairness, we offer (for most benchmarks):

Public Validation Set: A completely open dataset, to provide transparency in the types of samples we use for evaluation.
Private Validation Set: A larger, privately held dataset, which we license for companies to do their own internal validation. We provide (statistical) proof that this is correlated with our test suite.
Test Set: This dataset remains private at all times, and is the only dataset that is used for the benchmarks we publish. It is private to prevent leakage into the training dataset of foundation models.

Metrics and evaluation

Benchmarks often report only accuracy numbers; however, it is important to consider factors such as efficiency, cost, time taken per test, failure modes, and more. Our evaluation framework provides detailed insights into model performance through multiple metrics:

Accuracy: Evaluates the correctness of model outputs for each task and benchmark. This includes strict accuracy checks, as well as rubric-based LLM-as-a-judge accuracy metrics.
Latency: Measures the response time of models in returning a complete response.
Cost: Analyzes the operational cost of running each model from an API provider.
Additional quantitative and qualitative insights: For each benchmark, we also provide further information, including but not limited to statistics regarding tool-use, qualitative insights about the nature of the errors, and comparisons between different models. This provides information beyond the raw benchmark numbers, and also helps contextualize the performance of models.

This information enables us to offer a more comprehensive, holistic view of model performance, including accuracy, reliability, efficiency, and qualitative insights.

Evaluating not just models, but agentic systems and scaffolds

Since LLMs are often used as part of agentic systems with general scaffolds, and part of larger workflows or products, it is important to design evaluations that measure capabilities of these kinds of systems. Our benchmarks test crucial aspects of this, such as tool-calling, multi-turn flows, coding skills and computer-use.

In the future, our benchmarks will also evolve to test not only agentic systems we design, but also custom user-provided scaffolds and products.

These benchmarks ensure comprehensive evaluation of AI systems, addressing their growing utility in real-world applications and their ability to function autonomously in larger systems.

References

[1] https://arxiv.org/abs/2410.08385

[2] https://arxiv.org/abs/2407.07565

Our Methodology