Independent platform committed to advancing the future of Gen AI
through unbiased benchmarks and scalable evaluation infrastructure for
labs and engineering teams.
Popular benchmarks for reporting model performance today are seriously lacking.
Benchmarks are based on contrived academic datasets. It is far more
relevant to study how models perform on industry-specific tasks where these
models will be used.
Live leaderboards are often compromised. Researchers release datasets
openly but this data is integrated into pre-training corpora making the
evaluation results inaccurate. Bad actors fine-tune their models on evaluation
sets making openly hosted leaderboards irrelevant.
The results posted by companies building the models are biased. Each time
large language model providers share results for a new model developed they do
so with cherry-picked demo examples or with an evaluation regimen they have
optimized the model to perform well in.
Our Plans
To address these problems, we are building custom benchmarks for specific tasks that mimic real industry use cases. To avoid dataset leakage, we keep the data we use private and secure. We review these models as a neutral third-party, meaning we provide unbiased evaluation, and do not cherry-pick tasks. We work closely with researchers and industry members, but intend our reports to be accessible by general audiences.
We are continually expanding the scope of our benchmarks to include more domains and task types, while evaluating more language model methods as they are made available. Reach out if you have an interest in contributing or have any ideas we should consider.
Vals AI Platform
We use our own evaluation infrastructure to create these benchmarks. It allows us to collect review criteria from subject-matter experts, then run evaluation of any LLM model, at scale. Not only can this platform expose model performance on these general domains, it can also evaluate any LLM application on task-specific data. We currently are extending early access to this platform on a case-by-case basis. If this is of interest, check out our platform.
Overview
Our benchmarks measure the capability and reliability of AI models and agents in realistic tasks. In contrast with contrived exam-style benchmarks, we focus on economically valuable and scientifically important domains—finance, healthcare, math, coding, and more.
Developed in collaboration with domain experts, our datasets are carefully curated to be of extremely high quality and push models to their limits.
Task Design
Our benchmarks reflect the complexity of real-world tasks, which necessitates evaluating multiple types of capabilities:
Tool-Use: How well can models call the right tools to solve problems?
Multiple Modalities: How well can models handle images, tabular data, files, and other modalities beyond text?
Reasoning: Models are increasingly trained to output reasoning before answering; do these capabilities actually improve real-world utility?
Long-Context Capabilities: Can models reason over long contexts, such as extensive legal documents or large codebases?
Long-Horizon Tasks: Can models autonomously work on tasks that take minutes, hours, or longer?
Public and Private Sets
A major problem with evaluations of AI models is test-set leakage 1. Benchmark data can contaminate training sets either directly or through synthetic data 2, undermining the validity of reported results.
Thus, we offer private benchmarking; for transparency and fairness, we offer (for most benchmarks):
Public Validation Set: A completely open dataset, to provide transparency in the types of samples we use for evaluation.
Private Validation Set: A larger, privately held dataset, which we license for companies to do their own internal validation. We provide (statistical) proof that this is correlated with our test suite.
Test Set: This dataset remains private at all times, and is the only dataset that is used for the benchmarks we publish. It is private to prevent leakage into the training dataset of foundation models.
Metrics and evaluation
Benchmarks often report only accuracy numbers; however, it is important to consider factors such as efficiency, cost, time taken per test, failure modes, and more.
Our evaluation framework provides detailed insights into model performance through multiple metrics:
Accuracy: Evaluates the correctness of model outputs for each task and benchmark. This includes strict accuracy checks, as well as rubric-based LLM-as-a-judge accuracy metrics.
Latency: Measures the response time of models in returning a complete response.
Cost: Analyzes the operational cost of running each model from an API provider.
Additional quantitative and qualitative insights: For each benchmark, we also provide further information, including but not limited to
statistics regarding tool-use, qualitative insights about the nature of the errors, and comparisons between different models. This provides information beyond
the raw benchmark numbers, and also helps contextualize the performance of models.
This information enables us to offer a more comprehensive, holistic view of model performance, including accuracy, reliability, efficiency, and qualitative insights.
Error Bars
We report standard errors alongside benchmark scores to reflect statistical uncertainty.
Our methodology depends on how the benchmark is structured:
Single-run benchmarks
For benchmarks evaluated once, we follow standard uncertainty reporting practice, as suggested by “Adding Error Bars to Evals:
A Statistical Approach to Language Model Evaluations” by Evan Miller 3. Error bars are computed as the standard error of the mean (SEM) over instance-level scores.
In particular, let x1,…,xn be instance-level scores. Standard error of the mean (SEM), using sample standard deviation is given by:
SE=nn−11∑i=1n(xi−xˉ)2
where xˉ=n1∑i=1nxi is the mean.
These error bars capture measurement uncertainty in the benchmark itself. They do not reflect variability across prompts, seeds, deployment settings, or the stochastic nature of LLM generation.
Multiple-run benchmarks (AIME, CaseLaw)
When a benchmark includes multiple independent runs, we compute the SEM of the score of each run, estimating uncertainty over runs rather than over individual instances.
Let a1,…,aR be the average score from each of R independent runs and the average score be aˉ=R1∑r=1Rar.
Standard error over runs is then given by:
SEruns=Rσ=RR−11∑r=1R(ar−aˉ)2
Composite benchmarks
For benchmarks that combine multiple tasks, we propagate uncertainty from each component using weighted variance pooling.
Let component standard errors be SE1,…,SEK and weights w1,…,wK.
The propagated standard error is then:
SEoverall=∑k=1Kwk∑k=1Kwk2SEk2
In all cases, we use standard statistical definitions of the standard error of the mean, with sample standard deviation where applicable.
Evaluating not just models, but agentic systems and scaffolds
Since LLMs are often used as part of agentic systems with general scaffolds, and part of larger workflows or products, it is important to design
evaluations that measure capabilities of these kinds of systems. Our benchmarks test crucial aspects of this, such as tool-calling,
multi-turn flows, coding skills and computer-use.
In the future, our benchmarks will also evolve to test not only agentic systems we design, but also custom user-provided scaffolds and products.
These benchmarks ensure comprehensive evaluation of AI systems, addressing their growing utility in real-world applications and their ability to function autonomously in larger systems.