About Vals AI

Independent platform committed to advancing the future of Gen AI through unbiased benchmarks and scalable evaluation infrastructure for labs and engineering teams.

Contact us

Motivation for New Benchmarks

Popular benchmarks for reporting model performance today are seriously lacking.

Benchmarks are based on contrived academic datasets. It is far more relevant to study how models perform on industry-specific tasks where these models will be used.

Live leaderboards are often compromised. Researchers release datasets openly but this data is integrated into pre-training corpora making the evaluation results inaccurate. Bad actors fine-tune their models on evaluation sets making openly hosted leaderboards irrelevant.

The results posted by companies building the models are biased. Each time large language model providers share results for a new model developed they do so with cherry-picked demo examples or with an evaluation regimen they have optimized the model to perform well in.

Our Plans

To address these problems, we are building custom benchmarks for specific tasks that mimic real industry use cases. To avoid dataset leakage, we keep the data we use private and secure. We review these models as a neutral third-party, meaning we provide unbiased evaluation, and do not cherry-pick tasks. We work closely with researchers and industry members, but intend our reports to be accessible by general audiences.

We are continually expanding the scope of our benchmarks to include more domains and task types, while evaluating more language model methods as they are made available. Reach out if you have an interest in contributing or have any ideas we should consider.

Vals AI Platform

We use our own evaluation infrastructure to create these benchmarks. It allows us to collect review criteria from subject-matter experts, then run evaluation of any LLM model, at scale. Not only can this platform expose model performance on these general domains, it can also evaluate any LLM application on task-specific data. We currently are extending early access to this platform on a case-by-case basis. If this is of interest, check out our platform.

About Vals AI

Contact us

Motivation for New Benchmarks

Our Plans

Vals AI Platform

Overview

Task Design

Public and Private Sets

Metrics and evaluation

Error Bars

Single-run benchmarks

Multiple-run benchmarks (AIME, CaseLaw)

Composite benchmarks

Evaluating not just models, but agentic systems and scaffolds

References