Today, we are excited to announce Vals Public Sector: AI evaluation for government.
Vals has worked alongside leading AI labs and domain experts to answer one question: how well do these AI models actually perform on real-world tasks? To help answer this, we’ve built industry-standard benchmarks across the hardest domains, from finance to software engineering. Now, model evaluations do more than measure performance; they’ve become one of the strongest enablers of AI adoption.
AI adoption in government is already accelerating. More than 70% of public servants already use it, and the public increasingly meets its government through AI across public services like benefits and education. Government is investing billions of dollars into AI every year to keep pace. Taxpayers deserve a real return on that investment, proof this spending is actually improving how government serves the public.
Yet, at every level, government keeps asking questions that go unanswered: How well does this model handle a benefits decision or a warfighter scenario? Does it abide by our existing laws? Which model should we buy, and what policies do we adopt? Without answers to these questions, officials are asked to accept unknown risks. From the Pentagon to the local DMV, promising AI use cases stall in pilots instead of serving the mission. Government needs objective assurances, not the promises of vendors.
These assurances can only come from independent, rigorous, third-party evaluation. Independence is what makes a benchmark trustworthy, which helps government buy the best AI for the mission and make informed AI policy.
That’s where Vals Public Sector comes in. We work closely with the leading AI labs to keep up with the frontier while remaining independent. We’re proud to bring that proven track record to the public sector:
- Benchmark infrastructure, powered by Valkyrie, our open-source evaluation framework that enables complex, agentic evaluations.
- Expert-built benchmarks spanning some of the hardest fields: finance, law, education, software engineering, and public benefits (to name a few). Trusted and cited by the leading AI labs.
- Frontier risk measurement that gives government an independent read on model capabilities and the dangers they pose to the public.
We look forward to partnering with domain experts across the US, our allies, and vendors to measure how AI performs where the stakes are highest. To lead this work, we’ve brought on Glenn Parham as head of Public Sector. He’s spent his career bringing AI into national security missions at the Pentagon and in big tech, and knows firsthand why government needs rigorous, independent AI evaluation.
Vals can build and run benchmarks almost anywhere, even in the most sensitive national security settings. And to be closer to the institutions we serve, we’ve opened an office in Washington, DC. We’re here to help government measure AI with confidence.
Trust is earned. Evaluation is how you earn it.
To learn more, visit vals.ai/gov.
