Blog Research

Measuring Healthcare AI Where It Actually Works

AI Models Excel at Clinical Documentation But Struggle With Medical Billing

Rayan Krishnan Rayan Krishnan 02/23/2026
Measuring Healthcare AI Where It Actually Works

New benchmarks reveal 88% accuracy on clinical notes versus 56% on diagnosis coding

We recently released two new healthcare benchmarks: MedScribe and MedCode. The headline finding is a 32-percentage-point performance gap between clinical documentation and medical coding. That gap reflects something real about where AI stands in healthcare workflows today.

Why evaluate specific healthcare workflows?

Existing evaluations have largely measured general medical knowledge, not performance on the specific tasks where AI systems will actually be deployed. As healthcare organizations deploy AI scribes and coding assistants at scale, the relevant question is whether these systems work for the workflows they are being sold into.

MedScribe and MedCode are the first rigorous evaluations of AI performance on real-world healthcare workflows using de-identified patient records.

What the benchmarks measure

MedScribe evaluates clinical documentation. It tests whether AI can generate SOAP notes from doctor-patient conversations, using 100 expert-developed rubrics across 80 transcripts. Clinical documentation consumes twice the time physicians spend with patients and is a leading cause of burnout. The benchmark was designed to capture what a clinical reviewer would care about, not just surface-level text similarity.

MedCode evaluates medical billing. It tests whether AI can assign ICD-10-CM diagnosis codes across 2,755 samples. Unlike documentation, medical coding is precise and rule-bound. The task requires knowledge of complex coding conventions, condition hierarchies, and documentation requirements — closer to structured legal reasoning than to summarization.

Both benchmarks were developed in partnership with Protege, who curated the evaluation-ready data, with advisory from Professor Pranav Rajpurkar at Harvard Medical School. All samples were independently annotated by certified professional coders. Patient-level holdouts were built in to ensure the benchmarks measure real-world performance rather than memorization.

Results

On MedScribe, leading models achieve 85–88% accuracy. GPT 5.1 topped the leaderboard at 88.09%, followed by Claude Opus 4.6 at 86.74%.

On MedCode, results drop sharply. Most leading models cluster between 49–53%, with Gemini 3 Flash leading at 55.92%. Performance also varied by condition type: models handled physical conditions like diabetes and hypertension reasonably well but struggled with mental health diagnoses, where coding conventions require more layered clinical judgment.

Why the gap exists

Clinical documentation is fundamentally a summarization and organization task. Current language models handle this well. They can track context across a conversation, identify clinically relevant information, and produce coherent structured output.

Medical coding is a different class of problem. Assigning the correct ICD-10-CM code requires knowing which code takes precedence across conditions, how to handle comorbidities, and which of several similar codes applies to a specific clinical scenario. The rules are hierarchical, precise, and unforgiving. A wrong code affects reimbursement, compliance, and audit exposure.

The 32-point gap between documentation and coding reflects that distinction.

What this means

These benchmarks measure AI performance on the specific workflows where they will be deployed, not general medical knowledge. The results show AI is ready to meaningfully assist with clinical documentation, but medical coding requires further development before reliable automation. For hospitals evaluating these systems, that distinction matters.

For developers, the gap points to where the harder engineering problems remain. Documentation improvements have diminishing returns. Reliable coding requires models capable of precise, rule-governed reasoning at scale.

For hospitals, the benchmarks provide objective performance data tied to real healthcare operations, financial processes, and compliance requirements rather than generalized capability claims.

Both benchmarks are publicly available at vals.ai/benchmarks with full methodology and model-by-model performance data.

Join our mailing list to receive benchmark updates

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

By subscribing, I agree to Vals' Privacy Policy.