New benchmarks reveal 88% accuracy on clinical notes versus 56% on diagnosis coding
We recently released two new healthcare benchmarks: MedScribe and MedCode. The headline finding is a 32-percentage-point performance gap between clinical documentation and medical coding. That gap reflects something real about where AI stands in healthcare workflows today.
Why evaluate specific healthcare workflows?
Existing evaluations have largely measured general medical knowledge, not performance on the specific tasks where AI systems will actually be deployed. As healthcare organizations deploy AI scribes and coding assistants at scale, the relevant question is whether these systems work for the workflows they are being sold into.
MedScribe and MedCode are the first rigorous evaluations of AI performance on real-world healthcare workflows using de-identified patient records.
What the benchmarks measure
MedScribe evaluates clinical documentation. It tests whether AI can generate SOAP notes from doctor-patient conversations, using 100 expert-developed rubrics across 80 transcripts. Clinical documentation consumes twice the time physicians spend with patients and is a leading cause of burnout. The benchmark was designed to capture what a clinical reviewer would care about, not just surface-level text similarity.
MedCode evaluates medical billing. It tests whether AI can assign ICD-10-CM diagnosis codes across 2,755 samples. Unlike documentation, medical coding is precise and rule-bound. The task requires knowledge of complex coding conventions, condition hierarchies, and documentation requirements — closer to structured legal reasoning than to summarization.
Both benchmarks were developed in partnership with Protege, who curated the evaluation-ready data, with advisory from Professor Pranav Rajpurkar at Harvard Medical School. All samples were independently annotated by certified professional coders. Patient-level holdouts were built in to ensure the benchmarks measure real-world performance rather than memorization.
Results
On MedScribe, leading models achieve 85–88% accuracy. GPT 5.1 topped the leaderboard at 88.09%, followed by Claude Opus 4.6 at 86.74%.
On MedCode, results drop sharply. Most leading models cluster between 49–53%, with Gemini 3 Flash leading at 55.92%. Performance also varied by condition type: models handled physical conditions like diabetes and hypertension reasonably well but struggled with mental health diagnoses, where coding conventions require more layered clinical judgment.
Why the gap exists
Clinical documentation is fundamentally a summarization and organization task. Current language models handle this well. They can track context across a conversation, identify clinically relevant information, and produce coherent structured output.
Medical coding is a different class of problem. Assigning the correct ICD-10-CM code requires knowing which code takes precedence across conditions, how to handle comorbidities, and which of several similar codes applies to a specific clinical scenario. The rules are hierarchical, precise, and unforgiving. A wrong code affects reimbursement, compliance, and audit exposure.
The 32-point gap between documentation and coding reflects that distinction.
What this means
These benchmarks measure AI performance on the specific workflows where they will be deployed, not general medical knowledge. The results show AI is ready to meaningfully assist with clinical documentation, but medical coding requires further development before reliable automation. For hospitals evaluating these systems, that distinction matters.
For developers, the gap points to where the harder engineering problems remain. Documentation improvements have diminishing returns. Reliable coding requires models capable of precise, rule-governed reasoning at scale.
For hospitals, the benchmarks provide objective performance data tied to real healthcare operations, financial processes, and compliance requirements rather than generalized capability claims.
Both benchmarks are publicly available at vals.ai/benchmarks with full methodology and model-by-model performance data.