Partners in Evaluation
Key Takeaways
- AI models are being deployed to automate medical coding, but current systems still struggle with accuracy — models barely exceed 55% accuracy on this critical healthcare task.
- Gemini 3 Flash (12/25) takes the lead in overall performance, but its advantage narrows when evaluated at the individual code level, suggesting limited clinical understanding.
- Models perform better on physical conditions like diabetes and hypertension, but struggle significantly with mental health diagnoses and other complex categories, raising concerns about reliability in real-world clinical settings.
Results
Gemini 3 Flash leads with 55.9% accuracy while offering one of the strongest cost profiles. However, the field remains competitive, with GPT-5.1 close behind at 52.7%, followed by a cluster of models in the 49-52% range.
Background
The process of accurate medical coding is extraordinarily complex. There are tens of thousands of diagnosis codes, each with detailed rules dictating sequencing, modifiers, and payer-specific requirements1. These rules vary across patient demographics, medical providers, and geographical jurisdictions, making coding extremely difficult. Any error, even those that are seemingly minor, can lead to claim denials and lost revenue for the hospital2. While AI coding systems are being adopted to improve both accuracy and finances, there is little to no oversight or evaluation for how these systems perform under real-world conditions3.
Through our MedCode benchmark, we evaluate AI systems’ ability to perform medical coding given realistic documentation constraints. In collaboration with Protege, we created a dataset of 2755 primary and secondary diagnosis codes for each de-identified patient record, enabling coding evaluation for an entire hospitalization stay from admission through discharge.
Methodology
The International Classification of Diseases (ICD) is the global standard for classifying diseases, symptoms, and injuries as it enables consistent reporting across health systems. We evaluate AI models on the CMS ICD-10-CM standard, which is the official coding standard mandated by the Centers for Medicare & Medicaid services (CMS), as it is publicly available with comprehensive, up-to-date documentation.
Our dataset consists of de-identified patient records containing discharge summaries supported by corresponding progress notes and consult notes throughout the duration of the patient’s hospital stay. We matched the three-digit ZIP code of each patient’s stay to geographical jurisdictions and other demographic information through reference tables, allowing us to obtain key information necessary for accurate coding without compromising patient privacy. Each sample was independently annotated by two certified professional coders, who assigned primary and secondary diagnosis codes based on the ICD-10-CM standard. They were permitted to use Codify, a professional tool for code lookup and validation, but were not allowed to use AI models. Differences between coder annotations were resolved collaboratively to ensure consistency and high quality.
Each model was given the de-identified patient records and prompted to output corresponding primary and secondary diagnoses following the ICD-10-CM standard. All models are evaluated with temperature 1, and produce at most 30k tokens.
Additional Findings
We break down model performance across several dimensions: primary versus secondary diagnoses, the most frequently occurring codes, ICD-10 chapters, and levels of code precision.
Primary vs Secondary Pass Rates
All models perform better at predicting primary diagnosis codes than secondary codes. Interestingly, very few errors stem from misclassifying a code as primary versus secondary. The majority of errors occur when models completely miss a diagnosis code that appears in the rubric.
Most Frequent Codes
No model consistently leads across the top 10 most occurring diagnoses codes. This inconsistency suggests brittle generalization rather than robust clinical understanding.
Models generally perform well on physiological conditions such as type 2 diabetes (E11.9), hypertension (I10), and acute kidney failure (N17.9), but they struggle with mental-health diagnoses like F32.A (depression). Notably, GPT 5.1 stands out, correctly identifying F32.A in 80% of cases.
Pass Rates by Chapter
This pattern extends to the chapter level. Models perform better on codes tied to physical conditions, while accuracy drops sharply for chapters like F (Mental, Behavioral, and Neurodevelopmental disorders) and Z (Factors influencing health status and contact with health services).
O (delivery) codes were excluded because obfuscation in the dataset made them unreliable to evaluate.
Pass Rates by Code Precision
We also investigated how models perform when diagnosis codes are evaluated with greater precision. The plot below shows pass rates at different code levels. For example, for the code F32.A, the chapter is F (mental disorders), the category is F32 (depression), and the subcategory is the full code F32.A (major depressive disorder, single episode, psychotic features).
The accuracy drop-off with increased precision is expected. However, even when evaluating by category alone, the best models still perform poorly, indicating that the primary cause of failure is completely missing the diagnosis, not mislabelling the specificity.
Citations
[1] ICD.Codes. (2018, January 19). ICD-9 to ICD-10 explained. Retrieved from https://web.archive.org/web/20180119120210/https://icd.codes/articles/icd9-to-icd10-explained
[2] BlueBrix Health. (2025, March 12). The hidden costs of coding errors: How accurate medical coding boosts revenue. BlueBrix Health. https://bluebrix.health/blogs/the-hidden-costs-of-coding-errors-how-accurate-medical-coding-boosts-revenue
[3] Angus, D. C., Lee, S. I., Beam, A. L., Denecke, K., Farahani, N. I., Kohane, I. S., Matheny, M. E., Sendak, M. P., Shah, N. H., Steinhubl, S. R., & Topol, E. J. (2025). AI, health, and health care today and tomorrow: The JAMA summit report on artificial intelligence. JAMA, 333(14), 1279–1287. https://doi.org/10.1001/jama.2025.2437