We’re releasing MedScribe, a new benchmark evaluating AI systems on clinical SOAP note generation, built in collaboration with Harvard Medical School and Protege.
- GPT 5.1 takes the lead, exceeding 88% accuracy. The Claude family also demonstrates strong performance.
- Results suggest models are capable of producing structurally sound, clinically relevant medical notes and can meaningfully assist scribes in their workflow.
- Most models perform marginally worse on the Plan section, where even small errors can affect care continuity and billing accuracy.