Automated Patient Note Grading: Examining Scoring Reliability and Feasibility

William F. Bond, Jianing Zhou, Suma Bhat, Yoon Soo Park, Rebecca A. Ebert-Allen, Rebecca L. Ruger, Rachel Yudkowsky

Research output: Contribution to journalArticlepeer-review


Purpose Scoring postencounter patient notes (PNs) yields significant insights into student performance, but the resource intensity of scoring limits its use. Recent advances in natural language processing (NLP) and machine learning allow application of automated short answer grading (ASAG) for this task. This retrospective study evaluated psychometric characteristics and reliability of an ASAG system for PNs and factors contributing to implementation, including feasibility and case-specific phrase annotation required to tune the system for a new case. Method PNs from standardized patient (SP) cases within a graduation competency exam were used to train the ASAG system, applying a feed-forward neural networks algorithm for scoring. Using faculty phrase-level annotation, 10 PNs per case were required to tune the ASAG system. After tuning, ASAG item-level ratings for 20 notes were compared across ASAG-faculty (4 cases, 80 pairings) and ASAG-nonfaculty (2 cases, 40 pairings). Psychometric characteristics were examined using item analysis and Cronbach's alpha. Inter-rater reliability (IRR) was examined using kappa. Results ASAG scores demonstrated sufficient variability in differentiating learner PN performance and high IRR between machine and human ratings. Across all items the ASAG-faculty scoring mean kappa was.83 (SE ±.02). The ASAG-nonfaculty pairings kappa was.83 (SE ±.02). The ASAG scoring demonstrated high item discrimination. Internal consistency reliability values at the case level ranged from a Cronbach's alpha of.65 to.77. Faculty time cost to train and supervise nonfaculty raters for 4 cases was approximately $1,856. Faculty cost to tune the ASAG system was approximately $928. Conclusions NLP-based automated scoring of PNs demonstrated a high degree of reliability and psychometric confidence for use as learner feedback. The small number of phrase-level annotations required to tune the system to a new case enhances feasibility. ASAG-enabled PN scoring has broad implications for improving feedback in case-based learning contexts in medical education.

Original languageEnglish (US)
Pages (from-to)S90-S97
JournalAcademic Medicine
Issue number11
StatePublished - Nov 1 2023
Externally publishedYes

ASJC Scopus subject areas

  • Education


Dive into the research topics of 'Automated Patient Note Grading: Examining Scoring Reliability and Feasibility'. Together they form a unique fingerprint.

Cite this