Arvind Rajan, University of North Carolina School of Medicine
Seth McKenzie Alexander, Vanderbilt University Medical Center
Christina L. Shenvi, University of North Carolina School of Medicine
Purpose
This study evaluates the effectiveness of GPT-4o, a large language model (LLM), in grading short-answer questions (SAQs) from case-based learning (CBL) medical exams. The primary outcome was to assess the non-inferiority of LLM grading compared to faculty grading while examining precision, performance across cognitive levels defined by Bloom’s Taxonomy, and the relationship between question complexity and LLM performance.
Methods
A total of 1,450 SAQ responses were collected from 58 medical student exams, each containing 25 questions. Faculty-assigned scores were compared with those generated by GPT-4o, which was deployed in a zero-shot capacity using a standardized prompt and rubric. Statistical analysis, including the Wilcoxon signed-rank test due to the non-normality of the data and interclass correlation coefficient (ICC) evaluated the non-inferiority and precision of LLM grading, respectively. Questions were stratified by Bloom’s Taxonomy levels, and multivariate regression assessed the relationship between perceived question complexity and grading discrepancies.
Results
GPT-4o grading was non-inferior to faculty grading overall (p = 0.8422), with a high ICC (0.993) indicating excellent precision. Discrepancies between LLM- and faculty-assigned grades were identified in specific cognitive levels, with significant differences for “Understanding” (LLM scored lower, p < 0.0001) and “Evaluating” (LLM scored higher, p = 0.0095) categories. LLM performance correlated strongly with faculty grading for easier questions but showed greater variation for more complex questions (R² = 0.6199, p < 0.0001).
Conclusions
GPT-4o demonstrates potential as a preliminary grading tool for SAQs, reducing grading time and maintaining accuracy. However, human oversight is essential for addressing nuanced or complex responses. Incorporating artificial intelligence into medical education assessments could enhance grading efficiency, enabling faculty to focus more on teaching and feedback while maintaining educational standards.