Ahmet Guven, Medical College of Georgia at Augusta University A.J. Kleinheksel, Medical College of Georgia at Augusta University Thaddeus Carson, Medical College of Georgia at Augusta University
PURPOSE
EPA ratings are widely used to assess workplace-based performance, yet they
may be influenced by clerkship context, rater severity, and cohort-level
changes. Meaningful comparisons across clerkships and years require
adjustment for differences in student ability, rating difficulty, and
rotation sequencing. This study examined the stability and fairness of EPA
scoring using vertically scaled ability estimates and mixed-effects modeling.
METHODS
EPA ratings were collected across five clerkships from 2023–2025. A graded
response model was vertically scaled using common EPAs as anchors to place
all students on a unified ability metric. EPA residuals (observed minus
expected ratings based on ability) were computed to isolate non-ability
variation. For each EPA, mixed-effects models included Clerkship, Year, and
their interaction as fixed effects, with random intercepts for Student and
Rater. Variance components quantified the proportion of variability
attributable to students, raters, and residual factors. Interaction plots
visualized scoring patterns across clerkships and years.
RESULTS
EPA implementation varied across clerkships and years. Significant
clerkship effects were present in 10 of 12 EPAs, indicating substantial
context-dependent scoring. Nine EPAs demonstrated year-level drift. Six EPAs
(1, 2, 3, 4, 9, 10) showed significant interactions, revealing unstable
scoring patterns across cohorts. EPAs 12 and 13 showed stable but persistent
clerkship differences; EPA 7 showed minimal clerkship variation but modest
year drift. Residual variance accounted for 78%–97% of total variance, rater
variance for 3%–20%, and student variance was near zero, indicating scoring
behavior was driven primarily by rater and situational factors.
CONCLUSION
EPA ratings exhibited pervasive context dependence and temporal drift after
adjusting for student ability through vertical scaling. These findings
underscore the need for improved standardization of EPA implementation and
enhanced rater calibration to strengthen fairness and validity.