Name
Instant Feedback on Feedback: Usage of Large Language Models to Automate Quality Assessment of Narrative Feedback in Diagnostic Radiology Residency
EPAs in the Era of CBME
Date & Time
Monday, June 8, 2026, 1:30 PM - 1:45 PM
Location Name
Oglethorpe H
Speakers
Authors
Benjamin Kwan, Queen's University
Zier Zhou, Queen's University
Christina Rogoza, Queen's University
Nick Rogoza, Logike
Ingrid de Vries, Queen's University
Andrew Chung, Queen's University
Presentation Topic(s)
Technology and Innovation
Description
PURPOSE
Narrative feedback in competency-based medical education is critical for
guiding resident development, however, the quality of narrative feedback in
structured assessment tasks remains underexplored. Current evaluation
processes are labor-intensive and time-consuming, requiring significant human
effort to analyze and interpret comments. Innovative solutions using AI and
Large language models (LLMs) offer potential to automate the evaluation of
narrative feedback.
This study investigated the performance of GPT-4o and GPT-4.1 in assessing
the quality of narrative feedback comments in Entrustable Professional
Activity (EPA) assessments using the Quality of Assessment for Learning
(QuAL) instrument within a Diagnostic Radiology CBME residency program.
METHODS
A dataset of 2,766 feedback comments from 20 residents (2019–2024) was
analyzed. Baseline zero-shot performance of GPT-4o and GPT-4.1 was
established, followed by fine-tuning using the training subset. Models were
evaluated on a held-out test set using F1 scores and per-class accuracy
across three dimensions: Evidence, Suggestion, and Connection. Human
inter-rater reliability was calculated to benchmark model performance.
RESULTS
Fine-tuning substantially improved model performance, particularly for the
Evidence dimension (GPT-4o: 0.520 ? 0.827, +59.0%; GPT-4.1: 0.321 ? 0.848,
+164.2%). Suggestion and Connection dimensions were high at baseline
(>0.88 F1) with smaller gains after fine-tuning. More detailed per-class
accuracy analysis revealed errors mainly in intermediate Evidence categories.
CONCLUSIONS
This study demonstrates that GPT-4o and GPT-4.1 can effectively evaluate
narrative feedback when fine-tuned on domain-specific data, with the most
substantial improvements observed in the Evidence dimension. The dramatic
gains for GPT-4.1 (+164.2%) reflect its initially weaker baseline
performance, highlighting the effectiveness of even modest fine-tuning datasets
in calibrating LLMs for complex, structured assessment tasks. Per-class
analysis revealed that most misclassifications occurred in intermediate
adjacent Evidence categories, mirroring the challenges human raters face when
interpreting nuanced feedback.
Narrative feedback in competency-based medical education is critical for
guiding resident development, however, the quality of narrative feedback in
structured assessment tasks remains underexplored. Current evaluation
processes are labor-intensive and time-consuming, requiring significant human
effort to analyze and interpret comments. Innovative solutions using AI and
Large language models (LLMs) offer potential to automate the evaluation of
narrative feedback.
This study investigated the performance of GPT-4o and GPT-4.1 in assessing
the quality of narrative feedback comments in Entrustable Professional
Activity (EPA) assessments using the Quality of Assessment for Learning
(QuAL) instrument within a Diagnostic Radiology CBME residency program.
METHODS
A dataset of 2,766 feedback comments from 20 residents (2019–2024) was
analyzed. Baseline zero-shot performance of GPT-4o and GPT-4.1 was
established, followed by fine-tuning using the training subset. Models were
evaluated on a held-out test set using F1 scores and per-class accuracy
across three dimensions: Evidence, Suggestion, and Connection. Human
inter-rater reliability was calculated to benchmark model performance.
RESULTS
Fine-tuning substantially improved model performance, particularly for the
Evidence dimension (GPT-4o: 0.520 ? 0.827, +59.0%; GPT-4.1: 0.321 ? 0.848,
+164.2%). Suggestion and Connection dimensions were high at baseline
(>0.88 F1) with smaller gains after fine-tuning. More detailed per-class
accuracy analysis revealed errors mainly in intermediate Evidence categories.
CONCLUSIONS
This study demonstrates that GPT-4o and GPT-4.1 can effectively evaluate
narrative feedback when fine-tuned on domain-specific data, with the most
substantial improvements observed in the Evidence dimension. The dramatic
gains for GPT-4.1 (+164.2%) reflect its initially weaker baseline
performance, highlighting the effectiveness of even modest fine-tuning datasets
in calibrating LLMs for complex, structured assessment tasks. Per-class
analysis revealed that most misclassifications occurred in intermediate
adjacent Evidence categories, mirroring the challenges human raters face when
interpreting nuanced feedback.
Presentation Tag(s)
International Presenter