Name
Optimizing Narrative Feedback Evaluation in Competency-Based Medical Education: A Comparative Study of GPT-3.5 and GPT-4
Date & Time
Tuesday, June 17, 2025, 10:00 AM - 10:15 AM
Authors

Benjamin Kwan, Queen's University
Zier Zhou, Queen's University
Christina Rogoza, Queen's University
Nikoo Aghaei, Queen's University
Ingrid de Vries, Queen's University
Tessa Hanmore, Queen's University
Boris Zevin, Queen's University

Presentation Topic(s)
Technology and Innovation
Description

Purpose
The transition to Competency-Based Medical Education has transformed the evaluation of postgraduate resident training. There is greater emphasis on qualitative feedback, however, the quality of these narrative comments is rarely assessed due to the labor-intensive nature of manually annotating large datasets. Recent advancements in Large Language Models (LLMs) offer potential for automating and enhancing analysis of narrative comments. This study evaluated the performance of OpenAI's LLM’s GPT-3.5 and GPT-4, in analyzing the quality of narrative feedback comments.

Method
A dataset of 2,229 narrative feedback comments from assessments of residents in the Surgical Foundations program at Queen’s University was used. The quality of the comments was assessed by two independent raters using the Quality of Assessment for Learning (QuAL) instrument. The performance of Chat GPT-3.5-turbo-1106 and GPT-4-1106-preview in applying the QuAL score was evaluated, using prompt-driven zero-shot learning techniques and fine-tuning. The F1 score was used to evaluate the models' accuracy in predicting QuAL ratings. Measurement of performance improvements for each LLM by comparing F1 scores for different prompting techniques and fine-tuning against baseline performance was conducted.

Results
The effectiveness of prompting techniques varied between the two LLMs. GPT-4 achieved the highest F1 scores in Evidence (0.554), Suggestion (0.901), and Connection (0.882) dimensions of the QuAL score. However, the performance of GPT-4 in the Evidence dimension was weak. Fine-tuning GPT-3.5 emerged as the most effective technique for improving performance of the LLM across all QuAL score dimensions, with F1 scores for Evidence (0.827), Suggestion (0.949), and Connection (0.933).

Conclusions
This study highlights the potential of a fine-tuned GPT-3.5 model to efficiently analyze the quality of narrative comments within large datasets without the need for manual annotation. Automating the evaluation of narrative feedback, could streamline the identification of areas needing improvement, provide faculty with insights on their feedback practices, and support ongoing faculty development.

Presentation Tag(s)
International Presenter