Name
The Future is Here: Utilizing ChatGPT as a Tool for Student Assessments
Date & Time
Monday, June 16, 2025, 1:49 PM - 2:04 PM
Authors

Kathy Keefe, Temple University Lewis Katz School of Medicine
Jill Allenbaugh, Temple University Lewis Katz School of Medicine

Presentation Topic(s)
Assessment
Description

Purpose
A challenge in preclinical undergraduate medical education is the creation of statistically reliable multiple-choice assessments. Teaching faculty lack training and time to create USMLE-style questions with appropriate structure and content level. In this study, it was hypothesized that artificial intelligence (AI) could help create appropriate assessments.

Methods
The 8-week endocrine and reproductive course was chosen as a pilot. Twenty teaching session slide sets across the course were uploaded into ChatGPT with a standardized prompt asking the AI to review content and create 3-6 USMLE-style questions at the level of a second-year medical student. Resultant questions were sent to clinical faculty for review, edits, and recommendations to be used for either formative or summative assessment purposes.

Results
68 questions were sent to 13 faculty for review. Fifty questions were deemed high quality by faculty, requiring minimal edits, defined as “minor wording changes and/or a change to 1 distractor”, while another 14 required moderate editing, defined as, “multiple edits to the stem, change to 2 distractors, or a notably incorrect answer.” Only 4 questions were eliminated. Fifty questions were used on summative exams and evaluated for statistical appropriateness. The average score of AI-generated/faculty-edited questions was 83% correct which was significantly different from pre-existing faculty-developed exam questions which scored 88% correct (p value= 0.033). The point-biserial of both sets of questions were not significantly different (p-value = 0.20), with the AI-generated/faculty-edited questions slightly higher at 0.23 vs 0.21.

Conclusions
This pilot study demonstrates that ChatGPT, with faculty feedback, can be a valuable tool in question development for UME. Statistical analysis shows questions are at least equal to faculty-written questions. With the emergence of AI in UME, this pilot study confirms ChatGPT or other large language models as a feasible and low-effort way to create new USMLE-style assessments.