Name
Can Generative Artificial Intelligence (AI) Reliably Score Open-Ended Questions (OEQs) in the Assessment of Medical Knowledge?
Description

Presented By: Marieke Kruidering, New York University Grossman Long Island School of Medicine
Co-Authors: Jeffrey Bird, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell
Judith Brenner, New York University Grossman Long Island School of Medicine
Kumiko Endo, Med2Lab
Tracy Fulton, University of California at San Francisco School of Medicine
Doreen Olvet, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell
Bao Truong, Med2Lab
Joanne Willey, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell

Purpose
The objective is to establish the accuracy of generative artificial intelligence (AI) when scoring medical student exam questions in an open-ended format (OEQ) compared to faculty content experts. Background: Despite the numerous benefits to including OEQs in assessment of medical knowledge1,2, only 39% of US allopathic medical schools use them3. Faculty report that the biggest barrier is the time it takes to grade responses1,2. Natural language processing has been explored to automate scoring of clinical reasoning4, but no study has evaluated the use of generative AI to score OEQ responses in the pre-clerkship curriculum.

Methods
OEQ responses from two questions administered at the Zucker School of Medicine (ZSOM) and the University of California at San Francisco School of Medicine (UCSF) were used for the current study5. Responses from 54 students per site were analyzed. Content experts scored the responses using an analytic (ZSOM) or holistic rubric (UCSF). Questions, rubrics, and student responses were fed into the GPT-4 model via the Med2Lab platform. Once finalized, scores for each student's response were generated. Cohen's weighted kappa (kw) was used to evaluate inter-rater reliability (IRR) between the content expert and generative AI scores, with kw scores between 0.60 and 0.80 being considered substantial6. Prompt engineering was employed for question 1 (analytic rubric) to evaluate its impact on IRR.

Results
IRR between the content expert and generative AI scores was substantial using the analytic rubric (question 1: kw=0.71; question 2: kw=0.63) and the holistic rubric (question 1: kw=0.66; question 2: kw=0.68). IRR for question 1 (analytic rubric) was initially kw=0.61 but was increased to kw=0.71 after adjustments with prompt engineering and re-run in GPT-4.

Conclusions
Generative AI can score OEQs with substantial reliability. With the potential to alleviate grading burden, AI scoring will allow medical schools to broadly implement OEQs for assessment.

Date & Time
Sunday, June 16, 2024, 4:00 PM - 4:15 PM
Location Name
Marquette V