Kelly Zhao, M1 student, Alice Walton School of Medicine. Safwan Sarkar, M1 student, Alice Walton School of Medicine. Vahdeta Suljic, M1 student, Alice Walton School of Medicine. Benjamin Zhang, M1 student, Alice Walton School of Medicine. Amira Ibrahim, Private Practice Yonaton Ghiwot, Kirk Kerkorian School of Medicine at UNLV Trager Hinze, Dept. Medical Education, Alice Walton School of Medicine. Sarah Assem, Alice Walton School of Medicine. Patrick Brooks, Dept. Medical Education, Alice Walton School of Medicine. Ian Murray, Dept. Medical Education, Alice Walton School of Medicine.
PURPOSE: Script Concordance Tests (SCTs) assess clinical reasoning under
uncertainty, but lack feedback to transform them from evaluative to formative
learning tools. Building on cognitive apprenticeship theory, where expert
thinking is made visible to learners, this study evaluates whether
AI-generated feedback can explain reasoning divergences between learners and
experts, bridging assessment and learning.
METHODS: Medical students (M1) and
clinicians (n=4;5), using a custom AI app
(https://sct-evaluation-imurra1.replit.app), completed 10 published SCT items
spanning mixed specialties from the SCT-Bench dataset, with difficulty
ranging from easy-moderate (n=3), moderate (n=5), to high (n=3) based on
expert concordance distributions. Participants justified their clinical
reasoning in writing. AI-generated feedback used a structured, proprietary
rubric addressing acknowledgment of reasoning, key clinical features,
reasoning quality, and epistemic humility. AI feedback quality was rated
using a validated 18-item Feedback Perceptions Questionnaire across five
dimensions: fairness, usefulness, acceptance, encouragement, and
developmental value (n=4;5). Concordance scores and engagement times were
analyzed.
RESULTS: Both groups averaged
approximately 3 minutes of engagement per SCT question. AI-generated feedback
received positive survey ratings across all dimensions (means 3.53-4.68 on
5-point scale). Differences between M1and faculty were minimal (all
<0.33), with acceptance showing lowest ratings (M1: 3.75, SD=1.06; C:
3.53, SD=1.46) with highest variability. SCT performance analysis revealed
moderate concordance with expert scoring and non-significant differences
between groups (M1: M=0.54, SD=0.42; C: M=0.54, SD=0.41).
CONCLUSION: AI-generated feedback achieved high ratings for fairness,
usefulness, encouragement, and developmental value. Lower acceptance scores
suggest users appreciate feedback quality while maintaining appropriate
caution about AI-generated content. While AI surfaced valuable clinical
associations, it sometimes "thinks differently than clinicians,"
requiring expert oversight, and SCT case refinement. This pilot establishes
feasibility with a small sample but identifies critical limitations for
AI-enhanced formative assessment in clinical reasoning education. Larger
validation studies with refined cases and thematic analysis are underway.