Number
806
Name
Development of a Residency Match Risk Calculator for UCF Medical Students Using Curriculum Vitae Data and Machine Learning
Date & Time
Monday, June 8, 2026, 6:00 PM - 7:30 PM
Location Name
Oglethorpe Ballroom
Speakers
Authors
Yu Zhang, University of Central Florida College of Medicine
Presentation Topic(s)
Technology and Innovation
Description
PURPOSE
The residency match process is competitive and high-stakes yet advising
often relies on anecdotal evidence and national statistics. This project
aimed to develop and internally validate a machine learning–based Residency
Match Risk Calculator using curriculum vitae (CV) data from University of
Central Florida College of Medicine (UCF COM) students, incorporating
specialty-specific competitiveness.
METHODS
A retrospective analysis was performed on 911 UCF COM graduates (Classes of
2018–2025) with known match outcomes. Features included USMLE Step scores,
class rank percentile, and scholarly activity metrics. Predictive models
tested included logistic regression, random forest, XGBoost with SMOTE,
linear SVM, and an ensemble classifier. Specialty-specific weighting was
applied based on historical match rates. Model performance was evaluated
using AUC, accuracy, precision, recall, and F1-score. Thresholds were
optimized to detect unmatched outcomes.
RESULTS
Baseline logistic regression showed limited discrimination (AUC = 0.635).
Specialty weighting improved performance (AUC = 0.943; accuracy = 85%;
unmatched recall = 83%). Threshold optimization increased accuracy to 94%.
The linear SVM had the highest unmatched sensitivity, while the ensemble
model balanced precision and recall. Feature importance analysis ranked class
rank (37%) as most predictive, followed by Step 1 (23%), research experiences
(20%), publications (11%), and Step 2 CK (9%). Preliminary NLP models using
full CV text underperformed due to sample size and class imbalance.
CONCLUSIONS
A machine learning–based Match Risk Calculator using institutional CV data
showed strong predictive performance and interpretability. Specialty
weighting and threshold tuning enabled individualized risk estimation and
advising. Future work will explore multi-institutional validation and
improved NLP-based feature extraction.
The residency match process is competitive and high-stakes yet advising
often relies on anecdotal evidence and national statistics. This project
aimed to develop and internally validate a machine learning–based Residency
Match Risk Calculator using curriculum vitae (CV) data from University of
Central Florida College of Medicine (UCF COM) students, incorporating
specialty-specific competitiveness.
METHODS
A retrospective analysis was performed on 911 UCF COM graduates (Classes of
2018–2025) with known match outcomes. Features included USMLE Step scores,
class rank percentile, and scholarly activity metrics. Predictive models
tested included logistic regression, random forest, XGBoost with SMOTE,
linear SVM, and an ensemble classifier. Specialty-specific weighting was
applied based on historical match rates. Model performance was evaluated
using AUC, accuracy, precision, recall, and F1-score. Thresholds were
optimized to detect unmatched outcomes.
RESULTS
Baseline logistic regression showed limited discrimination (AUC = 0.635).
Specialty weighting improved performance (AUC = 0.943; accuracy = 85%;
unmatched recall = 83%). Threshold optimization increased accuracy to 94%.
The linear SVM had the highest unmatched sensitivity, while the ensemble
model balanced precision and recall. Feature importance analysis ranked class
rank (37%) as most predictive, followed by Step 1 (23%), research experiences
(20%), publications (11%), and Step 2 CK (9%). Preliminary NLP models using
full CV text underperformed due to sample size and class imbalance.
CONCLUSIONS
A machine learning–based Match Risk Calculator using institutional CV data
showed strong predictive performance and interpretability. Specialty
weighting and threshold tuning enabled individualized risk estimation and
advising. Future work will explore multi-institutional validation and
improved NLP-based feature extraction.
Presentation Tag(s)
Student Presentation