Purpose
Differential Item Functioning (DIF) occurs when test takers from different subgroups (e.g., sex, ethnicity) have different probabilities of answering a test item correctly, despite having the same level of ability. The purpose of the current study was to determine the extent to which our faculty-developed summative examinations may exhibit DIF.
Methods
The final examination items from one of our second-year courses were used as the subject of DIF analysis. A logistic regression approach was used to compute DIF for each item based upon sex, race, ethnicity and native language proficiency. By convention, a Nagerlike R2 >0.035 was used to define the presence of substantial DIF. Faculty reviewed all items with DIF to discuss whether logical item flaws were present that might account for DIF.
Results
Of 115 unique test items, 26 were found to display DIF, as indicated by a Nagerlike R2 >0.035. Eight instances of DIF were by race, 6 by sex, 10 by different ethnicities, and 2 by primary language. No consistent biases by sex, race, or ethnicity were noted across items showing DIF. Item review by the research team, which included subject matter experts, resulted in proposed edits to 16 of 26 items.
Conclusion
Significant DIF was detected in a larger proportion of test items than anticipated. Faculty were able to propose logical edits to several items where DIF was detected. Future work will study the outcomes of faculty item edits and will seek learner input on sources of DIF in affected items. Inclusion of DIF analysis provides a stimulus for faculty to improve test items and promotes discussion about fairness in high-stakes testing.