TY - GEN
T1 - Are We Fair? Quantifying Score Impacts of Computer Science Exams with Randomized Question Pools
AU - Fowler, Max
AU - Smith, David H.
AU - Emeka, Chinedu
AU - West, Matthew
AU - Zilles, Craig
N1 - Funding Information:
This material is based upon work supported by the National Science Foundation under Grants No. DUE-1915257.
Publisher Copyright:
© 2022 ACM.
PY - 2022/2/22
Y1 - 2022/2/22
N2 - With the increase of large enrollment courses and the growing need to offer online instruction, computer-based exams randomly generated from question pools have a clear benefit for computing courses. Such exams can be used at scale, scheduled asynchronously and/or online, and use versioning to make attempts at cheating less profitable. Despite these benefits, we want to ensure that the technique is not unfair to students, particularly when it comes to equivalent difficulty across exam versions. To investigate generated exam fairness, we use a Generalized Partial Credit Model (GPCM) Item-Response Theory (IRT) model to fit exams from a for-majors data structures course and non-majors CS0 course, both of which used randomly generated exams. For all exams, students' estimated ability and exam score are strongly correlated (ρ ≥ 0.7), suggesting that the exams are reasonably fair. Through simulation, we find that most of the variance in any given student's simulated scores is due to chance and the worst of the score impacts from possibly unfair permutations is only around 5 percentage points on an exam. We discuss implications of this work and possible future steps.
AB - With the increase of large enrollment courses and the growing need to offer online instruction, computer-based exams randomly generated from question pools have a clear benefit for computing courses. Such exams can be used at scale, scheduled asynchronously and/or online, and use versioning to make attempts at cheating less profitable. Despite these benefits, we want to ensure that the technique is not unfair to students, particularly when it comes to equivalent difficulty across exam versions. To investigate generated exam fairness, we use a Generalized Partial Credit Model (GPCM) Item-Response Theory (IRT) model to fit exams from a for-majors data structures course and non-majors CS0 course, both of which used randomly generated exams. For all exams, students' estimated ability and exam score are strongly correlated (ρ ≥ 0.7), suggesting that the exams are reasonably fair. Through simulation, we find that most of the variance in any given student's simulated scores is due to chance and the worst of the score impacts from possibly unfair permutations is only around 5 percentage points on an exam. We discuss implications of this work and possible future steps.
KW - assessment
KW - exam generation
KW - fairness
KW - randomized exams
UR - http://www.scopus.com/inward/record.url?scp=85126114833&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85126114833&partnerID=8YFLogxK
U2 - 10.1145/3478431.3499388
DO - 10.1145/3478431.3499388
M3 - Conference contribution
AN - SCOPUS:85126114833
T3 - SIGCSE 2022 - Proceedings of the 53rd ACM Technical Symposium on Computer Science Education
SP - 647
EP - 653
BT - SIGCSE 2022 - Proceedings of the 53rd ACM Technical Symposium on Computer Science Education
PB - Association for Computing Machinery, Inc
T2 - 53rd Annual ACM Technical Symposium on Computer Science Education, SIGCSE 2022
Y2 - 3 March 2022 through 5 March 2022
ER -