TY - GEN
T1 - Short answer scoring with GPT-4
AU - Jiang, Lan
AU - Bosch, Nigel
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/7/9
Y1 - 2024/7/9
N2 - Automatic short-answer scoring is a long-standing research problem in education. However, assessing short answers at human-level accuracy requires a deep understanding of natural language. Given the notable abilities of recent generative pre-trained transformer (GPT) models, we investigate gpt-4-1106-preview to automatically score student responses from the Automated Student Assessment Prize Short Answer Scoring dataset. We systematically varied information given to the model including possible correct answers and scoring examples, as well as the order of sub-tasks within short answer scoring (e.g., assigning a score vs. generating a rationale for an assigned score) to understand what affects short answer scoring. With the best configuration, GPT-4 yielded a quadratic weighted kappa of .677 across 10 questions. However, we observe that the performance differs across educational subjects (e.g., biology, English), the quality of scoring rubrics might affect the predictions, and the overall utility of rationales generated to explain scores is uncertain.
AB - Automatic short-answer scoring is a long-standing research problem in education. However, assessing short answers at human-level accuracy requires a deep understanding of natural language. Given the notable abilities of recent generative pre-trained transformer (GPT) models, we investigate gpt-4-1106-preview to automatically score student responses from the Automated Student Assessment Prize Short Answer Scoring dataset. We systematically varied information given to the model including possible correct answers and scoring examples, as well as the order of sub-tasks within short answer scoring (e.g., assigning a score vs. generating a rationale for an assigned score) to understand what affects short answer scoring. With the best configuration, GPT-4 yielded a quadratic weighted kappa of .677 across 10 questions. However, we observe that the performance differs across educational subjects (e.g., biology, English), the quality of scoring rubrics might affect the predictions, and the overall utility of rationales generated to explain scores is uncertain.
KW - gpt (generative pre-trained transformer)
KW - short answer scoring
KW - text classification
UR - http://www.scopus.com/inward/record.url?scp=85199888214&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85199888214&partnerID=8YFLogxK
U2 - 10.1145/3657604.3664685
DO - 10.1145/3657604.3664685
M3 - Conference contribution
AN - SCOPUS:85199888214
T3 - L@S 2024 - Proceedings of the 11th ACM Conference on Learning @ Scale
SP - 438
EP - 442
BT - L@S 2024 - Proceedings of the 11th ACM Conference on Learning @ Scale
PB - Association for Computing Machinery
T2 - 11th ACM Conference on Learning @ Scale, L@S 2024
Y2 - 18 July 2024 through 20 July 2024
ER -