TY - JOUR
T1 - Assessing writing quality using crowdsourced non-expert comparative judgement ratings
AU - Crossley, Scott A.
AU - Kim, Minkyung
AU - Wan, Qian
AU - Allen, Laura K.
AU - Tywoniw, Rurik
AU - McNamara, Danielle
N1 - The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education through Grant R305A180261. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.
PY - 2025
Y1 - 2025
N2 - This study examines the potential to use non-expert, crowd-sourced raters to score essays by comparing expert raters’ and crowd-sourced raters’ assessments of writing quality. Expert raters and crowd-sourced raters scored 400 essays using a standardised holistic rubric and comparative judgement (pairwise ratings) scoring techniques, respectively. The findings indicated that 92% of non-expert, pairwise ratings were sufficiently reliable and raters’ alignment with overall rankings was 67.9%. Additionally, the non-expert ratings were moderately correlated (r =.397) with expert ratings. Further, the linguistic features of the essays were computed to predict expert and non-expert pairwise ratings, revealing that the predictive models of essay quality for both expert and non-expert scores accounted for around 30–35% of the variance. The two models also shared similar linguistic features. The results collectively demonstrate similarities between non-expert pairwise raters and expert raters when assessing essay quality.
AB - This study examines the potential to use non-expert, crowd-sourced raters to score essays by comparing expert raters’ and crowd-sourced raters’ assessments of writing quality. Expert raters and crowd-sourced raters scored 400 essays using a standardised holistic rubric and comparative judgement (pairwise ratings) scoring techniques, respectively. The findings indicated that 92% of non-expert, pairwise ratings were sufficiently reliable and raters’ alignment with overall rankings was 67.9%. Additionally, the non-expert ratings were moderately correlated (r =.397) with expert ratings. Further, the linguistic features of the essays were computed to predict expert and non-expert pairwise ratings, revealing that the predictive models of essay quality for both expert and non-expert scores accounted for around 30–35% of the variance. The two models also shared similar linguistic features. The results collectively demonstrate similarities between non-expert pairwise raters and expert raters when assessing essay quality.
KW - Crowdsourcing
KW - corpus linguistics
KW - natural language processing
KW - pairwise comparisons
KW - writing assessment
UR - http://www.scopus.com/inward/record.url?scp=105002287910&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105002287910&partnerID=8YFLogxK
U2 - 10.1080/0969594X.2025.2467664
DO - 10.1080/0969594X.2025.2467664
M3 - Article
AN - SCOPUS:105002287910
SN - 0969-594X
VL - 32
SP - 33
EP - 59
JO - Assessment in Education: Principles, Policy and Practice
JF - Assessment in Education: Principles, Policy and Practice
IS - 1
ER -