High-Stakes Testing Case Study: A Latent Variable Approach for Assessing Measurement and Prediction Invariance

Steven Andrew Culpepper, Herman Aguinis, Justin L. Kern, Roger Millsap

Research output: Contribution to journalArticle

Abstract

The existence of differences in prediction systems involving test scores across demographic groups continues to be a thorny and unresolved scientific, professional, and societal concern. Our case study uses a two-stage least squares (2SLS) estimator to jointly assess measurement invariance and prediction invariance in high-stakes testing. So, we examined differences across groups based on latent as opposed to observed scores with data for 176 colleges and universities from The College Board. Results showed that evidence regarding measurement invariance was rejected for the SAT mathematics (SAT-M) subtest at the 0.01 level for 74.5% and 29.9% of cohorts for Black versus White and Hispanic versus White comparisons, respectively. Also, on average, Black students with the same standing on a common factor had observed SAT-M scores that were nearly a third of a standard deviation lower than for comparable Whites. We also found evidence that group differences in SAT-M measurement intercepts may partly explain the well-known finding of observed differences in prediction intercepts. Additionally, results provided evidence that nearly a quarter of the statistically significant observed intercept differences were not statistically significant at the 0.05 level once predictor measurement error was accounted for using the 2SLS procedure. Our joint measurement and prediction invariance approach based on latent scores opens the door to a new high-stakes testing research agenda whose goal is to not simply assess whether observed group-based differences exist and the size and direction of such differences. Rather, the goal of this research agenda is to assess the causal chain starting with underlying theoretical mechanisms (e.g., contextual factors, differences in latent predictor scores) that affect the size and direction of any observed differences.

Original languageEnglish (US)
Pages (from-to)285-309
Number of pages25
JournalPsychometrika
Volume84
Issue number1
DOIs
StatePublished - Mar 15 2019

Fingerprint

Mathematics
Latent Variables
Invariance
Testing
Prediction
Intercept
Least-Squares Analysis
Measurement Invariance
Hispanic Americans
Research
Joints
Measurement errors
Demography
Students
Predictors
Two-stage Least Squares
Common factor
Score Test
Least Squares Estimator
Test System

Keywords

  • high-stakes testing
  • instrumental variables
  • measurement invariance
  • prediction invariance

ASJC Scopus subject areas

  • Psychology(all)
  • Applied Mathematics

Cite this

High-Stakes Testing Case Study : A Latent Variable Approach for Assessing Measurement and Prediction Invariance. / Culpepper, Steven Andrew; Aguinis, Herman; Kern, Justin L.; Millsap, Roger.

In: Psychometrika, Vol. 84, No. 1, 15.03.2019, p. 285-309.

Research output: Contribution to journalArticle

@article{ed71cb36f0eb4789bb5fcdf87b961054,
title = "High-Stakes Testing Case Study: A Latent Variable Approach for Assessing Measurement and Prediction Invariance",
abstract = "The existence of differences in prediction systems involving test scores across demographic groups continues to be a thorny and unresolved scientific, professional, and societal concern. Our case study uses a two-stage least squares (2SLS) estimator to jointly assess measurement invariance and prediction invariance in high-stakes testing. So, we examined differences across groups based on latent as opposed to observed scores with data for 176 colleges and universities from The College Board. Results showed that evidence regarding measurement invariance was rejected for the SAT mathematics (SAT-M) subtest at the 0.01 level for 74.5{\%} and 29.9{\%} of cohorts for Black versus White and Hispanic versus White comparisons, respectively. Also, on average, Black students with the same standing on a common factor had observed SAT-M scores that were nearly a third of a standard deviation lower than for comparable Whites. We also found evidence that group differences in SAT-M measurement intercepts may partly explain the well-known finding of observed differences in prediction intercepts. Additionally, results provided evidence that nearly a quarter of the statistically significant observed intercept differences were not statistically significant at the 0.05 level once predictor measurement error was accounted for using the 2SLS procedure. Our joint measurement and prediction invariance approach based on latent scores opens the door to a new high-stakes testing research agenda whose goal is to not simply assess whether observed group-based differences exist and the size and direction of such differences. Rather, the goal of this research agenda is to assess the causal chain starting with underlying theoretical mechanisms (e.g., contextual factors, differences in latent predictor scores) that affect the size and direction of any observed differences.",
keywords = "high-stakes testing, instrumental variables, measurement invariance, prediction invariance",
author = "Culpepper, {Steven Andrew} and Herman Aguinis and Kern, {Justin L.} and Roger Millsap",
year = "2019",
month = "3",
day = "15",
doi = "10.1007/s11336-018-9649-2",
language = "English (US)",
volume = "84",
pages = "285--309",
journal = "Psychometrika",
issn = "0033-3123",
publisher = "Springer New York",
number = "1",

}

TY - JOUR

T1 - High-Stakes Testing Case Study

T2 - A Latent Variable Approach for Assessing Measurement and Prediction Invariance

AU - Culpepper, Steven Andrew

AU - Aguinis, Herman

AU - Kern, Justin L.

AU - Millsap, Roger

PY - 2019/3/15

Y1 - 2019/3/15

N2 - The existence of differences in prediction systems involving test scores across demographic groups continues to be a thorny and unresolved scientific, professional, and societal concern. Our case study uses a two-stage least squares (2SLS) estimator to jointly assess measurement invariance and prediction invariance in high-stakes testing. So, we examined differences across groups based on latent as opposed to observed scores with data for 176 colleges and universities from The College Board. Results showed that evidence regarding measurement invariance was rejected for the SAT mathematics (SAT-M) subtest at the 0.01 level for 74.5% and 29.9% of cohorts for Black versus White and Hispanic versus White comparisons, respectively. Also, on average, Black students with the same standing on a common factor had observed SAT-M scores that were nearly a third of a standard deviation lower than for comparable Whites. We also found evidence that group differences in SAT-M measurement intercepts may partly explain the well-known finding of observed differences in prediction intercepts. Additionally, results provided evidence that nearly a quarter of the statistically significant observed intercept differences were not statistically significant at the 0.05 level once predictor measurement error was accounted for using the 2SLS procedure. Our joint measurement and prediction invariance approach based on latent scores opens the door to a new high-stakes testing research agenda whose goal is to not simply assess whether observed group-based differences exist and the size and direction of such differences. Rather, the goal of this research agenda is to assess the causal chain starting with underlying theoretical mechanisms (e.g., contextual factors, differences in latent predictor scores) that affect the size and direction of any observed differences.

AB - The existence of differences in prediction systems involving test scores across demographic groups continues to be a thorny and unresolved scientific, professional, and societal concern. Our case study uses a two-stage least squares (2SLS) estimator to jointly assess measurement invariance and prediction invariance in high-stakes testing. So, we examined differences across groups based on latent as opposed to observed scores with data for 176 colleges and universities from The College Board. Results showed that evidence regarding measurement invariance was rejected for the SAT mathematics (SAT-M) subtest at the 0.01 level for 74.5% and 29.9% of cohorts for Black versus White and Hispanic versus White comparisons, respectively. Also, on average, Black students with the same standing on a common factor had observed SAT-M scores that were nearly a third of a standard deviation lower than for comparable Whites. We also found evidence that group differences in SAT-M measurement intercepts may partly explain the well-known finding of observed differences in prediction intercepts. Additionally, results provided evidence that nearly a quarter of the statistically significant observed intercept differences were not statistically significant at the 0.05 level once predictor measurement error was accounted for using the 2SLS procedure. Our joint measurement and prediction invariance approach based on latent scores opens the door to a new high-stakes testing research agenda whose goal is to not simply assess whether observed group-based differences exist and the size and direction of such differences. Rather, the goal of this research agenda is to assess the causal chain starting with underlying theoretical mechanisms (e.g., contextual factors, differences in latent predictor scores) that affect the size and direction of any observed differences.

KW - high-stakes testing

KW - instrumental variables

KW - measurement invariance

KW - prediction invariance

UR - http://www.scopus.com/inward/record.url?scp=85060661511&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85060661511&partnerID=8YFLogxK

U2 - 10.1007/s11336-018-9649-2

DO - 10.1007/s11336-018-9649-2

M3 - Article

C2 - 30671788

AN - SCOPUS:85060661511

VL - 84

SP - 285

EP - 309

JO - Psychometrika

JF - Psychometrika

SN - 0033-3123

IS - 1

ER -