TY - JOUR
T1 - Variable selection in heterogeneous datasets
T2 - A truncated-rank sparse linear mixed model with applications to genome-wide association studies
AU - Wang, Haohan
AU - Aragam, Bryon
AU - Xing, Eric P.
N1 - Funding Information:
This material is based upon work funded and supported by the Department of Defense under Contract No. FA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. This work is also supported by the National Institutes of Health grants R01-GM093156 and P30-DA035778 .
Publisher Copyright:
© 2018 Elsevier Inc.
PY - 2018/8/1
Y1 - 2018/8/1
N2 - A fundamental and important challenge in modern datasets of ever increasing dimensionality is variable selection, which has taken on renewed interest recently due to the growth of biological and medical datasets with complex, non-i.i.d. structures. Naïvely applying classical variable selection methods such as the Lasso to such datasets may lead to a large number of false discoveries. Motivated by genome-wide association studies in genetics, we study the problem of variable selection for datasets arising from multiple subpopulations, when this underlying population structure is unknown to the researcher. We propose a unified framework for sparse variable selection that adaptively corrects for population structure via a low-rank linear mixed model. Most importantly, the proposed method does not require prior knowledge of sample structure in the data and adaptively selects a covariance structure of the correct complexity. Through extensive experiments, we illustrate the effectiveness of this framework over existing methods. Further, we test our method on three different genomic datasets from plants, mice, and human, and discuss the knowledge we discover with our method.
AB - A fundamental and important challenge in modern datasets of ever increasing dimensionality is variable selection, which has taken on renewed interest recently due to the growth of biological and medical datasets with complex, non-i.i.d. structures. Naïvely applying classical variable selection methods such as the Lasso to such datasets may lead to a large number of false discoveries. Motivated by genome-wide association studies in genetics, we study the problem of variable selection for datasets arising from multiple subpopulations, when this underlying population structure is unknown to the researcher. We propose a unified framework for sparse variable selection that adaptively corrects for population structure via a low-rank linear mixed model. Most importantly, the proposed method does not require prior knowledge of sample structure in the data and adaptively selects a covariance structure of the correct complexity. Through extensive experiments, we illustrate the effectiveness of this framework over existing methods. Further, we test our method on three different genomic datasets from plants, mice, and human, and discuss the knowledge we discover with our method.
KW - Confounding correction
KW - Genome-wide association study
KW - Heterogeneity
KW - Mixed model
KW - Variable selection
UR - http://www.scopus.com/inward/record.url?scp=85047262880&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85047262880&partnerID=8YFLogxK
U2 - 10.1016/j.ymeth.2018.04.021
DO - 10.1016/j.ymeth.2018.04.021
M3 - Article
C2 - 29705212
AN - SCOPUS:85047262880
SN - 1046-2023
VL - 145
SP - 2
EP - 9
JO - Methods
JF - Methods
ER -