Variable selection in omics data: A practical evaluation of small sample sizes

Alexander Kirpich, Elizabeth Ainsworth, Jessica M. Wedow, Jeremy R.B. Newman, George Michailidis, Lauren M. McIntyre

Research output: Contribution to journalArticle

Abstract

In omics experiments, variable selection involves a large number of metabolites/ genes and a small number of samples (the n < p problem). The ultimate goal is often the identification of one, or a few features that are different among conditions- a biomarker. Complicating biomarker identification, the p variables often contain a correlation structure due to the biology of the experiment making identifying causal compounds from correlated compounds difficult. Additionally, there may be elements in the experimental design (blocks, batches) that introduce structure in the data. While this problem has been discussed in the literature and various strategies proposed, the over fitting problems concomitant with such approaches are rarely acknowledged. Instead of viewing a single omics experiment as a definitive test for a biomarker, an unrealistic analytical goal, we propose to view such studies as screening studies where the goal of the study is to reduce the number of features present in the second round of testing, and to limit the Type II error. Using this perspective, the performance of LASSO, ridge regression and Elastic Net was compared with the performance of an ANOVA via a simulation study and two real data comparisons. Interestingly, a dramatic increase in the number of features had no effect on Type I error for the ANOVA approach. ANOVA, even without multiple test correction, has a low false positive rates in the scenarios tested. The Elastic Net has an inflated Type I error (from 10 to 50%) for small numbers of features which increases with sample size. The Type II error rate for the ANOVA is comparable or lower than that for the Elastic Net leading us to conclude that an ANOVA is an effective analytical tool for the initial screening of features in omics experiments.

Original languageEnglish (US)
Article numbere0197910
JournalPloS one
Volume13
Issue number6
DOIs
StatePublished - Jun 2018

Fingerprint

Analysis of variance (ANOVA)
Sample Size
Analysis of Variance
analysis of variance
Biomarkers
biomarkers
Screening
sampling
Experiments
screening
testing
Metabolites
Design of experiments
Research Design
Genes
experimental design
metabolites
Biological Sciences
Testing
genes

ASJC Scopus subject areas

  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)

Cite this

Kirpich, A., Ainsworth, E., Wedow, J. M., Newman, J. R. B., Michailidis, G., & McIntyre, L. M. (2018). Variable selection in omics data: A practical evaluation of small sample sizes. PloS one, 13(6), [e0197910]. https://doi.org/10.1371/journal.pone.0197910

Variable selection in omics data : A practical evaluation of small sample sizes. / Kirpich, Alexander; Ainsworth, Elizabeth; Wedow, Jessica M.; Newman, Jeremy R.B.; Michailidis, George; McIntyre, Lauren M.

In: PloS one, Vol. 13, No. 6, e0197910, 06.2018.

Research output: Contribution to journalArticle

Kirpich, A, Ainsworth, E, Wedow, JM, Newman, JRB, Michailidis, G & McIntyre, LM 2018, 'Variable selection in omics data: A practical evaluation of small sample sizes', PloS one, vol. 13, no. 6, e0197910. https://doi.org/10.1371/journal.pone.0197910
Kirpich, Alexander ; Ainsworth, Elizabeth ; Wedow, Jessica M. ; Newman, Jeremy R.B. ; Michailidis, George ; McIntyre, Lauren M. / Variable selection in omics data : A practical evaluation of small sample sizes. In: PloS one. 2018 ; Vol. 13, No. 6.
@article{c569b848a3584d5891c9d51a5625eee2,
title = "Variable selection in omics data: A practical evaluation of small sample sizes",
abstract = "In omics experiments, variable selection involves a large number of metabolites/ genes and a small number of samples (the n < p problem). The ultimate goal is often the identification of one, or a few features that are different among conditions- a biomarker. Complicating biomarker identification, the p variables often contain a correlation structure due to the biology of the experiment making identifying causal compounds from correlated compounds difficult. Additionally, there may be elements in the experimental design (blocks, batches) that introduce structure in the data. While this problem has been discussed in the literature and various strategies proposed, the over fitting problems concomitant with such approaches are rarely acknowledged. Instead of viewing a single omics experiment as a definitive test for a biomarker, an unrealistic analytical goal, we propose to view such studies as screening studies where the goal of the study is to reduce the number of features present in the second round of testing, and to limit the Type II error. Using this perspective, the performance of LASSO, ridge regression and Elastic Net was compared with the performance of an ANOVA via a simulation study and two real data comparisons. Interestingly, a dramatic increase in the number of features had no effect on Type I error for the ANOVA approach. ANOVA, even without multiple test correction, has a low false positive rates in the scenarios tested. The Elastic Net has an inflated Type I error (from 10 to 50{\%}) for small numbers of features which increases with sample size. The Type II error rate for the ANOVA is comparable or lower than that for the Elastic Net leading us to conclude that an ANOVA is an effective analytical tool for the initial screening of features in omics experiments.",
author = "Alexander Kirpich and Elizabeth Ainsworth and Wedow, {Jessica M.} and Newman, {Jeremy R.B.} and George Michailidis and McIntyre, {Lauren M.}",
year = "2018",
month = "6",
doi = "10.1371/journal.pone.0197910",
language = "English (US)",
volume = "13",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "6",

}

TY - JOUR

T1 - Variable selection in omics data

T2 - A practical evaluation of small sample sizes

AU - Kirpich, Alexander

AU - Ainsworth, Elizabeth

AU - Wedow, Jessica M.

AU - Newman, Jeremy R.B.

AU - Michailidis, George

AU - McIntyre, Lauren M.

PY - 2018/6

Y1 - 2018/6

N2 - In omics experiments, variable selection involves a large number of metabolites/ genes and a small number of samples (the n < p problem). The ultimate goal is often the identification of one, or a few features that are different among conditions- a biomarker. Complicating biomarker identification, the p variables often contain a correlation structure due to the biology of the experiment making identifying causal compounds from correlated compounds difficult. Additionally, there may be elements in the experimental design (blocks, batches) that introduce structure in the data. While this problem has been discussed in the literature and various strategies proposed, the over fitting problems concomitant with such approaches are rarely acknowledged. Instead of viewing a single omics experiment as a definitive test for a biomarker, an unrealistic analytical goal, we propose to view such studies as screening studies where the goal of the study is to reduce the number of features present in the second round of testing, and to limit the Type II error. Using this perspective, the performance of LASSO, ridge regression and Elastic Net was compared with the performance of an ANOVA via a simulation study and two real data comparisons. Interestingly, a dramatic increase in the number of features had no effect on Type I error for the ANOVA approach. ANOVA, even without multiple test correction, has a low false positive rates in the scenarios tested. The Elastic Net has an inflated Type I error (from 10 to 50%) for small numbers of features which increases with sample size. The Type II error rate for the ANOVA is comparable or lower than that for the Elastic Net leading us to conclude that an ANOVA is an effective analytical tool for the initial screening of features in omics experiments.

AB - In omics experiments, variable selection involves a large number of metabolites/ genes and a small number of samples (the n < p problem). The ultimate goal is often the identification of one, or a few features that are different among conditions- a biomarker. Complicating biomarker identification, the p variables often contain a correlation structure due to the biology of the experiment making identifying causal compounds from correlated compounds difficult. Additionally, there may be elements in the experimental design (blocks, batches) that introduce structure in the data. While this problem has been discussed in the literature and various strategies proposed, the over fitting problems concomitant with such approaches are rarely acknowledged. Instead of viewing a single omics experiment as a definitive test for a biomarker, an unrealistic analytical goal, we propose to view such studies as screening studies where the goal of the study is to reduce the number of features present in the second round of testing, and to limit the Type II error. Using this perspective, the performance of LASSO, ridge regression and Elastic Net was compared with the performance of an ANOVA via a simulation study and two real data comparisons. Interestingly, a dramatic increase in the number of features had no effect on Type I error for the ANOVA approach. ANOVA, even without multiple test correction, has a low false positive rates in the scenarios tested. The Elastic Net has an inflated Type I error (from 10 to 50%) for small numbers of features which increases with sample size. The Type II error rate for the ANOVA is comparable or lower than that for the Elastic Net leading us to conclude that an ANOVA is an effective analytical tool for the initial screening of features in omics experiments.

UR - http://www.scopus.com/inward/record.url?scp=85048855768&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85048855768&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0197910

DO - 10.1371/journal.pone.0197910

M3 - Article

C2 - 29927942

AN - SCOPUS:85048855768

VL - 13

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 6

M1 - e0197910

ER -