TY - CHAP
T1 - Bayesian model selection for high-dimensional data
AU - Narisetty, Naveen Naidu
N1 - The author is grateful to two reviewers for their extensive and helpful feedback and graduate students Teng Wu and Ke Li for proofreading an initial version of the article. The author gratefully acknowledges support from NSF Award DMS-1811768.
PY - 2020
Y1 - 2020
N2 - High-dimensional data, where the number of features or covariates can even be larger than the number of independent samples, are ubiquitous and are encountered on a regular basis by statistical scientists both in academia and in industry. A majority of the classical research in statistics dealt with the settings where there is a small number of covariates. Due to the modern advancements in data storage and computational power, the high-dimensional data revolution has significantly occupied mainstream statistical research. In gene expression datasets, for instance, it is not uncommon to encounter datasets with observations on at most a few hundred independent samples (subjects) and with information on tens or hundreds of thousands of genes per each sample. An important and common question that arises quickly is—“which of the available covariates are relevant to the outcome of interest?” This concerns the problem of variable selection (and more generally model selection) in statistics and data science. This chapter will provide an overview of some of the most well-known model selection methods along with some of the more recent methods. While frequentist methods will be discussed, Bayesian approaches will be given a more elaborate treatment. The frequentist framework for model selection is primarily based on penalization, whereas the Bayesian framework relies on prior distributions for inducing shrinkage and sparsity. The chapter treats the Bayesian framework in the light of objective and empirical Bayesian viewpoints as the priors in the high-dimensional setting are typically not completely based subjective prior beliefs. An important practical aspect of high-dimensional model selection methods is computational scalability which will also be discussed.
AB - High-dimensional data, where the number of features or covariates can even be larger than the number of independent samples, are ubiquitous and are encountered on a regular basis by statistical scientists both in academia and in industry. A majority of the classical research in statistics dealt with the settings where there is a small number of covariates. Due to the modern advancements in data storage and computational power, the high-dimensional data revolution has significantly occupied mainstream statistical research. In gene expression datasets, for instance, it is not uncommon to encounter datasets with observations on at most a few hundred independent samples (subjects) and with information on tens or hundreds of thousands of genes per each sample. An important and common question that arises quickly is—“which of the available covariates are relevant to the outcome of interest?” This concerns the problem of variable selection (and more generally model selection) in statistics and data science. This chapter will provide an overview of some of the most well-known model selection methods along with some of the more recent methods. While frequentist methods will be discussed, Bayesian approaches will be given a more elaborate treatment. The frequentist framework for model selection is primarily based on penalization, whereas the Bayesian framework relies on prior distributions for inducing shrinkage and sparsity. The chapter treats the Bayesian framework in the light of objective and empirical Bayesian viewpoints as the priors in the high-dimensional setting are typically not completely based subjective prior beliefs. An important practical aspect of high-dimensional model selection methods is computational scalability which will also be discussed.
KW - Bayesian computation
KW - Bayesian variable selection
KW - High-dimensional data
KW - Model comparison
UR - http://www.scopus.com/inward/record.url?scp=85140202704&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140202704&partnerID=8YFLogxK
U2 - 10.1016/bs.host.2019.08.001
DO - 10.1016/bs.host.2019.08.001
M3 - Chapter
AN - SCOPUS:85140202704
SN - 9780444642110
T3 - Handbook of Statistics
SP - 207
EP - 248
BT - Principles and Methods for Data Science
A2 - Srinivasa Rao, Arni S.R.
A2 - Rao, C.R.
PB - Elsevier B.V.
ER -