TY - JOUR
T1 - Optimal Estimation of Wasserstein Distance on a Tree With an Application to Microbiome Studies
AU - Wang, Shulei
AU - Cai, T. Tony
AU - Li, Hongzhe
N1 - Funding Information:
This research was supported by NIH grants R01GM123056 and R01GM129781. Shulei Wang is a Postdoctoral Fellow, Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104 (E-mail: [email protected]). T. Tony Cai is Daniel H. Silberberg Professor of Statistics, Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104 (E-mail:[email protected]). Hongzhe Li is Professor of Biostatistics and Statistics, Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104 (E-mail: [email protected]).
Publisher Copyright:
© 2020 American Statistical Association.
PY - 2021
Y1 - 2021
N2 - The weighted UniFrac distance, a plug-in estimator of the Wasserstein distance of read counts on a tree, has been widely used to measure the microbial community difference in microbiome studies. Our investigation however shows that such a plug-in estimator, although intuitive and commonly used in practice, suffers from potential bias. Motivated by this finding, we study the problem of optimal estimation of the Wasserstein distance between two distributions on a tree from the sampled data in the high-dimensional setting. The minimax rate of convergence is established. To overcome the bias problem, we introduce a new estimator, referred to as the moment-screening estimator on a tree (MET), by using implicit best polynomial approximation that incorporates the tree structure. The new estimator is computationally efficient and is shown to be minimax rate-optimal. Numerical studies using both simulated and real biological datasets demonstrate the practical merits of MET, including reduced biases and statistically more significant differences in microbiome between the inactive Crohn’s disease patients and the normal controls. Supplementary materials for this article are available online.
AB - The weighted UniFrac distance, a plug-in estimator of the Wasserstein distance of read counts on a tree, has been widely used to measure the microbial community difference in microbiome studies. Our investigation however shows that such a plug-in estimator, although intuitive and commonly used in practice, suffers from potential bias. Motivated by this finding, we study the problem of optimal estimation of the Wasserstein distance between two distributions on a tree from the sampled data in the high-dimensional setting. The minimax rate of convergence is established. To overcome the bias problem, we introduce a new estimator, referred to as the moment-screening estimator on a tree (MET), by using implicit best polynomial approximation that incorporates the tree structure. The new estimator is computationally efficient and is shown to be minimax rate-optimal. Numerical studies using both simulated and real biological datasets demonstrate the practical merits of MET, including reduced biases and statistically more significant differences in microbiome between the inactive Crohn’s disease patients and the normal controls. Supplementary materials for this article are available online.
KW - Estimation of nonsmooth functional
KW - Polynomial approximation
KW - Pylogenetic tree
UR - http://www.scopus.com/inward/record.url?scp=85078437054&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85078437054&partnerID=8YFLogxK
U2 - 10.1080/01621459.2019.1699422
DO - 10.1080/01621459.2019.1699422
M3 - Article
AN - SCOPUS:85078437054
SN - 0162-1459
VL - 116
SP - 1237
EP - 1253
JO - Journal of the American Statistical Association
JF - Journal of the American Statistical Association
IS - 535
ER -