TY - GEN
T1 - Support Estimation with Sampling Artifacts and Errors
AU - Chien, Eli
AU - Milenkovic, Olgica
AU - Nedich, Angelia
N1 - Funding Information:
The work was supported by the NSF grant 2107344.
Publisher Copyright:
© 2021 IEEE.
PY - 2021/7/12
Y1 - 2021/7/12
N2 - The problem of estimating the support of a distribution is of great importance in many areas of machine learning, computer science and molecular biology. Almost all of the existing work in this area has used perfectly accurate sampling assumptions, which is seldom true in practice. Here we introduce the first known theoretical approach to support estimation in the presence of sampling artifacts, where each sample is assumed to be observed through a Poisson channel that simultaneously captures repetitions and deletions. The proposed estimator is based on regularized weighted Chebyshev approximations, with weights governed by evaluations of Touchard (Bell) polynomials. The supports in the presence of sampling artifacts are calculated via discretized semi-infinite programming methods. The newly proposed estimation approach is tested on synthetic and textual data, as well as on GISAID data for the purpose of estimating the mutational diversity of genes in the SARS-Cov-2 viral genome. For all experiments performed, we observed significant improvements of our integrated method compared to adequately modified known noiseless support estimation methods.
AB - The problem of estimating the support of a distribution is of great importance in many areas of machine learning, computer science and molecular biology. Almost all of the existing work in this area has used perfectly accurate sampling assumptions, which is seldom true in practice. Here we introduce the first known theoretical approach to support estimation in the presence of sampling artifacts, where each sample is assumed to be observed through a Poisson channel that simultaneously captures repetitions and deletions. The proposed estimator is based on regularized weighted Chebyshev approximations, with weights governed by evaluations of Touchard (Bell) polynomials. The supports in the presence of sampling artifacts are calculated via discretized semi-infinite programming methods. The newly proposed estimation approach is tested on synthetic and textual data, as well as on GISAID data for the purpose of estimating the mutational diversity of genes in the SARS-Cov-2 viral genome. For all experiments performed, we observed significant improvements of our integrated method compared to adequately modified known noiseless support estimation methods.
UR - http://www.scopus.com/inward/record.url?scp=85115082172&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85115082172&partnerID=8YFLogxK
U2 - 10.1109/ISIT45174.2021.9517892
DO - 10.1109/ISIT45174.2021.9517892
M3 - Conference contribution
AN - SCOPUS:85115082172
T3 - IEEE International Symposium on Information Theory - Proceedings
SP - 244
EP - 249
BT - 2021 IEEE International Symposium on Information Theory, ISIT 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE International Symposium on Information Theory, ISIT 2021
Y2 - 12 July 2021 through 20 July 2021
ER -