TY - JOUR
T1 - Profile Hidden Markov Models Are Not Identifiable
AU - Pattabiraman, Srilakshmi
AU - Warnow, Tandy
N1 - Funding Information:
This research was supported by National Science Foundation grants ABI-1458652 and III:AF:1513629 to TW. This research began as a final project by the first author for the course Computer Science 581: Algorithmic Computational Genomics, taught by the second author at the University of Illinois at Urbana-Champaign in Spring 2018. The authors thank Sarah Christensen, Erin Molloy, Pranjal Vachaspati, and the other members of the Warnow Lab for their helpful suggestions throughout this research project. The authors also thank the anonymous reviewers whose comments were very helpful in improving the manuscript.
Funding Information:
Tandy Warnow received the PhD degree in mathematics at UC Berkeley under the direction of Gene Lawler, and did postdoctoral training with Simon Tavaré and Michael Waterman with the University of Southern California. She is the founder professor of computer science with the University of Illinois at Urbana-Champaign. Her research combines computer science, statistics, and discrete mathematics, focusing on develop-ing improved models and algorithms for recon-structing complex and large-scale evolutionary histories in biology and historical linguistics. Her awards include the NSF Young Investigator Award (1994), the David and Lucile Packard Foundation Award (1996), a Radcliffe Institute Fellowship (2006), and the John Simon Guggenheim Foundation Fellowship (2011). She was elected a fellow of the Association for Computing Machinery (ACM), in 2015 and of the International Society for Computational Biology (ISCB), in 2017.
Publisher Copyright:
© 2004-2012 IEEE.
PY - 2021/1/1
Y1 - 2021/1/1
N2 - Profile Hidden Markov Models (HMMs) are graphical models that can be used to produce finite length sequences from a distribution. In fact, although they were only introduced for bioinformatics 25 years ago (by Haussler et al., Hawaii International Conference on Systems Science, 1993), they are arguably the most commonly used statistical model in bioinformatics, with multiple applications, including protein structure and function prediction, classifications of novel proteins into existing protein families and superfamilies, metagenomics, and multiple sequence alignment. The standard use of profile HMMs in bioinformatics has two steps: first a profile HMM is built for a collection of molecular sequences (which may not be in a multiple sequence alignment), and then the profile HMM is used in some subsequent analysis of new molecular sequences. The construction of the profile thus is itself a statistical estimation problem, since any given set of sequences might potentially fit more than one model well. Hence, a basic question about profile HMMs is whether they are statistically identifiable, which means that no two profile HMMs can produce the same distribution on finite length sequences. Indeed, statistical identifiability is a fundamental aspect of any statistical model, and yet it is not known whether profile HMMs are statistically identifiable. In this paper, we report on preliminary results towards characterizing the statistical identifiability of profile HMMs in one of the standard forms used in bioinformatics.
AB - Profile Hidden Markov Models (HMMs) are graphical models that can be used to produce finite length sequences from a distribution. In fact, although they were only introduced for bioinformatics 25 years ago (by Haussler et al., Hawaii International Conference on Systems Science, 1993), they are arguably the most commonly used statistical model in bioinformatics, with multiple applications, including protein structure and function prediction, classifications of novel proteins into existing protein families and superfamilies, metagenomics, and multiple sequence alignment. The standard use of profile HMMs in bioinformatics has two steps: first a profile HMM is built for a collection of molecular sequences (which may not be in a multiple sequence alignment), and then the profile HMM is used in some subsequent analysis of new molecular sequences. The construction of the profile thus is itself a statistical estimation problem, since any given set of sequences might potentially fit more than one model well. Hence, a basic question about profile HMMs is whether they are statistically identifiable, which means that no two profile HMMs can produce the same distribution on finite length sequences. Indeed, statistical identifiability is a fundamental aspect of any statistical model, and yet it is not known whether profile HMMs are statistically identifiable. In this paper, we report on preliminary results towards characterizing the statistical identifiability of profile HMMs in one of the standard forms used in bioinformatics.
KW - Profile hidden Markov models
KW - and statistical consistency
KW - statistical identifiability
UR - http://www.scopus.com/inward/record.url?scp=85100540495&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85100540495&partnerID=8YFLogxK
U2 - 10.1109/TCBB.2019.2933821
DO - 10.1109/TCBB.2019.2933821
M3 - Article
C2 - 31425043
AN - SCOPUS:85100540495
SN - 1545-5963
VL - 18
SP - 162
EP - 172
JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics
JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics
IS - 1
M1 - 8798655
ER -