Abstract
Accurate multiple sequence alignment is challenging on many data sets, including those that are large, evolve under high rates of evolution, or have sequence length heterogeneity. While substantial progress has been made over the last decade in addressing the first two challenges, sequence length heterogeneity remains a significant issue for many data sets. Sequence length heterogeneity occurs for biological and technological reasons, including large insertions or deletions (indels) that occurred in the evolutionary history relating the sequences, or the inclusion of sequences that are not fully assembled. Ultra-large alignments using Phylogeny-Aware Profiles (UPP) (Nguyen et al. 2015) is one of the most accurate approaches for aligning data sets that exhibit sequence length heterogeneity: it constructs an alignment on the subset of sequences it considers "full-length," represents this "backbone alignment" using an ensemble of hidden Markov models (HMMs), and then adds each remaining sequence into the backbone alignment based on an HMM selected for that sequence from the ensemble. Our new method, WeIghTed Consensus Hmm alignment (WITCH), improves on UPP in three important ways: first, it uses a statistically principled technique to weight and rank the HMMs; second, it usesformula presented HMMs from the ensemble rather than a single HMM; and third, it combines the alignments for each of the selected HMMs using a consensus algorithm that takes the weights into account. We show that this approach provides improved alignment accuracy compared with UPP and other leading alignment methods, as well as improved accuracy for maximum likelihood trees based on these alignments.
Original language | English (US) |
---|---|
Pages (from-to) | 782-801 |
Number of pages | 20 |
Journal | Journal of Computational Biology |
Volume | 29 |
Issue number | 8 |
Early online date | May 17 2022 |
DOIs | |
State | Published - Aug 1 2022 |
Keywords
- divide and conquer
- multiple sequence alignment
- hidden Markov model
ASJC Scopus subject areas
- Computational Mathematics
- Genetics
- Molecular Biology
- Computational Theory and Mathematics
- Modeling and Simulation
Fingerprint
Dive into the research topics of 'WITCH: Improved Multiple Sequence Alignment Through Weighted Consensus Hidden Markov Model Alignment'. Together they form a unique fingerprint.Datasets
-
The 16S.B.ALL dataset in 100-HF condition
Shen, C. (Creator), Park, M. (Creator) & Warnow, T. (Creator), University of Illinois Urbana-Champaign, Mar 25 2022
DOI: 10.13012/B2IDB-6604429_V1
Dataset