Abstract

Accurate multiple sequence alignment is challenging on many data sets, including those that are large, evolve under high rates of evolution, or have sequence length heterogeneity. While substantial progress has been made over the last decade in addressing the first two challenges, sequence length heterogeneity remains a significant issue for many data sets. Sequence length heterogeneity occurs for biological and technological reasons, including large insertions or deletions (indels) that occurred in the evolutionary history relating the sequences, or the inclusion of sequences that are not fully assembled. Ultra-large alignments using Phylogeny-Aware Profiles (UPP) (Nguyen et al. 2015) is one of the most accurate approaches for aligning data sets that exhibit sequence length heterogeneity: it constructs an alignment on the subset of sequences it considers "full-length," represents this "backbone alignment" using an ensemble of hidden Markov models (HMMs), and then adds each remaining sequence into the backbone alignment based on an HMM selected for that sequence from the ensemble. Our new method, WeIghTed Consensus Hmm alignment (WITCH), improves on UPP in three important ways: first, it uses a statistically principled technique to weight and rank the HMMs; second, it usesformula presented HMMs from the ensemble rather than a single HMM; and third, it combines the alignments for each of the selected HMMs using a consensus algorithm that takes the weights into account. We show that this approach provides improved alignment accuracy compared with UPP and other leading alignment methods, as well as improved accuracy for maximum likelihood trees based on these alignments.

Original languageEnglish (US)
Pages (from-to)782-801
Number of pages20
JournalJournal of Computational Biology
Volume29
Issue number8
Early online dateMay 17 2022
DOIs
StatePublished - Aug 1 2022

Keywords

  • divide and conquer
  • multiple sequence alignment
  • hidden Markov model

ASJC Scopus subject areas

  • Computational Mathematics
  • Genetics
  • Molecular Biology
  • Computational Theory and Mathematics
  • Modeling and Simulation

Cite this