TY - JOUR
T1 - MAGUS1eHMMs
T2 - improved multiple sequence alignment accuracy for fragmentary sequences
AU - Shen, Chengze
AU - Zaharias, Paul
AU - Warnow, Tandy
N1 - Funding Information:
This work was supported in part by NSF [2006069, 1458652 to T.W.]. This research is part of the Blue Waters sustained-petascale computing project,
Funding Information:
which is supported by the US National Science Foundation (awards OCI-0725070 and ACI-1238993) the State of Illinois, and as of December 2019, the National Geospatial-Intelligence Agency. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.
Publisher Copyright:
© 2022 Oxford University Press. All rights reserved.
PY - 2022/2/15
Y1 - 2022/2/15
N2 - Multiple sequence alignment is an initial step in many bioinformatics pipelines, including phylogeny estimation, protein structure prediction and taxonomic identification of reads produced in amplicon or metagenomic datasets, etc. Yet, alignment estimation is challenging on datasets that exhibit substantial sequence length heterogeneity, and especially when the datasets have fragmentary sequences as a result of including reads or contigs generated by next-generation sequencing technologies. Here, we examine techniques that have been developed to improve alignment estimation when datasets contain substantial numbers of fragmentary sequences. We find that MAGUS, a recently developed MSA method, is fairly robust to fragmentary sequences under many conditions, and that using a two-stage approach where MAGUS is used to align selected 'backbone sequences' and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models further improves alignment accuracy. The combination of MAGUS with the ensemble of eHMMs (i.e. MAGUS eHMMs) clearly improves on UPP, the previous leading method for aligning datasets with high levels of fragmentation.
AB - Multiple sequence alignment is an initial step in many bioinformatics pipelines, including phylogeny estimation, protein structure prediction and taxonomic identification of reads produced in amplicon or metagenomic datasets, etc. Yet, alignment estimation is challenging on datasets that exhibit substantial sequence length heterogeneity, and especially when the datasets have fragmentary sequences as a result of including reads or contigs generated by next-generation sequencing technologies. Here, we examine techniques that have been developed to improve alignment estimation when datasets contain substantial numbers of fragmentary sequences. We find that MAGUS, a recently developed MSA method, is fairly robust to fragmentary sequences under many conditions, and that using a two-stage approach where MAGUS is used to align selected 'backbone sequences' and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models further improves alignment accuracy. The combination of MAGUS with the ensemble of eHMMs (i.e. MAGUS eHMMs) clearly improves on UPP, the previous leading method for aligning datasets with high levels of fragmentation.
UR - http://www.scopus.com/inward/record.url?scp=85127605391&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85127605391&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btab788
DO - 10.1093/bioinformatics/btab788
M3 - Article
C2 - 34791036
AN - SCOPUS:85127605391
SN - 1367-4803
VL - 38
SP - 918
EP - 924
JO - Bioinformatics
JF - Bioinformatics
IS - 4
ER -