TY - JOUR
T1 - WITCH-NG
T2 - efficient and accurate alignment of datasets with sequence length heterogeneity
AU - Liu, Baqiao
AU - Warnow, Tandy
N1 - Publisher Copyright:
© Bioinformatics Advances. All Rights Reserved.
PY - 2023
Y1 - 2023
N2 - Multiple sequence alignment is a basic part of many bioinformatics pipelines, including in phylogeny estimation, prediction of structure for both RNAs and proteins, and metagenomic sequence analysis. Yet many sequence datasets exhibit substantial sequence length heterogeneity, both because of large insertions and deletions in the evolutionary history of the sequences and the inclusion of unassembled reads or incompletely assembled sequences in the input. A few methods have been developed that can be highly accurate in aligning datasets with sequence length heterogeneity, with UPP one of the first methods to achieve good accuracy, and WITCH a recent improvement on UPP for accuracy. In this article, we show how we can speed up WITCH. Our improvement includes replacing a critical step in WITCH (currently performed using a heuristic search) by a polynomial time exact algorithm using Smith-Waterman. Our new method, WITCH-NG (i.e. 'next generation WITCH') achieves the same accuracy but is substantially faster. WITCH-NG is available at https://github.com/RuneBlaze/WITCH-NG. Availability and implementation: The datasets used in this study are from prior publications and are freely available in public repositories, as indicated in the Supplementary Materials.
AB - Multiple sequence alignment is a basic part of many bioinformatics pipelines, including in phylogeny estimation, prediction of structure for both RNAs and proteins, and metagenomic sequence analysis. Yet many sequence datasets exhibit substantial sequence length heterogeneity, both because of large insertions and deletions in the evolutionary history of the sequences and the inclusion of unassembled reads or incompletely assembled sequences in the input. A few methods have been developed that can be highly accurate in aligning datasets with sequence length heterogeneity, with UPP one of the first methods to achieve good accuracy, and WITCH a recent improvement on UPP for accuracy. In this article, we show how we can speed up WITCH. Our improvement includes replacing a critical step in WITCH (currently performed using a heuristic search) by a polynomial time exact algorithm using Smith-Waterman. Our new method, WITCH-NG (i.e. 'next generation WITCH') achieves the same accuracy but is substantially faster. WITCH-NG is available at https://github.com/RuneBlaze/WITCH-NG. Availability and implementation: The datasets used in this study are from prior publications and are freely available in public repositories, as indicated in the Supplementary Materials.
UR - http://www.scopus.com/inward/record.url?scp=85159184360&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85159184360&partnerID=8YFLogxK
U2 - 10.1093/bioadv/vbad024
DO - 10.1093/bioadv/vbad024
M3 - Article
C2 - 36970502
AN - SCOPUS:85159184360
SN - 2635-0041
VL - 3
JO - Bioinformatics Advances
JF - Bioinformatics Advances
IS - 1
M1 - vbad024
ER -