Abstract

The estimation of very large multiple sequence alignments is a challenging problem that requires special techniques in order to achieve high accuracy. Here we describe two software packages—PASTA and UPP—for constructing alignments on large and ultra-large datasets. Both methods have been able to produce highly accurate alignments on 1,000,000 sequences, and trees computed on these alignments are also highly accurate. PASTA provides the best tree accuracy when the input sequences are all full-length, but UPP provides improved accuracy compared to PASTA and other methods when the input contains a large number of fragmentary sequences. Both methods are available in open source form on GitHub.

Original languageEnglish (US)
Title of host publicationMethods in Molecular Biology
PublisherHumana Press Inc.
Pages99-119
Number of pages21
DOIs
StatePublished - 2021

Publication series

NameMethods in Molecular Biology
Volume2231
ISSN (Print)1064-3745
ISSN (Electronic)1940-6029

Keywords

  • Ensembles of Hidden Markov Models
  • Multiple sequence alignment
  • PASTA
  • SATé
  • UPP

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics

Fingerprint

Dive into the research topics of 'Multiple Sequence Alignment for Large Heterogeneous Datasets Using SATé, PASTA, and UPP'. Together they form a unique fingerprint.

Cite this