The estimation of very large multiple sequence alignments is a challenging problem that requires special techniques in order to achieve high accuracy. Here we describe two software packages—PASTA and UPP—for constructing alignments on large and ultra-large datasets. Both methods have been able to produce highly accurate alignments on 1,000,000 sequences, and trees computed on these alignments are also highly accurate. PASTA provides the best tree accuracy when the input sequences are all full-length, but UPP provides improved accuracy compared to PASTA and other methods when the input contains a large number of fragmentary sequences. Both methods are available in open source form on GitHub.

Original languageEnglish (US)
Title of host publicationMethods in Molecular Biology
PublisherHumana Press Inc.
Number of pages21
StatePublished - 2021

Publication series

NameMethods in Molecular Biology
ISSN (Print)1064-3745
ISSN (Electronic)1940-6029


  • Ensembles of Hidden Markov Models
  • Multiple sequence alignment
  • SATé
  • UPP

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics


Dive into the research topics of 'Multiple Sequence Alignment for Large Heterogeneous Datasets Using SATé, PASTA, and UPP'. Together they form a unique fingerprint.

Cite this