SATé-II: Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees

Kevin Liu, Tandy J. Warnow, Mark T. Holder, Serita M. Nelesen, Jiaye Yu, Alexandros P. Stamatakis, C. Randal Linder

Research output: Contribution to journalArticle

Abstract

Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestimation of alignments and trees has been attempted but currently only SATé estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SATé algorithm that improves upon SATé (which we now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy. SATé-II uses a different divide-and-conquer strategy than SATé-I and so produces smaller more closely related subsets than SATé-I; as a result, SATé-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SATé-I. Generally, SATé is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATé-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SATé-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SATé's performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense. For all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma ML to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SATé-II and SATé-I is not because ML is used as an optimization criterion for choosing the best tree/alignment pair but rather due to the particular divide-and-conquer realignment techniques employed.

Original languageEnglish (US)
Pages (from-to)90-106
Number of pages17
JournalSystematic biology
Volume61
Issue number1
DOIs
StatePublished - Jan 1 2012
Externally publishedYes

Fingerprint

Sequence Alignment
sequence alignment
phylogenetics
phylogeny
alignment
methodology
system optimization
Phylogeny

Keywords

  • Alignment
  • maximum likelihood
  • phylogenetics
  • SATé

ASJC Scopus subject areas

  • Ecology, Evolution, Behavior and Systematics
  • Genetics

Cite this

SATé-II : Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. / Liu, Kevin; Warnow, Tandy J.; Holder, Mark T.; Nelesen, Serita M.; Yu, Jiaye; Stamatakis, Alexandros P.; Randal Linder, C.

In: Systematic biology, Vol. 61, No. 1, 01.01.2012, p. 90-106.

Research output: Contribution to journalArticle

Liu, Kevin ; Warnow, Tandy J. ; Holder, Mark T. ; Nelesen, Serita M. ; Yu, Jiaye ; Stamatakis, Alexandros P. ; Randal Linder, C. / SATé-II : Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. In: Systematic biology. 2012 ; Vol. 61, No. 1. pp. 90-106.
@article{8073485b337440e19e3211b7b08971e5,
title = "SAT{\'e}-II: Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees",
abstract = "Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestimation of alignments and trees has been attempted but currently only SAT{\'e} estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SAT{\'e} algorithm that improves upon SAT{\'e} (which we now call SAT{\'e}-I) in terms of speed and of phylogenetic and alignment accuracy. SAT{\'e}-II uses a different divide-and-conquer strategy than SAT{\'e}-I and so produces smaller more closely related subsets than SAT{\'e}-I; as a result, SAT{\'e}-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SAT{\'e}-I. Generally, SAT{\'e} is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SAT{\'e}-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SAT{\'e}-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SAT{\'e}'s performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense. For all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma ML to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SAT{\'e}-II and SAT{\'e}-I is not because ML is used as an optimization criterion for choosing the best tree/alignment pair but rather due to the particular divide-and-conquer realignment techniques employed.",
keywords = "Alignment, maximum likelihood, phylogenetics, SAT{\'e}",
author = "Kevin Liu and Warnow, {Tandy J.} and Holder, {Mark T.} and Nelesen, {Serita M.} and Jiaye Yu and Stamatakis, {Alexandros P.} and {Randal Linder}, C.",
year = "2012",
month = "1",
day = "1",
doi = "10.1093/sysbio/syr095",
language = "English (US)",
volume = "61",
pages = "90--106",
journal = "Systematic Biology",
issn = "1063-5157",
publisher = "Oxford University Press",
number = "1",

}

TY - JOUR

T1 - SATé-II

T2 - Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees

AU - Liu, Kevin

AU - Warnow, Tandy J.

AU - Holder, Mark T.

AU - Nelesen, Serita M.

AU - Yu, Jiaye

AU - Stamatakis, Alexandros P.

AU - Randal Linder, C.

PY - 2012/1/1

Y1 - 2012/1/1

N2 - Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestimation of alignments and trees has been attempted but currently only SATé estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SATé algorithm that improves upon SATé (which we now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy. SATé-II uses a different divide-and-conquer strategy than SATé-I and so produces smaller more closely related subsets than SATé-I; as a result, SATé-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SATé-I. Generally, SATé is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATé-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SATé-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SATé's performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense. For all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma ML to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SATé-II and SATé-I is not because ML is used as an optimization criterion for choosing the best tree/alignment pair but rather due to the particular divide-and-conquer realignment techniques employed.

AB - Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestimation of alignments and trees has been attempted but currently only SATé estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SATé algorithm that improves upon SATé (which we now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy. SATé-II uses a different divide-and-conquer strategy than SATé-I and so produces smaller more closely related subsets than SATé-I; as a result, SATé-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SATé-I. Generally, SATé is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATé-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SATé-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SATé's performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense. For all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma ML to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SATé-II and SATé-I is not because ML is used as an optimization criterion for choosing the best tree/alignment pair but rather due to the particular divide-and-conquer realignment techniques employed.

KW - Alignment

KW - maximum likelihood

KW - phylogenetics

KW - SATé

UR - http://www.scopus.com/inward/record.url?scp=84555194934&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84555194934&partnerID=8YFLogxK

U2 - 10.1093/sysbio/syr095

DO - 10.1093/sysbio/syr095

M3 - Article

C2 - 22139466

AN - SCOPUS:84555194934

VL - 61

SP - 90

EP - 106

JO - Systematic Biology

JF - Systematic Biology

SN - 1063-5157

IS - 1

ER -