DACTAL: Divide-and-conquer trees (almost) without alignments

Serita Nelesen, Kevin Liu, Li San Wang, C. Randal Linder, Tandy Warnow

Research output: Contribution to journalArticle

Abstract

Motivation: While phylogenetic analyses of datasets containing 1000-5000 sequences are challenging for existing methods, the estimation of substantially larger phylogenies poses a problem of much greater complexity and scale. Methods: We present DACTAL, a method for phylogeny estimation that produces trees from unaligned sequence datasets without ever needing to estimate an alignment on the entire dataset. DACTAL combines iteration with a novel divide-and-conquer approach, so that each iteration begins with a tree produced in the prior iteration, decomposes the taxon set into overlapping subsets, estimates trees on each subset, and then combines the smaller trees into a tree on the full taxon set using a new supertree method. We prove that DACTAL is guaranteed to produce the true tree under certain conditions. We compare DACTAL to SATé and maximum likelihood trees on estimated alignments using simulated and real datasets with 1000-27 643 taxa. Results: Our studies show that on average DACTAL yields more accurate trees than the two-phase methods we studied on very large datasets that are difficult to align, and has approximately the same accuracy on the easier datasets. The comparison to SATé shows that both have the same accuracy, but that DACTAL achieves this accuracy in a fraction of the time. Furthermore, DACTAL can analyze larger datasets than SATé, including a dataset with almost 28 000 sequences.

Original languageEnglish (US)
Article numberbts218
Pages (from-to)i274-i282
JournalBioinformatics
Volume28
Issue number12
DOIs
StatePublished - Jun 1 2012
Externally publishedYes

Fingerprint

Divide and conquer
Alignment
Maximum likelihood
Phylogeny
Iteration
Large Data Sets
Subset
Datasets
Phylogenetics
Estimate
Maximum Likelihood
Overlapping
Entire
Decompose

ASJC Scopus subject areas

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

Nelesen, S., Liu, K., Wang, L. S., Randal Linder, C., & Warnow, T. (2012). DACTAL: Divide-and-conquer trees (almost) without alignments. Bioinformatics, 28(12), i274-i282. [bts218]. https://doi.org/10.1093/bioinformatics/bts218

DACTAL : Divide-and-conquer trees (almost) without alignments. / Nelesen, Serita; Liu, Kevin; Wang, Li San; Randal Linder, C.; Warnow, Tandy.

In: Bioinformatics, Vol. 28, No. 12, bts218, 01.06.2012, p. i274-i282.

Research output: Contribution to journalArticle

Nelesen, S, Liu, K, Wang, LS, Randal Linder, C & Warnow, T 2012, 'DACTAL: Divide-and-conquer trees (almost) without alignments', Bioinformatics, vol. 28, no. 12, bts218, pp. i274-i282. https://doi.org/10.1093/bioinformatics/bts218
Nelesen, Serita ; Liu, Kevin ; Wang, Li San ; Randal Linder, C. ; Warnow, Tandy. / DACTAL : Divide-and-conquer trees (almost) without alignments. In: Bioinformatics. 2012 ; Vol. 28, No. 12. pp. i274-i282.
@article{248fcfc64a474fbfa267bfcb0ba6ea42,
title = "DACTAL: Divide-and-conquer trees (almost) without alignments",
abstract = "Motivation: While phylogenetic analyses of datasets containing 1000-5000 sequences are challenging for existing methods, the estimation of substantially larger phylogenies poses a problem of much greater complexity and scale. Methods: We present DACTAL, a method for phylogeny estimation that produces trees from unaligned sequence datasets without ever needing to estimate an alignment on the entire dataset. DACTAL combines iteration with a novel divide-and-conquer approach, so that each iteration begins with a tree produced in the prior iteration, decomposes the taxon set into overlapping subsets, estimates trees on each subset, and then combines the smaller trees into a tree on the full taxon set using a new supertree method. We prove that DACTAL is guaranteed to produce the true tree under certain conditions. We compare DACTAL to SAT{\'e} and maximum likelihood trees on estimated alignments using simulated and real datasets with 1000-27 643 taxa. Results: Our studies show that on average DACTAL yields more accurate trees than the two-phase methods we studied on very large datasets that are difficult to align, and has approximately the same accuracy on the easier datasets. The comparison to SAT{\'e} shows that both have the same accuracy, but that DACTAL achieves this accuracy in a fraction of the time. Furthermore, DACTAL can analyze larger datasets than SAT{\'e}, including a dataset with almost 28 000 sequences.",
author = "Serita Nelesen and Kevin Liu and Wang, {Li San} and {Randal Linder}, C. and Tandy Warnow",
year = "2012",
month = "6",
day = "1",
doi = "10.1093/bioinformatics/bts218",
language = "English (US)",
volume = "28",
pages = "i274--i282",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "12",

}

TY - JOUR

T1 - DACTAL

T2 - Divide-and-conquer trees (almost) without alignments

AU - Nelesen, Serita

AU - Liu, Kevin

AU - Wang, Li San

AU - Randal Linder, C.

AU - Warnow, Tandy

PY - 2012/6/1

Y1 - 2012/6/1

N2 - Motivation: While phylogenetic analyses of datasets containing 1000-5000 sequences are challenging for existing methods, the estimation of substantially larger phylogenies poses a problem of much greater complexity and scale. Methods: We present DACTAL, a method for phylogeny estimation that produces trees from unaligned sequence datasets without ever needing to estimate an alignment on the entire dataset. DACTAL combines iteration with a novel divide-and-conquer approach, so that each iteration begins with a tree produced in the prior iteration, decomposes the taxon set into overlapping subsets, estimates trees on each subset, and then combines the smaller trees into a tree on the full taxon set using a new supertree method. We prove that DACTAL is guaranteed to produce the true tree under certain conditions. We compare DACTAL to SATé and maximum likelihood trees on estimated alignments using simulated and real datasets with 1000-27 643 taxa. Results: Our studies show that on average DACTAL yields more accurate trees than the two-phase methods we studied on very large datasets that are difficult to align, and has approximately the same accuracy on the easier datasets. The comparison to SATé shows that both have the same accuracy, but that DACTAL achieves this accuracy in a fraction of the time. Furthermore, DACTAL can analyze larger datasets than SATé, including a dataset with almost 28 000 sequences.

AB - Motivation: While phylogenetic analyses of datasets containing 1000-5000 sequences are challenging for existing methods, the estimation of substantially larger phylogenies poses a problem of much greater complexity and scale. Methods: We present DACTAL, a method for phylogeny estimation that produces trees from unaligned sequence datasets without ever needing to estimate an alignment on the entire dataset. DACTAL combines iteration with a novel divide-and-conquer approach, so that each iteration begins with a tree produced in the prior iteration, decomposes the taxon set into overlapping subsets, estimates trees on each subset, and then combines the smaller trees into a tree on the full taxon set using a new supertree method. We prove that DACTAL is guaranteed to produce the true tree under certain conditions. We compare DACTAL to SATé and maximum likelihood trees on estimated alignments using simulated and real datasets with 1000-27 643 taxa. Results: Our studies show that on average DACTAL yields more accurate trees than the two-phase methods we studied on very large datasets that are difficult to align, and has approximately the same accuracy on the easier datasets. The comparison to SATé shows that both have the same accuracy, but that DACTAL achieves this accuracy in a fraction of the time. Furthermore, DACTAL can analyze larger datasets than SATé, including a dataset with almost 28 000 sequences.

UR - http://www.scopus.com/inward/record.url?scp=84863536564&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84863536564&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/bts218

DO - 10.1093/bioinformatics/bts218

M3 - Article

C2 - 22689772

AN - SCOPUS:84863536564

VL - 28

SP - i274-i282

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 12

M1 - bts218

ER -