Obtaining highly accurate topology estimates of evolutionary trees from very short sequences

Daniel H. Huson, Scott Nettles, Tandy Warnow

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

The evolutionary history of a set of species is represented by a phylogenetic tree, in other words, by a rooted, leaf-labelled tree, where internal nodes represent ancestral species and the leaves represent modern day species. Accurate (or even boundedly inaccurate) topology reconstructions of large and divergent trees has long been considered one of the major challenges in systematic biology. None of the polynomial time methods developed by the theoretical computer science community has been shown to outperform the popular Neighbor-Joining method used by systematic biologists, with respect to topology estimation. (However, preliminary experiments indicate that two new variants of Neighbor-Joining, Bio-NJ and Weighbor, do exhibit improved performance.) In this paper, we present a simple polynomial time method, the Disk-Covering Method (DCM), which boosts the performance of base phylogenetic methods. We analyze the performance of DCM-boosted distance methods under the general Markov model of evolution, and prove that, by using the DCM-boosted Buneman method, for almost all trees, polylogarithmic length sequences suffice for complete accuracy with high probability, while polynomial length sequences always suffice. Our experimental study (based upon simulating sequence evolution on model trees, generating about 1000 datasets) confirms these substantial reductions in error rates and extremely fast convergence rates. In particular, we report that DCM-boosted Neighbor-Joining has only 8% of the error of Neighbor-Joining under conditions that are adverse to Neighbor-Joining, and on some trees achieving acceptable error rates (less than 5% error in the topology estimation) from sequences of a few hundred nucleotides, while Neighbor-Joining needs more than 10 K nucleotides to achieve the same level of accuracy.

Original languageEnglish
Title of host publicationProceedings of the Annual International Conference on Computational Molecular Biology, RECOMB
Place of PublicationNew York, NY, United States
PublisherACM
Pages198-207
Number of pages10
StatePublished - Jan 1 1999
Externally publishedYes
EventProceedings of the 1999 3rd Annual International Conference on Computational Molecular Biology, RECOMB '99 - Lyon
Duration: Apr 11 1999Apr 14 1999

Other

OtherProceedings of the 1999 3rd Annual International Conference on Computational Molecular Biology, RECOMB '99
CityLyon
Period4/11/994/14/99

Fingerprint

Joining
Topology
Polynomials
Nucleotides
Computer science
History
Experiments

ASJC Scopus subject areas

  • Biochemistry, Genetics and Molecular Biology(all)
  • Computer Science(all)

Cite this

Huson, D. H., Nettles, S., & Warnow, T. (1999). Obtaining highly accurate topology estimates of evolutionary trees from very short sequences. In Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB (pp. 198-207). New York, NY, United States: ACM.

Obtaining highly accurate topology estimates of evolutionary trees from very short sequences. / Huson, Daniel H.; Nettles, Scott; Warnow, Tandy.

Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB. New York, NY, United States : ACM, 1999. p. 198-207.

Research output: Chapter in Book/Report/Conference proceedingChapter

Huson, DH, Nettles, S & Warnow, T 1999, Obtaining highly accurate topology estimates of evolutionary trees from very short sequences. in Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB. ACM, New York, NY, United States, pp. 198-207, Proceedings of the 1999 3rd Annual International Conference on Computational Molecular Biology, RECOMB '99, Lyon, 4/11/99.
Huson DH, Nettles S, Warnow T. Obtaining highly accurate topology estimates of evolutionary trees from very short sequences. In Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB. New York, NY, United States: ACM. 1999. p. 198-207
Huson, Daniel H. ; Nettles, Scott ; Warnow, Tandy. / Obtaining highly accurate topology estimates of evolutionary trees from very short sequences. Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB. New York, NY, United States : ACM, 1999. pp. 198-207
@inbook{f0232549b1064d86b5f0970db42e22e7,
title = "Obtaining highly accurate topology estimates of evolutionary trees from very short sequences",
abstract = "The evolutionary history of a set of species is represented by a phylogenetic tree, in other words, by a rooted, leaf-labelled tree, where internal nodes represent ancestral species and the leaves represent modern day species. Accurate (or even boundedly inaccurate) topology reconstructions of large and divergent trees has long been considered one of the major challenges in systematic biology. None of the polynomial time methods developed by the theoretical computer science community has been shown to outperform the popular Neighbor-Joining method used by systematic biologists, with respect to topology estimation. (However, preliminary experiments indicate that two new variants of Neighbor-Joining, Bio-NJ and Weighbor, do exhibit improved performance.) In this paper, we present a simple polynomial time method, the Disk-Covering Method (DCM), which boosts the performance of base phylogenetic methods. We analyze the performance of DCM-boosted distance methods under the general Markov model of evolution, and prove that, by using the DCM-boosted Buneman method, for almost all trees, polylogarithmic length sequences suffice for complete accuracy with high probability, while polynomial length sequences always suffice. Our experimental study (based upon simulating sequence evolution on model trees, generating about 1000 datasets) confirms these substantial reductions in error rates and extremely fast convergence rates. In particular, we report that DCM-boosted Neighbor-Joining has only 8{\%} of the error of Neighbor-Joining under conditions that are adverse to Neighbor-Joining, and on some trees achieving acceptable error rates (less than 5{\%} error in the topology estimation) from sequences of a few hundred nucleotides, while Neighbor-Joining needs more than 10 K nucleotides to achieve the same level of accuracy.",
author = "Huson, {Daniel H.} and Scott Nettles and Tandy Warnow",
year = "1999",
month = "1",
day = "1",
language = "English",
pages = "198--207",
booktitle = "Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB",
publisher = "ACM",

}

TY - CHAP

T1 - Obtaining highly accurate topology estimates of evolutionary trees from very short sequences

AU - Huson, Daniel H.

AU - Nettles, Scott

AU - Warnow, Tandy

PY - 1999/1/1

Y1 - 1999/1/1

N2 - The evolutionary history of a set of species is represented by a phylogenetic tree, in other words, by a rooted, leaf-labelled tree, where internal nodes represent ancestral species and the leaves represent modern day species. Accurate (or even boundedly inaccurate) topology reconstructions of large and divergent trees has long been considered one of the major challenges in systematic biology. None of the polynomial time methods developed by the theoretical computer science community has been shown to outperform the popular Neighbor-Joining method used by systematic biologists, with respect to topology estimation. (However, preliminary experiments indicate that two new variants of Neighbor-Joining, Bio-NJ and Weighbor, do exhibit improved performance.) In this paper, we present a simple polynomial time method, the Disk-Covering Method (DCM), which boosts the performance of base phylogenetic methods. We analyze the performance of DCM-boosted distance methods under the general Markov model of evolution, and prove that, by using the DCM-boosted Buneman method, for almost all trees, polylogarithmic length sequences suffice for complete accuracy with high probability, while polynomial length sequences always suffice. Our experimental study (based upon simulating sequence evolution on model trees, generating about 1000 datasets) confirms these substantial reductions in error rates and extremely fast convergence rates. In particular, we report that DCM-boosted Neighbor-Joining has only 8% of the error of Neighbor-Joining under conditions that are adverse to Neighbor-Joining, and on some trees achieving acceptable error rates (less than 5% error in the topology estimation) from sequences of a few hundred nucleotides, while Neighbor-Joining needs more than 10 K nucleotides to achieve the same level of accuracy.

AB - The evolutionary history of a set of species is represented by a phylogenetic tree, in other words, by a rooted, leaf-labelled tree, where internal nodes represent ancestral species and the leaves represent modern day species. Accurate (or even boundedly inaccurate) topology reconstructions of large and divergent trees has long been considered one of the major challenges in systematic biology. None of the polynomial time methods developed by the theoretical computer science community has been shown to outperform the popular Neighbor-Joining method used by systematic biologists, with respect to topology estimation. (However, preliminary experiments indicate that two new variants of Neighbor-Joining, Bio-NJ and Weighbor, do exhibit improved performance.) In this paper, we present a simple polynomial time method, the Disk-Covering Method (DCM), which boosts the performance of base phylogenetic methods. We analyze the performance of DCM-boosted distance methods under the general Markov model of evolution, and prove that, by using the DCM-boosted Buneman method, for almost all trees, polylogarithmic length sequences suffice for complete accuracy with high probability, while polynomial length sequences always suffice. Our experimental study (based upon simulating sequence evolution on model trees, generating about 1000 datasets) confirms these substantial reductions in error rates and extremely fast convergence rates. In particular, we report that DCM-boosted Neighbor-Joining has only 8% of the error of Neighbor-Joining under conditions that are adverse to Neighbor-Joining, and on some trees achieving acceptable error rates (less than 5% error in the topology estimation) from sequences of a few hundred nucleotides, while Neighbor-Joining needs more than 10 K nucleotides to achieve the same level of accuracy.

UR - http://www.scopus.com/inward/record.url?scp=0032683036&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0032683036&partnerID=8YFLogxK

M3 - Chapter

AN - SCOPUS:0032683036

SP - 198

EP - 207

BT - Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB

PB - ACM

CY - New York, NY, United States

ER -