Estimating optimal species trees from incomplete gene trees under deep coalescence

M. D. Shamsuzzoha Bayzid, Tandy Warnow

Research output: Contribution to journalArticle

Abstract

The estimation of species trees typically involves the estimation of trees and alignments on many different genes, so that the species tree can be based on many different parts of the genome. This kind of phylogenomic approach to species tree estimation has the potential to produce more accurate species tree estimates, especially when gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss, and horizontal gene transfer. Because ILS (also called "deep coalescence") is a frequent problem in systematics, many methods have been developed to estimate species trees from gene trees or alignments that specifically take ILS into consideration. In this paper we consider the problem of estimating species trees from gene trees and alignments for the general case where the gene trees and alignments can be incomplete, which means that not all the genes contain sequences for all the species. We formalize optimization problems for this context and prove theoretical results for these problems. We also present the results of a simulation study evaluating existing methods for estimating species trees from incomplete gene trees. Our simulation study shows that *BEAST, a statistical method for estimating species trees from gene sequence alignments, produces by far the most accurate species trees. However, *BEAST can only be run on small datasets. The second most accurate method, MRP (a standard supertree method), can analyze very large datasets and produces very good trees, making MRP a potentially acceptable alternative to *BEAST for large datasets.

Original languageEnglish (US)
Pages (from-to)591-605
Number of pages15
JournalJournal of Computational Biology
Volume19
Issue number6
DOIs
StatePublished - Jun 1 2012
Externally publishedYes

Fingerprint

Coalescence
Genes
Gene
Sorting
Alignment
Gene transfer
Trees (mathematics)
Large Data Sets
Simulation Study
Statistical methods
Horizontal Gene Transfer
Gene Duplication
Sequence Alignment
Duplication
Estimate
Statistical method

Keywords

  • algorithms

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics
  • Computational Mathematics
  • Modeling and Simulation
  • Computational Theory and Mathematics

Cite this

Estimating optimal species trees from incomplete gene trees under deep coalescence. / Shamsuzzoha Bayzid, M. D.; Warnow, Tandy.

In: Journal of Computational Biology, Vol. 19, No. 6, 01.06.2012, p. 591-605.

Research output: Contribution to journalArticle

@article{a24ca3a13f234bfba3fa3eeaf7661b8c,
title = "Estimating optimal species trees from incomplete gene trees under deep coalescence",
abstract = "The estimation of species trees typically involves the estimation of trees and alignments on many different genes, so that the species tree can be based on many different parts of the genome. This kind of phylogenomic approach to species tree estimation has the potential to produce more accurate species tree estimates, especially when gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss, and horizontal gene transfer. Because ILS (also called {"}deep coalescence{"}) is a frequent problem in systematics, many methods have been developed to estimate species trees from gene trees or alignments that specifically take ILS into consideration. In this paper we consider the problem of estimating species trees from gene trees and alignments for the general case where the gene trees and alignments can be incomplete, which means that not all the genes contain sequences for all the species. We formalize optimization problems for this context and prove theoretical results for these problems. We also present the results of a simulation study evaluating existing methods for estimating species trees from incomplete gene trees. Our simulation study shows that *BEAST, a statistical method for estimating species trees from gene sequence alignments, produces by far the most accurate species trees. However, *BEAST can only be run on small datasets. The second most accurate method, MRP (a standard supertree method), can analyze very large datasets and produces very good trees, making MRP a potentially acceptable alternative to *BEAST for large datasets.",
keywords = "algorithms",
author = "{Shamsuzzoha Bayzid}, {M. D.} and Tandy Warnow",
year = "2012",
month = "6",
day = "1",
doi = "10.1089/cmb.2012.0037",
language = "English (US)",
volume = "19",
pages = "591--605",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "6",

}

TY - JOUR

T1 - Estimating optimal species trees from incomplete gene trees under deep coalescence

AU - Shamsuzzoha Bayzid, M. D.

AU - Warnow, Tandy

PY - 2012/6/1

Y1 - 2012/6/1

N2 - The estimation of species trees typically involves the estimation of trees and alignments on many different genes, so that the species tree can be based on many different parts of the genome. This kind of phylogenomic approach to species tree estimation has the potential to produce more accurate species tree estimates, especially when gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss, and horizontal gene transfer. Because ILS (also called "deep coalescence") is a frequent problem in systematics, many methods have been developed to estimate species trees from gene trees or alignments that specifically take ILS into consideration. In this paper we consider the problem of estimating species trees from gene trees and alignments for the general case where the gene trees and alignments can be incomplete, which means that not all the genes contain sequences for all the species. We formalize optimization problems for this context and prove theoretical results for these problems. We also present the results of a simulation study evaluating existing methods for estimating species trees from incomplete gene trees. Our simulation study shows that *BEAST, a statistical method for estimating species trees from gene sequence alignments, produces by far the most accurate species trees. However, *BEAST can only be run on small datasets. The second most accurate method, MRP (a standard supertree method), can analyze very large datasets and produces very good trees, making MRP a potentially acceptable alternative to *BEAST for large datasets.

AB - The estimation of species trees typically involves the estimation of trees and alignments on many different genes, so that the species tree can be based on many different parts of the genome. This kind of phylogenomic approach to species tree estimation has the potential to produce more accurate species tree estimates, especially when gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss, and horizontal gene transfer. Because ILS (also called "deep coalescence") is a frequent problem in systematics, many methods have been developed to estimate species trees from gene trees or alignments that specifically take ILS into consideration. In this paper we consider the problem of estimating species trees from gene trees and alignments for the general case where the gene trees and alignments can be incomplete, which means that not all the genes contain sequences for all the species. We formalize optimization problems for this context and prove theoretical results for these problems. We also present the results of a simulation study evaluating existing methods for estimating species trees from incomplete gene trees. Our simulation study shows that *BEAST, a statistical method for estimating species trees from gene sequence alignments, produces by far the most accurate species trees. However, *BEAST can only be run on small datasets. The second most accurate method, MRP (a standard supertree method), can analyze very large datasets and produces very good trees, making MRP a potentially acceptable alternative to *BEAST for large datasets.

KW - algorithms

UR - http://www.scopus.com/inward/record.url?scp=84862568674&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84862568674&partnerID=8YFLogxK

U2 - 10.1089/cmb.2012.0037

DO - 10.1089/cmb.2012.0037

M3 - Article

C2 - 22697236

AN - SCOPUS:84862568674

VL - 19

SP - 591

EP - 605

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 6

ER -