The estimation of species trees typically involves the estimation of trees and alignments on many different genes, so that the species tree can be based on many different parts of the genome. This kind of phylogenomic approach to species tree estimation has the potential to produce more accurate species tree estimates, especially when gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss, and horizontal gene transfer. Because ILS (also called "deep coalescence") is a frequent problem in systematics, many methods have been developed to estimate species trees from gene trees or alignments that specifically take ILS into consideration. In this paper we consider the problem of estimating species trees from gene trees and alignments for the general case where the gene trees and alignments can be incomplete, which means that not all the genes contain sequences for all the species. We formalize optimization problems for this context and prove theoretical results for these problems. We also present the results of a simulation study evaluating existing methods for estimating species trees from incomplete gene trees. Our simulation study shows that *BEAST, a statistical method for estimating species trees from gene sequence alignments, produces by far the most accurate species trees. However, *BEAST can only be run on small datasets. The second most accurate method, MRP (a standard supertree method), can analyze very large datasets and produces very good trees, making MRP a potentially acceptable alternative to *BEAST for large datasets.
ASJC Scopus subject areas
- Molecular Biology
- Computational Mathematics
- Modeling and Simulation
- Computational Theory and Mathematics