TY - JOUR
T1 - OCTAL
T2 - Optimal Completion of gene trees in polynomial time
AU - Christensen, Sarah
AU - Molloy, Erin K.
AU - Vachaspati, Pranjal
AU - Warnow, Tandy
N1 - SC and TW were supported by National Science Foundation (Grant CCF-1535977). SC was also supported by the Chirag Foundation Graduate Fellowship in Computer Science. EKM and PV were supported by the National Science Foundation Research Fellowship Program (Grant DGE-1144245). EKM was also supported by the Ira and Debra Cohen Graduate Fellowship in Computer Science. This research was part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (Grants OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications. This work made use of the Illinois Campus Cluster, a computing resource that is operated by the Illinois Campus Cluster Program in conjunction with the National Center for Supercomputing Applications and which is supported by funds from the University of Illinois at Urbana-Champaign. Publication costs for this paper are paid for by CCF-1535977 (to TW).
PY - 2018/3/15
Y1 - 2018/3/15
N2 - Background: For a combination of reasons (including data generation protocols, approaches to taxon and gene sampling, and gene birth and loss), estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest. As incomplete gene trees can impact downstream analyses, accurate completion of gene trees is desirable. Results: We introduce the Optimal Tree Completion problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. We present OCTAL, an algorithm that finds an optimal solution to this problem when the distance between trees is defined using the Robinson-Foulds (RF) distance, and we prove that OCTAL runs in O(n2) O (n2) time, where n is the total number of species. We report on a simulation study in which gene trees can differ from the species tree due to incomplete lineage sorting, and estimated gene trees are completed using OCTAL with a reference tree based on a species tree estimated from the multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach in ASTRAL-II, but the accuracy of a completed gene tree computed by OCTAL depends on how topologically similar the reference tree (typically an estimated species tree) is to the true gene tree. Conclusions: OCTAL is a useful technique for adding missing taxa to incomplete gene trees and provides good accuracy under a wide range of model conditions. However, results show that OCTAL's accuracy can be reduced when incomplete lineage sorting is high, as the reference tree can be far from the true gene tree. Hence, this study suggests that OCTAL would benefit from using other types of reference trees instead of species trees when there are large topological distances between true gene trees and species trees.
AB - Background: For a combination of reasons (including data generation protocols, approaches to taxon and gene sampling, and gene birth and loss), estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest. As incomplete gene trees can impact downstream analyses, accurate completion of gene trees is desirable. Results: We introduce the Optimal Tree Completion problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. We present OCTAL, an algorithm that finds an optimal solution to this problem when the distance between trees is defined using the Robinson-Foulds (RF) distance, and we prove that OCTAL runs in O(n2) O (n2) time, where n is the total number of species. We report on a simulation study in which gene trees can differ from the species tree due to incomplete lineage sorting, and estimated gene trees are completed using OCTAL with a reference tree based on a species tree estimated from the multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach in ASTRAL-II, but the accuracy of a completed gene tree computed by OCTAL depends on how topologically similar the reference tree (typically an estimated species tree) is to the true gene tree. Conclusions: OCTAL is a useful technique for adding missing taxa to incomplete gene trees and provides good accuracy under a wide range of model conditions. However, results show that OCTAL's accuracy can be reduced when incomplete lineage sorting is high, as the reference tree can be far from the true gene tree. Hence, this study suggests that OCTAL would benefit from using other types of reference trees instead of species trees when there are large topological distances between true gene trees and species trees.
KW - Gene trees
KW - Missing data
KW - Multispecies coalescent
KW - Phylogenomics
KW - Species trees
UR - https://www.scopus.com/pages/publications/85043780303
UR - https://www.scopus.com/pages/publications/85043780303#tab=citedBy
U2 - 10.1186/s13015-018-0124-5
DO - 10.1186/s13015-018-0124-5
M3 - Article
C2 - 29568323
AN - SCOPUS:85043780303
SN - 1748-7188
VL - 13
JO - Algorithms for Molecular Biology
JF - Algorithms for Molecular Biology
IS - 1
M1 - 6
ER -