Abstract

Background: For a combination of reasons (including data generation protocols, approaches to taxon and gene sampling, and gene birth and loss), estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest. As incomplete gene trees can impact downstream analyses, accurate completion of gene trees is desirable. Results: We introduce the Optimal Tree Completion problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. We present OCTAL, an algorithm that finds an optimal solution to this problem when the distance between trees is defined using the Robinson-Foulds (RF) distance, and we prove that OCTAL runs in O(n2) O (n2) time, where n is the total number of species. We report on a simulation study in which gene trees can differ from the species tree due to incomplete lineage sorting, and estimated gene trees are completed using OCTAL with a reference tree based on a species tree estimated from the multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach in ASTRAL-II, but the accuracy of a completed gene tree computed by OCTAL depends on how topologically similar the reference tree (typically an estimated species tree) is to the true gene tree. Conclusions: OCTAL is a useful technique for adding missing taxa to incomplete gene trees and provides good accuracy under a wide range of model conditions. However, results show that OCTAL's accuracy can be reduced when incomplete lineage sorting is high, as the reference tree can be far from the true gene tree. Hence, this study suggests that OCTAL would benefit from using other types of reference trees instead of species trees when there are large topological distances between true gene trees and species trees.

Original languageEnglish (US)
Article number6
JournalAlgorithms for Molecular Biology
Volume13
Issue number1
DOIs
StatePublished - Mar 15 2018

Fingerprint

Completion
Polynomial time
Genes
Polynomials
Gene
Sorting
Binary trees
Leaves
Completion Problem
Sampling
Binary Tree

Keywords

  • Gene trees
  • Missing data
  • Multispecies coalescent
  • Phylogenomics
  • Species trees

ASJC Scopus subject areas

  • Structural Biology
  • Molecular Biology
  • Computational Theory and Mathematics
  • Applied Mathematics

Cite this

OCTAL : Optimal Completion of gene trees in polynomial time. / Christensen, Sarah; Molloy, Erin K.; Vachaspati, Pranjal; Warnow, Tandy.

In: Algorithms for Molecular Biology, Vol. 13, No. 1, 6, 15.03.2018.

Research output: Contribution to journalArticle

Christensen, Sarah ; Molloy, Erin K. ; Vachaspati, Pranjal ; Warnow, Tandy. / OCTAL : Optimal Completion of gene trees in polynomial time. In: Algorithms for Molecular Biology. 2018 ; Vol. 13, No. 1.
@article{d0645e2825cb448596ee02de38aaa2a7,
title = "OCTAL: Optimal Completion of gene trees in polynomial time",
abstract = "Background: For a combination of reasons (including data generation protocols, approaches to taxon and gene sampling, and gene birth and loss), estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest. As incomplete gene trees can impact downstream analyses, accurate completion of gene trees is desirable. Results: We introduce the Optimal Tree Completion problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. We present OCTAL, an algorithm that finds an optimal solution to this problem when the distance between trees is defined using the Robinson-Foulds (RF) distance, and we prove that OCTAL runs in O(n2) O (n2) time, where n is the total number of species. We report on a simulation study in which gene trees can differ from the species tree due to incomplete lineage sorting, and estimated gene trees are completed using OCTAL with a reference tree based on a species tree estimated from the multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach in ASTRAL-II, but the accuracy of a completed gene tree computed by OCTAL depends on how topologically similar the reference tree (typically an estimated species tree) is to the true gene tree. Conclusions: OCTAL is a useful technique for adding missing taxa to incomplete gene trees and provides good accuracy under a wide range of model conditions. However, results show that OCTAL's accuracy can be reduced when incomplete lineage sorting is high, as the reference tree can be far from the true gene tree. Hence, this study suggests that OCTAL would benefit from using other types of reference trees instead of species trees when there are large topological distances between true gene trees and species trees.",
keywords = "Gene trees, Missing data, Multispecies coalescent, Phylogenomics, Species trees",
author = "Sarah Christensen and Molloy, {Erin K.} and Pranjal Vachaspati and Tandy Warnow",
year = "2018",
month = "3",
day = "15",
doi = "10.1186/s13015-018-0124-5",
language = "English (US)",
volume = "13",
journal = "Algorithms for Molecular Biology",
issn = "1748-7188",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - OCTAL

T2 - Optimal Completion of gene trees in polynomial time

AU - Christensen, Sarah

AU - Molloy, Erin K.

AU - Vachaspati, Pranjal

AU - Warnow, Tandy

PY - 2018/3/15

Y1 - 2018/3/15

N2 - Background: For a combination of reasons (including data generation protocols, approaches to taxon and gene sampling, and gene birth and loss), estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest. As incomplete gene trees can impact downstream analyses, accurate completion of gene trees is desirable. Results: We introduce the Optimal Tree Completion problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. We present OCTAL, an algorithm that finds an optimal solution to this problem when the distance between trees is defined using the Robinson-Foulds (RF) distance, and we prove that OCTAL runs in O(n2) O (n2) time, where n is the total number of species. We report on a simulation study in which gene trees can differ from the species tree due to incomplete lineage sorting, and estimated gene trees are completed using OCTAL with a reference tree based on a species tree estimated from the multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach in ASTRAL-II, but the accuracy of a completed gene tree computed by OCTAL depends on how topologically similar the reference tree (typically an estimated species tree) is to the true gene tree. Conclusions: OCTAL is a useful technique for adding missing taxa to incomplete gene trees and provides good accuracy under a wide range of model conditions. However, results show that OCTAL's accuracy can be reduced when incomplete lineage sorting is high, as the reference tree can be far from the true gene tree. Hence, this study suggests that OCTAL would benefit from using other types of reference trees instead of species trees when there are large topological distances between true gene trees and species trees.

AB - Background: For a combination of reasons (including data generation protocols, approaches to taxon and gene sampling, and gene birth and loss), estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest. As incomplete gene trees can impact downstream analyses, accurate completion of gene trees is desirable. Results: We introduce the Optimal Tree Completion problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. We present OCTAL, an algorithm that finds an optimal solution to this problem when the distance between trees is defined using the Robinson-Foulds (RF) distance, and we prove that OCTAL runs in O(n2) O (n2) time, where n is the total number of species. We report on a simulation study in which gene trees can differ from the species tree due to incomplete lineage sorting, and estimated gene trees are completed using OCTAL with a reference tree based on a species tree estimated from the multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach in ASTRAL-II, but the accuracy of a completed gene tree computed by OCTAL depends on how topologically similar the reference tree (typically an estimated species tree) is to the true gene tree. Conclusions: OCTAL is a useful technique for adding missing taxa to incomplete gene trees and provides good accuracy under a wide range of model conditions. However, results show that OCTAL's accuracy can be reduced when incomplete lineage sorting is high, as the reference tree can be far from the true gene tree. Hence, this study suggests that OCTAL would benefit from using other types of reference trees instead of species trees when there are large topological distances between true gene trees and species trees.

KW - Gene trees

KW - Missing data

KW - Multispecies coalescent

KW - Phylogenomics

KW - Species trees

UR - http://www.scopus.com/inward/record.url?scp=85043780303&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85043780303&partnerID=8YFLogxK

U2 - 10.1186/s13015-018-0124-5

DO - 10.1186/s13015-018-0124-5

M3 - Article

AN - SCOPUS:85043780303

VL - 13

JO - Algorithms for Molecular Biology

JF - Algorithms for Molecular Biology

SN - 1748-7188

IS - 1

M1 - 6

ER -