The performance of coalescent-based species tree estimation methods under models of missing data

Michael Nute, Jed Chou, Erin K. Molloy, Tandy Warnow

Research output: Contribution to journalArticle

Abstract

Background: Estimation of species trees from multiple genes is complicated by processes such as incomplete lineage sorting, gene duplication and loss, and horizontal gene transfer, that result in gene trees that differ from each other and from the species phylogeny. Methods to estimate species trees in the presence of gene tree discord due to incomplete lineage sorting have been developed and proved to be statistically consistent when gene tree discord is due only to incomplete lineage sorting and every gene tree includes the full set of species. Results: We establish statistical consistency of certain coalescent-based species tree estimation methods under some models of taxon deletion from genes. We also evaluate the impact of missing data on four species tree estimation methods (ASTRAL-II, ASTRID, MP-EST, and SVDquartets) using simulated datasets with varying levels of incomplete lineage sorting, gene tree estimation error, and degrees/patterns of missing data. Conclusions: All the species tree estimation methods improved in accuracy as the number of genes increased and often produced highly accurate species trees even when the amount of missing data was large. These results together indicate that accurate species tree estimation is possible under a variety of conditions, even when there are substantial amounts of missing data.

Original languageEnglish (US)
Article number286
JournalBMC genomics
Volume19
DOIs
StatePublished - May 8 2018

Fingerprint

Genes
Horizontal Gene Transfer
Gene Duplication
Expressed Sequence Tags
Gene Deletion
Phylogeny

Keywords

  • ASTRAL
  • ASTRID
  • Incomplete lineage sorting
  • MP-EST
  • Missing data
  • Multi-species coalescent
  • SVDquartets
  • Species tree

ASJC Scopus subject areas

  • Biotechnology
  • Genetics

Cite this

The performance of coalescent-based species tree estimation methods under models of missing data. / Nute, Michael; Chou, Jed; Molloy, Erin K.; Warnow, Tandy.

In: BMC genomics, Vol. 19, 286, 08.05.2018.

Research output: Contribution to journalArticle

@article{9f7673b6479d47d2b518470b0e867d38,
title = "The performance of coalescent-based species tree estimation methods under models of missing data",
abstract = "Background: Estimation of species trees from multiple genes is complicated by processes such as incomplete lineage sorting, gene duplication and loss, and horizontal gene transfer, that result in gene trees that differ from each other and from the species phylogeny. Methods to estimate species trees in the presence of gene tree discord due to incomplete lineage sorting have been developed and proved to be statistically consistent when gene tree discord is due only to incomplete lineage sorting and every gene tree includes the full set of species. Results: We establish statistical consistency of certain coalescent-based species tree estimation methods under some models of taxon deletion from genes. We also evaluate the impact of missing data on four species tree estimation methods (ASTRAL-II, ASTRID, MP-EST, and SVDquartets) using simulated datasets with varying levels of incomplete lineage sorting, gene tree estimation error, and degrees/patterns of missing data. Conclusions: All the species tree estimation methods improved in accuracy as the number of genes increased and often produced highly accurate species trees even when the amount of missing data was large. These results together indicate that accurate species tree estimation is possible under a variety of conditions, even when there are substantial amounts of missing data.",
keywords = "ASTRAL, ASTRID, Incomplete lineage sorting, MP-EST, Missing data, Multi-species coalescent, SVDquartets, Species tree",
author = "Michael Nute and Jed Chou and Molloy, {Erin K.} and Tandy Warnow",
year = "2018",
month = "5",
day = "8",
doi = "10.1186/s12864-018-4619-8",
language = "English (US)",
volume = "19",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central",

}

TY - JOUR

T1 - The performance of coalescent-based species tree estimation methods under models of missing data

AU - Nute, Michael

AU - Chou, Jed

AU - Molloy, Erin K.

AU - Warnow, Tandy

PY - 2018/5/8

Y1 - 2018/5/8

N2 - Background: Estimation of species trees from multiple genes is complicated by processes such as incomplete lineage sorting, gene duplication and loss, and horizontal gene transfer, that result in gene trees that differ from each other and from the species phylogeny. Methods to estimate species trees in the presence of gene tree discord due to incomplete lineage sorting have been developed and proved to be statistically consistent when gene tree discord is due only to incomplete lineage sorting and every gene tree includes the full set of species. Results: We establish statistical consistency of certain coalescent-based species tree estimation methods under some models of taxon deletion from genes. We also evaluate the impact of missing data on four species tree estimation methods (ASTRAL-II, ASTRID, MP-EST, and SVDquartets) using simulated datasets with varying levels of incomplete lineage sorting, gene tree estimation error, and degrees/patterns of missing data. Conclusions: All the species tree estimation methods improved in accuracy as the number of genes increased and often produced highly accurate species trees even when the amount of missing data was large. These results together indicate that accurate species tree estimation is possible under a variety of conditions, even when there are substantial amounts of missing data.

AB - Background: Estimation of species trees from multiple genes is complicated by processes such as incomplete lineage sorting, gene duplication and loss, and horizontal gene transfer, that result in gene trees that differ from each other and from the species phylogeny. Methods to estimate species trees in the presence of gene tree discord due to incomplete lineage sorting have been developed and proved to be statistically consistent when gene tree discord is due only to incomplete lineage sorting and every gene tree includes the full set of species. Results: We establish statistical consistency of certain coalescent-based species tree estimation methods under some models of taxon deletion from genes. We also evaluate the impact of missing data on four species tree estimation methods (ASTRAL-II, ASTRID, MP-EST, and SVDquartets) using simulated datasets with varying levels of incomplete lineage sorting, gene tree estimation error, and degrees/patterns of missing data. Conclusions: All the species tree estimation methods improved in accuracy as the number of genes increased and often produced highly accurate species trees even when the amount of missing data was large. These results together indicate that accurate species tree estimation is possible under a variety of conditions, even when there are substantial amounts of missing data.

KW - ASTRAL

KW - ASTRID

KW - Incomplete lineage sorting

KW - MP-EST

KW - Missing data

KW - Multi-species coalescent

KW - SVDquartets

KW - Species tree

UR - http://www.scopus.com/inward/record.url?scp=85046675009&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85046675009&partnerID=8YFLogxK

U2 - 10.1186/s12864-018-4619-8

DO - 10.1186/s12864-018-4619-8

M3 - Article

C2 - 29745854

AN - SCOPUS:85046675009

VL - 19

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

M1 - 286

ER -