Long-Branch Attraction in Species Tree Estimation

Inconsistency of Partitioned Likelihood and Topology-Based Summary Methods

Sebastien Roch, Michael Nute, Tandy Warnow

Research output: Contribution to journalArticle

Abstract

With advances in sequencing technologies, there are now massive amounts of genomic data from across all life, leading to the possibility that a robust Tree of Life can be constructed. However, "gene tree heterogeneity", which is when different genomic regions can evolve differently, is a common phenomenon in multi-locus data sets, and reduces the accuracy of standard methods for species tree estimation that do not take this heterogeneity into account. New methods have been developed for species tree estimation that specifically address gene tree heterogeneity, and that have been proven to converge to the true species tree when the number of loci and number of sites per locus both increase (i.e., the methods are said to be "statistically consistent"). Yet, little is known about the biologically realistic condition where the number of sites per locus is bounded. We show that when the sequence length of each locus is bounded (by any arbitrarily chosen value), the most common approaches to species tree estimation that take heterogeneity into account (i.e., traditional fully partitioned concatenated maximum likelihood and newer approaches, called summary methods, that estimate the species tree by combining estimated gene trees) are not statistically consistent, even when the heterogeneity is extremely constrained. The main challenge is the presence of conditions such as long branch attraction that create biased tree estimation when the number of sites is restricted. Hence, our study uncovers a fundamental challenge to species tree estimation using both traditional and new methods.

Original languageEnglish (US)
Pages (from-to)281-297
Number of pages17
JournalSystematic biology
Volume68
Issue number2
DOIs
StatePublished - Mar 1 2019

Fingerprint

topology
loci
methodology
gene
genomics
method
Genes
genes
Technology

Keywords

  • Incomplete lineage sorting
  • partitioned likelihood analysis
  • species tree estimation
  • statistical consistency

ASJC Scopus subject areas

  • Ecology, Evolution, Behavior and Systematics
  • Genetics

Cite this

Long-Branch Attraction in Species Tree Estimation : Inconsistency of Partitioned Likelihood and Topology-Based Summary Methods. / Roch, Sebastien; Nute, Michael; Warnow, Tandy.

In: Systematic biology, Vol. 68, No. 2, 01.03.2019, p. 281-297.

Research output: Contribution to journalArticle

@article{1e31d7fd57ac428da57b214b3ff2a915,
title = "Long-Branch Attraction in Species Tree Estimation: Inconsistency of Partitioned Likelihood and Topology-Based Summary Methods",
abstract = "With advances in sequencing technologies, there are now massive amounts of genomic data from across all life, leading to the possibility that a robust Tree of Life can be constructed. However, {"}gene tree heterogeneity{"}, which is when different genomic regions can evolve differently, is a common phenomenon in multi-locus data sets, and reduces the accuracy of standard methods for species tree estimation that do not take this heterogeneity into account. New methods have been developed for species tree estimation that specifically address gene tree heterogeneity, and that have been proven to converge to the true species tree when the number of loci and number of sites per locus both increase (i.e., the methods are said to be {"}statistically consistent{"}). Yet, little is known about the biologically realistic condition where the number of sites per locus is bounded. We show that when the sequence length of each locus is bounded (by any arbitrarily chosen value), the most common approaches to species tree estimation that take heterogeneity into account (i.e., traditional fully partitioned concatenated maximum likelihood and newer approaches, called summary methods, that estimate the species tree by combining estimated gene trees) are not statistically consistent, even when the heterogeneity is extremely constrained. The main challenge is the presence of conditions such as long branch attraction that create biased tree estimation when the number of sites is restricted. Hence, our study uncovers a fundamental challenge to species tree estimation using both traditional and new methods.",
keywords = "Incomplete lineage sorting, partitioned likelihood analysis, species tree estimation, statistical consistency",
author = "Sebastien Roch and Michael Nute and Tandy Warnow",
year = "2019",
month = "3",
day = "1",
doi = "10.1093/sysbio/syy061",
language = "English (US)",
volume = "68",
pages = "281--297",
journal = "Systematic Biology",
issn = "1063-5157",
publisher = "Oxford University Press",
number = "2",

}

TY - JOUR

T1 - Long-Branch Attraction in Species Tree Estimation

T2 - Inconsistency of Partitioned Likelihood and Topology-Based Summary Methods

AU - Roch, Sebastien

AU - Nute, Michael

AU - Warnow, Tandy

PY - 2019/3/1

Y1 - 2019/3/1

N2 - With advances in sequencing technologies, there are now massive amounts of genomic data from across all life, leading to the possibility that a robust Tree of Life can be constructed. However, "gene tree heterogeneity", which is when different genomic regions can evolve differently, is a common phenomenon in multi-locus data sets, and reduces the accuracy of standard methods for species tree estimation that do not take this heterogeneity into account. New methods have been developed for species tree estimation that specifically address gene tree heterogeneity, and that have been proven to converge to the true species tree when the number of loci and number of sites per locus both increase (i.e., the methods are said to be "statistically consistent"). Yet, little is known about the biologically realistic condition where the number of sites per locus is bounded. We show that when the sequence length of each locus is bounded (by any arbitrarily chosen value), the most common approaches to species tree estimation that take heterogeneity into account (i.e., traditional fully partitioned concatenated maximum likelihood and newer approaches, called summary methods, that estimate the species tree by combining estimated gene trees) are not statistically consistent, even when the heterogeneity is extremely constrained. The main challenge is the presence of conditions such as long branch attraction that create biased tree estimation when the number of sites is restricted. Hence, our study uncovers a fundamental challenge to species tree estimation using both traditional and new methods.

AB - With advances in sequencing technologies, there are now massive amounts of genomic data from across all life, leading to the possibility that a robust Tree of Life can be constructed. However, "gene tree heterogeneity", which is when different genomic regions can evolve differently, is a common phenomenon in multi-locus data sets, and reduces the accuracy of standard methods for species tree estimation that do not take this heterogeneity into account. New methods have been developed for species tree estimation that specifically address gene tree heterogeneity, and that have been proven to converge to the true species tree when the number of loci and number of sites per locus both increase (i.e., the methods are said to be "statistically consistent"). Yet, little is known about the biologically realistic condition where the number of sites per locus is bounded. We show that when the sequence length of each locus is bounded (by any arbitrarily chosen value), the most common approaches to species tree estimation that take heterogeneity into account (i.e., traditional fully partitioned concatenated maximum likelihood and newer approaches, called summary methods, that estimate the species tree by combining estimated gene trees) are not statistically consistent, even when the heterogeneity is extremely constrained. The main challenge is the presence of conditions such as long branch attraction that create biased tree estimation when the number of sites is restricted. Hence, our study uncovers a fundamental challenge to species tree estimation using both traditional and new methods.

KW - Incomplete lineage sorting

KW - partitioned likelihood analysis

KW - species tree estimation

KW - statistical consistency

UR - http://www.scopus.com/inward/record.url?scp=85061486542&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85061486542&partnerID=8YFLogxK

U2 - 10.1093/sysbio/syy061

DO - 10.1093/sysbio/syy061

M3 - Article

VL - 68

SP - 281

EP - 297

JO - Systematic Biology

JF - Systematic Biology

SN - 1063-5157

IS - 2

ER -