TY - JOUR
T1 - Long-Branch Attraction in Species Tree Estimation
T2 - Inconsistency of Partitioned Likelihood and Topology-Based Summary Methods
AU - Roch, Sebastien
AU - Nute, Michael
AU - Warnow, Tandy
N1 - Publisher Copyright:
© The Author(s) 2018.
PY - 2019/3/1
Y1 - 2019/3/1
N2 - With advances in sequencing technologies, there are now massive amounts of genomic data from across all life, leading to the possibility that a robust Tree of Life can be constructed. However, "gene tree heterogeneity", which is when different genomic regions can evolve differently, is a common phenomenon in multi-locus data sets, and reduces the accuracy of standard methods for species tree estimation that do not take this heterogeneity into account. New methods have been developed for species tree estimation that specifically address gene tree heterogeneity, and that have been proven to converge to the true species tree when the number of loci and number of sites per locus both increase (i.e., the methods are said to be "statistically consistent"). Yet, little is known about the biologically realistic condition where the number of sites per locus is bounded. We show that when the sequence length of each locus is bounded (by any arbitrarily chosen value), the most common approaches to species tree estimation that take heterogeneity into account (i.e., traditional fully partitioned concatenated maximum likelihood and newer approaches, called summary methods, that estimate the species tree by combining estimated gene trees) are not statistically consistent, even when the heterogeneity is extremely constrained. The main challenge is the presence of conditions such as long branch attraction that create biased tree estimation when the number of sites is restricted. Hence, our study uncovers a fundamental challenge to species tree estimation using both traditional and new methods.
AB - With advances in sequencing technologies, there are now massive amounts of genomic data from across all life, leading to the possibility that a robust Tree of Life can be constructed. However, "gene tree heterogeneity", which is when different genomic regions can evolve differently, is a common phenomenon in multi-locus data sets, and reduces the accuracy of standard methods for species tree estimation that do not take this heterogeneity into account. New methods have been developed for species tree estimation that specifically address gene tree heterogeneity, and that have been proven to converge to the true species tree when the number of loci and number of sites per locus both increase (i.e., the methods are said to be "statistically consistent"). Yet, little is known about the biologically realistic condition where the number of sites per locus is bounded. We show that when the sequence length of each locus is bounded (by any arbitrarily chosen value), the most common approaches to species tree estimation that take heterogeneity into account (i.e., traditional fully partitioned concatenated maximum likelihood and newer approaches, called summary methods, that estimate the species tree by combining estimated gene trees) are not statistically consistent, even when the heterogeneity is extremely constrained. The main challenge is the presence of conditions such as long branch attraction that create biased tree estimation when the number of sites is restricted. Hence, our study uncovers a fundamental challenge to species tree estimation using both traditional and new methods.
KW - Incomplete lineage sorting
KW - partitioned likelihood analysis
KW - species tree estimation
KW - statistical consistency
UR - http://www.scopus.com/inward/record.url?scp=85061486542&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85061486542&partnerID=8YFLogxK
U2 - 10.1093/sysbio/syy061
DO - 10.1093/sysbio/syy061
M3 - Article
C2 - 30247732
AN - SCOPUS:85061486542
SN - 1063-5157
VL - 68
SP - 281
EP - 297
JO - Systematic biology
JF - Systematic biology
IS - 2
ER -