Weighted statistical binning: Enabling statistically consistent genome-scale phylogenetic analyses

Md Shamsuzzoha Bayzid, Siavash Mirarab, Bastien Boussau, Tandy Warnow

Research output: Contribution to journalArticle

Abstract

Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called "summary methods". Because summary methods are generally fast (and much faster than more complicated coalescent-based methods that co-estimate gene trees and species trees), they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate in biologically realistic conditions. Mirarab et al. (Science 2014) presented the "statistical binning" technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple heuristic to evaluate "combinability" and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomic pipeline does not have the desirable property of being statistically consistent. We show that weighting the re-calculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, "weighted statistical binning" enables highly accurate genome-scale species tree estimation, and is also statistically consistent under the multi-species coalescent model. New data used in this study are available at DOI: http://dx.doi.org/10.6084/m9.figshare.1411146, and the software is available at https://github.com/smirarab/binning.

Original languageEnglish (US)
Article numbere0129183
JournalPloS one
Volume10
Issue number6
DOIs
StatePublished - Jun 18 2015

Fingerprint

Genes
Genome
genome
phylogeny
genes
loci
Error analysis
methodology
Bins
Biological Phenomena
Dominant Genes
Sorting
Expressed Sequence Tags
dominant genes
Pipelines
sorting
Software
history

ASJC Scopus subject areas

  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)
  • General

Cite this

Weighted statistical binning : Enabling statistically consistent genome-scale phylogenetic analyses. / Bayzid, Md Shamsuzzoha; Mirarab, Siavash; Boussau, Bastien; Warnow, Tandy.

In: PloS one, Vol. 10, No. 6, e0129183, 18.06.2015.

Research output: Contribution to journalArticle

Bayzid, Md Shamsuzzoha ; Mirarab, Siavash ; Boussau, Bastien ; Warnow, Tandy. / Weighted statistical binning : Enabling statistically consistent genome-scale phylogenetic analyses. In: PloS one. 2015 ; Vol. 10, No. 6.
@article{b17b071cb46749f58c14b6667e786074,
title = "Weighted statistical binning: Enabling statistically consistent genome-scale phylogenetic analyses",
abstract = "Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called {"}summary methods{"}. Because summary methods are generally fast (and much faster than more complicated coalescent-based methods that co-estimate gene trees and species trees), they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate in biologically realistic conditions. Mirarab et al. (Science 2014) presented the {"}statistical binning{"} technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple heuristic to evaluate {"}combinability{"} and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomic pipeline does not have the desirable property of being statistically consistent. We show that weighting the re-calculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, {"}weighted statistical binning{"} enables highly accurate genome-scale species tree estimation, and is also statistically consistent under the multi-species coalescent model. New data used in this study are available at DOI: http://dx.doi.org/10.6084/m9.figshare.1411146, and the software is available at https://github.com/smirarab/binning.",
author = "Bayzid, {Md Shamsuzzoha} and Siavash Mirarab and Bastien Boussau and Tandy Warnow",
year = "2015",
month = "6",
day = "18",
doi = "10.1371/journal.pone.0129183",
language = "English (US)",
volume = "10",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "6",

}

TY - JOUR

T1 - Weighted statistical binning

T2 - Enabling statistically consistent genome-scale phylogenetic analyses

AU - Bayzid, Md Shamsuzzoha

AU - Mirarab, Siavash

AU - Boussau, Bastien

AU - Warnow, Tandy

PY - 2015/6/18

Y1 - 2015/6/18

N2 - Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called "summary methods". Because summary methods are generally fast (and much faster than more complicated coalescent-based methods that co-estimate gene trees and species trees), they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate in biologically realistic conditions. Mirarab et al. (Science 2014) presented the "statistical binning" technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple heuristic to evaluate "combinability" and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomic pipeline does not have the desirable property of being statistically consistent. We show that weighting the re-calculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, "weighted statistical binning" enables highly accurate genome-scale species tree estimation, and is also statistically consistent under the multi-species coalescent model. New data used in this study are available at DOI: http://dx.doi.org/10.6084/m9.figshare.1411146, and the software is available at https://github.com/smirarab/binning.

AB - Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called "summary methods". Because summary methods are generally fast (and much faster than more complicated coalescent-based methods that co-estimate gene trees and species trees), they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate in biologically realistic conditions. Mirarab et al. (Science 2014) presented the "statistical binning" technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple heuristic to evaluate "combinability" and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomic pipeline does not have the desirable property of being statistically consistent. We show that weighting the re-calculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, "weighted statistical binning" enables highly accurate genome-scale species tree estimation, and is also statistically consistent under the multi-species coalescent model. New data used in this study are available at DOI: http://dx.doi.org/10.6084/m9.figshare.1411146, and the software is available at https://github.com/smirarab/binning.

UR - http://www.scopus.com/inward/record.url?scp=84939184648&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84939184648&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0129183

DO - 10.1371/journal.pone.0129183

M3 - Article

C2 - 26086579

AN - SCOPUS:84939184648

VL - 10

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 6

M1 - e0129183

ER -