TY - JOUR
T1 - Constrained incremental tree building
T2 - New absolute fast converging phylogeny estimation methods with improved scalability and accuracy
AU - Zhang, Qiuyi
AU - Rao, Satish
AU - Warnow, Tandy
N1 - Supported by U.S. National Science Foundation Grant 1535989 (to SR) and 1535977 (to TW).
PY - 2019/2/6
Y1 - 2019/2/6
N2 - Background: Absolute fast converging (AFC) phylogeny estimation methods are ones that have been proven to recover the true tree with high probability given sequences whose lengths are polynomial in the number of number of leaves in the tree (once the shortest and longest branch weights are fixed). While there has been a large literature on AFC methods, the best in terms of empirical performance was DCMNJ, D C M NJ, published in SODA 2001. The main empirical advantage of DCMNJ DCM NJ over other AFC methods is its use of neighbor joining (NJ) to construct trees on smaller taxon subsets, which are then combined into a tree on the full set of species using a supertree method; in contrast, the other AFC methods in essence depend on quartet trees that are computed independently of each other, which reduces accuracy compared to neighbor joining. However, DCMNJ DCM NJ is unlikely to scale to large datasets due to its reliance on supertree methods, as no current supertree methods are able to scale to large datasets with high accuracy. Results: In this study we present a new approach to large-scale phylogeny estimation that shares some of the features of DCMNJ DCM NJ but bypasses the use of supertree methods. We prove that this new approach is AFC and uses polynomial time and space. Furthermore, we describe variations on this basic approach that can be used with leaf-disjoint constraint trees (computed using methods such as maximum likelihood) to produce other methods that are likely to provide even better accuracy. Thus, we present a new generalizable technique for large-scale tree estimation that is designed to improve scalability for phylogeny estimation methods to ultra-large datasets, and that can be used in a variety of settings (including tree estimation from unaligned sequences, and species tree estimation from gene trees).
AB - Background: Absolute fast converging (AFC) phylogeny estimation methods are ones that have been proven to recover the true tree with high probability given sequences whose lengths are polynomial in the number of number of leaves in the tree (once the shortest and longest branch weights are fixed). While there has been a large literature on AFC methods, the best in terms of empirical performance was DCMNJ, D C M NJ, published in SODA 2001. The main empirical advantage of DCMNJ DCM NJ over other AFC methods is its use of neighbor joining (NJ) to construct trees on smaller taxon subsets, which are then combined into a tree on the full set of species using a supertree method; in contrast, the other AFC methods in essence depend on quartet trees that are computed independently of each other, which reduces accuracy compared to neighbor joining. However, DCMNJ DCM NJ is unlikely to scale to large datasets due to its reliance on supertree methods, as no current supertree methods are able to scale to large datasets with high accuracy. Results: In this study we present a new approach to large-scale phylogeny estimation that shares some of the features of DCMNJ DCM NJ but bypasses the use of supertree methods. We prove that this new approach is AFC and uses polynomial time and space. Furthermore, we describe variations on this basic approach that can be used with leaf-disjoint constraint trees (computed using methods such as maximum likelihood) to produce other methods that are likely to provide even better accuracy. Thus, we present a new generalizable technique for large-scale tree estimation that is designed to improve scalability for phylogeny estimation methods to ultra-large datasets, and that can be used in a variety of settings (including tree estimation from unaligned sequences, and species tree estimation from gene trees).
KW - Absolute fast converging methods
KW - Maximum likelihood
KW - Neighbor joining
KW - Phylogeny estimation
KW - Sample complexity
KW - Short quartets
UR - http://www.scopus.com/inward/record.url?scp=85061115601&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85061115601&partnerID=8YFLogxK
U2 - 10.1186/s13015-019-0136-9
DO - 10.1186/s13015-019-0136-9
M3 - Article
C2 - 30839943
AN - SCOPUS:85061115601
SN - 1748-7188
VL - 14
JO - Algorithms for Molecular Biology
JF - Algorithms for Molecular Biology
IS - 1
M1 - 2
ER -