TY - JOUR
T1 - Multilocus phylogenetic analysis with gene tree clustering
AU - Yoshida, Ruriko
AU - Fukumizu, Kenji
AU - Vogiatzis, Chrysafis
N1 - Funding Information:
The authors would like to thank the editor and the anonymous referees for their useful comments for improving the manuscript. Funding K. F. and R. Y. were supported by JSPS KAKENHI 26540016. C. V. would also like to acknowledge support from ND EPSCoR NSF #1355466.
Publisher Copyright:
© 2017, Springer Science+Business Media New York.
PY - 2019/5/1
Y1 - 2019/5/1
N2 - Both theoretical and empirical evidence point to the fact that phylogenetic trees of different genes (loci) do not display precisely matched topologies. Nonetheless, most genes do display related phylogenies; this implies they form cohesive subsets (clusters). In this work, we discuss gene tree clustering, focusing on the normalized cut (Ncut) framework as a suitable method for phylogenetics. We proceed to show that this framework is both efficient and statistically accurate when clustering gene trees using the geodesic distance between them over the Billera–Holmes–Vogtmann tree space. We also conduct a computational study on the performance of different clustering methods, with and without preprocessing, under different distance metrics, and using a series of dimensionality reduction techniques. Our results with simulated data reveal that Ncut accurately clusters the set of gene trees, given a species tree under the coalescent process. Other observations from our computational study include the similar performance displayed by Ncut and k-means under most dimensionality reduction schemes, the worse performance of hierarchical clustering, and the significantly better performance of the neighbor-joining method with the p-distance compared to the maximum-likelihood estimation method. Supplementary material, all codes, and the data used in this work are freely available at http://polytopes.net/research/cluster/ online.
AB - Both theoretical and empirical evidence point to the fact that phylogenetic trees of different genes (loci) do not display precisely matched topologies. Nonetheless, most genes do display related phylogenies; this implies they form cohesive subsets (clusters). In this work, we discuss gene tree clustering, focusing on the normalized cut (Ncut) framework as a suitable method for phylogenetics. We proceed to show that this framework is both efficient and statistically accurate when clustering gene trees using the geodesic distance between them over the Billera–Holmes–Vogtmann tree space. We also conduct a computational study on the performance of different clustering methods, with and without preprocessing, under different distance metrics, and using a series of dimensionality reduction techniques. Our results with simulated data reveal that Ncut accurately clusters the set of gene trees, given a species tree under the coalescent process. Other observations from our computational study include the similar performance displayed by Ncut and k-means under most dimensionality reduction schemes, the worse performance of hierarchical clustering, and the significantly better performance of the neighbor-joining method with the p-distance compared to the maximum-likelihood estimation method. Supplementary material, all codes, and the data used in this work are freely available at http://polytopes.net/research/cluster/ online.
KW - Clustering
KW - Normalized cut
KW - Phylogenetics
UR - http://www.scopus.com/inward/record.url?scp=85014220671&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85014220671&partnerID=8YFLogxK
U2 - 10.1007/s10479-017-2456-9
DO - 10.1007/s10479-017-2456-9
M3 - Article
AN - SCOPUS:85014220671
VL - 276
SP - 293
EP - 313
JO - Annals of Operations Research
JF - Annals of Operations Research
SN - 0254-5330
IS - 1-2
ER -