TY - JOUR
T1 - ImPhy
T2 - Imputing Phylogenetic Trees with Missing Information Using Mathematical Programming
AU - Yasui, Niko
AU - Vogiatzis, Chrysafis
AU - Yoshida, Ruriko
AU - Fukumizu, Kenji
N1 - Funding Information:
Part of this work was performed when Niko Yasui was funded as a summer intern in the Institute of Statistical Mathematics under the supervision of Kenji Fukumizu and Ruriko Yoshida. This work was supported by the Japan Society for the Promotion of Science [JSPS KAKENHI 26280009 to Kenji Fukumizu] and the National Science Foundation [ND EPSCoR NSF 1355466 to Chrysafis Vogiatzis, during his tenure at North Dakota State University] and [Division of Mathematical Sciences: CDS&E-MSS program 1622369 to Ruriko Yoshida].
Publisher Copyright:
© 2004-2012 IEEE.
PY - 2020/7/1
Y1 - 2020/7/1
N2 - Advances in modern genomics have allowed researchers to apply phylogenetic analyses on a genome-wide scale. While large volumes of genomic data can be generated cheaply and quickly, data missingness is a non-trivial and somewhat expected problem. Since the available information is often incomplete for a given set of genetic loci and individual organisms, a large proportion of trees that depict the evolutionary history of a single genetic locus, called gene trees, fail to contain all individuals. Data incompleteness causes difficulties in data collection, information extraction, and gene tree inference. Furthermore, identifying outlying gene trees, which can represent horizontal gene transfers, gene duplications, or hybridizations, is difficult when data is missing from the gene trees. The typical approach is to remove all individuals with missing data from the gene trees, and focus the analysis on individuals whose information is fully available - a huge loss of information. In this work, we propose and design an optimization-based imputation approach to infer the missing distances between leaves in a set of gene trees via a mixed integer non-linear programming model. We also present a new research pipeline, imPhy, that can (i) simulate a set of gene trees with leaves randomly missing in each tree, (ii) impute the missing pairwise distances in each gene tree, (iii) reconstruct the gene trees using the Neighbor Joining (NJ) and Unweighted Pair Group Method with Arithmetic Mean (UPGMA) methods, and (iv) analyze and report the efficiency of the reconstruction. To impute the missing leaves, we employ our newly proposed non-linear programming framework, and demonstrate its capability in reconstructing gene trees with incomplete information in both simulated and empirical datasets. In the empirical datasets apicomplexa and lungfish, our imputation has very small normalized mean square errors, even in the extreme case where 50 percent of the individuals in each gene tree are missing. Data, software, and user manuals can be found at https://github.com/yasuiniko/imPhy.
AB - Advances in modern genomics have allowed researchers to apply phylogenetic analyses on a genome-wide scale. While large volumes of genomic data can be generated cheaply and quickly, data missingness is a non-trivial and somewhat expected problem. Since the available information is often incomplete for a given set of genetic loci and individual organisms, a large proportion of trees that depict the evolutionary history of a single genetic locus, called gene trees, fail to contain all individuals. Data incompleteness causes difficulties in data collection, information extraction, and gene tree inference. Furthermore, identifying outlying gene trees, which can represent horizontal gene transfers, gene duplications, or hybridizations, is difficult when data is missing from the gene trees. The typical approach is to remove all individuals with missing data from the gene trees, and focus the analysis on individuals whose information is fully available - a huge loss of information. In this work, we propose and design an optimization-based imputation approach to infer the missing distances between leaves in a set of gene trees via a mixed integer non-linear programming model. We also present a new research pipeline, imPhy, that can (i) simulate a set of gene trees with leaves randomly missing in each tree, (ii) impute the missing pairwise distances in each gene tree, (iii) reconstruct the gene trees using the Neighbor Joining (NJ) and Unweighted Pair Group Method with Arithmetic Mean (UPGMA) methods, and (iv) analyze and report the efficiency of the reconstruction. To impute the missing leaves, we employ our newly proposed non-linear programming framework, and demonstrate its capability in reconstructing gene trees with incomplete information in both simulated and empirical datasets. In the empirical datasets apicomplexa and lungfish, our imputation has very small normalized mean square errors, even in the extreme case where 50 percent of the individuals in each gene tree are missing. Data, software, and user manuals can be found at https://github.com/yasuiniko/imPhy.
KW - Gene trees
KW - missing information
KW - mixed integer non-linear programming
UR - http://www.scopus.com/inward/record.url?scp=85057817133&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85057817133&partnerID=8YFLogxK
U2 - 10.1109/TCBB.2018.2884459
DO - 10.1109/TCBB.2018.2884459
M3 - Article
C2 - 30507538
AN - SCOPUS:85057817133
SN - 1545-5963
VL - 17
SP - 1222
EP - 1230
JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics
JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics
IS - 4
M1 - 8554124
ER -