TY - JOUR
T1 - Comparison of the Strengths and Weaknesses of Machine Learning Algorithms and Feature Selection on KEGG Database Microbial Gene Pathway Annotation and Its Effects on Reconstructed Network Topology
AU - Robben, Michael
AU - Nasr, Mohammad Sadegh
AU - Das, Avishek
AU - Veerla, Jai Prakash
AU - Huber, Manfred
AU - Jaworski, Justyn
AU - Weidanz, Jon
AU - Luber, Jacob
N1 - Funding Information:
The published work was made possible by the University of Texas System Rising STARs Award (J.M.L.) and the CPRIT First Time Faculty Award (J.M.L.). The authors would also like to express their thanks to Fiza Saeed for her assistance with the curation of data from the KEGG database.
Publisher Copyright:
© Copyright 2023, Mary Ann Liebert, Inc., publishers 2023.
PY - 2023/7/1
Y1 - 2023/7/1
N2 - The development of tools for the annotation of genes from newly sequenced species has not evolved much from homologous alignment to prior annotated species. While the quality of gene annotations continues to decline as we sequence and assemble more evolutionary distant gut microbiome species, machine learning presents a high quality alternative to traditional techniques. In this study, we investigate the relative performance of common classical and nonclassical machine learning algorithms in the problem of gene annotation using human microbiome-Associated species genes from the KEGG database. The majority of the ensemble, clustering, and deep learning algorithms that we investigated showed higher prediction accuracy than CD-Hit in predicting partial KEGG function. Motif-based, machine-learning methods of annotation in new species were faster and had higher precision-recall than methods of homologous alignment or orthologous gene clustering. Gradient boosted ensemble methods and neural networks also predicted higher connectivity in reconstructed KEGG pathways, finding twice as many new pathway interactions than blast alignment. The use of motif-based, machine-learning algorithms in annotation software will allow researchers to develop powerful tools to interact with bacterial microbiomes in ways previously unachievable through homologous sequence alignment alone.
AB - The development of tools for the annotation of genes from newly sequenced species has not evolved much from homologous alignment to prior annotated species. While the quality of gene annotations continues to decline as we sequence and assemble more evolutionary distant gut microbiome species, machine learning presents a high quality alternative to traditional techniques. In this study, we investigate the relative performance of common classical and nonclassical machine learning algorithms in the problem of gene annotation using human microbiome-Associated species genes from the KEGG database. The majority of the ensemble, clustering, and deep learning algorithms that we investigated showed higher prediction accuracy than CD-Hit in predicting partial KEGG function. Motif-based, machine-learning methods of annotation in new species were faster and had higher precision-recall than methods of homologous alignment or orthologous gene clustering. Gradient boosted ensemble methods and neural networks also predicted higher connectivity in reconstructed KEGG pathways, finding twice as many new pathway interactions than blast alignment. The use of motif-based, machine-learning algorithms in annotation software will allow researchers to develop powerful tools to interact with bacterial microbiomes in ways previously unachievable through homologous sequence alignment alone.
KW - biological databases
KW - functional annotation
KW - machine learning
KW - network biology
UR - http://www.scopus.com/inward/record.url?scp=85164541544&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85164541544&partnerID=8YFLogxK
U2 - 10.1089/cmb.2022.0370
DO - 10.1089/cmb.2022.0370
M3 - Article
C2 - 37437088
AN - SCOPUS:85164541544
SN - 1066-5277
VL - 30
SP - 766
EP - 782
JO - Journal of Computational Biology
JF - Journal of Computational Biology
IS - 7
ER -