TY - GEN
T1 - PpIacerDC
T2 - 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021
AU - Koning, Elizabeth
AU - Phillips, Malachi
AU - Warnow, Tandy
N1 - Funding Information:
This work was supported in part by the US National Science Foundation through grant ABI-1458652 to TW. This study was performed on the Illinois Campus Cluster, a resource operated and financially supported by UIUC in conjunction with the National Center for Supercomputing Applications.
Publisher Copyright:
© 2021 ACM.
PY - 2021/1/18
Y1 - 2021/1/18
N2 - Motivation: Phylogenetic placement (i.e., the insertion of a sequence into a phylogenetic tree) is a basic step in several bioinformatics pipelines, including taxon identification in metagenomic analysis and large scale phylogeny estimation. The most accurate current method is pplacer, which attempts to optimize the placement using maximum likelihood, but it frequently fails on datasets where the phylogenetic tree has 5000 leaves. APPLES is the current most scalable method, and EPA-ng, although more scalable than pplacer and more accurate than APPLES, also fails on many 50,000-taxon trees. Here we describe pplacerDC, a divide-and-conquer approach that enables pplacer to be used when the phylogenetic tree is very large. Results: Our study shows that pplacerDC has excellent accuracy and scalability, matching pplacer where pplacer can run, improving accuracy compared to APPLES and EPA-ng, and is able to run on datasets with up to 100,000 sequences. Availability: The pplacerDC code is available on GitHub at https://github.com/kodingkoning/pplacerDC.
AB - Motivation: Phylogenetic placement (i.e., the insertion of a sequence into a phylogenetic tree) is a basic step in several bioinformatics pipelines, including taxon identification in metagenomic analysis and large scale phylogeny estimation. The most accurate current method is pplacer, which attempts to optimize the placement using maximum likelihood, but it frequently fails on datasets where the phylogenetic tree has 5000 leaves. APPLES is the current most scalable method, and EPA-ng, although more scalable than pplacer and more accurate than APPLES, also fails on many 50,000-taxon trees. Here we describe pplacerDC, a divide-and-conquer approach that enables pplacer to be used when the phylogenetic tree is very large. Results: Our study shows that pplacerDC has excellent accuracy and scalability, matching pplacer where pplacer can run, improving accuracy compared to APPLES and EPA-ng, and is able to run on datasets with up to 100,000 sequences. Availability: The pplacerDC code is available on GitHub at https://github.com/kodingkoning/pplacerDC.
KW - phylogenetic placement
KW - pplacer
UR - http://www.scopus.com/inward/record.url?scp=85112374757&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85112374757&partnerID=8YFLogxK
U2 - 10.1145/3459930.3469516
DO - 10.1145/3459930.3469516
M3 - Conference contribution
AN - SCOPUS:85112374757
T3 - Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021
BT - Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021
PB - Association for Computing Machinery
Y2 - 1 August 2021 through 4 August 2021
ER -