TY - JOUR
T1 - Phylogeny Estimation Given Sequence Length Heterogeneity
AU - Smirnov, Vladimir
AU - Warnow, Tandy
N1 - Funding Information:
The authors acknowledge the support of the US National Science Foundation under grants ABI-1458652 and 1513629. We thank Olivier Gascuel, Erick Matsen, Erin Molloy, and the anonymous reviewers for their helpful feedback and suggestions, which led to improvements in the manuscript; we also thank Metin Balaban and Siavash Mirarab for advice on how to run APPLES.
Publisher Copyright:
© 2020 The Author(s).
PY - 2021/3/1
Y1 - 2021/3/1
N2 - Phylogeny estimation is a major step in many biological studies, and has many well known challenges. With the dropping cost of sequencing technologies, biologists now have increasingly large datasets available for use in phylogeny estimation. Here we address the challenge of estimating a tree given large datasets with a combination of full-length sequences and fragmentary sequences, which can arise due to a variety of reasons, including sample collection, sequencing technologies, and analytical pipelines. We compare two basic approaches: (1) computing an alignment on the full dataset and then computing a maximum likelihood tree on the alignment, or (2) constructing an alignment and tree on the full length sequences and then using phylogenetic placement to add the remaining sequences (which will generally be fragmentary) into the tree. We explore these two approaches on a range of simulated datasets, each with 1000 sequences and varying in rates of evolution, and two biological datasets. Our study shows some striking performance differences between methods, especially when there is substantial sequence length heterogeneity and high rates of evolution. We find in particular that using UPP to align sequences and RAxML to compute a tree on the alignment provides the best accuracy, substantially outperforming trees computed using phylogenetic placement methods. We also find that FastTree has poor accuracy on alignments containing fragmentary sequences. Overall, our study provides insights into the literature comparing different methods and pipelines for phylogenetic estimation, and suggests directions for future method development.
AB - Phylogeny estimation is a major step in many biological studies, and has many well known challenges. With the dropping cost of sequencing technologies, biologists now have increasingly large datasets available for use in phylogeny estimation. Here we address the challenge of estimating a tree given large datasets with a combination of full-length sequences and fragmentary sequences, which can arise due to a variety of reasons, including sample collection, sequencing technologies, and analytical pipelines. We compare two basic approaches: (1) computing an alignment on the full dataset and then computing a maximum likelihood tree on the alignment, or (2) constructing an alignment and tree on the full length sequences and then using phylogenetic placement to add the remaining sequences (which will generally be fragmentary) into the tree. We explore these two approaches on a range of simulated datasets, each with 1000 sequences and varying in rates of evolution, and two biological datasets. Our study shows some striking performance differences between methods, especially when there is substantial sequence length heterogeneity and high rates of evolution. We find in particular that using UPP to align sequences and RAxML to compute a tree on the alignment provides the best accuracy, substantially outperforming trees computed using phylogenetic placement methods. We also find that FastTree has poor accuracy on alignments containing fragmentary sequences. Overall, our study provides insights into the literature comparing different methods and pipelines for phylogenetic estimation, and suggests directions for future method development.
KW - Phylogeny estimation
KW - phylogenetic placement
KW - sequence length heterogeneity
UR - http://www.scopus.com/inward/record.url?scp=85098825829&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098825829&partnerID=8YFLogxK
U2 - 10.1093/sysbio/syaa058
DO - 10.1093/sysbio/syaa058
M3 - Article
C2 - 32692823
AN - SCOPUS:85098825829
VL - 70
SP - 268
EP - 282
JO - Systematic Biology
JF - Systematic Biology
SN - 1063-5157
IS - 2
ER -