## Description

This repository includes scripts, datasets, and supplementary materials for the study, "NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees", presented at RECOMB-CG 2018. The supplementary figures and tables referenced in the main paper can be found in njmerge-supplementary-materials.pdf. The latest version of NJMerge can be downloaded from Github: https://github.com/ekmolloy/njmerge.

***When downloading datasets, please note that the following errors.***

In README.txt, lines 37 and 38 should read:

+ fasttree-exon.tre contains lines 1-25, 1-100, or 1-1000 of fasttree-total.tre

+ fasttree-intron.tre contains lines 26-50, 101-200, or 1001-2000 of fasttree-total.tre

Note that the file names (fasttree-exon.tre and fasttree-intron.tre) are swapped.

In tools.zip, the compare_trees.py and the compare_tree_lists.py scripts incorrectly refer to the "symmetric difference error rate" as the "Robinson-Foulds error rate". Because the normalized symmetric difference and the normalized Robinson-Foulds distance are equal for binary trees, this does not impact the species tree error rates reported in the study. This could impact the gene tree error rates reported in the study (see data-gene-trees.csv in data.zip), as FastTree-2 returns trees with polytomies whenever 3 or more sequences in the input alignment are identical. Note that the normalized symmetric difference is always greater than or equal to the normalized Robinson-Foulds distance, so the gene tree error rates reported in the study are more conservative.

In njmerge-supplementary-materials.pdf, the alpha parameter shown in Supplementary Table S2 is actually the divisor D, which is used to compute alpha for each gene as follows.

1. For each gene, a random value X between 0 and 1 is drawn from a uniform distribution.

2. Alpha is computed as -log(X) / D, where D is 4.2 for exons, 1.0 for UCEs, and 0.4 for introns (as stated in Table S2).

Note that because the mean of the uniform distribution (between 0 and 1) is 0.5, the mean alpha value is -log(0.5) / 4.2 = 0.16 for exons, -log(0.5) / 1.0 = 0.69 for UCEs, and -log(0.5) / 0.4 = 1.73 for introns.

***When downloading datasets, please note that the following errors.***

In README.txt, lines 37 and 38 should read:

+ fasttree-exon.tre contains lines 1-25, 1-100, or 1-1000 of fasttree-total.tre

+ fasttree-intron.tre contains lines 26-50, 101-200, or 1001-2000 of fasttree-total.tre

Note that the file names (fasttree-exon.tre and fasttree-intron.tre) are swapped.

In tools.zip, the compare_trees.py and the compare_tree_lists.py scripts incorrectly refer to the "symmetric difference error rate" as the "Robinson-Foulds error rate". Because the normalized symmetric difference and the normalized Robinson-Foulds distance are equal for binary trees, this does not impact the species tree error rates reported in the study. This could impact the gene tree error rates reported in the study (see data-gene-trees.csv in data.zip), as FastTree-2 returns trees with polytomies whenever 3 or more sequences in the input alignment are identical. Note that the normalized symmetric difference is always greater than or equal to the normalized Robinson-Foulds distance, so the gene tree error rates reported in the study are more conservative.

In njmerge-supplementary-materials.pdf, the alpha parameter shown in Supplementary Table S2 is actually the divisor D, which is used to compute alpha for each gene as follows.

1. For each gene, a random value X between 0 and 1 is drawn from a uniform distribution.

2. Alpha is computed as -log(X) / D, where D is 4.2 for exons, 1.0 for UCEs, and 0.4 for introns (as stated in Table S2).

Note that because the mean of the uniform distribution (between 0 and 1) is 0.5, the mean alpha value is -log(0.5) / 4.2 = 0.16 for exons, -log(0.5) / 1.0 = 0.69 for UCEs, and -log(0.5) / 0.4 = 1.73 for introns.

Date made available | Jul 29 2018 |
---|---|

Publisher | University of Illinois Urbana-Champaign |

## Keywords

- phylogenomics
- species trees
- divide-and-conquer
- incomplete lineage sorting