Description
This upload contains all datasets used in Experiment 2 of the EMMA paper (appeared in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. "EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment".
The zip file has the following structure (presented as an example):
salma_paper_datasets/
|_README.md
|_10aa/
|_crw/
|_homfam/
|_aat/
| |_...
|_...
|_het/
|_5000M2-het/
| |_...
|_5000M3-het/
...
|_rec_res/
Generally, the structure can be viewed as:
[category]/[dataset]/[replicate]/[alignment files]
# Categories:
1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.
2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).
3. homfam: There are the 10 largest Homfam datasets, each with one replicate.
4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.
5. rec\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.
# Alignment files
There are at most 6 `.fasta` files in each sub-directory:
1. `all.unaln.fasta`: All unaligned sequences.
2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.
3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).
4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.
5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).
6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.
>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.
>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.
>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.
# Additional file(s)
1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.
The zip file has the following structure (presented as an example):
salma_paper_datasets/
|_README.md
|_10aa/
|_crw/
|_homfam/
|_aat/
| |_...
|_...
|_het/
|_5000M2-het/
| |_...
|_5000M3-het/
...
|_rec_res/
Generally, the structure can be viewed as:
[category]/[dataset]/[replicate]/[alignment files]
# Categories:
1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.
2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).
3. homfam: There are the 10 largest Homfam datasets, each with one replicate.
4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.
5. rec\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.
# Alignment files
There are at most 6 `.fasta` files in each sub-directory:
1. `all.unaln.fasta`: All unaligned sequences.
2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.
3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).
4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.
5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).
6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.
>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.
>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.
>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.
# Additional file(s)
1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.
Date made available | Aug 8 2022 |
---|---|
Publisher | University of Illinois Urbana-Champaign |
Keywords
- sequence length heterogeneity
- SALMA
- alignment
- eHMM
- MAFFT