Description

This upload contains all datasets used in Experiment 2 of the EMMA paper (appeared in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. "EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment".

The zip file has the following structure (presented as an example):
salma_paper_datasets/
|_README.md
|_10aa/
|_crw/
|_homfam/
|_aat/
| |_...
|_...
|_het/
|_5000M2-het/
| |_...
|_5000M3-het/
...
|_rec_res/


Generally, the structure can be viewed as:
[category]/[dataset]/[replicate]/[alignment files]

# Categories:
1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.
2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).
3. homfam: There are the 10 largest Homfam datasets, each with one replicate.
4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.
5. rec\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.

# Alignment files
There are at most 6 `.fasta` files in each sub-directory:
1. `all.unaln.fasta`: All unaligned sequences.
2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.
3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).
4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.
5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).
6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.

>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.
>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.
>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.

# Additional file(s)
1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.
Date made availableAug 8 2022
PublisherUniversity of Illinois Urbana-Champaign

Keywords

  • sequence length heterogeneity
  • SALMA
  • alignment
  • eHMM
  • MAFFT

Cite this