TY - JOUR
T1 - Removing the bad apples
T2 - A simple bioinformatic method to improve loci-recovery in de novo RADseq data for non-model organisms
AU - Cerca, José
AU - Maurstad, Marius F.
AU - Rochette, Nicolas C.
AU - Rivera-Colón, Angel G.
AU - Rayamajhi, Niraj
AU - Catchen, Julian M.
AU - Struck, Torsten H.
N1 - Publisher Copyright:
© 2021 The Authors. Methods in Ecology and Evolution published by John Wiley & Sons Ltd on behalf of British Ecological Society
PY - 2021/5
Y1 - 2021/5
N2 - The restriction site-associated DNA (RADseq) family of protocols involves digesting DNA and sequencing the region flanking the cut site, thus providing a cost and time-efficient way for obtaining thousands of genomic markers. However, when working with non-model taxa with few genomic resources, optimization of RADseq wet-lab and bioinformatic tools may be challenging, often resulting in allele dropout—that is when a given RADseq locus is not sequenced in one or more individuals resulting in missing data. Additionally, as datasets include divergent taxa, rates of dropout will increase since restriction sites may be lost due to mutation. Mitigating the impacts of allele dropout is crucial, as missing data may lead to incorrect inferences in population genetics and phylogenetics. Here, we demonstrate a simple pipeline for the optimization of RADseq datasets which involves partitioning datasets into subgroups, namely by reducing and analysing the dataset at a population or species level. By running the software Stacks at a subgroup level, we were able to reliably identify and remove individuals with high levels of missing data (bad apples) likely stemming from artefacts in library preparation, DNA quality or sequencing artefacts. Removal of the bad apples generally led to an increase in loci and decrease in missing data in the final datasets. The biological interpretability of the data, as measured by the number of retrieved loci and missing data, was considerably increased.
AB - The restriction site-associated DNA (RADseq) family of protocols involves digesting DNA and sequencing the region flanking the cut site, thus providing a cost and time-efficient way for obtaining thousands of genomic markers. However, when working with non-model taxa with few genomic resources, optimization of RADseq wet-lab and bioinformatic tools may be challenging, often resulting in allele dropout—that is when a given RADseq locus is not sequenced in one or more individuals resulting in missing data. Additionally, as datasets include divergent taxa, rates of dropout will increase since restriction sites may be lost due to mutation. Mitigating the impacts of allele dropout is crucial, as missing data may lead to incorrect inferences in population genetics and phylogenetics. Here, we demonstrate a simple pipeline for the optimization of RADseq datasets which involves partitioning datasets into subgroups, namely by reducing and analysing the dataset at a population or species level. By running the software Stacks at a subgroup level, we were able to reliably identify and remove individuals with high levels of missing data (bad apples) likely stemming from artefacts in library preparation, DNA quality or sequencing artefacts. Removal of the bad apples generally led to an increase in loci and decrease in missing data in the final datasets. The biological interpretability of the data, as measured by the number of retrieved loci and missing data, was considerably increased.
KW - RADseq
KW - ddRADseq
KW - genetics
KW - genome
KW - genomics
KW - library preparation
KW - methods
KW - pipeline
UR - http://www.scopus.com/inward/record.url?scp=85100964257&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85100964257&partnerID=8YFLogxK
U2 - 10.1111/2041-210X.13562
DO - 10.1111/2041-210X.13562
M3 - Article
AN - SCOPUS:85100964257
SN - 2041-210X
VL - 12
SP - 805
EP - 817
JO - Methods in Ecology and Evolution
JF - Methods in Ecology and Evolution
IS - 5
ER -