TY - GEN
T1 - Measuring the Mappability Spectrum of Reference Genome Assemblies
AU - Stephens, Zachary D.
AU - Iyer, Ravishankar K.
N1 - Funding Information:
This material is based uponwork partially supported by a CompGen Fellowship; an IBM Faculty Award; National Science Foundation (NSF) under grants CNS 13-37732, CNS 16-24790 and CNS 16-24615; and the Mayo Clinic Center for Individualized Medicine
Publisher Copyright:
© 2018 ACM.
PY - 2018/8/15
Y1 - 2018/8/15
N2 - The ability to infer actionable information from genomic variation data in a resequencing experiment relies on accurately aligning the sequences to a reference genome. However, this accuracy is inherently limited by the quality of the reference assembly and the repetitive content of the subject's genome. As long read sequencing technologies become more widespread, it is crucial to investigate the expected improvements in alignment accuracy and variant analysis over existing short read methods. The ability to quantify the read length and error rate necessary to uniquely map regions of interest in a sequence allows users to make informed decisions regarding experiment design and provides useful metrics for comparing the magnitude of repetition across different reference assemblies. To this end we have developed NEAT-Repeat, a toolkit for exhaustively identifying the minimum read length required to uniquely map each position of a reference sequence given a specified error rate. Using these tools we computed the -mappability spectrum" for ten reference sequences, including human and a range of plants and animals, quantifying the theoretical improvements in alignment accuracy that would result from sequencing with longer reads or reads with less base-calling errors. Our inclusion of read length and error rate builds upon existing methods for mappability tracks based on uniqueness or aligner-specific mapping scores, and thus enables more comprehensive analysis. We apply our mappability results to whole-genome variant call data, and demonstrate that variants called with low mapping and genotype quality scores are disproportionately found in reference regions that require long reads to be uniquely covered. We propose that our mappability metrics provide a valuable supplement to established variant filtering and annotation pipelines by supplying users with an additional metric related to read mapping quality. NEAT-Repeat can process large and repetitive genomes, such as those of corn and soybean, in a tractable amount of time by leveraging efficient methods for edit distance computation as well as running multiple jobs in parallel. NEAT-Repeat is written in Python 2.7 and C++, and is available at https://github.com/zstephens/neat-repeat.
AB - The ability to infer actionable information from genomic variation data in a resequencing experiment relies on accurately aligning the sequences to a reference genome. However, this accuracy is inherently limited by the quality of the reference assembly and the repetitive content of the subject's genome. As long read sequencing technologies become more widespread, it is crucial to investigate the expected improvements in alignment accuracy and variant analysis over existing short read methods. The ability to quantify the read length and error rate necessary to uniquely map regions of interest in a sequence allows users to make informed decisions regarding experiment design and provides useful metrics for comparing the magnitude of repetition across different reference assemblies. To this end we have developed NEAT-Repeat, a toolkit for exhaustively identifying the minimum read length required to uniquely map each position of a reference sequence given a specified error rate. Using these tools we computed the -mappability spectrum" for ten reference sequences, including human and a range of plants and animals, quantifying the theoretical improvements in alignment accuracy that would result from sequencing with longer reads or reads with less base-calling errors. Our inclusion of read length and error rate builds upon existing methods for mappability tracks based on uniqueness or aligner-specific mapping scores, and thus enables more comprehensive analysis. We apply our mappability results to whole-genome variant call data, and demonstrate that variants called with low mapping and genotype quality scores are disproportionately found in reference regions that require long reads to be uniquely covered. We propose that our mappability metrics provide a valuable supplement to established variant filtering and annotation pipelines by supplying users with an additional metric related to read mapping quality. NEAT-Repeat can process large and repetitive genomes, such as those of corn and soybean, in a tractable amount of time by leveraging efficient methods for edit distance computation as well as running multiple jobs in parallel. NEAT-Repeat is written in Python 2.7 and C++, and is available at https://github.com/zstephens/neat-repeat.
KW - Mappability
KW - Repetitive DNA
KW - Sequence analysis
UR - http://www.scopus.com/inward/record.url?scp=85056117834&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85056117834&partnerID=8YFLogxK
U2 - 10.1145/3233547.3233582
DO - 10.1145/3233547.3233582
M3 - Conference contribution
AN - SCOPUS:85056117834
T3 - ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
SP - 47
EP - 52
BT - ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
PB - Association for Computing Machinery
T2 - 9th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2018
Y2 - 29 August 2018 through 1 September 2018
ER -