Measuring the Mappability Spectrum of Reference Genome Assemblies

Zachary D. Stephens, Ravishankar K. Iyer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The ability to infer actionable information from genomic variation data in a resequencing experiment relies on accurately aligning the sequences to a reference genome. However, this accuracy is inherently limited by the quality of the reference assembly and the repetitive content of the subject's genome. As long read sequencing technologies become more widespread, it is crucial to investigate the expected improvements in alignment accuracy and variant analysis over existing short read methods. The ability to quantify the read length and error rate necessary to uniquely map regions of interest in a sequence allows users to make informed decisions regarding experiment design and provides useful metrics for comparing the magnitude of repetition across different reference assemblies. To this end we have developed NEAT-Repeat, a toolkit for exhaustively identifying the minimum read length required to uniquely map each position of a reference sequence given a specified error rate. Using these tools we computed the -mappability spectrum" for ten reference sequences, including human and a range of plants and animals, quantifying the theoretical improvements in alignment accuracy that would result from sequencing with longer reads or reads with less base-calling errors. Our inclusion of read length and error rate builds upon existing methods for mappability tracks based on uniqueness or aligner-specific mapping scores, and thus enables more comprehensive analysis. We apply our mappability results to whole-genome variant call data, and demonstrate that variants called with low mapping and genotype quality scores are disproportionately found in reference regions that require long reads to be uniquely covered. We propose that our mappability metrics provide a valuable supplement to established variant filtering and annotation pipelines by supplying users with an additional metric related to read mapping quality. NEAT-Repeat can process large and repetitive genomes, such as those of corn and soybean, in a tractable amount of time by leveraging efficient methods for edit distance computation as well as running multiple jobs in parallel. NEAT-Repeat is written in Python 2.7 and C++, and is available at https://github.com/zstephens/neat-repeat.

Original languageEnglish (US)
Title of host publicationACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
PublisherAssociation for Computing Machinery, Inc
Pages47-52
Number of pages6
ISBN (Electronic)9781450357944
DOIs
StatePublished - Aug 15 2018
Event9th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2018 - Washington, United States
Duration: Aug 29 2018Sep 1 2018

Publication series

NameACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

Other

Other9th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2018
Country/TerritoryUnited States
CityWashington
Period8/29/189/1/18

Keywords

  • Mappability
  • Repetitive DNA
  • Sequence analysis

ASJC Scopus subject areas

  • Computer Science Applications
  • Software
  • Health Informatics
  • Biomedical Engineering

Fingerprint

Dive into the research topics of 'Measuring the Mappability Spectrum of Reference Genome Assemblies'. Together they form a unique fingerprint.

Cite this