Fundamental Limits of Genome Assembly under an Adversarial Erasure Model

Ilan Shomorony, Thomas A. Courtade, David Tse

Research output: Contribution to journalArticlepeer-review

Abstract

In contrast to second-generation DNA sequencing technologies, emerging third-generation technologies are capable of delivering reads that are long enough to enable perfect genome assembly. Unfortunately, the benefits of long reads are accompanied by higher rates of read errors. This motivates a question of fundamental import: what read-length and error-rate combinations allow for perfect assembly of the genome? Formal investigation of this tradeoff is complicated by the fact that tractable probabilistic models for the genome sequence and error process fail to capture key features of the problem: real genomes contain long repetitive patterns, and read errors are often bursty and sequence-dependent. In order to circumvent these modeling barriers and take a first step toward the study of this question, we consider a simple setting: the genome sequence is arbitrary, and the read errors are erasures that can occur at adversarially chosen positions, up to a limit in the number of erasures per read and per genome position. In this context, we show that a natural error-correction scheme is optimal in the sense that it recovers the error-free k-spectrum of the genome for the largest possible k. The worst-case nature of our analysis ensures that the proposed error-correction method is robust and allows us to study its performance under stochastic error models. As a result, we show that, for several real genomes, the impact of read errors on the information-theoretic requirements for perfect assembly is relatively mild.

Original languageEnglish (US)
Article number7790807
Pages (from-to)199-208
Number of pages10
JournalIEEE Transactions on Molecular, Biological, and Multi-Scale Communications
Volume2
Issue number2
DOIs
StatePublished - Dec 2016
Externally publishedYes

Keywords

  • DNA sequencing
  • Erasure model
  • Error correction
  • Genome assembly

ASJC Scopus subject areas

  • Biotechnology
  • Bioengineering
  • Computer Networks and Communications
  • Electrical and Electronic Engineering
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'Fundamental Limits of Genome Assembly under an Adversarial Erasure Model'. Together they form a unique fingerprint.

Cite this