Abstract
In contrast to second-generation DNA sequencing technologies, emerging third-generation technologies are capable of delivering reads that are long enough to enable perfect genome assembly. Unfortunately, the benefits of long reads are accompanied by higher rates of read errors. This motivates a question of fundamental import: what read-length and error-rate combinations allow for perfect assembly of the genome? Formal investigation of this tradeoff is complicated by the fact that tractable probabilistic models for the genome sequence and error process fail to capture key features of the problem: real genomes contain long repetitive patterns, and read errors are often bursty and sequence-dependent. In order to circumvent these modeling barriers and take a first step toward the study of this question, we consider a simple setting: the genome sequence is arbitrary, and the read errors are erasures that can occur at adversarially chosen positions, up to a limit in the number of erasures per read and per genome position. In this context, we show that a natural error-correction scheme is optimal in the sense that it recovers the error-free k-spectrum of the genome for the largest possible k. The worst-case nature of our analysis ensures that the proposed error-correction method is robust and allows us to study its performance under stochastic error models. As a result, we show that, for several real genomes, the impact of read errors on the information-theoretic requirements for perfect assembly is relatively mild.
Original language | English (US) |
---|---|
Article number | 7790807 |
Pages (from-to) | 199-208 |
Number of pages | 10 |
Journal | IEEE Transactions on Molecular, Biological, and Multi-Scale Communications |
Volume | 2 |
Issue number | 2 |
DOIs | |
State | Published - Dec 2016 |
Externally published | Yes |
Keywords
- DNA sequencing
- Erasure model
- Error correction
- Genome assembly
ASJC Scopus subject areas
- Biotechnology
- Bioengineering
- Computer Networks and Communications
- Electrical and Electronic Engineering
- Modeling and Simulation