Abstract
Recent advancements in DNA-based storage prototypes focus on encoding information across multiple DNA molecules. This approach utilizes high-throughput sequencing technologies, leading to outputs that are out-of-order. We study the shuffling channel, where input codewords are split into fixed-size fragments. We show that achieving channel capacity uses index-based coding, which assigns unique indices to each fragment. We also introduce two more complex channels, which aim to model popular sequencing strategies in DNA sequencing. In the torn-paper channel, the input codeword is torn up into fragments of random sizes, while in the shotgun sequencing channel, fixed-length random substrings of the input codeword are observed at the output. In both of these channels, the lack of ordering cannot be circumvented by simply adding unique indices to the fragments. We show how the capacity of both of these channels can be achieved using random codes. We introduce and analyze code constructions based on index sequences. While these codes are computationally efficient, they are not capacity-achieving, and we leave the questions of finding efficient capacity-achieving codes for these settings as open problems.
Original language | English (US) |
---|---|
Pages (from-to) | 334-348 |
Number of pages | 15 |
Journal | IEEE Transactions on Molecular, Biological, and Multi-Scale Communications |
Volume | 10 |
Issue number | 2 |
DOIs | |
State | Published - Jun 1 2024 |
Keywords
- Biological information theory
- Channel coding
- DNA
- Decoding
ASJC Scopus subject areas
- Biotechnology
- Bioengineering
- Modeling and Simulation
- Computer Networks and Communications
- Electrical and Electronic Engineering