TY - JOUR
T1 - Unique Reconstruction of Coded Strings from Multiset Substring Spectra
AU - Gabrys, Ryan
AU - Milenkovic, Olgica
N1 - Funding Information:
Manuscript received April 12, 2018; revised April 22, 2019; accepted July 31, 2019. Date of publication August 19, 2019; date of current version November 20, 2019. This work was supported in part by NSF under Grant CCF 16-18366 and in part by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under Grant Agreement CCF-0939370.
Publisher Copyright:
© 1963-2012 IEEE.
PY - 2019/12
Y1 - 2019/12
N2 - The problem of reconstructing strings from their substring spectra has a long history and in its most simple incarnation asks for determining under which conditions the spectrum uniquely determines the string. We study the problem of coded string reconstruction from multiset substring spectra, where the strings are restricted to lie in some codebook. In particular, we consider binary codebooks that allow for unique string reconstruction and propose a new method, termed repeat replacement, to create the codebook. Our contributions include algorithmic solutions for repeat replacement and constructive redundancy bounds for the underlying coding schemes. We also consider extensions of the problem to noisy settings in which substrings are compromised by burst and random errors. The study is motivated by applications in DNA-based data storage systems that use high throughput readout sequencers.
AB - The problem of reconstructing strings from their substring spectra has a long history and in its most simple incarnation asks for determining under which conditions the spectrum uniquely determines the string. We study the problem of coded string reconstruction from multiset substring spectra, where the strings are restricted to lie in some codebook. In particular, we consider binary codebooks that allow for unique string reconstruction and propose a new method, termed repeat replacement, to create the codebook. Our contributions include algorithmic solutions for repeat replacement and constructive redundancy bounds for the underlying coding schemes. We also consider extensions of the problem to noisy settings in which substrings are compromised by burst and random errors. The study is motivated by applications in DNA-based data storage systems that use high throughput readout sequencers.
KW - Data storage systems
KW - biological information theory
UR - http://www.scopus.com/inward/record.url?scp=85077394432&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85077394432&partnerID=8YFLogxK
U2 - 10.1109/TIT.2019.2935973
DO - 10.1109/TIT.2019.2935973
M3 - Article
AN - SCOPUS:85077394432
SN - 0018-9448
VL - 65
SP - 7682
EP - 7696
JO - IRE Professional Group on Information Theory
JF - IRE Professional Group on Information Theory
IS - 12
M1 - 8805120
ER -