TY - GEN
T1 - The Metagenomic Binning Problem
T2 - 2019 IEEE Information Theory Workshop, ITW 2019
AU - Greenberg, Grant
AU - Shomorony, Ilan
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/8
Y1 - 2019/8
N2 - The goal of metagenomics is to study the composition of microbial communities, typically using high-throughput shotgun sequencing. In the metagenomic binning problem, we observe random substrings (called contigs) from a mixture of genomes and want to cluster them according to their genome of origin. Based on the empirical observation that genomes of different bacterial species can be distinguished based on their tetranucleotide frequencies, we model this task as the problem of clustering N sequences generated by M distinct Markov processes, where M N. Utilizing the large-deviation principle for Markov processes, we establish the information-theoretic limit for perfect binning. Specifically, we show that the length of the contigs must scale with the inverse of the Chernoff Information between the two most similar species. Our result also implies that contigs should be binned using the conditional relative entropy as a measure of distance, as opposed to the Euclidean distance often used in practice.
AB - The goal of metagenomics is to study the composition of microbial communities, typically using high-throughput shotgun sequencing. In the metagenomic binning problem, we observe random substrings (called contigs) from a mixture of genomes and want to cluster them according to their genome of origin. Based on the empirical observation that genomes of different bacterial species can be distinguished based on their tetranucleotide frequencies, we model this task as the problem of clustering N sequences generated by M distinct Markov processes, where M N. Utilizing the large-deviation principle for Markov processes, we establish the information-theoretic limit for perfect binning. Specifically, we show that the length of the contigs must scale with the inverse of the Chernoff Information between the two most similar species. Our result also implies that contigs should be binned using the conditional relative entropy as a measure of distance, as opposed to the Euclidean distance often used in practice.
UR - http://www.scopus.com/inward/record.url?scp=85081099457&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85081099457&partnerID=8YFLogxK
U2 - 10.1109/ITW44776.2019.8988939
DO - 10.1109/ITW44776.2019.8988939
M3 - Conference contribution
AN - SCOPUS:85081099457
T3 - 2019 IEEE Information Theory Workshop, ITW 2019
BT - 2019 IEEE Information Theory Workshop, ITW 2019
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 25 August 2019 through 28 August 2019
ER -