TY - GEN

T1 - The Metagenomic Binning Problem

T2 - 2019 IEEE Information Theory Workshop, ITW 2019

AU - Greenberg, Grant

AU - Shomorony, Ilan

N1 - Publisher Copyright:
© 2019 IEEE.

PY - 2019/8

Y1 - 2019/8

N2 - The goal of metagenomics is to study the composition of microbial communities, typically using high-throughput shotgun sequencing. In the metagenomic binning problem, we observe random substrings (called contigs) from a mixture of genomes and want to cluster them according to their genome of origin. Based on the empirical observation that genomes of different bacterial species can be distinguished based on their tetranucleotide frequencies, we model this task as the problem of clustering N sequences generated by M distinct Markov processes, where M N. Utilizing the large-deviation principle for Markov processes, we establish the information-theoretic limit for perfect binning. Specifically, we show that the length of the contigs must scale with the inverse of the Chernoff Information between the two most similar species. Our result also implies that contigs should be binned using the conditional relative entropy as a measure of distance, as opposed to the Euclidean distance often used in practice.

AB - The goal of metagenomics is to study the composition of microbial communities, typically using high-throughput shotgun sequencing. In the metagenomic binning problem, we observe random substrings (called contigs) from a mixture of genomes and want to cluster them according to their genome of origin. Based on the empirical observation that genomes of different bacterial species can be distinguished based on their tetranucleotide frequencies, we model this task as the problem of clustering N sequences generated by M distinct Markov processes, where M N. Utilizing the large-deviation principle for Markov processes, we establish the information-theoretic limit for perfect binning. Specifically, we show that the length of the contigs must scale with the inverse of the Chernoff Information between the two most similar species. Our result also implies that contigs should be binned using the conditional relative entropy as a measure of distance, as opposed to the Euclidean distance often used in practice.

UR - http://www.scopus.com/inward/record.url?scp=85081099457&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85081099457&partnerID=8YFLogxK

U2 - 10.1109/ITW44776.2019.8988939

DO - 10.1109/ITW44776.2019.8988939

M3 - Conference contribution

AN - SCOPUS:85081099457

T3 - 2019 IEEE Information Theory Workshop, ITW 2019

BT - 2019 IEEE Information Theory Workshop, ITW 2019

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 25 August 2019 through 28 August 2019

ER -