The Metagenomic Binning Problem: Clustering Markov Sequences

Grant Greenberg, Ilan Shomorony

Research output: Chapter in Book/Report/Conference proceedingConference contribution


The goal of metagenomics is to study the composition of microbial communities, typically using high-throughput shotgun sequencing. In the metagenomic binning problem, we observe random substrings (called contigs) from a mixture of genomes and want to cluster them according to their genome of origin. Based on the empirical observation that genomes of different bacterial species can be distinguished based on their tetranucleotide frequencies, we model this task as the problem of clustering N sequences generated by M distinct Markov processes, where M N. Utilizing the large-deviation principle for Markov processes, we establish the information-theoretic limit for perfect binning. Specifically, we show that the length of the contigs must scale with the inverse of the Chernoff Information between the two most similar species. Our result also implies that contigs should be binned using the conditional relative entropy as a measure of distance, as opposed to the Euclidean distance often used in practice.

Original languageEnglish (US)
Title of host publication2019 IEEE Information Theory Workshop, ITW 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781538669006
StatePublished - Aug 2019
Event2019 IEEE Information Theory Workshop, ITW 2019 - Visby, Sweden
Duration: Aug 25 2019Aug 28 2019

Publication series

Name2019 IEEE Information Theory Workshop, ITW 2019


Conference2019 IEEE Information Theory Workshop, ITW 2019

ASJC Scopus subject areas

  • Software
  • Computational Theory and Mathematics
  • Computer Networks and Communications
  • Information Systems


Dive into the research topics of 'The Metagenomic Binning Problem: Clustering Markov Sequences'. Together they form a unique fingerprint.

Cite this