Classification of homogeneous data with large alphabets

Benjamin G. Kelly, Aaron B. Wagner, Thitidej Tularak, Pramod Viswanath

Research output: Contribution to journalArticle

Abstract

Given training sequences generated by two distinct, but unknown, distributions on a common alphabet, we study the problem of determining whether a third sequence was generated according to the first or second distribution. To model sources such as natural language, for which the underlying distributions are difficult to learn from realistic amounts of data, we allow the alphabet size to grow and therefore the probability distributions to change with the block length. Our primary focus is the situation in which the underlying probabilities are all of the same order, and in this regime, we show that consistent classification is possible if and only if the alphabet grows subquadratically with the block length. We also show that some commonly used statistical tests are suboptimal in that they are consistent only if the alphabet grows sublinearly.

Original languageEnglish (US)
Article number6340343
Pages (from-to)782-795
Number of pages14
JournalIEEE Transactions on Information Theory
Volume59
Issue number2
DOIs
StatePublished - Jan 28 2013

Keywords

  • Chi-squared
  • classification
  • hypothesis testing
  • large alphabets
  • natural language

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Library and Information Sciences

Fingerprint Dive into the research topics of 'Classification of homogeneous data with large alphabets'. Together they form a unique fingerprint.

  • Cite this