Classification of homogeneous data with large alphabets

Benjamin G. Kelly, Aaron B. Wagner, Thitidej Tularak, Pramod Viswanath

Research output: Contribution to journalArticle

Abstract

Given training sequences generated by two distinct, but unknown, distributions on a common alphabet, we study the problem of determining whether a third sequence was generated according to the first or second distribution. To model sources such as natural language, for which the underlying distributions are difficult to learn from realistic amounts of data, we allow the alphabet size to grow and therefore the probability distributions to change with the block length. Our primary focus is the situation in which the underlying probabilities are all of the same order, and in this regime, we show that consistent classification is possible if and only if the alphabet grows subquadratically with the block length. We also show that some commonly used statistical tests are suboptimal in that they are consistent only if the alphabet grows sublinearly.

Original languageEnglish (US)
Article number6340343
Pages (from-to)782-795
Number of pages14
JournalIEEE Transactions on Information Theory
Volume59
Issue number2
DOIs
StatePublished - Jan 28 2013

Fingerprint

Statistical tests
Probability distributions
statistical test
regime
language

Keywords

  • Chi-squared
  • classification
  • hypothesis testing
  • large alphabets
  • natural language

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Library and Information Sciences

Cite this

Classification of homogeneous data with large alphabets. / Kelly, Benjamin G.; Wagner, Aaron B.; Tularak, Thitidej; Viswanath, Pramod.

In: IEEE Transactions on Information Theory, Vol. 59, No. 2, 6340343, 28.01.2013, p. 782-795.

Research output: Contribution to journalArticle

Kelly, Benjamin G. ; Wagner, Aaron B. ; Tularak, Thitidej ; Viswanath, Pramod. / Classification of homogeneous data with large alphabets. In: IEEE Transactions on Information Theory. 2013 ; Vol. 59, No. 2. pp. 782-795.
@article{f717a37a94d8476e99972c4312a3317d,
title = "Classification of homogeneous data with large alphabets",
abstract = "Given training sequences generated by two distinct, but unknown, distributions on a common alphabet, we study the problem of determining whether a third sequence was generated according to the first or second distribution. To model sources such as natural language, for which the underlying distributions are difficult to learn from realistic amounts of data, we allow the alphabet size to grow and therefore the probability distributions to change with the block length. Our primary focus is the situation in which the underlying probabilities are all of the same order, and in this regime, we show that consistent classification is possible if and only if the alphabet grows subquadratically with the block length. We also show that some commonly used statistical tests are suboptimal in that they are consistent only if the alphabet grows sublinearly.",
keywords = "Chi-squared, classification, hypothesis testing, large alphabets, natural language",
author = "Kelly, {Benjamin G.} and Wagner, {Aaron B.} and Thitidej Tularak and Pramod Viswanath",
year = "2013",
month = "1",
day = "28",
doi = "10.1109/TIT.2012.2222343",
language = "English (US)",
volume = "59",
pages = "782--795",
journal = "IEEE Transactions on Information Theory",
issn = "0018-9448",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "2",

}

TY - JOUR

T1 - Classification of homogeneous data with large alphabets

AU - Kelly, Benjamin G.

AU - Wagner, Aaron B.

AU - Tularak, Thitidej

AU - Viswanath, Pramod

PY - 2013/1/28

Y1 - 2013/1/28

N2 - Given training sequences generated by two distinct, but unknown, distributions on a common alphabet, we study the problem of determining whether a third sequence was generated according to the first or second distribution. To model sources such as natural language, for which the underlying distributions are difficult to learn from realistic amounts of data, we allow the alphabet size to grow and therefore the probability distributions to change with the block length. Our primary focus is the situation in which the underlying probabilities are all of the same order, and in this regime, we show that consistent classification is possible if and only if the alphabet grows subquadratically with the block length. We also show that some commonly used statistical tests are suboptimal in that they are consistent only if the alphabet grows sublinearly.

AB - Given training sequences generated by two distinct, but unknown, distributions on a common alphabet, we study the problem of determining whether a third sequence was generated according to the first or second distribution. To model sources such as natural language, for which the underlying distributions are difficult to learn from realistic amounts of data, we allow the alphabet size to grow and therefore the probability distributions to change with the block length. Our primary focus is the situation in which the underlying probabilities are all of the same order, and in this regime, we show that consistent classification is possible if and only if the alphabet grows subquadratically with the block length. We also show that some commonly used statistical tests are suboptimal in that they are consistent only if the alphabet grows sublinearly.

KW - Chi-squared

KW - classification

KW - hypothesis testing

KW - large alphabets

KW - natural language

UR - http://www.scopus.com/inward/record.url?scp=84872713571&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84872713571&partnerID=8YFLogxK

U2 - 10.1109/TIT.2012.2222343

DO - 10.1109/TIT.2012.2222343

M3 - Article

AN - SCOPUS:84872713571

VL - 59

SP - 782

EP - 795

JO - IEEE Transactions on Information Theory

JF - IEEE Transactions on Information Theory

SN - 0018-9448

IS - 2

M1 - 6340343

ER -