Probability estimation in the rare-events regime

Aaron B. Wagner, Pramod Viswanath, Sanjeev R. Kulkarni

Research output: Contribution to journalArticle

Abstract

We address the problem of estimating the probability of an observed string that is drawn i.i.d. from an unknown distribution. Motivated by models of natural language, we consider the regime in which the length of the observed string and the size of the underlying alphabet are comparably large. In this regime, the maximum likelihood distribution tends to overestimate the probability of the observed letters, so the Good-Turing probability estimator is typically used instead. We show that when used to estimate the sequence probability, the Good-Turing estimator is not consistent in this regime. We then introduce a novel sequence probability estimator that is consistent. This estimator also yields consistent estimators for other quantities of interest and a consistent universal classifier.

Original languageEnglish (US)
Article number5773059
Pages (from-to)3207-3229
Number of pages23
JournalIEEE Transactions on Information Theory
Volume57
Issue number6
DOIs
StatePublished - Jun 1 2011

Fingerprint

regime
event
Maximum likelihood
Classifiers
language

Keywords

  • Classification
  • entropy estimation
  • large alphabets
  • large number of rare events (LNRE)
  • natural language
  • probability estimation

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Library and Information Sciences

Cite this

Probability estimation in the rare-events regime. / Wagner, Aaron B.; Viswanath, Pramod; Kulkarni, Sanjeev R.

In: IEEE Transactions on Information Theory, Vol. 57, No. 6, 5773059, 01.06.2011, p. 3207-3229.

Research output: Contribution to journalArticle

Wagner, Aaron B. ; Viswanath, Pramod ; Kulkarni, Sanjeev R. / Probability estimation in the rare-events regime. In: IEEE Transactions on Information Theory. 2011 ; Vol. 57, No. 6. pp. 3207-3229.
@article{4fa91721cb5d433a8e88eb683a406026,
title = "Probability estimation in the rare-events regime",
abstract = "We address the problem of estimating the probability of an observed string that is drawn i.i.d. from an unknown distribution. Motivated by models of natural language, we consider the regime in which the length of the observed string and the size of the underlying alphabet are comparably large. In this regime, the maximum likelihood distribution tends to overestimate the probability of the observed letters, so the Good-Turing probability estimator is typically used instead. We show that when used to estimate the sequence probability, the Good-Turing estimator is not consistent in this regime. We then introduce a novel sequence probability estimator that is consistent. This estimator also yields consistent estimators for other quantities of interest and a consistent universal classifier.",
keywords = "Classification, entropy estimation, large alphabets, large number of rare events (LNRE), natural language, probability estimation",
author = "Wagner, {Aaron B.} and Pramod Viswanath and Kulkarni, {Sanjeev R.}",
year = "2011",
month = "6",
day = "1",
doi = "10.1109/TIT.2011.2137210",
language = "English (US)",
volume = "57",
pages = "3207--3229",
journal = "IEEE Transactions on Information Theory",
issn = "0018-9448",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "6",

}

TY - JOUR

T1 - Probability estimation in the rare-events regime

AU - Wagner, Aaron B.

AU - Viswanath, Pramod

AU - Kulkarni, Sanjeev R.

PY - 2011/6/1

Y1 - 2011/6/1

N2 - We address the problem of estimating the probability of an observed string that is drawn i.i.d. from an unknown distribution. Motivated by models of natural language, we consider the regime in which the length of the observed string and the size of the underlying alphabet are comparably large. In this regime, the maximum likelihood distribution tends to overestimate the probability of the observed letters, so the Good-Turing probability estimator is typically used instead. We show that when used to estimate the sequence probability, the Good-Turing estimator is not consistent in this regime. We then introduce a novel sequence probability estimator that is consistent. This estimator also yields consistent estimators for other quantities of interest and a consistent universal classifier.

AB - We address the problem of estimating the probability of an observed string that is drawn i.i.d. from an unknown distribution. Motivated by models of natural language, we consider the regime in which the length of the observed string and the size of the underlying alphabet are comparably large. In this regime, the maximum likelihood distribution tends to overestimate the probability of the observed letters, so the Good-Turing probability estimator is typically used instead. We show that when used to estimate the sequence probability, the Good-Turing estimator is not consistent in this regime. We then introduce a novel sequence probability estimator that is consistent. This estimator also yields consistent estimators for other quantities of interest and a consistent universal classifier.

KW - Classification

KW - entropy estimation

KW - large alphabets

KW - large number of rare events (LNRE)

KW - natural language

KW - probability estimation

UR - http://www.scopus.com/inward/record.url?scp=79957659601&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79957659601&partnerID=8YFLogxK

U2 - 10.1109/TIT.2011.2137210

DO - 10.1109/TIT.2011.2137210

M3 - Article

AN - SCOPUS:79957659601

VL - 57

SP - 3207

EP - 3229

JO - IEEE Transactions on Information Theory

JF - IEEE Transactions on Information Theory

SN - 0018-9448

IS - 6

M1 - 5773059

ER -