Large-scale multiple hypothesis testing in information retrieval: Towards a new approach to document ranking

Miles Efron

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Information retrieval (IR) may be considered an instance of a common modern statistical problem: a massive simultaneous hypothesis test. Such problems arise often in biostatistics where plentiful data must be winnowed to name a small number of potentially "interesting" cases. For instance, DNA microarray analysis requires researchers to filter thousands of genes, searching for genes implicated in a particular condition. This paper describes a novel approach to IR that is based on the notion of simultaneous hypothesis testing. In this case the test is performed on each document and the null hypothesis is that the document is non-relevant. After a mathematical derivation of the proposed model, we test its performance on three standard data sets against the effectiveness of two baseline IR systems, a vector space model and a language modeling-based system. These preliminary experiments show that the hypothesis testing approach to IR is not only philosophically appealing, but that it also operates at the state of the art in effectiveness.

Original languageEnglish (US)
Title of host publicationASIST 2008
Subtitle of host publicationProceedings of the 71st ASIST Annual Meeting: People Transforming Information - Information Transforming People
PublisherAmerican Society for Information Science and Technology
Volume45
ISBN (Print)0877155402, 9780877155409
DOIs
StatePublished - 2008
Externally publishedYes
EventASIST 2008: 71st ASIST Annual Meeting: People Transforming Information - Information Transforming People - Columbus, OH, United States
Duration: Oct 24 2008Oct 29 2008

Other

OtherASIST 2008: 71st ASIST Annual Meeting: People Transforming Information - Information Transforming People
Country/TerritoryUnited States
CityColumbus, OH
Period10/24/0810/29/08

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'Large-scale multiple hypothesis testing in information retrieval: Towards a new approach to document ranking'. Together they form a unique fingerprint.

Cite this