Abstract
Information retrieval (IR) may be considered an instance of a common modern statistical problem: a massive simultaneous hypothesis test. Such problems arise often in biostatistics where plentiful data must be winnowed to name a small number of potentially "interesting" cases. For instance, DNA microarray analysis requires researchers to filter thousands of genes, searching for genes implicated in a particular condition. This paper describes a novel approach to IR that is based on the notion of simultaneous hypothesis testing. In this case the test is performed on each document and the null hypothesis is that the document is non-relevant. After a mathematical derivation of the proposed model, we test its performance on three standard data sets against the effectiveness of two baseline IR systems, a vector space model and a language modeling-based system. These preliminary experiments show that the hypothesis testing approach to IR is not only philosophically appealing, but that it also operates at the state of the art in effectiveness.
Original language | English (US) |
---|---|
Title of host publication | ASIST 2008 |
Subtitle of host publication | Proceedings of the 71st ASIST Annual Meeting: People Transforming Information - Information Transforming People |
Publisher | American Society for Information Science and Technology |
Volume | 45 |
ISBN (Print) | 0877155402, 9780877155409 |
DOIs | |
State | Published - 2008 |
Externally published | Yes |
Event | ASIST 2008: 71st ASIST Annual Meeting: People Transforming Information - Information Transforming People - Columbus, OH, United States Duration: Oct 24 2008 → Oct 29 2008 |
Other
Other | ASIST 2008: 71st ASIST Annual Meeting: People Transforming Information - Information Transforming People |
---|---|
Country/Territory | United States |
City | Columbus, OH |
Period | 10/24/08 → 10/29/08 |
ASJC Scopus subject areas
- Information Systems
- Library and Information Sciences