Improving retrieval of short texts through document expansion

Miles Efron, Peter Organisciak, Katrina Fenlon

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Collections containing a large number of short documents are becoming increasingly common. As these collections grow in number and size, providing effective retrieval of brief texts presents a significant research problem. We propose a novel approach to improving information retrieval (IR) for short texts based on aggressive document expansion. Starting from the hypothesis that short documents tend to be about a single topic, we submit documents as pseudo-queries and analyze the results to learn about the documents themselves. Document expansion helps in this context because short documents yield little in the way of term frequency information. However, as we show, the proposed technique helps us model not only lexical properties, but also temporal properties of documents. We present experimental results using a corpus of microblog (Twitter) data and a corpus of metadata records from a federated digital library. With respect to established baselines, results of these experiments show that applying our proposed document expansion method yields significant improvements in effectiveness. Specifically, our method improves the lexical representation of documents and the ability to let time influence retrieval.

Original languageEnglish (US)
Title of host publicationSIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval
Pages911-920
Number of pages10
DOIs
StatePublished - 2012
Event35th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012 - Portland, OR, United States
Duration: Aug 12 2012Aug 16 2012

Publication series

NameSIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval

Other

Other35th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012
Country/TerritoryUnited States
CityPortland, OR
Period8/12/128/16/12

Keywords

  • document expansion
  • dublin core
  • information retrieval
  • language models
  • microblogs
  • temporal IR
  • twitter

ASJC Scopus subject areas

  • Information Systems

Fingerprint

Dive into the research topics of 'Improving retrieval of short texts through document expansion'. Together they form a unique fingerprint.

Cite this