Two-stage language models for information retrieval

Cheng Xiang Zhai, John Lafferty

Research output: Contribution to journalConference articlepeer-review


The optimal settings of retrieval parameters often depend on both the document collection and the query, and are usually found through empirical tuning. In this paper, we propose a family of two-stage language models for information retrieval that explicitly captures the different influences of the query and document collection on the optimal settings of retrieval parameters. As a special case, we present a two-stage smoothing method that allows us to estimate the smoothing parameters completely automatically. In the first stage, the document language model is smoothed using a Dirichlet prior with the collection language model as the reference model. In the second stage, the smoothed document language model is further interpolated with a query background language model. We propose a leave-one-out method for estimating the Dirichlet parameter of the first stage, and the use of document mixture models for estimating the interpolation parameter of the second stage. Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to - or better than - the best results achieved using a single smoothing method and exhaustive parameter search on the test data.

Original languageEnglish (US)
Pages (from-to)49-56
Number of pages8
JournalSIGIR Forum (ACM Special Interest Group on Information Retrieval)
StatePublished - 2002
Externally publishedYes
EventProceedings of the Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - Tampere, Finland
Duration: Aug 11 2002Aug 15 2002


  • Dirichlet prior
  • Interpolation
  • Leave-one-out
  • Mixture model
  • Parameter estimation
  • Risk minimization
  • Two-stage language models
  • Two-stage smoothing

ASJC Scopus subject areas

  • Management Information Systems
  • Hardware and Architecture


Dive into the research topics of 'Two-stage language models for information retrieval'. Together they form a unique fingerprint.

Cite this