TY - GEN
T1 - Estimation of statistical translation models based on mutual information for ad hoc information retrieval
AU - Karimzadehgan, Maryam
AU - Zhai, Cheng Xiang
PY - 2010
Y1 - 2010
N2 - As a principled approach to capturing semantic relations of words in information retrieval, statistical translation models have been shown to outperform simple document language models which rely on exact matching of words in the query and documents. A main challenge in applying translation models to ad hoc information retrieval is to estimate a translation model without training data. Existing work has relied on training on synthetic queries generated based on a document collection. However, this method is computationally expensive and does not have a good coverage of query words. In this paper, we propose an alternative way to estimate a translation model based on normalized mutual information between words, which is less computationally expensive and has better coverage of query words than the synthetic query method of estimation. We also propose to regularize estimated translation probabilities to ensure sufficient probability mass for self-translation. Experiment results show that the proposed mutual information-based estimation method is not only more efficient, but also more effective than the synthetic query-based method, and it can be combined with pseudo-relevance feedback to further improve retrieval accuracy. The results also show that the proposed regularization strategy is effective and can improve retrieval accuracy for both synthetic query-based estimation and mutual information-based estimation.
AB - As a principled approach to capturing semantic relations of words in information retrieval, statistical translation models have been shown to outperform simple document language models which rely on exact matching of words in the query and documents. A main challenge in applying translation models to ad hoc information retrieval is to estimate a translation model without training data. Existing work has relied on training on synthetic queries generated based on a document collection. However, this method is computationally expensive and does not have a good coverage of query words. In this paper, we propose an alternative way to estimate a translation model based on normalized mutual information between words, which is less computationally expensive and has better coverage of query words than the synthetic query method of estimation. We also propose to regularize estimated translation probabilities to ensure sufficient probability mass for self-translation. Experiment results show that the proposed mutual information-based estimation method is not only more efficient, but also more effective than the synthetic query-based method, and it can be combined with pseudo-relevance feedback to further improve retrieval accuracy. The results also show that the proposed regularization strategy is effective and can improve retrieval accuracy for both synthetic query-based estimation and mutual information-based estimation.
KW - Estimation
KW - Feedback
KW - Language models
KW - Smoothing
KW - Statistical machine translation
UR - http://www.scopus.com/inward/record.url?scp=77956032016&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77956032016&partnerID=8YFLogxK
U2 - 10.1145/1835449.1835505
DO - 10.1145/1835449.1835505
M3 - Conference contribution
AN - SCOPUS:77956032016
SN - 9781605588964
T3 - SIGIR 2010 Proceedings - 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 323
EP - 330
BT - SIGIR 2010 Proceedings - 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
T2 - 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010
Y2 - 19 July 2010 through 23 July 2010
ER -