TY - GEN
T1 - Lower-bounding term frequency normalization
AU - Lv, Yuanhua
AU - Zhai, Chengxiang
PY - 2011
Y1 - 2011
N2 - In this paper, we reveal a common deficiency of the current retrieval models: the component of term frequency (TF) normalization by document length is not lower-bounded properly; as a result, very long documents tend to be overly penalized. In order to analytically diagnose this problem, we propose two desirable formal constraints to capture the heuristic of lower-bounding TF, and use constraint analysis to examine several representative retrieval functions. Analysis results show that all these retrieval functions can only satisfy the constraints for a certain range of parameter values and/or for a particular set of query terms. Empirical results further show that the retrieval performance tends to be poor when the parameter is out of the range or the query term is not in the particular set. To solve this common problem, we propose a general and efficient method to introduce a sufficiently large lower bound for TF normalization which can be shown analytically to fix or alleviate the problem. Our experimental results demonstrate that the proposed method, incurring almost no additional computational cost, can be applied to state-of-the-art retrieval functions, such as Okapi BM25, language models, and the divergence from randomness approach, to significantly improve the average precision, especially for verbose queries.
AB - In this paper, we reveal a common deficiency of the current retrieval models: the component of term frequency (TF) normalization by document length is not lower-bounded properly; as a result, very long documents tend to be overly penalized. In order to analytically diagnose this problem, we propose two desirable formal constraints to capture the heuristic of lower-bounding TF, and use constraint analysis to examine several representative retrieval functions. Analysis results show that all these retrieval functions can only satisfy the constraints for a certain range of parameter values and/or for a particular set of query terms. Empirical results further show that the retrieval performance tends to be poor when the parameter is out of the range or the query term is not in the particular set. To solve this common problem, we propose a general and efficient method to introduce a sufficiently large lower bound for TF normalization which can be shown analytically to fix or alleviate the problem. Our experimental results demonstrate that the proposed method, incurring almost no additional computational cost, can be applied to state-of-the-art retrieval functions, such as Okapi BM25, language models, and the divergence from randomness approach, to significantly improve the average precision, especially for verbose queries.
KW - BM25+
KW - DIR+
KW - Pl2+
KW - data analysis
KW - document length
KW - formal constraints
KW - lower bound
KW - term frequency
UR - http://www.scopus.com/inward/record.url?scp=83055187814&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=83055187814&partnerID=8YFLogxK
U2 - 10.1145/2063576.2063584
DO - 10.1145/2063576.2063584
M3 - Conference contribution
AN - SCOPUS:83055187814
SN - 9781450307178
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 7
EP - 16
BT - CIKM'11 - Proceedings of the 2011 ACM International Conference on Information and Knowledge Management
T2 - 20th ACM Conference on Information and Knowledge Management, CIKM'11
Y2 - 24 October 2011 through 28 October 2011
ER -