TY - GEN
T1 - A log-logistic model-based interpretation of TF normalization of BM25
AU - Lv, Yuanhua
AU - Zhai, Chengxiang
PY - 2012
Y1 - 2012
N2 - The effectiveness of BM25 retrieval function is mainly due to its sub-linear term frequency (TF) normalization component, which is controlled by a parameter k 1. Although BM25 was derived based on the classic probabilistic retrieval model, it has been so far unclear how to interpret its parameter k 1 probabilistically, making it hard to optimize the setting of this parameter. In this paper, we provide a novel probabilistic interpretation of the BM25 TF normalization and its parameter k 1 based on a log-logistic model for the probability of seeing a document in the collection with a given level of TF. The proposed interpretation allows us to derive different approaches to estimation of parameter k 1 based solely on the current collection without requiring any training data, thus effectively eliminating one free parameter from BM25. Our experiment results show that the proposed approaches can accurately predict the optimal k 1 without requiring training data and achieve better or comparable retrieval performance to a well-tuned BM25 where k 1 is optimized based on training data.
AB - The effectiveness of BM25 retrieval function is mainly due to its sub-linear term frequency (TF) normalization component, which is controlled by a parameter k 1. Although BM25 was derived based on the classic probabilistic retrieval model, it has been so far unclear how to interpret its parameter k 1 probabilistically, making it hard to optimize the setting of this parameter. In this paper, we provide a novel probabilistic interpretation of the BM25 TF normalization and its parameter k 1 based on a log-logistic model for the probability of seeing a document in the collection with a given level of TF. The proposed interpretation allows us to derive different approaches to estimation of parameter k 1 based solely on the current collection without requiring any training data, thus effectively eliminating one free parameter from BM25. Our experiment results show that the proposed approaches can accurately predict the optimal k 1 without requiring training data and achieve better or comparable retrieval performance to a well-tuned BM25 where k 1 is optimized based on training data.
KW - BM25
KW - automatic parameter tuning
KW - log-logistic model
KW - percentile term frequency normalization
KW - term frequency
UR - http://www.scopus.com/inward/record.url?scp=84860204505&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84860204505&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-28997-2_21
DO - 10.1007/978-3-642-28997-2_21
M3 - Conference contribution
AN - SCOPUS:84860204505
SN - 9783642289965
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 244
EP - 255
BT - Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012, Proceedings
T2 - 34th European Conference on Information Retrieval, ECIR 2012
Y2 - 1 April 2012 through 5 April 2012
ER -