TY - GEN
T1 - Generative model-based metasearch for data fusion in information retrieval
AU - Efron, Miles
PY - 2009
Y1 - 2009
N2 - "Data fusion" refers to the problem in information retrieval (IR) where several lists of documents ranked against a query are to be merged into a single ranked list for presentation to a user. Data fusion is also known as "metasearch." In a digital library setting data fusion may support operations such as federated search based on multiple repository representations. This paper presents a novel approach to the fusion problem: generative model-based Metasearch (GeM). We suggest viewing the appearance of documents in a return set as the outcome of a probabilistic process; some documents are likely to occur in the model, while others are unlikely. Using Bayesian parameter estimation to fit a multinomial distribution based on the return sets to be merged, GeM achieves a final ranking by listing documents in decreasing probability of generation under the induced model. We also introduce what we call "the impatient reader" approach to normalizing document ranks in service to the fusion operation. We report results from several experiments on TREC data suggesting that GeM, informed with impatient reader document scores, operates at state-of-the-art levels of effectiveness.
AB - "Data fusion" refers to the problem in information retrieval (IR) where several lists of documents ranked against a query are to be merged into a single ranked list for presentation to a user. Data fusion is also known as "metasearch." In a digital library setting data fusion may support operations such as federated search based on multiple repository representations. This paper presents a novel approach to the fusion problem: generative model-based Metasearch (GeM). We suggest viewing the appearance of documents in a return set as the outcome of a probabilistic process; some documents are likely to occur in the model, while others are unlikely. Using Bayesian parameter estimation to fit a multinomial distribution based on the return sets to be merged, GeM achieves a final ranking by listing documents in decreasing probability of generation under the induced model. We also introduce what we call "the impatient reader" approach to normalizing document ranks in service to the fusion operation. We report results from several experiments on TREC data suggesting that GeM, informed with impatient reader document scores, operates at state-of-the-art levels of effectiveness.
KW - DataFusion
KW - Digital libraries
KW - Generative models
KW - Information retrieval
KW - Metasearch
KW - Probabilistic models
UR - http://www.scopus.com/inward/record.url?scp=70450245153&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70450245153&partnerID=8YFLogxK
U2 - 10.1145/1555400.1555426
DO - 10.1145/1555400.1555426
M3 - Conference contribution
AN - SCOPUS:70450245153
SN - 9781605586977
T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
SP - 153
EP - 162
BT - JCDL'09 - Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries
T2 - 2009 ACM/IEEE Joint Conference on Digital Libraries, JCDL'09
Y2 - 15 June 2009 through 19 June 2009
ER -