TY - JOUR
T1 - An empirical study of tokenization strategies for biomedical information retrieval
AU - Jiang, Jing
AU - Zhai, Chengxiang
N1 - Funding Information:
Acknowledgments This work was in part Supported by the National Science Foundation under award numbers 0425852 and 0428472. We thank the anonymous reviewers for their useful comments.
PY - 2007/10
Y1 - 2007/10
N2 - Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text.
AB - Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text.
KW - Biomedical information retrieval
KW - Stemming
KW - Stop word
KW - Tokenization
UR - http://www.scopus.com/inward/record.url?scp=34848845892&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34848845892&partnerID=8YFLogxK
U2 - 10.1007/s10791-007-9027-7
DO - 10.1007/s10791-007-9027-7
M3 - Article
AN - SCOPUS:34848845892
SN - 1386-4564
VL - 10
SP - 341
EP - 363
JO - Information Retrieval
JF - Information Retrieval
IS - 4-5
ER -