A concept-based framework for passage retrieval in genomics

Wei Zhou, Clement T. Yu, Vetle I. Torvik, Neil R. Smalheiser

Research output: Contribution to journalConference article

Abstract

The task of TREC 2006 Genomics Track is to retrieve passages (from part to paragraph) from full-text HTML biomedical journal papers to answer the structured questions from real biologists. A system for such task needs to be able to parse the HTML free-texts (convert the HTML free-texts into plain texts) and pinpoint the most relevant passage(s) within documents for the specified question. This task is accomplished in three steps in our system. The first step is to parse the HTML articles and partition them into paragraphs. The second step is to retrieve the relevant paragraphs. The third step is to identify the most relevant passages within paragraphs and finally rank those passages. We are interested in 1. How does a concept- based IR model perform on structured queries comparing to Okapi? 2. Will the query expansion based on domain knowledge increase retrieval effectiveness? 3. Will our abbreviation database from MEDLINE help improve query expansion and will the abbreviation disambiguation help improve the ranking? The experiment results show that our concept-based IR model works better than the Okapi; query expansion based on domain knowledge is important, especially for those queries with very few relevant documents; an abbreviation database for query expansion and disambiguation is helpful for passage retrieval.

Original languageEnglish (US)
JournalNIST Special Publication
StatePublished - Dec 1 2006
Externally publishedYes
Event15th Text REtrieval Conference, TREC 2006 - Gaithersburg, MD, United States
Duration: Nov 14 2006Nov 17 2006

Fingerprint

HTML
Genomics
Experiments

ASJC Scopus subject areas

  • Engineering(all)

Cite this

A concept-based framework for passage retrieval in genomics. / Zhou, Wei; Yu, Clement T.; Torvik, Vetle I.; Smalheiser, Neil R.

In: NIST Special Publication, 01.12.2006.

Research output: Contribution to journalConference article

@article{eedbd15892b441f99f32b4a931d058b8,
title = "A concept-based framework for passage retrieval in genomics",
abstract = "The task of TREC 2006 Genomics Track is to retrieve passages (from part to paragraph) from full-text HTML biomedical journal papers to answer the structured questions from real biologists. A system for such task needs to be able to parse the HTML free-texts (convert the HTML free-texts into plain texts) and pinpoint the most relevant passage(s) within documents for the specified question. This task is accomplished in three steps in our system. The first step is to parse the HTML articles and partition them into paragraphs. The second step is to retrieve the relevant paragraphs. The third step is to identify the most relevant passages within paragraphs and finally rank those passages. We are interested in 1. How does a concept- based IR model perform on structured queries comparing to Okapi? 2. Will the query expansion based on domain knowledge increase retrieval effectiveness? 3. Will our abbreviation database from MEDLINE help improve query expansion and will the abbreviation disambiguation help improve the ranking? The experiment results show that our concept-based IR model works better than the Okapi; query expansion based on domain knowledge is important, especially for those queries with very few relevant documents; an abbreviation database for query expansion and disambiguation is helpful for passage retrieval.",
author = "Wei Zhou and Yu, {Clement T.} and Torvik, {Vetle I.} and Smalheiser, {Neil R.}",
year = "2006",
month = "12",
day = "1",
language = "English (US)",
journal = "NIST Special Publication",
issn = "1048-776X",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - A concept-based framework for passage retrieval in genomics

AU - Zhou, Wei

AU - Yu, Clement T.

AU - Torvik, Vetle I.

AU - Smalheiser, Neil R.

PY - 2006/12/1

Y1 - 2006/12/1

N2 - The task of TREC 2006 Genomics Track is to retrieve passages (from part to paragraph) from full-text HTML biomedical journal papers to answer the structured questions from real biologists. A system for such task needs to be able to parse the HTML free-texts (convert the HTML free-texts into plain texts) and pinpoint the most relevant passage(s) within documents for the specified question. This task is accomplished in three steps in our system. The first step is to parse the HTML articles and partition them into paragraphs. The second step is to retrieve the relevant paragraphs. The third step is to identify the most relevant passages within paragraphs and finally rank those passages. We are interested in 1. How does a concept- based IR model perform on structured queries comparing to Okapi? 2. Will the query expansion based on domain knowledge increase retrieval effectiveness? 3. Will our abbreviation database from MEDLINE help improve query expansion and will the abbreviation disambiguation help improve the ranking? The experiment results show that our concept-based IR model works better than the Okapi; query expansion based on domain knowledge is important, especially for those queries with very few relevant documents; an abbreviation database for query expansion and disambiguation is helpful for passage retrieval.

AB - The task of TREC 2006 Genomics Track is to retrieve passages (from part to paragraph) from full-text HTML biomedical journal papers to answer the structured questions from real biologists. A system for such task needs to be able to parse the HTML free-texts (convert the HTML free-texts into plain texts) and pinpoint the most relevant passage(s) within documents for the specified question. This task is accomplished in three steps in our system. The first step is to parse the HTML articles and partition them into paragraphs. The second step is to retrieve the relevant paragraphs. The third step is to identify the most relevant passages within paragraphs and finally rank those passages. We are interested in 1. How does a concept- based IR model perform on structured queries comparing to Okapi? 2. Will the query expansion based on domain knowledge increase retrieval effectiveness? 3. Will our abbreviation database from MEDLINE help improve query expansion and will the abbreviation disambiguation help improve the ranking? The experiment results show that our concept-based IR model works better than the Okapi; query expansion based on domain knowledge is important, especially for those queries with very few relevant documents; an abbreviation database for query expansion and disambiguation is helpful for passage retrieval.

UR - http://www.scopus.com/inward/record.url?scp=84873555440&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84873555440&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:84873555440

JO - NIST Special Publication

JF - NIST Special Publication

SN - 1048-776X

ER -