TY - GEN
T1 - Statistical sentence extraction for information distillation
AU - Hakkani-Tür, Dilek
AU - Tur, Gokhan
PY - 2007
Y1 - 2007
N2 - Information distillation aims to extract the most useful pieces of information related to a given query from massive, possibly multilingual, audio and textual document sources. One critical component in a distillation engine is detecting sentences to be extracted from each relevant document. In this paper, we presenta statistical sentenceextraction approach for distillation. Basically, we frame this task as a classification problem, where each candidate sentence in documents is classified as relevant to the query or not. These documents may be in textual or audio format and in a number of languages. For audio documents, we use both manual and automatic transcriptions, for non-English documents, we use automatic translations. In this work, we use AdaBoost, a discriminative classification method with both lexical and semantic features. The results indicate 11%-13% relative improvement over a baseline key word-spotting-based approach. We also show the robustness of our method on the audio subset of the document sources using manual and automatic transcriptions.
AB - Information distillation aims to extract the most useful pieces of information related to a given query from massive, possibly multilingual, audio and textual document sources. One critical component in a distillation engine is detecting sentences to be extracted from each relevant document. In this paper, we presenta statistical sentenceextraction approach for distillation. Basically, we frame this task as a classification problem, where each candidate sentence in documents is classified as relevant to the query or not. These documents may be in textual or audio format and in a number of languages. For audio documents, we use both manual and automatic transcriptions, for non-English documents, we use automatic translations. In this work, we use AdaBoost, a discriminative classification method with both lexical and semantic features. The results indicate 11%-13% relative improvement over a baseline key word-spotting-based approach. We also show the robustness of our method on the audio subset of the document sources using manual and automatic transcriptions.
KW - Information distillation
KW - Information extraction
KW - Language understanding
KW - Natural language processing
KW - Speech processing
UR - http://www.scopus.com/inward/record.url?scp=34547528176&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34547528176&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2007.367148
DO - 10.1109/ICASSP.2007.367148
M3 - Conference contribution
AN - SCOPUS:34547528176
SN - 1424407281
SN - 9781424407286
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - IV1-IV4
BT - 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '07
T2 - 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '07
Y2 - 15 April 2007 through 20 April 2007
ER -