Updating an NLP System to Fit New Domains: an empirical study on the sentence segmentation problem

Tong Zhang, Fred Damerau, David Johnson

Research output: Contribution to conferencePaperpeer-review

Abstract

Statistical machine learning algorithms have been successfully applied to many natural language processing (NLP) problems. Compared to manually constructed systems, statistical NLP systems are often easier to develop and maintain since only annotated training text is required. From annotated data, the underlying statistical algorithm can build a model so that annotations for future data can be predicted. However, the performance of a statistical system can also depend heavily on the characteristics of the training data. If we apply such a system to text with characteristics different from that of the training data, then performance degradation will occur. In this paper, we examine this issue empirically using the sentence boundary detection problem. We propose and compare several methods that can be used to update a statistical NLP system when moving to a different domain.

Original languageEnglish (US)
Pages56-62
Number of pages7
StatePublished - 2003
Externally publishedYes
Event7th Conference on Natural Language Learning, CoNLL 2003 at HLT-NAACL 2003 - Edmonton, Canada
Duration: May 31 2003Jun 1 2003

Conference

Conference7th Conference on Natural Language Learning, CoNLL 2003 at HLT-NAACL 2003
Country/TerritoryCanada
CityEdmonton
Period5/31/036/1/03

ASJC Scopus subject areas

  • Management Science and Operations Research
  • Computer Graphics and Computer-Aided Design
  • Computer Vision and Pattern Recognition
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'Updating an NLP System to Fit New Domains: an empirical study on the sentence segmentation problem'. Together they form a unique fingerprint.

Cite this