A Stochastic Finite-State Word-Segmentation Algorithm for Chinese

Richard Sproat, William Gale, Chilin Shih, Nancy Chang

Research output: Contribution to journalArticlepeer-review

Abstract

The initial stage of text analysis for any NLP task usually involves the tokenization of the input into words. For languages like English one can assume, to a first approximation, that word boundaries are given by whitespace or punctuation. In various Asian languages, including Chinese, on the other hand, whitespace is never used to delimit words, so one must resort to lexical information to "reconstruct" the word-boundary information. In this paper we present a stochastic finite-state model wherein the basic workhorse is the weighted finite-state transducer. The model segments Chinese text into dictionary entries and words derived by various productive lexical processes, and - since the primary intended application of this model is to text-to-speech synthesis - provides pronunciations for these words. We evaluate the system's performance by comparing its segmentation "judgments" with the judgments of a pool of human segmenters, and the system is shown to perform quite well.

Original languageEnglish (US)
Pages (from-to)377-404
Number of pages28
JournalComputational Linguistics
Volume22
Issue number3
StatePublished - Sep 1996
Externally publishedYes

ASJC Scopus subject areas

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'A Stochastic Finite-State Word-Segmentation Algorithm for Chinese'. Together they form a unique fingerprint.

Cite this