Abstract
The initial stage of text analysis for any NLP task usually involves the tokenization of the input into words. For languages like English one can assume, to a first approximation, that word boundaries are given by whitespace or punctuation. In various Asian languages, including Chinese, on the other hand, whitespace is never used to delimit words, so one must resort to lexical information to "reconstruct" the word-boundary information. In this paper we present a stochastic finite-state model wherein the basic workhorse is the weighted finite-state transducer. The model segments Chinese text into dictionary entries and words derived by various productive lexical processes, and - since the primary intended application of this model is to text-to-speech synthesis - provides pronunciations for these words. We evaluate the system's performance by comparing its segmentation "judgments" with the judgments of a pool of human segmenters, and the system is shown to perform quite well.
Original language | English (US) |
---|---|
Pages (from-to) | 377-404 |
Number of pages | 28 |
Journal | Computational Linguistics |
Volume | 22 |
Issue number | 3 |
State | Published - Sep 1996 |
Externally published | Yes |
ASJC Scopus subject areas
- Language and Linguistics
- Computational Theory and Mathematics
- Computer Science Applications
- Linguistics and Language