Abstract
A significant problem in Chinese text analysis is detecting where word boundaries lie. We have employed a statistical method to group Chinese characters into two-character words making use of a measure of character association based on mutual information. The statistics were derived from a corpus of approximately 2.6 million characters of Chinese newspaper text. The method has been tested on randomly selected texts from a corpus of Chinese newspaper text with quite favorable results.
Original language | English (US) |
---|---|
Pages (from-to) | 336-351 |
Number of pages | 16 |
Journal | Computer Processing of Chinese & Oriental Languages |
Volume | 4 |
Issue number | 4 |
State | Published - Mar 1990 |
Externally published | Yes |
Keywords
- Chinese text processing
- character-to-word assignment
- natural language processing
- parsing
- statistical models of language
- mutual information