Abstract
A significant problem in Chinese text analysis is detecting where word boundaries lie. We have employed a statistical method to group Chinese characters into two-character words making use of a measure of character association based on mutual information. The statistics were derived from a corpus of approximately 2.6 million characters of Chinese newspaper text. The method has been tested on randomly selected texts from a corpus of Chinese newspaper text with quite favorable results.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 336-351 |
| Number of pages | 16 |
| Journal | Computer Processing of Chinese & Oriental Languages |
| Volume | 4 |
| Issue number | 4 |
| State | Published - Mar 1990 |
| Externally published | Yes |
Keywords
- Chinese text processing
- character-to-word assignment
- natural language processing
- parsing
- statistical models of language
- mutual information
Fingerprint
Dive into the research topics of 'A statistical method for finding word boundaries in Chinese text'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS