A statistical method for finding word boundaries in Chinese text

Richard Sproat, Chilin Shih

Research output: Contribution to journalArticlepeer-review

Abstract

A significant problem in Chinese text analysis is detecting where word boundaries lie. We have employed a statistical method to group Chinese characters into two-character words making use of a measure of character association based on mutual information. The statistics were derived from a corpus of approximately 2.6 million characters of Chinese newspaper text. The method has been tested on randomly selected texts from a corpus of Chinese newspaper text with quite favorable results.
Original languageEnglish (US)
Pages (from-to)336-351
Number of pages16
JournalComputer Processing of Chinese & Oriental Languages
Volume4
Issue number4
StatePublished - Mar 1990
Externally publishedYes

Keywords

  • Chinese text processing
  • character-to-word assignment
  • natural language processing
  • parsing
  • statistical models of language
  • mutual information

Fingerprint

Dive into the research topics of 'A statistical method for finding word boundaries in Chinese text'. Together they form a unique fingerprint.

Cite this