Learning better transliterations

Jeff Pasternack, Dan Roth

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We introduce a new probabilistic model for transliteration that performs significantly better than previous approaches, is language-agnostic, requiring no knowledge of the source or target languages, and is capable of both generation (creating the most likely transliteration of a source word) and discovery (selecting the most likely transliteration from a list of candidate words). Our experimental results demonstrate improved accuracy over the existing state-of-the-art by more than 10% in Chinese, Hebrew and Russian. While past work has commonly made use of fixed-size n-gram features along with more traditional models such as HMM or Perceptron, we utilize an intuitive notion of "productions", where each source word can be segmented into a series of contiguous, non-overlapping substrings of any size, each of which independently transliterates to a substring in the target language with a given probability. (e.g. P(wash⇒ BaIII) = 0:95). To learn these parameters, we employ Expectation-Maximization (EM), with the alignment between substrings in the source and target word training pairs as our latent data. Despite the size of the parameter space and the 2|w|-1 possible segmentations to consider for each word, by using dynamic programming each iteration of EM takes O(m6n) time, where m is the length of the longest word in the data and n is the number of word pairs, and is very fast in practice. Furthermore, discovering transliterations takes only O(m4w) time, where w is the number of candidate words to choose from, and generating a transliteration takes O(m2k2) time, where k is a pruning constant (we used a value of 100). Additionally, we are able to obtain training examples in an unsupervised fashion from Wikipedia by using a relatively simple algorithm to filter potential word pairs.

Original languageEnglish (US)
Title of host publicationACM 18th International Conference on Information and Knowledge Management, CIKM 2009
Pages177-184
Number of pages8
DOIs
StatePublished - Dec 1 2009
EventACM 18th International Conference on Information and Knowledge Management, CIKM 2009 - Hong Kong, China
Duration: Nov 2 2009Nov 6 2009

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Other

OtherACM 18th International Conference on Information and Knowledge Management, CIKM 2009
Country/TerritoryChina
CityHong Kong
Period11/2/0911/6/09

Keywords

  • Multi-lingual information retrieval
  • Probabilistic models
  • Translation
  • Transliteration

ASJC Scopus subject areas

  • Decision Sciences(all)
  • Business, Management and Accounting(all)

Fingerprint

Dive into the research topics of 'Learning better transliterations'. Together they form a unique fingerprint.

Cite this