TY - GEN
T1 - Learning better transliterations
AU - Pasternack, Jeff
AU - Roth, Dan
PY - 2009
Y1 - 2009
N2 - We introduce a new probabilistic model for transliteration that performs significantly better than previous approaches, is language-agnostic, requiring no knowledge of the source or target languages, and is capable of both generation (creating the most likely transliteration of a source word) and discovery (selecting the most likely transliteration from a list of candidate words). Our experimental results demonstrate improved accuracy over the existing state-of-the-art by more than 10% in Chinese, Hebrew and Russian. While past work has commonly made use of fixed-size n-gram features along with more traditional models such as HMM or Perceptron, we utilize an intuitive notion of "productions", where each source word can be segmented into a series of contiguous, non-overlapping substrings of any size, each of which independently transliterates to a substring in the target language with a given probability. (e.g. P(wash⇒ BaIII) = 0:95). To learn these parameters, we employ Expectation-Maximization (EM), with the alignment between substrings in the source and target word training pairs as our latent data. Despite the size of the parameter space and the 2|w|-1 possible segmentations to consider for each word, by using dynamic programming each iteration of EM takes O(m6n) time, where m is the length of the longest word in the data and n is the number of word pairs, and is very fast in practice. Furthermore, discovering transliterations takes only O(m4w) time, where w is the number of candidate words to choose from, and generating a transliteration takes O(m2k2) time, where k is a pruning constant (we used a value of 100). Additionally, we are able to obtain training examples in an unsupervised fashion from Wikipedia by using a relatively simple algorithm to filter potential word pairs.
AB - We introduce a new probabilistic model for transliteration that performs significantly better than previous approaches, is language-agnostic, requiring no knowledge of the source or target languages, and is capable of both generation (creating the most likely transliteration of a source word) and discovery (selecting the most likely transliteration from a list of candidate words). Our experimental results demonstrate improved accuracy over the existing state-of-the-art by more than 10% in Chinese, Hebrew and Russian. While past work has commonly made use of fixed-size n-gram features along with more traditional models such as HMM or Perceptron, we utilize an intuitive notion of "productions", where each source word can be segmented into a series of contiguous, non-overlapping substrings of any size, each of which independently transliterates to a substring in the target language with a given probability. (e.g. P(wash⇒ BaIII) = 0:95). To learn these parameters, we employ Expectation-Maximization (EM), with the alignment between substrings in the source and target word training pairs as our latent data. Despite the size of the parameter space and the 2|w|-1 possible segmentations to consider for each word, by using dynamic programming each iteration of EM takes O(m6n) time, where m is the length of the longest word in the data and n is the number of word pairs, and is very fast in practice. Furthermore, discovering transliterations takes only O(m4w) time, where w is the number of candidate words to choose from, and generating a transliteration takes O(m2k2) time, where k is a pruning constant (we used a value of 100). Additionally, we are able to obtain training examples in an unsupervised fashion from Wikipedia by using a relatively simple algorithm to filter potential word pairs.
KW - Multi-lingual information retrieval
KW - Probabilistic models
KW - Translation
KW - Transliteration
UR - http://www.scopus.com/inward/record.url?scp=74549186548&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=74549186548&partnerID=8YFLogxK
U2 - 10.1145/1645953.1645978
DO - 10.1145/1645953.1645978
M3 - Conference contribution
AN - SCOPUS:74549186548
SN - 9781605585123
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 177
EP - 184
BT - ACM 18th International Conference on Information and Knowledge Management, CIKM 2009
T2 - ACM 18th International Conference on Information and Knowledge Management, CIKM 2009
Y2 - 2 November 2009 through 6 November 2009
ER -