Named Entity transliteration and discovery from multilingual comparable corpora

Alexandre Klementiev, Dan Roth

Research output: Contribution to conferencePaper

Abstract

Named Entity recognition (NER) is an important part of many natural language processing tasks. Most current approaches employ machine learning techniques and require supervised data. However, many languages lack such resources. This paper presents an algorithm to automatically discover Named Entities (NEs) in a resource free language, given a bilingual corpora in which it is weakly temporally aligned with a resource rich language. We observe that NEs have similar time distributions across such corpora, and that they are often transliterated, and develop an algorithm that exploits both iteratively. The algorithm makes use of a new, frequency based, metric for time distributions and a resource free discriminative approach to transliteration. We evaluate the algorithm on an English-Russian corpus, and show high level of NEs discovery in Russian.

Original languageEnglish (US)
Pages82-88
Number of pages7
StatePublished - Dec 1 2006
Event2006 Human Language Technology Conference - North American Chapter of the Association for Computational Linguistics Annual Meeting, HLT-NAACL 2006 - New York, NY, United States
Duration: Jun 4 2006Jun 9 2006

Other

Other2006 Human Language Technology Conference - North American Chapter of the Association for Computational Linguistics Annual Meeting, HLT-NAACL 2006
CountryUnited States
CityNew York, NY
Period6/4/066/9/06

    Fingerprint

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Cite this

Klementiev, A., & Roth, D. (2006). Named Entity transliteration and discovery from multilingual comparable corpora. 82-88. Paper presented at 2006 Human Language Technology Conference - North American Chapter of the Association for Computational Linguistics Annual Meeting, HLT-NAACL 2006, New York, NY, United States.