Cross-lingual joint entity and word embedding to improve entity linking and parallel sentence mining

Xiaoman Pan, Thamme Gowda, Heng Ji, Jonathan May, Scott Miller

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Entities, which refer to distinct objects in the real world, can be viewed as language universals and used as effective signals to generate less ambiguous semantic representations and align multiple languages. We propose a novel method, CLEW, to generate cross-lingual data that is a mix of entities and contextual words based on Wikipedia. We replace each anchor link in the source language with its corresponding entity title in the target language if it exists, or in the source language otherwise. A cross-lingual joint entity and word embedding learned from this kind of data not only can disambiguate linkable entities but can also effectively represent unlinkable entities. Because this multilingual common space directly relates the semantics of contextual words in the source language to that of entities in the target language, we leverage it for unsupervised cross-lingual entity linking. Experimental results show that CLEW significantly advances the state-of-the-art: up to 3.1% absolute F-score gain for unsupervised cross-lingual entity linking. Moreover, it provides reliable alignment on both the word/entity level and the sentence level, and thus we use it to mine parallel sentences for all (302/2) language pairs in Wikipedia.

Original languageEnglish (US)
Title of host publicationDeepLo@EMNLP-IJCNLP 2019 - Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource Natural Language Processing - Proceedings
PublisherAssociation for Computational Linguistics (ACL)
Pages56-66
Number of pages11
ISBN (Electronic)9781950737789
StatePublished - 2021
Event2nd Workshop on Deep Learning Approaches for Low-Resource Natural Language Processing, DeepLo@EMNLP-IJCNLP 2019 - Hong Kong, China
Duration: Nov 3 2019 → …

Publication series

NameDeepLo@EMNLP-IJCNLP 2019 - Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource Natural Language Processing - Proceedings

Conference

Conference2nd Workshop on Deep Learning Approaches for Low-Resource Natural Language Processing, DeepLo@EMNLP-IJCNLP 2019
Country/TerritoryChina
CityHong Kong
Period11/3/19 → …

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Software

Fingerprint

Dive into the research topics of 'Cross-lingual joint entity and word embedding to improve entity linking and parallel sentence mining'. Together they form a unique fingerprint.

Cite this