Cross-lingual dataless classification for many languages

Song Yangqiu, Shyam Upadhyay, Peng Haoruo, Dan Roth

Research output: Contribution to journalConference article

Abstract

Dataless text classification [Chang et al., 2008] is a classification paradigm which maps documents into a given label space without requiring any annotated training data. This paper explores a crosslingual variant of this paradigm, where documents in multiple languages are classified into an English label space. We use CLESA (cross-lingual explicit semantic analysis) to embed both foreign language documents and an English label space into a shared semantic space, and select the best label( s) for a document using the similarity between the corresponding semantic representations. We illustrate our approach by experimenting with classifying documents in 88 different languages into the same English label space. In particular, we show that CLESA is better than using a monolingual ESA on the target foreign language and translating the English labels into that language. Moreover, the evaluation on two benchmarks, TED and RCV2, showed that cross-lingual dataless classification outperforms supervised learning methods when a large collection of annotated documents is not available.

Original languageEnglish (US)
Pages (from-to)2901-2907
Number of pages7
JournalIJCAI International Joint Conference on Artificial Intelligence
Volume2016-January
StatePublished - Jan 1 2016
Event25th International Joint Conference on Artificial Intelligence, IJCAI 2016 - New York, United States
Duration: Jul 9 2016Jul 15 2016

ASJC Scopus subject areas

  • Artificial Intelligence

Fingerprint Dive into the research topics of 'Cross-lingual dataless classification for many languages'. Together they form a unique fingerprint.

  • Cite this

    Yangqiu, S., Upadhyay, S., Haoruo, P., & Roth, D. (2016). Cross-lingual dataless classification for many languages. IJCAI International Joint Conference on Artificial Intelligence, 2016-January, 2901-2907.