Cross-lingual dataless classification for many languages

Song Yangqiu, Shyam Upadhyay, Peng Haoruo, Dan Roth

Research output: Contribution to journalConference article

Abstract

Dataless text classification [Chang et al., 2008] is a classification paradigm which maps documents into a given label space without requiring any annotated training data. This paper explores a crosslingual variant of this paradigm, where documents in multiple languages are classified into an English label space. We use CLESA (cross-lingual explicit semantic analysis) to embed both foreign language documents and an English label space into a shared semantic space, and select the best label( s) for a document using the similarity between the corresponding semantic representations. We illustrate our approach by experimenting with classifying documents in 88 different languages into the same English label space. In particular, we show that CLESA is better than using a monolingual ESA on the target foreign language and translating the English labels into that language. Moreover, the evaluation on two benchmarks, TED and RCV2, showed that cross-lingual dataless classification outperforms supervised learning methods when a large collection of annotated documents is not available.

Original languageEnglish (US)
Pages (from-to)2901-2907
Number of pages7
JournalIJCAI International Joint Conference on Artificial Intelligence
Volume2016-January
StatePublished - Jan 1 2016
Event25th International Joint Conference on Artificial Intelligence, IJCAI 2016 - New York, United States
Duration: Jul 9 2016Jul 15 2016

Fingerprint

Labels
Semantics
Supervised learning

ASJC Scopus subject areas

  • Artificial Intelligence

Cite this

Yangqiu, S., Upadhyay, S., Haoruo, P., & Roth, D. (2016). Cross-lingual dataless classification for many languages. IJCAI International Joint Conference on Artificial Intelligence, 2016-January, 2901-2907.

Cross-lingual dataless classification for many languages. / Yangqiu, Song; Upadhyay, Shyam; Haoruo, Peng; Roth, Dan.

In: IJCAI International Joint Conference on Artificial Intelligence, Vol. 2016-January, 01.01.2016, p. 2901-2907.

Research output: Contribution to journalConference article

Yangqiu, S, Upadhyay, S, Haoruo, P & Roth, D 2016, 'Cross-lingual dataless classification for many languages', IJCAI International Joint Conference on Artificial Intelligence, vol. 2016-January, pp. 2901-2907.
Yangqiu, Song ; Upadhyay, Shyam ; Haoruo, Peng ; Roth, Dan. / Cross-lingual dataless classification for many languages. In: IJCAI International Joint Conference on Artificial Intelligence. 2016 ; Vol. 2016-January. pp. 2901-2907.
@article{7acd9120c8ed4c1a817ba1400f91b30c,
title = "Cross-lingual dataless classification for many languages",
abstract = "Dataless text classification [Chang et al., 2008] is a classification paradigm which maps documents into a given label space without requiring any annotated training data. This paper explores a crosslingual variant of this paradigm, where documents in multiple languages are classified into an English label space. We use CLESA (cross-lingual explicit semantic analysis) to embed both foreign language documents and an English label space into a shared semantic space, and select the best label( s) for a document using the similarity between the corresponding semantic representations. We illustrate our approach by experimenting with classifying documents in 88 different languages into the same English label space. In particular, we show that CLESA is better than using a monolingual ESA on the target foreign language and translating the English labels into that language. Moreover, the evaluation on two benchmarks, TED and RCV2, showed that cross-lingual dataless classification outperforms supervised learning methods when a large collection of annotated documents is not available.",
author = "Song Yangqiu and Shyam Upadhyay and Peng Haoruo and Dan Roth",
year = "2016",
month = "1",
day = "1",
language = "English (US)",
volume = "2016-January",
pages = "2901--2907",
journal = "IJCAI International Joint Conference on Artificial Intelligence",
issn = "1045-0823",

}

TY - JOUR

T1 - Cross-lingual dataless classification for many languages

AU - Yangqiu, Song

AU - Upadhyay, Shyam

AU - Haoruo, Peng

AU - Roth, Dan

PY - 2016/1/1

Y1 - 2016/1/1

N2 - Dataless text classification [Chang et al., 2008] is a classification paradigm which maps documents into a given label space without requiring any annotated training data. This paper explores a crosslingual variant of this paradigm, where documents in multiple languages are classified into an English label space. We use CLESA (cross-lingual explicit semantic analysis) to embed both foreign language documents and an English label space into a shared semantic space, and select the best label( s) for a document using the similarity between the corresponding semantic representations. We illustrate our approach by experimenting with classifying documents in 88 different languages into the same English label space. In particular, we show that CLESA is better than using a monolingual ESA on the target foreign language and translating the English labels into that language. Moreover, the evaluation on two benchmarks, TED and RCV2, showed that cross-lingual dataless classification outperforms supervised learning methods when a large collection of annotated documents is not available.

AB - Dataless text classification [Chang et al., 2008] is a classification paradigm which maps documents into a given label space without requiring any annotated training data. This paper explores a crosslingual variant of this paradigm, where documents in multiple languages are classified into an English label space. We use CLESA (cross-lingual explicit semantic analysis) to embed both foreign language documents and an English label space into a shared semantic space, and select the best label( s) for a document using the similarity between the corresponding semantic representations. We illustrate our approach by experimenting with classifying documents in 88 different languages into the same English label space. In particular, we show that CLESA is better than using a monolingual ESA on the target foreign language and translating the English labels into that language. Moreover, the evaluation on two benchmarks, TED and RCV2, showed that cross-lingual dataless classification outperforms supervised learning methods when a large collection of annotated documents is not available.

UR - http://www.scopus.com/inward/record.url?scp=85006132687&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85006132687&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85006132687

VL - 2016-January

SP - 2901

EP - 2907

JO - IJCAI International Joint Conference on Artificial Intelligence

JF - IJCAI International Joint Conference on Artificial Intelligence

SN - 1045-0823

ER -