Cross-lingual latent topic extraction

Duo Zhang, Qiaozhu Mei, Cheng Xiang Zhai

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way. One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. In this paper, we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Probabilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data.

Original languageEnglish (US)
Title of host publicationACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Conference Proceedings
EditorsJan Hajic, Sandra Carberry, Stephen Clark
PublisherAssociation for Computational Linguistics (ACL)
Number of pages10
ISBN (Electronic)1932432663, 9781932432664
StatePublished - 2010
Event48th Annual Meeting of the Association for Computational Linguistics, ACL 2010 - Uppsala, Sweden
Duration: Jul 11 2010Jul 16 2010

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X


Other48th Annual Meeting of the Association for Computational Linguistics, ACL 2010

ASJC Scopus subject areas

  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics


Dive into the research topics of 'Cross-lingual latent topic extraction'. Together they form a unique fingerprint.

Cite this