TY - GEN
T1 - Cross-lingual latent topic extraction
AU - Zhang, Duo
AU - Mei, Qiaozhu
AU - Zhai, Cheng Xiang
N1 - Funding Information:
We sincerely thank the anonymous reviewers for their comprehensive and constructive comments. The work was supported in part by NASA grant NNX08AC35A, by the National Science Foundation under Grant Numbers IIS-0713581, IIS-0713571, and CNS-0834709, and by a Sloan Research Fellowship.
Publisher Copyright:
© 2010 Association for Computational Linguistics.
PY - 2010
Y1 - 2010
N2 - Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way. One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. In this paper, we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Probabilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data.
AB - Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way. One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. In this paper, we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Probabilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data.
UR - http://www.scopus.com/inward/record.url?scp=85118444601&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85118444601&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85118444601
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 1128
EP - 1137
BT - ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Conference Proceedings
A2 - Hajic, Jan
A2 - Carberry, Sandra
A2 - Clark, Stephen
PB - Association for Computational Linguistics (ACL)
T2 - 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010
Y2 - 11 July 2010 through 16 July 2010
ER -