TY - GEN
T1 - Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations
AU - Meng, Yu
AU - Zhang, Yunyi
AU - Huang, Jiaxin
AU - Zhang, Yu
AU - Han, Jiawei
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/4/25
Y1 - 2022/4/25
N2 - Topic models have been the prominent tools for automatic topic discovery from text corpora. Despite their effectiveness, topic models suffer from several limitations including the inability of modeling word ordering information in documents, the difficulty of incorporating external linguistic knowledge, and the lack of both accurate and efficient inference methods for approximating the intractable posterior. Recently, pretrained language models (PLMs) have brought astonishing performance improvements to a wide variety of tasks due to their superior representations of text. Interestingly, there have not been standard approaches to deploy PLMs for topic discovery as better alternatives to topic models. In this paper, we begin by analyzing the challenges of using PLM representations for topic discovery, and then propose a joint latent space learning and clustering framework built upon PLM embeddings. In the latent space, topic-word and document-topic distributions are jointly modeled so that the discovered topics can be interpreted by coherent and distinctive terms and meanwhile serve as meaningful summaries of the documents. Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery, and is conceptually simpler than topic models. On two benchmark datasets in different domains, our model generates significantly more coherent and diverse topics than strong topic models, and offers better topic-wise document representations, based on both automatic and human evaluations.1
AB - Topic models have been the prominent tools for automatic topic discovery from text corpora. Despite their effectiveness, topic models suffer from several limitations including the inability of modeling word ordering information in documents, the difficulty of incorporating external linguistic knowledge, and the lack of both accurate and efficient inference methods for approximating the intractable posterior. Recently, pretrained language models (PLMs) have brought astonishing performance improvements to a wide variety of tasks due to their superior representations of text. Interestingly, there have not been standard approaches to deploy PLMs for topic discovery as better alternatives to topic models. In this paper, we begin by analyzing the challenges of using PLM representations for topic discovery, and then propose a joint latent space learning and clustering framework built upon PLM embeddings. In the latent space, topic-word and document-topic distributions are jointly modeled so that the discovered topics can be interpreted by coherent and distinctive terms and meanwhile serve as meaningful summaries of the documents. Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery, and is conceptually simpler than topic models. On two benchmark datasets in different domains, our model generates significantly more coherent and diverse topics than strong topic models, and offers better topic-wise document representations, based on both automatic and human evaluations.1
KW - Clustering
KW - Pretrained Language Models
KW - Topic Discovery
UR - http://www.scopus.com/inward/record.url?scp=85129835929&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85129835929&partnerID=8YFLogxK
U2 - 10.1145/3485447.3512034
DO - 10.1145/3485447.3512034
M3 - Conference contribution
AN - SCOPUS:85129835929
T3 - WWW 2022 - Proceedings of the ACM Web Conference 2022
SP - 3143
EP - 3152
BT - WWW 2022 - Proceedings of the ACM Web Conference 2022
PB - Association for Computing Machinery
T2 - 31st ACM World Wide Web Conference, WWW 2022
Y2 - 25 April 2022 through 29 April 2022
ER -