TY - GEN
T1 - TaxoCom
T2 - 31st ACM World Wide Web Conference, WWW 2022
AU - Lee, Dongha
AU - Shen, Jiaming
AU - Kang, Seongku
AU - Yoon, Susik
AU - Han, Jiawei
AU - Yu, Hwanjo
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/4/25
Y1 - 2022/4/25
N2 - Topic taxonomies, which represent the latent topic (or category) structure of document collections, provide valuable knowledge of contents in many applications such as web search and information filtering. Recently, several unsupervised methods have been developed to automatically construct the topic taxonomy from a text corpus, but it is challenging to generate the desired taxonomy without any prior knowledge. In this paper, we study how to leverage the partial (or incomplete) information about the topic structure as guidance to find out the complete topic taxonomy. We propose a novel framework for topic taxonomy completion, named TaxoCom, which recursively expands the topic taxonomy by discovering novel sub-topic clusters of terms and documents. To effectively identify novel topics within a hierarchical topic structure, TaxoCom devises its embedding and clustering techniques to be closely-linked with each other: (i) locally discriminative embedding optimizes the text embedding space to be discriminative among known (i.e., given) sub-topics, and (ii) novelty adaptive clustering assigns terms into either one of the known sub-topics or novel sub-topics. Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage but also outperforms all other baselines for a downstream task.
AB - Topic taxonomies, which represent the latent topic (or category) structure of document collections, provide valuable knowledge of contents in many applications such as web search and information filtering. Recently, several unsupervised methods have been developed to automatically construct the topic taxonomy from a text corpus, but it is challenging to generate the desired taxonomy without any prior knowledge. In this paper, we study how to leverage the partial (or incomplete) information about the topic structure as guidance to find out the complete topic taxonomy. We propose a novel framework for topic taxonomy completion, named TaxoCom, which recursively expands the topic taxonomy by discovering novel sub-topic clusters of terms and documents. To effectively identify novel topics within a hierarchical topic structure, TaxoCom devises its embedding and clustering techniques to be closely-linked with each other: (i) locally discriminative embedding optimizes the text embedding space to be discriminative among known (i.e., given) sub-topics, and (ii) novelty adaptive clustering assigns terms into either one of the known sub-topics or novel sub-topics. Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage but also outperforms all other baselines for a downstream task.
KW - Hierarchical topic discovery
KW - Novelty detection
KW - Text clustering
KW - Text embedding
KW - Topic taxonomy completion
UR - http://www.scopus.com/inward/record.url?scp=85129844084&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85129844084&partnerID=8YFLogxK
U2 - 10.1145/3485447.3512002
DO - 10.1145/3485447.3512002
M3 - Conference contribution
AN - SCOPUS:85129844084
T3 - WWW 2022 - Proceedings of the ACM Web Conference 2022
SP - 2819
EP - 2829
BT - WWW 2022 - Proceedings of the ACM Web Conference 2022
PB - Association for Computing Machinery
Y2 - 25 April 2022 through 29 April 2022
ER -