TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters

Dongha Lee, Jiaming Shen, Seongku Kang, Susik Yoon, Jiawei Han, Hwanjo Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Topic taxonomies, which represent the latent topic (or category) structure of document collections, provide valuable knowledge of contents in many applications such as web search and information filtering. Recently, several unsupervised methods have been developed to automatically construct the topic taxonomy from a text corpus, but it is challenging to generate the desired taxonomy without any prior knowledge. In this paper, we study how to leverage the partial (or incomplete) information about the topic structure as guidance to find out the complete topic taxonomy. We propose a novel framework for topic taxonomy completion, named TaxoCom, which recursively expands the topic taxonomy by discovering novel sub-topic clusters of terms and documents. To effectively identify novel topics within a hierarchical topic structure, TaxoCom devises its embedding and clustering techniques to be closely-linked with each other: (i) locally discriminative embedding optimizes the text embedding space to be discriminative among known (i.e., given) sub-topics, and (ii) novelty adaptive clustering assigns terms into either one of the known sub-topics or novel sub-topics. Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage but also outperforms all other baselines for a downstream task.

Original languageEnglish (US)
Title of host publicationWWW 2022 - Proceedings of the ACM Web Conference 2022
PublisherAssociation for Computing Machinery
Pages2819-2829
Number of pages11
ISBN (Electronic)9781450390965
DOIs
StatePublished - Apr 25 2022
Event31st ACM World Wide Web Conference, WWW 2022 - Virtual, Online, France
Duration: Apr 25 2022Apr 29 2022

Publication series

NameWWW 2022 - Proceedings of the ACM Web Conference 2022

Conference

Conference31st ACM World Wide Web Conference, WWW 2022
Country/TerritoryFrance
CityVirtual, Online
Period4/25/224/29/22

Keywords

  • Hierarchical topic discovery
  • Novelty detection
  • Text clustering
  • Text embedding
  • Topic taxonomy completion

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Fingerprint

Dive into the research topics of 'TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters'. Together they form a unique fingerprint.

Cite this