Embedding-Driven Multi-Dimensional Topic Mining and Text Analysis

Yu Meng, Jiaxin Huang, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

People nowadays are immersed in a wealth of text data, ranging from news articles, to social media, academic publications, advertisements, and economic reports. A grand challenge of data mining is to develop effective, scalable and weakly-supervised methods for extracting actionable structures and knowledge from massive text data. Without requiring extensive and corpus-specific human annotations, these methods will satisfy people's diverse applications and needs for comprehending and making good use of large-scale corpora. In this tutorial, we will introduce recent advances in text embeddings and their applications to a wide range of text mining tasks that facilitate multi-dimensional analysis of massive text corpora. Specifically, we first overview a set of recently developed unsupervised and weakly-supervised text embedding methods including state-of-the-art context-free embeddings and pre-trained language models that serve as the fundamentals for downstream tasks. We then present several embedding-driven text mining techniques that are weakly-supervised, domain-independent, language-agnostic, effective and scalable for mining and discovering structured knowledge, in the form of multi-dimensional topics and multi-faceted taxonomies, from large-scale text corpora. We finally show that the topics and taxonomies so discovered will naturally form a multi-dimensional TextCube structure, which greatly enhances text exploration and analysis for various important applications, including text classification, retrieval and summarization. We will demonstrate on the most recent real-world datasets (including political news articles as well as scientific publications related to the coronavirus) how multi-dimensional analysis of massive text corpora can be conducted with the introduced embedding-driven text mining techniques.

Original languageEnglish (US)
Title of host publicationKDD 2020 - Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages3573-3574
Number of pages2
ISBN (Electronic)9781450379984
DOIs
StatePublished - Aug 23 2020
Event26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2020 - Virtual, Online, United States
Duration: Aug 23 2020Aug 27 2020

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Conference

Conference26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2020
CountryUnited States
CityVirtual, Online
Period8/23/208/27/20

Keywords

  • massive text corpora
  • multi-dimensional analysis
  • multi-faceted taxonomy
  • text cube
  • text embedding
  • topic mining

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint Dive into the research topics of 'Embedding-Driven Multi-Dimensional Topic Mining and Text Analysis'. Together they form a unique fingerprint.

Cite this