Minimally-supervised structure-rich text categorization via learning on text-rich networks

Xinyang Zhang, Chenwei Zhang, Xin Luna Dong, Jingbo Shang, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Text categorization is an essential task in Web content analysis. Considering the ever-evolving Web data and new emerging categories, instead of the laborious supervised setting, in this paper, we focus on the minimally-supervised setting that aims to categorize documents effectively, with a couple of seed documents annotated per category. We recognize that texts collected from the Web are often structure-rich, i.e., accompanied by various metadata. One can easily organize the corpus into a text-rich network, joining raw text documents with document attributes, high-quality phrases, label surface names as nodes, and their associations as edges. Such a network provides a holistic view of the corpus' heterogeneous data sources and enables a joint optimization for network-based analysis and deep textual model training. We therefore propose a novel framework for minimally supervised categorization by learning from the text-rich network. Specifically, we jointly train two modules with different inductive biases - a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning. Each module generates pseudo training labels from the unlabeled document set, and both modules mutually enhance each other by co-training using pooled pseudo labels. We test our model on two real-world datasets. On the challenging e-commerce product categorization dataset with 683 categories, our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%, significantly outperforming all compared methods; our accuracy is only less than 2% away from the supervised BERT model trained on about 50K labeled documents.

Original languageEnglish (US)
Title of host publicationThe Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
PublisherAssociation for Computing Machinery
Pages3258-3268
Number of pages11
ISBN (Electronic)9781450383127
DOIs
StatePublished - Apr 19 2021
Event2021 World Wide Web Conference, WWW 2021 - Ljubljana, Slovenia
Duration: Apr 19 2021Apr 23 2021

Publication series

NameThe Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021

Conference

Conference2021 World Wide Web Conference, WWW 2021
Country/TerritorySlovenia
CityLjubljana
Period4/19/214/23/21

Keywords

  • Minimal supervision
  • Text categorization
  • Text-rich networks

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Fingerprint

Dive into the research topics of 'Minimally-supervised structure-rich text categorization via learning on text-rich networks'. Together they form a unique fingerprint.

Cite this