Minimally Supervised Categorization of Text with Metadata

Yu Zhang, Yu Meng, Jiaxin Huang, Frank F. Xu, Xuan Wang, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Document categorization, which aims to assign a topic label to each document, plays a fundamental role in a wide variety of applications. Despite the success of existing studies in conventional supervised document classification, they are less concerned with two real problems: (1)the presence of metadata: in many domains, text is accompanied by various additional information such as authors and tags. Such metadata serve as compelling topic indicators and should be leveraged into the categorization framework; (2)label scarcity: labeled training samples are expensive to obtain in some cases, where categorization needs to be performed using only a small set of annotated data. In recognition of these two challenges, we propose MetaCat, a minimally supervised framework to categorize text with metadata. Specifically, we develop a generative process describing the relationships between words, documents, labels, and metadata. Guided by the generative model, we embed text and metadata into the same semantic space to encode heterogeneous signals. Then, based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity. We conduct a thorough evaluation on a wide range of datasets. Experimental results prove the effectiveness of MetaCat over many competitive baselines.

Original languageEnglish (US)
Title of host publicationSIGIR 2020 - Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages1231-1240
Number of pages10
ISBN (Electronic)9781450380164
DOIs
StatePublished - Jul 25 2020
Event43rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2020 - Virtual, Online, China
Duration: Jul 25 2020Jul 30 2020

Publication series

NameSIGIR 2020 - Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Conference

Conference43rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2020
CountryChina
CityVirtual, Online
Period7/25/207/30/20

Keywords

  • metadata
  • text classification
  • weak supervision

ASJC Scopus subject areas

  • Computer Graphics and Computer-Aided Design
  • Information Systems
  • Software

Fingerprint Dive into the research topics of 'Minimally Supervised Categorization of Text with Metadata'. Together they form a unique fingerprint.

Cite this