Out-of-Category Document Identification Using Target-Category Names as Weak Supervision

Dongha Lee, Dongmin Hyun, Jiawei Han, Hwanjo Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Identifying outlier documents, whose content is different from the majority of the documents in a corpus, has played an important role to manage a large text collection. However, due to the absence of explicit information about the inlier (or target) distribution, existing unsupervised outlier detectors are likely to make unreliable results depending on the density or diversity of the outliers in the corpus. To address this challenge, we introduce a new task referred to as out-of-category detection, which aims to distinguish the documents according to their semantic relevance to the inlier (or target) categories by using the category names as weak supervision. In practice, this task can be widely applicable in that it can flexibly designate the scope of target categories according to users' interests while requiring only the target-category names as minimum guidance. In this paper, we present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories. Our framework adopts a two-step approach, to take advantage of both (i) a discriminative text embedding and (ii) a neural text classifier. The experiments on real-world datasets demonstrate that our framework achieves the best detection performance among all baseline methods in various scenarios specifying different target categories.

Original languageEnglish (US)
Title of host publicationProceedings - 21st IEEE International Conference on Data Mining, ICDM 2021
EditorsJames Bailey, Pauli Miettinen, Yun Sing Koh, Dacheng Tao, Xindong Wu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1186-1191
Number of pages6
ISBN (Electronic)9781665423984
DOIs
StatePublished - 2021
Event21st IEEE International Conference on Data Mining, ICDM 2021 - Virtual, Online, New Zealand
Duration: Dec 7 2021Dec 10 2021

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
Volume2021-December
ISSN (Print)1550-4786

Conference

Conference21st IEEE International Conference on Data Mining, ICDM 2021
Country/TerritoryNew Zealand
CityVirtual, Online
Period12/7/2112/10/21

Keywords

  • Discriminative text embedding
  • Out-of-category detection
  • Text outlier detection
  • Weakly supervised classification

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'Out-of-Category Document Identification Using Target-Category Names as Weak Supervision'. Together they form a unique fingerprint.

Cite this