TY - GEN
T1 - Out-of-Category Document Identification Using Target-Category Names as Weak Supervision
AU - Lee, Dongha
AU - Hyun, Dongmin
AU - Han, Jiawei
AU - Yu, Hwanjo
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Identifying outlier documents, whose content is different from the majority of the documents in a corpus, has played an important role to manage a large text collection. However, due to the absence of explicit information about the inlier (or target) distribution, existing unsupervised outlier detectors are likely to make unreliable results depending on the density or diversity of the outliers in the corpus. To address this challenge, we introduce a new task referred to as out-of-category detection, which aims to distinguish the documents according to their semantic relevance to the inlier (or target) categories by using the category names as weak supervision. In practice, this task can be widely applicable in that it can flexibly designate the scope of target categories according to users' interests while requiring only the target-category names as minimum guidance. In this paper, we present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories. Our framework adopts a two-step approach, to take advantage of both (i) a discriminative text embedding and (ii) a neural text classifier. The experiments on real-world datasets demonstrate that our framework achieves the best detection performance among all baseline methods in various scenarios specifying different target categories.
AB - Identifying outlier documents, whose content is different from the majority of the documents in a corpus, has played an important role to manage a large text collection. However, due to the absence of explicit information about the inlier (or target) distribution, existing unsupervised outlier detectors are likely to make unreliable results depending on the density or diversity of the outliers in the corpus. To address this challenge, we introduce a new task referred to as out-of-category detection, which aims to distinguish the documents according to their semantic relevance to the inlier (or target) categories by using the category names as weak supervision. In practice, this task can be widely applicable in that it can flexibly designate the scope of target categories according to users' interests while requiring only the target-category names as minimum guidance. In this paper, we present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories. Our framework adopts a two-step approach, to take advantage of both (i) a discriminative text embedding and (ii) a neural text classifier. The experiments on real-world datasets demonstrate that our framework achieves the best detection performance among all baseline methods in various scenarios specifying different target categories.
KW - Discriminative text embedding
KW - Out-of-category detection
KW - Text outlier detection
KW - Weakly supervised classification
UR - http://www.scopus.com/inward/record.url?scp=85123333926&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123333926&partnerID=8YFLogxK
U2 - 10.1109/ICDM51629.2021.00041
DO - 10.1109/ICDM51629.2021.00041
M3 - Conference contribution
AN - SCOPUS:85123333926
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 1186
EP - 1191
BT - Proceedings - 21st IEEE International Conference on Data Mining, ICDM 2021
A2 - Bailey, James
A2 - Miettinen, Pauli
A2 - Koh, Yun Sing
A2 - Tao, Dacheng
A2 - Wu, Xindong
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 21st IEEE International Conference on Data Mining, ICDM 2021
Y2 - 7 December 2021 through 10 December 2021
ER -