Mining text outliers in document directories

Edouard Fouche, Yu Meng, Fang Guo, Honglei Zhuang, Klemens Bohm, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Nowadays, it is common to classify collections of documents into (human-generated, domain-specific) directory structures, such as email or document folders. But documents may be classified wrongly, for a multitude of reasons. Then they are outlying w.r.t. the folder they end up in. Orthogonally to this, and more specifically, two kinds of errors can occur: (O) Out-of-distribution: the document does not belong to any existing folder in the directory; and (M) Misclassification: the document belongs to another folder. It is this specific combination of issues that we address in this article, i.e., we mine text outliers from massive document directories, considering both error types. We propose a new proximity-based algorithm, which we dub kj-Nearest Neighbours (kj-NN). Our algorithm detects text outliers by exploiting semantic similarities and introduces a self-supervision mechanism that estimates the relevance of the original labels. Our approach is efficient and robust to large proportions of outliers. kj-NN also promotes the interpretability of the results by proposing alternative label names and by finding the most similar documents for each outlier. Our real-world experiments demonstrate that our approach outperforms the competitors by a large margin.

Original languageEnglish (US)
Title of host publicationProceedings - 20th IEEE International Conference on Data Mining, ICDM 2020
EditorsClaudia Plant, Haixun Wang, Alfredo Cuzzocrea, Carlo Zaniolo, Xindong Wu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages152-161
Number of pages10
ISBN (Electronic)9781728183169
DOIs
StatePublished - Nov 2020
Externally publishedYes
Event20th IEEE International Conference on Data Mining, ICDM 2020 - Virtual, Sorrento, Italy
Duration: Nov 17 2020Nov 20 2020

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
Volume2020-November
ISSN (Print)1550-4786

Conference

Conference20th IEEE International Conference on Data Mining, ICDM 2020
CountryItaly
CityVirtual, Sorrento
Period11/17/2011/20/20

Keywords

  • Anomaly Detection
  • Data Cleaning
  • Document Filtering
  • Nearest-Neighbour Search
  • Text Mining

ASJC Scopus subject areas

  • Engineering(all)

Fingerprint Dive into the research topics of 'Mining text outliers in document directories'. Together they form a unique fingerprint.

Cite this