Text classification from positive and unlabeled documents

Research output: Contribution to conferencePaperpeer-review

Abstract

Most existing studies of text classification assume that the training data are completely labeled. In reality, however, many information retrieval problems can be more accurately described as learning a binary classifier from a set of incompletely labeled examples, where we typically have a small number of labeled positive examples and a very large number of unlabeled examples. In this paper, we study such a problem of performing Text Classification WithOut labeled Negative data (TC-WON). In this paper, we explore an efficient extension of the standard Support Vector Machine (SVM) approach, called SVMC (Support Vector Mapping Convergence) [17], for the TC-WON tasks. Our analyses show that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods because SVMC basically exploits the natural "gap" between positive and negative documents in the feature space, which eventually corresponds to improving the generalization performance. In the text domain there are likely to exist many gaps in the feature space because a document is usually mapped to a sparse and high dimensional feature space. However, as the number of positive training data decreases, the boundary of SVMC starts overfilling at some point and end up generating very poor results. This is because when the positive training data is too few, the boundary over-iterates and trespasses the natural gaps between positive and negative class in the feature space and thus ends up fitting tightly around the few positive training data.

Original languageEnglish (US)
Pages232-239
Number of pages8
DOIs
StatePublished - 2003
EventCIKM 2003: Proceedings of the Twelfth ACM International Conference on Information and Knowledge Management - New Orleans, LA, United States
Duration: Nov 3 2003Nov 8 2003

Other

OtherCIKM 2003: Proceedings of the Twelfth ACM International Conference on Information and Knowledge Management
Country/TerritoryUnited States
CityNew Orleans, LA
Period11/3/0311/8/03

Keywords

  • Machine Learning
  • SVM
  • Text Classification
  • Text Filtering

ASJC Scopus subject areas

  • General Decision Sciences
  • General Business, Management and Accounting

Fingerprint

Dive into the research topics of 'Text classification from positive and unlabeled documents'. Together they form a unique fingerprint.

Cite this