Abstract

Most existing studies of text classification assume that the training data are completely labeled. In reality, however, many information retrieval problems can be more accurately described as learning a binary classifier from a set of incompletely labeled examples, where we typically have a small number of labeled positive examples and a very large number of unlabeled examples. In this paper, we study such a problem of performing Text Classification WithOut labeled Negative data (TC-WON). In this paper, we explore an efficient extension of the standard Support Vector Machine (SVM) approach, called SVMC (Support Vector Mapping Convergence) [17], for the TC-WON tasks. Our analyses show that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods because SVMC basically exploits the natural "gap" between positive and negative documents in the feature space, which eventually corresponds to improving the generalization performance. In the text domain there are likely to exist many gaps in the feature space because a document is usually mapped to a sparse and high dimensional feature space. However, as the number of positive training data decreases, the boundary of SVMC starts overfilling at some point and end up generating very poor results. This is because when the positive training data is too few, the boundary over-iterates and trespasses the natural gaps between positive and negative class in the feature space and thus ends up fitting tightly around the few positive training data.

Original languageEnglish (US)
Pages232-239
Number of pages8
StatePublished - Dec 1 2003
EventCIKM 2003: Proceedings of the Twelfth ACM International Conference on Information and Knowledge Management - New Orleans, LA, United States
Duration: Nov 3 2003Nov 8 2003

Other

OtherCIKM 2003: Proceedings of the Twelfth ACM International Conference on Information and Knowledge Management
CountryUnited States
CityNew Orleans, LA
Period11/3/0311/8/03

Fingerprint

Text classification
Information retrieval
Support vector machine
Classifier

Keywords

  • Machine Learning
  • SVM
  • Text Classification
  • Text Filtering

ASJC Scopus subject areas

  • Decision Sciences(all)
  • Business, Management and Accounting(all)

Cite this

Yu, H., Zhai, C., & Han, J. (2003). Text classification from positive and unlabeled documents. 232-239. Paper presented at CIKM 2003: Proceedings of the Twelfth ACM International Conference on Information and Knowledge Management, New Orleans, LA, United States.

Text classification from positive and unlabeled documents. / Yu, Hwanjo; Zhai, Chengxiang; Han, Jiawei.

2003. 232-239 Paper presented at CIKM 2003: Proceedings of the Twelfth ACM International Conference on Information and Knowledge Management, New Orleans, LA, United States.

Research output: Contribution to conferencePaper

Yu, H, Zhai, C & Han, J 2003, 'Text classification from positive and unlabeled documents', Paper presented at CIKM 2003: Proceedings of the Twelfth ACM International Conference on Information and Knowledge Management, New Orleans, LA, United States, 11/3/03 - 11/8/03 pp. 232-239.
Yu H, Zhai C, Han J. Text classification from positive and unlabeled documents. 2003. Paper presented at CIKM 2003: Proceedings of the Twelfth ACM International Conference on Information and Knowledge Management, New Orleans, LA, United States.
Yu, Hwanjo ; Zhai, Chengxiang ; Han, Jiawei. / Text classification from positive and unlabeled documents. Paper presented at CIKM 2003: Proceedings of the Twelfth ACM International Conference on Information and Knowledge Management, New Orleans, LA, United States.8 p.
@conference{738d28c5af0b4accbe94aea37349d2ad,
title = "Text classification from positive and unlabeled documents",
abstract = "Most existing studies of text classification assume that the training data are completely labeled. In reality, however, many information retrieval problems can be more accurately described as learning a binary classifier from a set of incompletely labeled examples, where we typically have a small number of labeled positive examples and a very large number of unlabeled examples. In this paper, we study such a problem of performing Text Classification WithOut labeled Negative data (TC-WON). In this paper, we explore an efficient extension of the standard Support Vector Machine (SVM) approach, called SVMC (Support Vector Mapping Convergence) [17], for the TC-WON tasks. Our analyses show that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods because SVMC basically exploits the natural {"}gap{"} between positive and negative documents in the feature space, which eventually corresponds to improving the generalization performance. In the text domain there are likely to exist many gaps in the feature space because a document is usually mapped to a sparse and high dimensional feature space. However, as the number of positive training data decreases, the boundary of SVMC starts overfilling at some point and end up generating very poor results. This is because when the positive training data is too few, the boundary over-iterates and trespasses the natural gaps between positive and negative class in the feature space and thus ends up fitting tightly around the few positive training data.",
keywords = "Machine Learning, SVM, Text Classification, Text Filtering",
author = "Hwanjo Yu and Chengxiang Zhai and Jiawei Han",
year = "2003",
month = "12",
day = "1",
language = "English (US)",
pages = "232--239",
note = "CIKM 2003: Proceedings of the Twelfth ACM International Conference on Information and Knowledge Management ; Conference date: 03-11-2003 Through 08-11-2003",

}

TY - CONF

T1 - Text classification from positive and unlabeled documents

AU - Yu, Hwanjo

AU - Zhai, Chengxiang

AU - Han, Jiawei

PY - 2003/12/1

Y1 - 2003/12/1

N2 - Most existing studies of text classification assume that the training data are completely labeled. In reality, however, many information retrieval problems can be more accurately described as learning a binary classifier from a set of incompletely labeled examples, where we typically have a small number of labeled positive examples and a very large number of unlabeled examples. In this paper, we study such a problem of performing Text Classification WithOut labeled Negative data (TC-WON). In this paper, we explore an efficient extension of the standard Support Vector Machine (SVM) approach, called SVMC (Support Vector Mapping Convergence) [17], for the TC-WON tasks. Our analyses show that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods because SVMC basically exploits the natural "gap" between positive and negative documents in the feature space, which eventually corresponds to improving the generalization performance. In the text domain there are likely to exist many gaps in the feature space because a document is usually mapped to a sparse and high dimensional feature space. However, as the number of positive training data decreases, the boundary of SVMC starts overfilling at some point and end up generating very poor results. This is because when the positive training data is too few, the boundary over-iterates and trespasses the natural gaps between positive and negative class in the feature space and thus ends up fitting tightly around the few positive training data.

AB - Most existing studies of text classification assume that the training data are completely labeled. In reality, however, many information retrieval problems can be more accurately described as learning a binary classifier from a set of incompletely labeled examples, where we typically have a small number of labeled positive examples and a very large number of unlabeled examples. In this paper, we study such a problem of performing Text Classification WithOut labeled Negative data (TC-WON). In this paper, we explore an efficient extension of the standard Support Vector Machine (SVM) approach, called SVMC (Support Vector Mapping Convergence) [17], for the TC-WON tasks. Our analyses show that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods because SVMC basically exploits the natural "gap" between positive and negative documents in the feature space, which eventually corresponds to improving the generalization performance. In the text domain there are likely to exist many gaps in the feature space because a document is usually mapped to a sparse and high dimensional feature space. However, as the number of positive training data decreases, the boundary of SVMC starts overfilling at some point and end up generating very poor results. This is because when the positive training data is too few, the boundary over-iterates and trespasses the natural gaps between positive and negative class in the feature space and thus ends up fitting tightly around the few positive training data.

KW - Machine Learning

KW - SVM

KW - Text Classification

KW - Text Filtering

UR - http://www.scopus.com/inward/record.url?scp=18744413274&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=18744413274&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:18744413274

SP - 232

EP - 239

ER -