Abstract
Web page classification is one of the essential techniques for Web mining. Specifically, classifying Web pages of a user-interesting class is the first step of mining interesting information from the Web. However, constructing a classifier for an interesting class requires laborious pre-processing such as collecting positive and negative training examples. For instance, in order to construct a "homepage" classifier, one needs to collect a sample of homepages (positive examples) and a sample of non-homepages (negative examples). In particular, collecting negative training examples requires arduous work and special caution to avoid biasing them. We introduce in this paper the Positive Example Based Learning (PEBL) framework for Web page classification which eliminates the need for manually collecting negative training examples in pre-processing. We present an algorithm called Mapping-Convergence (M-C) that achieves classification accuracy (with positive and unlabeled data) as high as that of traditional SVM (with positive and negative data). Our experiments show that when the M-C algorithm uses the same amount of positive examples as that of traditional SVM, the M-C algorithm performs as well as traditional SVM.
Original language | English (US) |
---|---|
Pages | 239-248 |
Number of pages | 10 |
State | Published - 2002 |
Event | KDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Edmonton, Alta, Canada Duration: Jul 23 2002 → Jul 26 2002 |
Other
Other | KDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining |
---|---|
Country/Territory | Canada |
City | Edmonton, Alta |
Period | 7/23/02 → 7/26/02 |
Keywords
- Labeled data
- Mapping-convergence (M-C) algorithm
- SVM (support vector machine)
- Unlabeled data
ASJC Scopus subject areas
- Software
- Information Systems