TY - JOUR

T1 - PEBL

T2 - Web Page Classification without Negative Examples

AU - Yu, Hwanjo

AU - Man, Jiawei

AU - Chang, Kevin Chen Chuan

N1 - Funding Information:
This paper is based on the paper by H. Yu, J. Han, and K.C. Chang, “PEBL: Positive Example Based Learning for Web Page Classification Using SVM,” Proc. 2002 Int’l Conf. Knowledge Discovery in Databases (KDD ’02), pp. 229-238, July 2002. However, this submission is substantially enhanced in technical contents and contains new and major-value added technical contribution in comparison with our conference publication. This work was supported in part by the US National Science Foundation under Grant No. IIS-02-09199 and IIS-01-33199, the University of Illinois, and an IBM Faculty Award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.

PY - 2004/1

Y1 - 2004/1

N2 - Web page classification is one of the essential techniques for Web mining because classifying Web pages of an interesting class is often the first step of mining the Web. However, constructing a classifier for an interesting class requires laborious preprocessing such as collecting positive and negative training examples. For instance, in order to construct a "homepage" classifier, one needs to collect a sample of homepages (positive examples) and a sample of nonhomepages (negative examples). In particular, collecting negative training examples requires arduous work and caution to avoid bias. This paper presents a framework, called Positive Example Based Learning (PEBL), for Web page classification which eliminates the need for manually collecting negative training examples in preprocessing. The PEBL framework applies an algorithm, called Mapping-Convergence (M-C), to achieve high classification accuracy (with positive and unlabeled data) as high as that of a traditional SVM (with positive and negative data). M-C runs in two stages: the mapping stage and convergence stage. In the mapping stage, the algorithm uses a weak classifier that draws an initial approximation of "strong" negative data. Based on the initial approximation, the convergence stage iteratively runs an internal classifier (e.g., SVM) which maximizes margins to progressively improve the approximation of negative data. Thus, the class boundary eventually converges to the true boundary of the positive class in the feature space. We present the M-C algorithm with supporting theoretical and experimental justifications. Our experiments show that, given the same set of positive examples, the M-C algorithm outperforms one-class SVMs, and it is almost as accurate as the traditional SVMs.

AB - Web page classification is one of the essential techniques for Web mining because classifying Web pages of an interesting class is often the first step of mining the Web. However, constructing a classifier for an interesting class requires laborious preprocessing such as collecting positive and negative training examples. For instance, in order to construct a "homepage" classifier, one needs to collect a sample of homepages (positive examples) and a sample of nonhomepages (negative examples). In particular, collecting negative training examples requires arduous work and caution to avoid bias. This paper presents a framework, called Positive Example Based Learning (PEBL), for Web page classification which eliminates the need for manually collecting negative training examples in preprocessing. The PEBL framework applies an algorithm, called Mapping-Convergence (M-C), to achieve high classification accuracy (with positive and unlabeled data) as high as that of a traditional SVM (with positive and negative data). M-C runs in two stages: the mapping stage and convergence stage. In the mapping stage, the algorithm uses a weak classifier that draws an initial approximation of "strong" negative data. Based on the initial approximation, the convergence stage iteratively runs an internal classifier (e.g., SVM) which maximizes margins to progressively improve the approximation of negative data. Thus, the class boundary eventually converges to the true boundary of the positive class in the feature space. We present the M-C algorithm with supporting theoretical and experimental justifications. Our experiments show that, given the same set of positive examples, the M-C algorithm outperforms one-class SVMs, and it is almost as accurate as the traditional SVMs.

KW - Document classification

KW - Mapping-Convergence (M-C) algorithm

KW - SVM (Support Vector Machine)

KW - Single-class classification

KW - Web mining

KW - Web page classification

UR - http://www.scopus.com/inward/record.url?scp=0742268826&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0742268826&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2004.1264823

DO - 10.1109/TKDE.2004.1264823

M3 - Article

AN - SCOPUS:0742268826

SN - 1041-4347

VL - 16

SP - 70

EP - 81

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 1

ER -