Making SVMs scalable to large data sets using hierarchical cluster indexing

Hwanjo Yu, Jiong Yang, Jiawei Han, Xiaolei Li

Research output: Contribution to journalReview article

Abstract

Support vector machines (SVMs) have been promising methods for classification and regression analysis due to their solid mathematical foundations, which include two desirable properties: margin maximization and nonlinear classification using kernels. However, despite these prominent properties, SVMs are usually not chosen for large-scale data mining problems because their training complexity is highly dependent on the data set size. Unlike traditional pattern recognition and machine learning, real-world data mining applications often involve huge numbers of data records. Thus it is too expensive to perform multiple scans on the entire data set, and it is also infeasible to put the data set in memory. This paper presents a method, Clustering-Based SVM (CB-SVM), that maximizes the SVM performance for very large data sets given a limited amount of resource, e.g., memory. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples. These samples carry statistical summaries of the data and maximize the benefit of learning. Our analyses show that the training complexity of CB-SVM is quadratically dependent on the number of support vectors, which is usually much less than that of the entire data set. Our experiments on synthetic and real-world data sets show that CB-SVM is highly scalable for very large data sets and very accurate in terms of classification.

Original languageEnglish (US)
Pages (from-to)295-321
Number of pages27
JournalData Mining and Knowledge Discovery
Volume11
Issue number3
DOIs
StatePublished - Nov 1 2005

Fingerprint

Support vector machines
Data mining
Data storage equipment
Clustering algorithms
Regression analysis
Pattern recognition
Learning systems
Experiments

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computer Networks and Communications

Cite this

Making SVMs scalable to large data sets using hierarchical cluster indexing. / Yu, Hwanjo; Yang, Jiong; Han, Jiawei; Li, Xiaolei.

In: Data Mining and Knowledge Discovery, Vol. 11, No. 3, 01.11.2005, p. 295-321.

Research output: Contribution to journalReview article

Yu, Hwanjo ; Yang, Jiong ; Han, Jiawei ; Li, Xiaolei. / Making SVMs scalable to large data sets using hierarchical cluster indexing. In: Data Mining and Knowledge Discovery. 2005 ; Vol. 11, No. 3. pp. 295-321.
@article{1b2a511ca79e420b8456222e28ed0b0f,
title = "Making SVMs scalable to large data sets using hierarchical cluster indexing",
abstract = "Support vector machines (SVMs) have been promising methods for classification and regression analysis due to their solid mathematical foundations, which include two desirable properties: margin maximization and nonlinear classification using kernels. However, despite these prominent properties, SVMs are usually not chosen for large-scale data mining problems because their training complexity is highly dependent on the data set size. Unlike traditional pattern recognition and machine learning, real-world data mining applications often involve huge numbers of data records. Thus it is too expensive to perform multiple scans on the entire data set, and it is also infeasible to put the data set in memory. This paper presents a method, Clustering-Based SVM (CB-SVM), that maximizes the SVM performance for very large data sets given a limited amount of resource, e.g., memory. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples. These samples carry statistical summaries of the data and maximize the benefit of learning. Our analyses show that the training complexity of CB-SVM is quadratically dependent on the number of support vectors, which is usually much less than that of the entire data set. Our experiments on synthetic and real-world data sets show that CB-SVM is highly scalable for very large data sets and very accurate in terms of classification.",
author = "Hwanjo Yu and Jiong Yang and Jiawei Han and Xiaolei Li",
year = "2005",
month = "11",
day = "1",
doi = "10.1007/s10618-005-0005-7",
language = "English (US)",
volume = "11",
pages = "295--321",
journal = "Data Mining and Knowledge Discovery",
issn = "1384-5810",
publisher = "Springer Netherlands",
number = "3",

}

TY - JOUR

T1 - Making SVMs scalable to large data sets using hierarchical cluster indexing

AU - Yu, Hwanjo

AU - Yang, Jiong

AU - Han, Jiawei

AU - Li, Xiaolei

PY - 2005/11/1

Y1 - 2005/11/1

N2 - Support vector machines (SVMs) have been promising methods for classification and regression analysis due to their solid mathematical foundations, which include two desirable properties: margin maximization and nonlinear classification using kernels. However, despite these prominent properties, SVMs are usually not chosen for large-scale data mining problems because their training complexity is highly dependent on the data set size. Unlike traditional pattern recognition and machine learning, real-world data mining applications often involve huge numbers of data records. Thus it is too expensive to perform multiple scans on the entire data set, and it is also infeasible to put the data set in memory. This paper presents a method, Clustering-Based SVM (CB-SVM), that maximizes the SVM performance for very large data sets given a limited amount of resource, e.g., memory. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples. These samples carry statistical summaries of the data and maximize the benefit of learning. Our analyses show that the training complexity of CB-SVM is quadratically dependent on the number of support vectors, which is usually much less than that of the entire data set. Our experiments on synthetic and real-world data sets show that CB-SVM is highly scalable for very large data sets and very accurate in terms of classification.

AB - Support vector machines (SVMs) have been promising methods for classification and regression analysis due to their solid mathematical foundations, which include two desirable properties: margin maximization and nonlinear classification using kernels. However, despite these prominent properties, SVMs are usually not chosen for large-scale data mining problems because their training complexity is highly dependent on the data set size. Unlike traditional pattern recognition and machine learning, real-world data mining applications often involve huge numbers of data records. Thus it is too expensive to perform multiple scans on the entire data set, and it is also infeasible to put the data set in memory. This paper presents a method, Clustering-Based SVM (CB-SVM), that maximizes the SVM performance for very large data sets given a limited amount of resource, e.g., memory. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples. These samples carry statistical summaries of the data and maximize the benefit of learning. Our analyses show that the training complexity of CB-SVM is quadratically dependent on the number of support vectors, which is usually much less than that of the entire data set. Our experiments on synthetic and real-world data sets show that CB-SVM is highly scalable for very large data sets and very accurate in terms of classification.

UR - http://www.scopus.com/inward/record.url?scp=27944509126&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=27944509126&partnerID=8YFLogxK

U2 - 10.1007/s10618-005-0005-7

DO - 10.1007/s10618-005-0005-7

M3 - Review article

AN - SCOPUS:27944509126

VL - 11

SP - 295

EP - 321

JO - Data Mining and Knowledge Discovery

JF - Data Mining and Knowledge Discovery

SN - 1384-5810

IS - 3

ER -