Classifying large data sets using SVMs with hierarchical clusters

Hwanjo Yu, Jiong Yang, Jiawei Han

Research output: Contribution to conferencePaper

Abstract

Support vector machines (SVMs) have been promising methods for classification and regression analysis because of their solid mathematical foundations which convery several salient properties that other methods hardly provide. However, despite the prominent properties of SVMs, they are not as favored for large-scale data mining as for pattern recognition or machine learning because the training complexity of SVMs is highly dependent on the size of a data set. Many real-world data mining applications involve millions or billions of data records where even multiple scans of the entire data are too expensive to perform. This paper presents a new method, Clustering-Based SVM (CB-SVM), which is specifically designed for handling very large data sets. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples that carry the statistical summaries of the data such that the summaries maximize the benefit of learning the SVM. CB-SVM tries to generate the best SVM boundary for very large data sets given limited amount of resources. Our experiments on synthetic and real data sets show that CB-SVM is highly scalable for very large data sets while also generating high classification accuracy.

Original languageEnglish (US)
Pages306-315
Number of pages10
DOIs
StatePublished - Dec 1 2003
Event9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03 - Washington, DC, United States
Duration: Aug 24 2003Aug 27 2003

Other

Other9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03
CountryUnited States
CityWashington, DC
Period8/24/038/27/03

Fingerprint

Support vector machines
Data mining
Clustering algorithms
Regression analysis
Pattern recognition
Learning systems
Experiments

Keywords

  • Hierarchical cluster
  • Support vector machines

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Yu, H., Yang, J., & Han, J. (2003). Classifying large data sets using SVMs with hierarchical clusters. 306-315. Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States. https://doi.org/10.1145/956750.956786

Classifying large data sets using SVMs with hierarchical clusters. / Yu, Hwanjo; Yang, Jiong; Han, Jiawei.

2003. 306-315 Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States.

Research output: Contribution to conferencePaper

Yu, H, Yang, J & Han, J 2003, 'Classifying large data sets using SVMs with hierarchical clusters' Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States, 8/24/03 - 8/27/03, pp. 306-315. https://doi.org/10.1145/956750.956786
Yu H, Yang J, Han J. Classifying large data sets using SVMs with hierarchical clusters. 2003. Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States. https://doi.org/10.1145/956750.956786
Yu, Hwanjo ; Yang, Jiong ; Han, Jiawei. / Classifying large data sets using SVMs with hierarchical clusters. Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States.10 p.
@conference{edec2ebf6c8a41f2a1e4281198fb24e7,
title = "Classifying large data sets using SVMs with hierarchical clusters",
abstract = "Support vector machines (SVMs) have been promising methods for classification and regression analysis because of their solid mathematical foundations which convery several salient properties that other methods hardly provide. However, despite the prominent properties of SVMs, they are not as favored for large-scale data mining as for pattern recognition or machine learning because the training complexity of SVMs is highly dependent on the size of a data set. Many real-world data mining applications involve millions or billions of data records where even multiple scans of the entire data are too expensive to perform. This paper presents a new method, Clustering-Based SVM (CB-SVM), which is specifically designed for handling very large data sets. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples that carry the statistical summaries of the data such that the summaries maximize the benefit of learning the SVM. CB-SVM tries to generate the best SVM boundary for very large data sets given limited amount of resources. Our experiments on synthetic and real data sets show that CB-SVM is highly scalable for very large data sets while also generating high classification accuracy.",
keywords = "Hierarchical cluster, Support vector machines",
author = "Hwanjo Yu and Jiong Yang and Jiawei Han",
year = "2003",
month = "12",
day = "1",
doi = "10.1145/956750.956786",
language = "English (US)",
pages = "306--315",
note = "9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03 ; Conference date: 24-08-2003 Through 27-08-2003",

}

TY - CONF

T1 - Classifying large data sets using SVMs with hierarchical clusters

AU - Yu, Hwanjo

AU - Yang, Jiong

AU - Han, Jiawei

PY - 2003/12/1

Y1 - 2003/12/1

N2 - Support vector machines (SVMs) have been promising methods for classification and regression analysis because of their solid mathematical foundations which convery several salient properties that other methods hardly provide. However, despite the prominent properties of SVMs, they are not as favored for large-scale data mining as for pattern recognition or machine learning because the training complexity of SVMs is highly dependent on the size of a data set. Many real-world data mining applications involve millions or billions of data records where even multiple scans of the entire data are too expensive to perform. This paper presents a new method, Clustering-Based SVM (CB-SVM), which is specifically designed for handling very large data sets. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples that carry the statistical summaries of the data such that the summaries maximize the benefit of learning the SVM. CB-SVM tries to generate the best SVM boundary for very large data sets given limited amount of resources. Our experiments on synthetic and real data sets show that CB-SVM is highly scalable for very large data sets while also generating high classification accuracy.

AB - Support vector machines (SVMs) have been promising methods for classification and regression analysis because of their solid mathematical foundations which convery several salient properties that other methods hardly provide. However, despite the prominent properties of SVMs, they are not as favored for large-scale data mining as for pattern recognition or machine learning because the training complexity of SVMs is highly dependent on the size of a data set. Many real-world data mining applications involve millions or billions of data records where even multiple scans of the entire data are too expensive to perform. This paper presents a new method, Clustering-Based SVM (CB-SVM), which is specifically designed for handling very large data sets. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples that carry the statistical summaries of the data such that the summaries maximize the benefit of learning the SVM. CB-SVM tries to generate the best SVM boundary for very large data sets given limited amount of resources. Our experiments on synthetic and real data sets show that CB-SVM is highly scalable for very large data sets while also generating high classification accuracy.

KW - Hierarchical cluster

KW - Support vector machines

UR - http://www.scopus.com/inward/record.url?scp=77952390455&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77952390455&partnerID=8YFLogxK

U2 - 10.1145/956750.956786

DO - 10.1145/956750.956786

M3 - Paper

AN - SCOPUS:77952390455

SP - 306

EP - 315

ER -