An objective evaluation criterion for clustering

Arindam Banerjee, John Langford

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We propose and test an objective criterion for evaluation of clustering performance: How well does a clustering algorithm run on unlabeled data aid a classification algorithm? The accuracy is quantified using the PAC-MDL bound [3] in a semisupervised setting. Clustering algorithms which naturally separate the data according to (hidden) labels with a small number of clusters perform well. A simple extension of the argument leads to an objective model selection method. Experimental results on text analysis datasets demonstrate that this approach empirically results in very competitive bounds on test set performance on natural datasets.

Original languageEnglish (US)
Title of host publicationKDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages515-520
Number of pages6
ISBN (Print)1581138881, 9781581138887
DOIs
StatePublished - 2004
Externally publishedYes
EventKDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Seattle, WA, United States
Duration: Aug 22 2004Aug 25 2004

Publication series

NameKDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Other

OtherKDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Country/TerritoryUnited States
CitySeattle, WA
Period8/22/048/25/04

Keywords

  • Clustering
  • Evaluation
  • MDL
  • PAC bounds

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'An objective evaluation criterion for clustering'. Together they form a unique fingerprint.

Cite this