A sampling-based framework for parallel data mining

Shengnan Cong, Jiawei Han, Jay Hoeflinger, David Padua

Research output: Contribution to conferencePaperpeer-review

Abstract

The goal of data mining algorithm is to discover useful information embedded in large databases. Frequent itemset mining and sequential pattern mining are two important data mining problems with broad applications. Perhaps the most efficient way to solve these problems sequentially is to apply a pattern-growth algorithm, which is a divide-and-conquer algorithm [9, 10]. In this paper, we present a framework for parallel mining frequent itemsets and sequential patterns based on the divide-and-conquer strategy of pattern growth. Then, we discuss the load balancing problem and introduce a sampling technique, called selective sampling, to address this problem. We implemented parallel versions of both frequent iternsets and sequential pattern mining algorithms following our framework. The experimental results show that our parallel algorithms usually achieve excellent speedups.

Original languageEnglish (US)
Pages255-265
Number of pages11
DOIs
StatePublished - 2005
Event2005 ACM SIGPLAN Symposium on Principles and Practise of Parallel Programming, PROPP 05 - Chicago, IL, United States
Duration: Jun 15 2005Jun 17 2005

Conference

Conference2005 ACM SIGPLAN Symposium on Principles and Practise of Parallel Programming, PROPP 05
Country/TerritoryUnited States
CityChicago, IL
Period6/15/056/17/05

Keywords

  • Data mining
  • Load balancing
  • Parallel algorithms
  • Sampling

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'A sampling-based framework for parallel data mining'. Together they form a unique fingerprint.

Cite this