Extracting redundancy-aware top-k patterns

Dong Xin, Hong Cheng, Xifeng Yan, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Observed in many applications, there is a potential need of extracting a small set of frequent patterns having not only high significance but also low redundancy. The significance is usually defined by the context of applications. Previous studies have been concentrating on how to compute top-k significant patterns or how to remove redundancy among patterns separately. There is limited work on finding those top-fc patterns which demonstrate high-significance and low-redundancy simultaneously. In this paper, we study the problem of extracting redundancy-aware top-k patterns from a large collection of frequent patterns. We first examine the evaluation functions for measuring the combined significance of a pattern set and propose the MMS (Maximal Marginal Significance) as the problem formulation. The problem is known as NP-hard. We further present a greedy algorithm which approximates the optimal solution with performance bound O(log k) (with conditions on redundancy), where k is the number of reported patterns. The direct usage of redundancy-aware top-k patterns is illustrated through two real applications: disk block prefetch and document theme extraction. Our method can also be applied to processing redundancy-aware top-k queries in traditional database.

Original languageEnglish (US)
Title of host publicationKDD 2006
Subtitle of host publicationProceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages444-453
Number of pages10
ISBN (Print)1595933395, 9781595933393
DOIs
StatePublished - 2006
EventKDD 2006: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Philadelphia, PA, United States
Duration: Aug 20 2006Aug 23 2006

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Volume2006

Other

OtherKDD 2006: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Country/TerritoryUnited States
CityPhiladelphia, PA
Period8/20/068/23/06

Keywords

  • Pattern Extraction
  • Redundancy
  • Significance

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'Extracting redundancy-aware top-k patterns'. Together they form a unique fingerprint.

Cite this