A general framework for mining concept-drifting data streams with skewed distributions

Jing Gao, Wei Fan, Jiawei Han, Philip S. Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In recent years, there have been some interesting studies on predictive modeling in data streams. However, most such studies assume relatively balanced and stable data streams but cannot handle well rather skewed (e.g., few positives but lots of negatives) and stochastic distributions, which are typical in many data stream applications. In this paper, we propose a new approach to mine data streams by estimating reliable posterior probabilities using an ensemble of models to match the distribution over under-samples of negatives and repeated samples of positives. We formally show some interesting and important properties of the proposed framework, e.g., reliability of estimated probabilities on skewed positive class, accuracy of estimated probabilities, efficiency and scalability. Experiments are performed on several synthetic as well as real-world datasets with skewed distributions, and they demonstrate that our framework has substantial advantages over existing approaches in estimation reliability and predication accuracy.

Original languageEnglish (US)
Title of host publicationProceedings of the 7th SIAM International Conference on Data Mining
PublisherSociety for Industrial and Applied Mathematics Publications
Pages3-14
Number of pages12
ISBN (Print)9780898716306
DOIs
StatePublished - 2007
Event7th SIAM International Conference on Data Mining - Minneapolis, MN, United States
Duration: Apr 26 2007Apr 28 2007

Publication series

NameProceedings of the 7th SIAM International Conference on Data Mining

Other

Other7th SIAM International Conference on Data Mining
Country/TerritoryUnited States
CityMinneapolis, MN
Period4/26/074/28/07

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'A general framework for mining concept-drifting data streams with skewed distributions'. Together they form a unique fingerprint.

Cite this