TY - GEN
T1 - A general framework for mining concept-drifting data streams with skewed distributions
AU - Gao, Jing
AU - Fan, Wei
AU - Han, Jiawei
AU - Yu, Philip S.
PY - 2007
Y1 - 2007
N2 - In recent years, there have been some interesting studies on predictive modeling in data streams. However, most such studies assume relatively balanced and stable data streams but cannot handle well rather skewed (e.g., few positives but lots of negatives) and stochastic distributions, which are typical in many data stream applications. In this paper, we propose a new approach to mine data streams by estimating reliable posterior probabilities using an ensemble of models to match the distribution over under-samples of negatives and repeated samples of positives. We formally show some interesting and important properties of the proposed framework, e.g., reliability of estimated probabilities on skewed positive class, accuracy of estimated probabilities, efficiency and scalability. Experiments are performed on several synthetic as well as real-world datasets with skewed distributions, and they demonstrate that our framework has substantial advantages over existing approaches in estimation reliability and predication accuracy.
AB - In recent years, there have been some interesting studies on predictive modeling in data streams. However, most such studies assume relatively balanced and stable data streams but cannot handle well rather skewed (e.g., few positives but lots of negatives) and stochastic distributions, which are typical in many data stream applications. In this paper, we propose a new approach to mine data streams by estimating reliable posterior probabilities using an ensemble of models to match the distribution over under-samples of negatives and repeated samples of positives. We formally show some interesting and important properties of the proposed framework, e.g., reliability of estimated probabilities on skewed positive class, accuracy of estimated probabilities, efficiency and scalability. Experiments are performed on several synthetic as well as real-world datasets with skewed distributions, and they demonstrate that our framework has substantial advantages over existing approaches in estimation reliability and predication accuracy.
UR - http://www.scopus.com/inward/record.url?scp=70449102582&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70449102582&partnerID=8YFLogxK
U2 - 10.1137/1.9781611972771.1
DO - 10.1137/1.9781611972771.1
M3 - Conference contribution
AN - SCOPUS:70449102582
SN - 9780898716306
T3 - Proceedings of the 7th SIAM International Conference on Data Mining
SP - 3
EP - 14
BT - Proceedings of the 7th SIAM International Conference on Data Mining
PB - Society for Industrial and Applied Mathematics Publications
T2 - 7th SIAM International Conference on Data Mining
Y2 - 26 April 2007 through 28 April 2007
ER -