TY - GEN
T1 - On appropriate assumptions to mine data streams
T2 - 7th IEEE International Conference on Data Mining, ICDM 2007
AU - Gao, Jing
AU - Fan, Wei
AU - Han, Jiawei
PY - 2007
Y1 - 2007
N2 - Recent years have witnessed an increasing number of studies in stream mining, which aim at building an accurate model for continuously arriving data. Somehow most existing work makes the implicit assumption that the training data and the yet-to-come testing data are always sampled from the "same distribution", and yet this "same distribution" evolves over time. We demonstrate that this may not be true, and one actually may never know either "how" or "when" the distribution changes. Thus, a model that fits well on the observed distribution can have unsatisfactory accuracy on the incoming data. Practically, one can just assume the bare minimum that learning from observed data is better than both random guessing and always predicting exactly the same class label. Importantly, we formally and experimentally demonstrate the robustness of a model averaging and simple voting-based framework for data streams, particularly when incoming data "continuously follows significantly different" distributions. On a real streaming data, this framework reduces the expected error of baseline models by 60%, and remains the most accurate compared to those baseline models.
AB - Recent years have witnessed an increasing number of studies in stream mining, which aim at building an accurate model for continuously arriving data. Somehow most existing work makes the implicit assumption that the training data and the yet-to-come testing data are always sampled from the "same distribution", and yet this "same distribution" evolves over time. We demonstrate that this may not be true, and one actually may never know either "how" or "when" the distribution changes. Thus, a model that fits well on the observed distribution can have unsatisfactory accuracy on the incoming data. Practically, one can just assume the bare minimum that learning from observed data is better than both random guessing and always predicting exactly the same class label. Importantly, we formally and experimentally demonstrate the robustness of a model averaging and simple voting-based framework for data streams, particularly when incoming data "continuously follows significantly different" distributions. On a real streaming data, this framework reduces the expected error of baseline models by 60%, and remains the most accurate compared to those baseline models.
UR - http://www.scopus.com/inward/record.url?scp=49749130418&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=49749130418&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2007.96
DO - 10.1109/ICDM.2007.96
M3 - Conference contribution
AN - SCOPUS:49749130418
SN - 0769530184
SN - 9780769530185
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 143
EP - 152
BT - Proceedings of the 7th IEEE International Conference on Data Mining, ICDM 2007
Y2 - 28 October 2007 through 31 October 2007
ER -