Abstract
Recent approaches for classifying data streams are mostly based on supervised learning algorithms, which can only be trained with labeled data. Manual labeling of data is both costly and time consuming. Therefore, in a real streaming environment where large volumes of data appear at a high speed, only a small fraction of the data can be labeled. Thus, only a limited number of instances will be available for training and updating the classification models, leading to poorly trained classifiers. We apply a novel technique to overcome this problem by utilizing both unlabeled and labeled instances to train and update the classification model. Each classification model is built as a collection of micro-clusters using semi-supervised clustering, and an ensemble of these models is used to classify unlabeled data. Empirical evaluation of both synthetic and real data reveals that our approach outperforms state-of-the-art stream classification algorithms that use ten times more labeled data than our approach.
Original language | English (US) |
---|---|
Pages (from-to) | 213-244 |
Number of pages | 32 |
Journal | Knowledge and Information Systems |
Volume | 33 |
Issue number | 1 |
DOIs | |
State | Published - Oct 2012 |
Keywords
- Concept drift
- Data stream classification
- Ensemble classification
- Semi-supervised clustering
ASJC Scopus subject areas
- Software
- Information Systems
- Human-Computer Interaction
- Hardware and Architecture
- Artificial Intelligence