Facing the reality of data stream classification: Coping with scarcity of labeled data

Mohammad M. Masud, Clay Woolam, Jing Gao, Latifur Khan, Jiawei Han, Kevin W. Hamlen, Nikunj C. Oza

Research output: Contribution to journalArticlepeer-review

Abstract

Recent approaches for classifying data streams are mostly based on supervised learning algorithms, which can only be trained with labeled data. Manual labeling of data is both costly and time consuming. Therefore, in a real streaming environment where large volumes of data appear at a high speed, only a small fraction of the data can be labeled. Thus, only a limited number of instances will be available for training and updating the classification models, leading to poorly trained classifiers. We apply a novel technique to overcome this problem by utilizing both unlabeled and labeled instances to train and update the classification model. Each classification model is built as a collection of micro-clusters using semi-supervised clustering, and an ensemble of these models is used to classify unlabeled data. Empirical evaluation of both synthetic and real data reveals that our approach outperforms state-of-the-art stream classification algorithms that use ten times more labeled data than our approach.

Original languageEnglish (US)
Pages (from-to)213-244
Number of pages32
JournalKnowledge and Information Systems
Volume33
Issue number1
DOIs
StatePublished - Oct 2012

Keywords

  • Concept drift
  • Data stream classification
  • Ensemble classification
  • Semi-supervised clustering

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Human-Computer Interaction
  • Hardware and Architecture
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Facing the reality of data stream classification: Coping with scarcity of labeled data'. Together they form a unique fingerprint.

Cite this