Abstract
This chapter discusses a framework for clustering evolving data streams. The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream render most traditional algorithms too inefficient. In recent years, a few one-pass clustering algorithms have been developed for the data stream problem. Although such methods address the scalability issues of the clustering problem, they are generally blind to the evolution of the data and do not address the following issues: (1) the quality of the clusters is poor when the data evolves considerably over time. (2) A data stream clustering algorithm requires much greater functionality in discovering and exploring clusters over different portions of the stream. The widely used practice of viewing data stream clustering algorithms as a class of one-pass clustering algorithms is not very useful from an application point of view. The chapter discusses a fundamentally different philosophy for data stream clustering which is guided by application-centered requirements. It divides the clustering process into an online component, which periodically stores detailed summary statistics and an offline component, which uses only this summary statistics. The problems of efficient choice, storage, and use of this statistical data for a fast data stream turns out to be quite tricky. The concepts of a pyramidal time frame in conjunction with a micro-clustering approach are used.
Original language | English (US) |
---|---|
Title of host publication | Proceedings 2003 VLDB Conference |
Subtitle of host publication | 29th International Conference on Very Large Databases (VLDB) |
Publisher | Elsevier |
Pages | 81-92 |
Number of pages | 12 |
ISBN (Electronic) | 9780127224428 |
DOIs | |
State | Published - Jan 1 2003 |
Externally published | Yes |
ASJC Scopus subject areas
- General Computer Science