An approximate L1-difference algorithm for massive data streams

Joan Feigenbaum, Sampath Kannan, Martin J. Strauss, Mahesh Viswanathan

Research output: Contribution to journalArticle

Abstract

Massive data sets are increasingly important in a wide range of applications, including observational sciences, product marketing, and the monitoring and operations of large systems. In network operations, raw data typically arrive in streams, and decisions must be made by algorithms that makes one pass over each stream, throw much of the raw data away, and produce "synopses" or "sketches" for further processing. Moreover, network-generated massive data sets are often distributed. Several different, physically separated network elements may receive or generate data streams that, together, comprise one logical data set; to be of use in operations, the streams must be analyzed locally and their synopses sent to a central operations facility. The enormous scale, distributed nature, and one-pass processing requirement on the data sets of interest must be addressed with new algorithmic techniques. We present one fundamental new technique here: a space-efficient, one-pass algorithm for approximating the L1-difference ∑i|ai-bi| between two functions, when the function values, ai and bi are given as data streams, and their order is chosen by an adversary. Our main technical innovation, which may be of interest outside the realm of massive data stream algorithmics, is a method of constructing families {Vj(s)} of limited-independence random variables that are range-summable, by which we mean that ∑j=0c-1 Vj(s) is computable in time polylog(c) for all seeds s. Our L-1-difference algorithm can be viewed as a "sketching" algorithm, in the sense of [Broder et al., J. Comput. System Sci., 60 (2000), pp. 630-659], and our technique performs better than that of Broder et al. when used to approximate the symmetric difference of two sets with small symmetric difference.

Original languageEnglish (US)
Pages (from-to)131-151
Number of pages21
JournalSIAM Journal on Computing
Volume32
Issue number1
DOIs
StatePublished - Jan 1 2003

Keywords

  • Distance approximation
  • Streaming algorithms

ASJC Scopus subject areas

  • Computer Science(all)
  • Mathematics(all)

Fingerprint Dive into the research topics of 'An approximate L<sup>1</sup>-difference algorithm for massive data streams'. Together they form a unique fingerprint.

  • Cite this