### Abstract

Massive data sets are increasingly important in a wide range of applications, including observational sciences, product marketing, and the monitoring and operations of large systems. In network operations, raw data typically arrive in streams, and decisions must be made by algorithms that makes one pass over each stream, throw much of the raw data away, and produce "synopses" or "sketches" for further processing. Moreover, network-generated massive data sets are often distributed. Several different, physically separated network elements may receive or generate data streams that, together, comprise one logical data set; to be of use in operations, the streams must be analyzed locally and their synopses sent to a central operations facility. The enormous scale, distributed nature, and one-pass processing requirement on the data sets of interest must be addressed with new algorithmic techniques. We present one fundamental new technique here: a space-efficient, one-pass algorithm for approximating the L^{1}-difference ∑_{i}|a_{i}-b_{i}| between two functions, when the function values, a_{i} and b_{i} are given as data streams, and their order is chosen by an adversary. Our main technical innovation, which may be of interest outside the realm of massive data stream algorithmics, is a method of constructing families {V_{j}(s)} of limited-independence random variables that are range-summable, by which we mean that ∑_{j=0}^{c-1} V_{j}(s) is computable in time polylog(c) for all seeds s. Our L^{-1}-difference algorithm can be viewed as a "sketching" algorithm, in the sense of [Broder et al., J. Comput. System Sci., 60 (2000), pp. 630-659], and our technique performs better than that of Broder et al. when used to approximate the symmetric difference of two sets with small symmetric difference.

Original language | English (US) |
---|---|

Pages (from-to) | 131-151 |

Number of pages | 21 |

Journal | SIAM Journal on Computing |

Volume | 32 |

Issue number | 1 |

DOIs | |

State | Published - Jan 1 2003 |

### Keywords

- Distance approximation
- Streaming algorithms

### ASJC Scopus subject areas

- Computer Science(all)
- Mathematics(all)

## Fingerprint Dive into the research topics of 'An approximate L<sup>1</sup>-difference algorithm for massive data streams'. Together they form a unique fingerprint.

## Cite this

^{1}-difference algorithm for massive data streams.

*SIAM Journal on Computing*,

*32*(1), 131-151. https://doi.org/10.1137/S0097539799361701