TY - JOUR
T1 - An approximate L1-difference algorithm for massive data streams
AU - Feigenbaum, Joan
AU - Kannan, Sampath
AU - Strauss, Martin J.
AU - Viswanathan, Mahesh
N1 - Copyright:
Copyright 2008 Elsevier B.V., All rights reserved.
PY - 2003/1
Y1 - 2003/1
N2 - Massive data sets are increasingly important in a wide range of applications, including observational sciences, product marketing, and the monitoring and operations of large systems. In network operations, raw data typically arrive in streams, and decisions must be made by algorithms that makes one pass over each stream, throw much of the raw data away, and produce "synopses" or "sketches" for further processing. Moreover, network-generated massive data sets are often distributed. Several different, physically separated network elements may receive or generate data streams that, together, comprise one logical data set; to be of use in operations, the streams must be analyzed locally and their synopses sent to a central operations facility. The enormous scale, distributed nature, and one-pass processing requirement on the data sets of interest must be addressed with new algorithmic techniques. We present one fundamental new technique here: a space-efficient, one-pass algorithm for approximating the L1-difference ∑i|ai-bi| between two functions, when the function values, ai and bi are given as data streams, and their order is chosen by an adversary. Our main technical innovation, which may be of interest outside the realm of massive data stream algorithmics, is a method of constructing families {Vj(s)} of limited-independence random variables that are range-summable, by which we mean that ∑j=0c-1 Vj(s) is computable in time polylog(c) for all seeds s. Our L-1-difference algorithm can be viewed as a "sketching" algorithm, in the sense of [Broder et al., J. Comput. System Sci., 60 (2000), pp. 630-659], and our technique performs better than that of Broder et al. when used to approximate the symmetric difference of two sets with small symmetric difference.
AB - Massive data sets are increasingly important in a wide range of applications, including observational sciences, product marketing, and the monitoring and operations of large systems. In network operations, raw data typically arrive in streams, and decisions must be made by algorithms that makes one pass over each stream, throw much of the raw data away, and produce "synopses" or "sketches" for further processing. Moreover, network-generated massive data sets are often distributed. Several different, physically separated network elements may receive or generate data streams that, together, comprise one logical data set; to be of use in operations, the streams must be analyzed locally and their synopses sent to a central operations facility. The enormous scale, distributed nature, and one-pass processing requirement on the data sets of interest must be addressed with new algorithmic techniques. We present one fundamental new technique here: a space-efficient, one-pass algorithm for approximating the L1-difference ∑i|ai-bi| between two functions, when the function values, ai and bi are given as data streams, and their order is chosen by an adversary. Our main technical innovation, which may be of interest outside the realm of massive data stream algorithmics, is a method of constructing families {Vj(s)} of limited-independence random variables that are range-summable, by which we mean that ∑j=0c-1 Vj(s) is computable in time polylog(c) for all seeds s. Our L-1-difference algorithm can be viewed as a "sketching" algorithm, in the sense of [Broder et al., J. Comput. System Sci., 60 (2000), pp. 630-659], and our technique performs better than that of Broder et al. when used to approximate the symmetric difference of two sets with small symmetric difference.
KW - Distance approximation
KW - Streaming algorithms
UR - http://www.scopus.com/inward/record.url?scp=0037943836&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0037943836&partnerID=8YFLogxK
U2 - 10.1137/S0097539799361701
DO - 10.1137/S0097539799361701
M3 - Article
AN - SCOPUS:0037943836
SN - 0097-5397
VL - 32
SP - 131
EP - 151
JO - SIAM Journal on Computing
JF - SIAM Journal on Computing
IS - 1
ER -