TY - JOUR
T1 - Samza
T2 - 43rd International Conference on Very Large Data Bases, VLDB 2017
AU - Noghabi, Shadi A.
AU - Paramasivamy, Kartik
AU - Pany, Yi
AU - Rameshy, Navina
AU - Bringhursty, Jon
AU - Gupta, Indranil
AU - Campbell, Roy H.
N1 - Funding Information:
We wish to thank the following people for their invaluable input towards this paper: Hassan Eslami, Wei Song, Xinyu Liu, Jagadish Venkatraman, and Jacob Maes. We would like to thank all contributors to Apache Samza with special mention to Chris Riccomini. Their ideas and hard work have been critical to the success of Samza. In addition, we would like to thank Swee Lim and Igor Perisic from LinkedIn for their support. The UIUC part of this work was supported in part by the following grants: NSF CNS 1319527, and AFOSR/AFRL FA8750-11-2-0084.
PY - 2017/8/1
Y1 - 2017/8/1
N2 - Distributed stream processing systems need to support stateful processing, recover quickly from failures to resume such processing, and reprocess an entire data stream quickly. We present Apache Samza, a distributed system for stateful and fault-tolerant stream processing. Samza utilizes a partitioned local state along with a low-overhead background changelog mechanism, allowing it to scale to massive state sizes (hundreds of TB) per application. Recovery from failures is sped up by re-scheduling based on Host Affnity. In addition to processing infinite streams of events, Samza supports processing a finite dataset as a stream, from either a streaming source (e.g., Kafka), a database snapshot (e.g., Databus), or a file system (e.g. HDFS), without having to change the application code (unlike the popular Lambda- based architectures which necessitate maintenance of separate code bases for batch and stream path processing). Samza is currently in use at LinkedIn by hundreds of production applications with more than 10; 000 containers. Samza is an open-source Apache project adopted by many top-tier companies (e.g., LinkedIn, Uber, Net ix, TripAdvisor, etc.). Our experiments show that Samza: a) handles state effciently, improving latency and throughput by more than 100 × compared to using a remote storage; b) provides recovery time independent of state size; c) scales performance linearly with number of containers; and d) supports reprocessing of the data stream quickly and with minimal interference on real-time traffc.
AB - Distributed stream processing systems need to support stateful processing, recover quickly from failures to resume such processing, and reprocess an entire data stream quickly. We present Apache Samza, a distributed system for stateful and fault-tolerant stream processing. Samza utilizes a partitioned local state along with a low-overhead background changelog mechanism, allowing it to scale to massive state sizes (hundreds of TB) per application. Recovery from failures is sped up by re-scheduling based on Host Affnity. In addition to processing infinite streams of events, Samza supports processing a finite dataset as a stream, from either a streaming source (e.g., Kafka), a database snapshot (e.g., Databus), or a file system (e.g. HDFS), without having to change the application code (unlike the popular Lambda- based architectures which necessitate maintenance of separate code bases for batch and stream path processing). Samza is currently in use at LinkedIn by hundreds of production applications with more than 10; 000 containers. Samza is an open-source Apache project adopted by many top-tier companies (e.g., LinkedIn, Uber, Net ix, TripAdvisor, etc.). Our experiments show that Samza: a) handles state effciently, improving latency and throughput by more than 100 × compared to using a remote storage; b) provides recovery time independent of state size; c) scales performance linearly with number of containers; and d) supports reprocessing of the data stream quickly and with minimal interference on real-time traffc.
UR - http://www.scopus.com/inward/record.url?scp=85036620015&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85036620015&partnerID=8YFLogxK
U2 - 10.14778/3137765.3137770
DO - 10.14778/3137765.3137770
M3 - Conference article
AN - SCOPUS:85036620015
VL - 10
SP - 1634
EP - 1645
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
SN - 2150-8097
IS - 12
Y2 - 28 August 2017 through 1 September 2017
ER -