Abstract
Distributed stream processing systems need to support stateful processing, recover quickly from failures to resume such processing, and reprocess an entire data stream quickly. We present Apache Samza, a distributed system for stateful and fault-tolerant stream processing. Samza utilizes a partitioned local state along with a low-overhead background changelog mechanism, allowing it to scale to massive state sizes (hundreds of TB) per application. Recovery from failures is sped up by re-scheduling based on Host Affnity. In addition to processing infinite streams of events, Samza supports processing a finite dataset as a stream, from either a streaming source (e.g., Kafka), a database snapshot (e.g., Databus), or a file system (e.g. HDFS), without having to change the application code (unlike the popular Lambda- based architectures which necessitate maintenance of separate code bases for batch and stream path processing). Samza is currently in use at LinkedIn by hundreds of production applications with more than 10; 000 containers. Samza is an open-source Apache project adopted by many top-tier companies (e.g., LinkedIn, Uber, Net ix, TripAdvisor, etc.). Our experiments show that Samza: a) handles state effciently, improving latency and throughput by more than 100 × compared to using a remote storage; b) provides recovery time independent of state size; c) scales performance linearly with number of containers; and d) supports reprocessing of the data stream quickly and with minimal interference on real-time traffc.
Original language | English (US) |
---|---|
Pages (from-to) | 1634-1645 |
Number of pages | 12 |
Journal | Proceedings of the VLDB Endowment |
Volume | 10 |
Issue number | 12 |
DOIs | |
State | Published - Aug 1 2017 |
Event | 43rd International Conference on Very Large Data Bases, VLDB 2017 - Munich, Germany Duration: Aug 28 2017 → Sep 1 2017 |
ASJC Scopus subject areas
- Computer Science (miscellaneous)
- General Computer Science