Samza: Stateful scalable stream processing at linkedin

Shadi A. Noghabi, Kartik Paramasivamy, Yi Pany, Navina Rameshy, Jon Bringhursty, Indranil Gupta, Roy H. Campbell

Research output: Contribution to journalConference articlepeer-review

Abstract

Distributed stream processing systems need to support stateful processing, recover quickly from failures to resume such processing, and reprocess an entire data stream quickly. We present Apache Samza, a distributed system for stateful and fault-tolerant stream processing. Samza utilizes a partitioned local state along with a low-overhead background changelog mechanism, allowing it to scale to massive state sizes (hundreds of TB) per application. Recovery from failures is sped up by re-scheduling based on Host Affnity. In addition to processing infinite streams of events, Samza supports processing a finite dataset as a stream, from either a streaming source (e.g., Kafka), a database snapshot (e.g., Databus), or a file system (e.g. HDFS), without having to change the application code (unlike the popular Lambda- based architectures which necessitate maintenance of separate code bases for batch and stream path processing). Samza is currently in use at LinkedIn by hundreds of production applications with more than 10; 000 containers. Samza is an open-source Apache project adopted by many top-tier companies (e.g., LinkedIn, Uber, Net ix, TripAdvisor, etc.). Our experiments show that Samza: a) handles state effciently, improving latency and throughput by more than 100 × compared to using a remote storage; b) provides recovery time independent of state size; c) scales performance linearly with number of containers; and d) supports reprocessing of the data stream quickly and with minimal interference on real-time traffc.

Original languageEnglish (US)
Pages (from-to)1634-1645
Number of pages12
JournalProceedings of the VLDB Endowment
Volume10
Issue number12
DOIs
StatePublished - Aug 1 2017
Event43rd International Conference on Very Large Data Bases, VLDB 2017 - Munich, Germany
Duration: Aug 28 2017Sep 1 2017

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • General Computer Science

Fingerprint

Dive into the research topics of 'Samza: Stateful scalable stream processing at linkedin'. Together they form a unique fingerprint.

Cite this