Samza: Stateful scalable stream processing at linkedin

Shadi A. Noghabi, Kartik Paramasivamy, Yi Pany, Navina Rameshy, Jon Bringhursty, Indranil Gupta, R H Campbell

Research output: Contribution to journalConference article

Abstract

Distributed stream processing systems need to support stateful processing, recover quickly from failures to resume such processing, and reprocess an entire data stream quickly. We present Apache Samza, a distributed system for stateful and fault-tolerant stream processing. Samza utilizes a partitioned local state along with a low-overhead background changelog mechanism, allowing it to scale to massive state sizes (hundreds of TB) per application. Recovery from failures is sped up by re-scheduling based on Host Affnity. In addition to processing infinite streams of events, Samza supports processing a finite dataset as a stream, from either a streaming source (e.g., Kafka), a database snapshot (e.g., Databus), or a file system (e.g. HDFS), without having to change the application code (unlike the popular Lambda- based architectures which necessitate maintenance of separate code bases for batch and stream path processing). Samza is currently in use at LinkedIn by hundreds of production applications with more than 10; 000 containers. Samza is an open-source Apache project adopted by many top-tier companies (e.g., LinkedIn, Uber, Net ix, TripAdvisor, etc.). Our experiments show that Samza: a) handles state effciently, improving latency and throughput by more than 100 × compared to using a remote storage; b) provides recovery time independent of state size; c) scales performance linearly with number of containers; and d) supports reprocessing of the data stream quickly and with minimal interference on real-time traffc.

Original languageEnglish (US)
Pages (from-to)1634-1645
Number of pages12
JournalProceedings of the VLDB Endowment
Volume10
Issue number12
StatePublished - Aug 1 2017
Event43rd International Conference on Very Large Data Bases, VLDB 2017 - Munich, Germany
Duration: Aug 28 2017Sep 1 2017

Fingerprint

Processing
Containers
Recovery
Scheduling
Throughput
Industry
Experiments

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Noghabi, S. A., Paramasivamy, K., Pany, Y., Rameshy, N., Bringhursty, J., Gupta, I., & Campbell, R. H. (2017). Samza: Stateful scalable stream processing at linkedin. Proceedings of the VLDB Endowment, 10(12), 1634-1645.

Samza : Stateful scalable stream processing at linkedin. / Noghabi, Shadi A.; Paramasivamy, Kartik; Pany, Yi; Rameshy, Navina; Bringhursty, Jon; Gupta, Indranil; Campbell, R H.

In: Proceedings of the VLDB Endowment, Vol. 10, No. 12, 01.08.2017, p. 1634-1645.

Research output: Contribution to journalConference article

Noghabi, SA, Paramasivamy, K, Pany, Y, Rameshy, N, Bringhursty, J, Gupta, I & Campbell, RH 2017, 'Samza: Stateful scalable stream processing at linkedin', Proceedings of the VLDB Endowment, vol. 10, no. 12, pp. 1634-1645.
Noghabi SA, Paramasivamy K, Pany Y, Rameshy N, Bringhursty J, Gupta I et al. Samza: Stateful scalable stream processing at linkedin. Proceedings of the VLDB Endowment. 2017 Aug 1;10(12):1634-1645.
Noghabi, Shadi A. ; Paramasivamy, Kartik ; Pany, Yi ; Rameshy, Navina ; Bringhursty, Jon ; Gupta, Indranil ; Campbell, R H. / Samza : Stateful scalable stream processing at linkedin. In: Proceedings of the VLDB Endowment. 2017 ; Vol. 10, No. 12. pp. 1634-1645.
@article{553df54ff6ec45cb9e5586f50ec92f1d,
title = "Samza: Stateful scalable stream processing at linkedin",
abstract = "Distributed stream processing systems need to support stateful processing, recover quickly from failures to resume such processing, and reprocess an entire data stream quickly. We present Apache Samza, a distributed system for stateful and fault-tolerant stream processing. Samza utilizes a partitioned local state along with a low-overhead background changelog mechanism, allowing it to scale to massive state sizes (hundreds of TB) per application. Recovery from failures is sped up by re-scheduling based on Host Affnity. In addition to processing infinite streams of events, Samza supports processing a finite dataset as a stream, from either a streaming source (e.g., Kafka), a database snapshot (e.g., Databus), or a file system (e.g. HDFS), without having to change the application code (unlike the popular Lambda- based architectures which necessitate maintenance of separate code bases for batch and stream path processing). Samza is currently in use at LinkedIn by hundreds of production applications with more than 10; 000 containers. Samza is an open-source Apache project adopted by many top-tier companies (e.g., LinkedIn, Uber, Net ix, TripAdvisor, etc.). Our experiments show that Samza: a) handles state effciently, improving latency and throughput by more than 100 × compared to using a remote storage; b) provides recovery time independent of state size; c) scales performance linearly with number of containers; and d) supports reprocessing of the data stream quickly and with minimal interference on real-time traffc.",
author = "Noghabi, {Shadi A.} and Kartik Paramasivamy and Yi Pany and Navina Rameshy and Jon Bringhursty and Indranil Gupta and Campbell, {R H}",
year = "2017",
month = "8",
day = "1",
language = "English (US)",
volume = "10",
pages = "1634--1645",
journal = "Proceedings of the VLDB Endowment",
issn = "2150-8097",
publisher = "Very Large Data Base Endowment Inc.",
number = "12",

}

TY - JOUR

T1 - Samza

T2 - Stateful scalable stream processing at linkedin

AU - Noghabi, Shadi A.

AU - Paramasivamy, Kartik

AU - Pany, Yi

AU - Rameshy, Navina

AU - Bringhursty, Jon

AU - Gupta, Indranil

AU - Campbell, R H

PY - 2017/8/1

Y1 - 2017/8/1

N2 - Distributed stream processing systems need to support stateful processing, recover quickly from failures to resume such processing, and reprocess an entire data stream quickly. We present Apache Samza, a distributed system for stateful and fault-tolerant stream processing. Samza utilizes a partitioned local state along with a low-overhead background changelog mechanism, allowing it to scale to massive state sizes (hundreds of TB) per application. Recovery from failures is sped up by re-scheduling based on Host Affnity. In addition to processing infinite streams of events, Samza supports processing a finite dataset as a stream, from either a streaming source (e.g., Kafka), a database snapshot (e.g., Databus), or a file system (e.g. HDFS), without having to change the application code (unlike the popular Lambda- based architectures which necessitate maintenance of separate code bases for batch and stream path processing). Samza is currently in use at LinkedIn by hundreds of production applications with more than 10; 000 containers. Samza is an open-source Apache project adopted by many top-tier companies (e.g., LinkedIn, Uber, Net ix, TripAdvisor, etc.). Our experiments show that Samza: a) handles state effciently, improving latency and throughput by more than 100 × compared to using a remote storage; b) provides recovery time independent of state size; c) scales performance linearly with number of containers; and d) supports reprocessing of the data stream quickly and with minimal interference on real-time traffc.

AB - Distributed stream processing systems need to support stateful processing, recover quickly from failures to resume such processing, and reprocess an entire data stream quickly. We present Apache Samza, a distributed system for stateful and fault-tolerant stream processing. Samza utilizes a partitioned local state along with a low-overhead background changelog mechanism, allowing it to scale to massive state sizes (hundreds of TB) per application. Recovery from failures is sped up by re-scheduling based on Host Affnity. In addition to processing infinite streams of events, Samza supports processing a finite dataset as a stream, from either a streaming source (e.g., Kafka), a database snapshot (e.g., Databus), or a file system (e.g. HDFS), without having to change the application code (unlike the popular Lambda- based architectures which necessitate maintenance of separate code bases for batch and stream path processing). Samza is currently in use at LinkedIn by hundreds of production applications with more than 10; 000 containers. Samza is an open-source Apache project adopted by many top-tier companies (e.g., LinkedIn, Uber, Net ix, TripAdvisor, etc.). Our experiments show that Samza: a) handles state effciently, improving latency and throughput by more than 100 × compared to using a remote storage; b) provides recovery time independent of state size; c) scales performance linearly with number of containers; and d) supports reprocessing of the data stream quickly and with minimal interference on real-time traffc.

UR - http://www.scopus.com/inward/record.url?scp=85036620015&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85036620015&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85036620015

VL - 10

SP - 1634

EP - 1645

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

SN - 2150-8097

IS - 12

ER -