TY - GEN
T1 - Mutiny! How Does Kubernetes Fail, and What Can We Do about It?
AU - Barletta, Marco
AU - Cinque, Marcello
AU - Di Martino, Catello
AU - Kalbarczyk, Zbigniew T.
AU - Iyer, Ravishankar K.
N1 - We thank the reviewers, S. Cui, H. Qiu, H. Sreejith, A. Patke, P. Cao, J. Applequist, and K. Atchley for the insightful comments on the early drafts. We acknowledge the early participation of Larisa Shwartz (IBM) and Saurabh Jha (IBM) in the conceptualization of fault injection methods for Kubernetes; and Chandra Narayanaswami (IBM) for his continued insights and support on related system issues. This work is partially supported by the National Science Foundation (NSF) under grant No. 2029049; by the IBM-ILLINOIS Discovery Accelerator Institute (IIDAI); a gift from Nokia Bell Labs Core Research; and by the Italian Ministry of Enterprises and Made in Italy (MIMIT) under the GENIO Project (CUP B69J23005770005). In memory of Fabio Barletta.
PY - 2024
Y1 - 2024
N2 - In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many real-world failure patterns. The paper aims to address the lack of studies on systematic analyses of Kubernetes failures to date. Our results show that even a single fault/error (e.g., a bit-flip) in the data stored can propagate, causing cluster-wide failures (3% of injections), service networking issues (4%), and service under/overprovisioning (24%). Errors in the fields tracking dependencies between object caused 51% of such cluster-wide failures. We argue that controlled fault/error injection-based testing should be employed to proactively assess Kubernetes' resiliency and guide the design of failure mitigation strategies.
AB - In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many real-world failure patterns. The paper aims to address the lack of studies on systematic analyses of Kubernetes failures to date. Our results show that even a single fault/error (e.g., a bit-flip) in the data stored can propagate, causing cluster-wide failures (3% of injections), service networking issues (4%), and service under/overprovisioning (24%). Errors in the fields tracking dependencies between object caused 51% of such cluster-wide failures. We argue that controlled fault/error injection-based testing should be employed to proactively assess Kubernetes' resiliency and guide the design of failure mitigation strategies.
KW - cloud
KW - container orchestration
KW - failure
KW - failure analysis
KW - fault injection
KW - Kubernetes
KW - mission-critical
KW - resiliency
UR - http://www.scopus.com/inward/record.url?scp=85203820273&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85203820273&partnerID=8YFLogxK
U2 - 10.1109/DSN58291.2024.00016
DO - 10.1109/DSN58291.2024.00016
M3 - Conference contribution
AN - SCOPUS:85203820273
T3 - Proceedings - 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2024
SP - 1
EP - 14
BT - Proceedings - 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2024
Y2 - 24 June 2024 through 27 June 2024
ER -