Abstract

In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many real-world failure patterns. The paper aims to address the lack of studies on systematic analyses of Kubernetes failures to date. Our results show that even a single fault/error (e.g., a bit-flip) in the data stored can propagate, causing cluster-wide failures (3% of injections), service networking issues (4%), and service under/overprovisioning (24%). Errors in the fields tracking dependencies between object caused 51% of such cluster-wide failures. We argue that controlled fault/error injection-based testing should be employed to proactively assess Kubernetes' resiliency and guide the design of failure mitigation strategies.

Original languageEnglish (US)
Title of host publicationProceedings - 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1-14
Number of pages14
ISBN (Electronic)9798350341058
DOIs
StatePublished - 2024
Event54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2024 - Brisbane, Australia
Duration: Jun 24 2024Jun 27 2024

Publication series

NameProceedings - 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2024

Conference

Conference54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2024
Country/TerritoryAustralia
CityBrisbane
Period6/24/246/27/24

Keywords

  • cloud
  • container orchestration
  • failure
  • failure analysis
  • fault injection
  • Kubernetes
  • mission-critical
  • resiliency

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Mutiny! How Does Kubernetes Fail, and What Can We Do about It?'. Together they form a unique fingerprint.

Cite this