Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems

Lilia Tang, Chaitanya Bhandari, Yongle Zhang, Anna Karanika, Shuyang Ji, Indranil Gupta, Tianyin Xu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Modern cloud systems are orchestrations of independent and interacting (sub-)systems, each specializing in important services (e.g., data processing, storage, resource management, etc.). Hence, cloud system reliability is affected not only by the reliability of each individual system, but also by the interplay between these systems. We observe that many recent production incidents of cloud systems are manifested through interactions across the system boundaries. However, there is a lack of systematic understanding of this emerging mode of failures, which we term as cross-system interaction failures (or CSI failures). This hinders the development of better design, integration practices, and new tooling. In this paper, we discuss cross-system interaction failures based on analyses of (1) 11 CSI-failure-induced cloud incidents of Google, Azure, and AWS, and (2) 120 CSI failure cases of seven widely co-deployed open-source systems. We focus on understanding discrepancies between interacting systems as the root causes of CSI failures—CSI failures cannot be understood by analyzing one single system in isolation. This paper draws attention to this emerging failure mode, provides a comprehensive understanding of CSI failure patterns, and discusses potential approaches for mitigation. We advocate for cross-system testing and verification and demonstrate its potential by cross-testing the Spark-Hive data plane and exposing 15 new discrepancies.

Original languageEnglish (US)
Title of host publicationProceedings of the 18th European Conference on Computer Systems, EuroSys 2023
PublisherAssociation for Computing Machinery
Pages433-451
Number of pages19
ISBN (Electronic)9781450394871
DOIs
StatePublished - May 8 2023
Event18th European Conference on Computer Systems, EuroSys 2023 - Rome, Italy
Duration: May 8 2023May 12 2023

Publication series

NameProceedings of the 18th European Conference on Computer Systems, EuroSys 2023

Conference

Conference18th European Conference on Computer Systems, EuroSys 2023
Country/TerritoryItaly
CityRome
Period5/8/235/12/23

Keywords

  • Cross-system interaction
  • cloud system
  • failure study
  • root cause analysis

ASJC Scopus subject areas

  • Information Systems
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems'. Together they form a unique fingerprint.

Cite this