TY - GEN
T1 - Fail through the Cracks
T2 - 18th European Conference on Computer Systems, EuroSys 2023
AU - Tang, Lilia
AU - Bhandari, Chaitanya
AU - Zhang, Yongle
AU - Karanika, Anna
AU - Ji, Shuyang
AU - Gupta, Indranil
AU - Xu, Tianyin
N1 - Publisher Copyright:
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2023/5/8
Y1 - 2023/5/8
N2 - Modern cloud systems are orchestrations of independent and interacting (sub-)systems, each specializing in important services (e.g., data processing, storage, resource management, etc.). Hence, cloud system reliability is affected not only by the reliability of each individual system, but also by the interplay between these systems. We observe that many recent production incidents of cloud systems are manifested through interactions across the system boundaries. However, there is a lack of systematic understanding of this emerging mode of failures, which we term as cross-system interaction failures (or CSI failures). This hinders the development of better design, integration practices, and new tooling. In this paper, we discuss cross-system interaction failures based on analyses of (1) 11 CSI-failure-induced cloud incidents of Google, Azure, and AWS, and (2) 120 CSI failure cases of seven widely co-deployed open-source systems. We focus on understanding discrepancies between interacting systems as the root causes of CSI failures—CSI failures cannot be understood by analyzing one single system in isolation. This paper draws attention to this emerging failure mode, provides a comprehensive understanding of CSI failure patterns, and discusses potential approaches for mitigation. We advocate for cross-system testing and verification and demonstrate its potential by cross-testing the Spark-Hive data plane and exposing 15 new discrepancies.
AB - Modern cloud systems are orchestrations of independent and interacting (sub-)systems, each specializing in important services (e.g., data processing, storage, resource management, etc.). Hence, cloud system reliability is affected not only by the reliability of each individual system, but also by the interplay between these systems. We observe that many recent production incidents of cloud systems are manifested through interactions across the system boundaries. However, there is a lack of systematic understanding of this emerging mode of failures, which we term as cross-system interaction failures (or CSI failures). This hinders the development of better design, integration practices, and new tooling. In this paper, we discuss cross-system interaction failures based on analyses of (1) 11 CSI-failure-induced cloud incidents of Google, Azure, and AWS, and (2) 120 CSI failure cases of seven widely co-deployed open-source systems. We focus on understanding discrepancies between interacting systems as the root causes of CSI failures—CSI failures cannot be understood by analyzing one single system in isolation. This paper draws attention to this emerging failure mode, provides a comprehensive understanding of CSI failure patterns, and discusses potential approaches for mitigation. We advocate for cross-system testing and verification and demonstrate its potential by cross-testing the Spark-Hive data plane and exposing 15 new discrepancies.
KW - Cross-system interaction
KW - cloud system
KW - failure study
KW - root cause analysis
UR - http://www.scopus.com/inward/record.url?scp=85160205311&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85160205311&partnerID=8YFLogxK
U2 - 10.1145/3552326.3587448
DO - 10.1145/3552326.3587448
M3 - Conference contribution
AN - SCOPUS:85160205311
T3 - Proceedings of the 18th European Conference on Computer Systems, EuroSys 2023
SP - 433
EP - 451
BT - Proceedings of the 18th European Conference on Computer Systems, EuroSys 2023
PB - Association for Computing Machinery
Y2 - 8 May 2023 through 12 May 2023
ER -