TY - GEN
T1 - Analysis and diagnosis of SLA violations in a production saas cloud
AU - Martino, Catello Di
AU - Chen, Daniel
AU - Goel, Geetika
AU - Ganesan, Rajeshwari
AU - Kalbarczyk, Zbigniew
AU - Iyer, Ravishankar
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/12/11
Y1 - 2014/12/11
N2 - This paper investigates SLA violations of a production SaaS platform by means of joint use of field failure data analysis (FFDA) and fault injection. The objective of this study is to diagnose the causes of SLA violations, pinpoint critical failure modes under realistic error assumptions and identify potential means to increase the user perceived availability of the platform and assurance of SLA requirements. We base our study on 283 days of logs obtained during the production time of the platform, while it was employed to process business data received by 42 customers in 22 countries. In this paper, we develop a set of tools that include i) a FFDA toolset used to analyze the data extracted from the platform and by the operating system event logs and ii) a. NET/C++ injector able to automate the injection of specific runtime errors in the production code and the collection of results. Major findings include i) 93% of all service level agreement (SLA) violations were due to system failures, ii) there were a few cases of bursts of SLA violations that could not be diagnosed from the logs and were revealed from the performed injections, and iii) the error injection revealed several error propagation paths leading to data corruptions that could not be detected from the analysis of failure data.
AB - This paper investigates SLA violations of a production SaaS platform by means of joint use of field failure data analysis (FFDA) and fault injection. The objective of this study is to diagnose the causes of SLA violations, pinpoint critical failure modes under realistic error assumptions and identify potential means to increase the user perceived availability of the platform and assurance of SLA requirements. We base our study on 283 days of logs obtained during the production time of the platform, while it was employed to process business data received by 42 customers in 22 countries. In this paper, we develop a set of tools that include i) a FFDA toolset used to analyze the data extracted from the platform and by the operating system event logs and ii) a. NET/C++ injector able to automate the injection of specific runtime errors in the production code and the collection of results. Major findings include i) 93% of all service level agreement (SLA) violations were due to system failures, ii) there were a few cases of bursts of SLA violations that could not be diagnosed from the logs and were revealed from the performed injections, and iii) the error injection revealed several error propagation paths leading to data corruptions that could not be detected from the analysis of failure data.
KW - SLA violations
KW - SaaS
KW - empirical reliability
KW - fault injection
KW - hazard analysis
KW - log analysis
UR - http://www.scopus.com/inward/record.url?scp=84928673968&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84928673968&partnerID=8YFLogxK
U2 - 10.1109/ISSRE.2014.26
DO - 10.1109/ISSRE.2014.26
M3 - Conference contribution
AN - SCOPUS:84928673968
T3 - Proceedings - International Symposium on Software Reliability Engineering, ISSRE
SP - 178
EP - 188
BT - Proceedings - IEEE 25th International Symposium on Software Reliability Engineering, ISSRE 2014
PB - IEEE Computer Society
T2 - 25th IEEE International Symposium on Software Reliability Engineering, ISSRE 2014
Y2 - 3 November 2014 through 6 November 2014
ER -