Analysis and diagnosis of SLA violations in a production saas cloud

Catello Di Martino, Daniel Chen, Geetika Goel, Rajeshwari Ganesan, Zbigniew T Kalbarczyk, Ravishankar K Iyer

Research output: Contribution to journalConference article

Abstract

This paper investigates SLA violations of a production SaaS platform by means of joint use of field failure data analysis (FFDA) and fault injection. The objective of this study is to diagnose the causes of SLA violations, pinpoint critical failure modes under realistic error assumptions and identify potential means to increase the user perceived availability of the platform and assurance of SLA requirements. We base our study on 283 days of logs obtained during the production time of the platform, while it was employed to process business data received by 42 customers in 22 countries. In this paper, we develop a set of tools that include i) a FFDA toolset used to analyze the data extracted from the platform and by the operating system event logs and ii) a. NET/C++ injector able to automate the injection of specific runtime errors in the production code and the collection of results. Major findings include i) 93% of all service level agreement (SLA) violations were due to system failures, ii) there were a few cases of bursts of SLA violations that could not be diagnosed from the logs and were revealed from the performed injections, and iii) the error injection revealed several error propagation paths leading to data corruptions that could not be detected from the analysis of failure data.

Original languageEnglish (US)
Article number6982625
Pages (from-to)178-188
Number of pages11
JournalProceedings - International Symposium on Software Reliability Engineering, ISSRE
DOIs
StatePublished - Dec 11 2014
Event25th IEEE International Symposium on Software Reliability Engineering, ISSRE 2014 - Naples, Italy
Duration: Nov 3 2014Nov 6 2014

Fingerprint

Production platforms
Failure modes
Availability
Industry

Keywords

  • SLA violations
  • SaaS
  • empirical reliability
  • fault injection
  • hazard analysis
  • log analysis

ASJC Scopus subject areas

  • Software
  • Safety, Risk, Reliability and Quality

Cite this

Analysis and diagnosis of SLA violations in a production saas cloud. / Martino, Catello Di; Chen, Daniel; Goel, Geetika; Ganesan, Rajeshwari; Kalbarczyk, Zbigniew T; Iyer, Ravishankar K.

In: Proceedings - International Symposium on Software Reliability Engineering, ISSRE, 11.12.2014, p. 178-188.

Research output: Contribution to journalConference article

@article{6685269847d14d4b8118c1d3efe1c41d,
title = "Analysis and diagnosis of SLA violations in a production saas cloud",
abstract = "This paper investigates SLA violations of a production SaaS platform by means of joint use of field failure data analysis (FFDA) and fault injection. The objective of this study is to diagnose the causes of SLA violations, pinpoint critical failure modes under realistic error assumptions and identify potential means to increase the user perceived availability of the platform and assurance of SLA requirements. We base our study on 283 days of logs obtained during the production time of the platform, while it was employed to process business data received by 42 customers in 22 countries. In this paper, we develop a set of tools that include i) a FFDA toolset used to analyze the data extracted from the platform and by the operating system event logs and ii) a. NET/C++ injector able to automate the injection of specific runtime errors in the production code and the collection of results. Major findings include i) 93{\%} of all service level agreement (SLA) violations were due to system failures, ii) there were a few cases of bursts of SLA violations that could not be diagnosed from the logs and were revealed from the performed injections, and iii) the error injection revealed several error propagation paths leading to data corruptions that could not be detected from the analysis of failure data.",
keywords = "SLA violations, SaaS, empirical reliability, fault injection, hazard analysis, log analysis",
author = "Martino, {Catello Di} and Daniel Chen and Geetika Goel and Rajeshwari Ganesan and Kalbarczyk, {Zbigniew T} and Iyer, {Ravishankar K}",
year = "2014",
month = "12",
day = "11",
doi = "10.1109/ISSRE.2014.26",
language = "English (US)",
pages = "178--188",
journal = "Proceedings of the International Symposium on Software Reliability Engineering, ISSRE",
issn = "1071-9458",

}

TY - JOUR

T1 - Analysis and diagnosis of SLA violations in a production saas cloud

AU - Martino, Catello Di

AU - Chen, Daniel

AU - Goel, Geetika

AU - Ganesan, Rajeshwari

AU - Kalbarczyk, Zbigniew T

AU - Iyer, Ravishankar K

PY - 2014/12/11

Y1 - 2014/12/11

N2 - This paper investigates SLA violations of a production SaaS platform by means of joint use of field failure data analysis (FFDA) and fault injection. The objective of this study is to diagnose the causes of SLA violations, pinpoint critical failure modes under realistic error assumptions and identify potential means to increase the user perceived availability of the platform and assurance of SLA requirements. We base our study on 283 days of logs obtained during the production time of the platform, while it was employed to process business data received by 42 customers in 22 countries. In this paper, we develop a set of tools that include i) a FFDA toolset used to analyze the data extracted from the platform and by the operating system event logs and ii) a. NET/C++ injector able to automate the injection of specific runtime errors in the production code and the collection of results. Major findings include i) 93% of all service level agreement (SLA) violations were due to system failures, ii) there were a few cases of bursts of SLA violations that could not be diagnosed from the logs and were revealed from the performed injections, and iii) the error injection revealed several error propagation paths leading to data corruptions that could not be detected from the analysis of failure data.

AB - This paper investigates SLA violations of a production SaaS platform by means of joint use of field failure data analysis (FFDA) and fault injection. The objective of this study is to diagnose the causes of SLA violations, pinpoint critical failure modes under realistic error assumptions and identify potential means to increase the user perceived availability of the platform and assurance of SLA requirements. We base our study on 283 days of logs obtained during the production time of the platform, while it was employed to process business data received by 42 customers in 22 countries. In this paper, we develop a set of tools that include i) a FFDA toolset used to analyze the data extracted from the platform and by the operating system event logs and ii) a. NET/C++ injector able to automate the injection of specific runtime errors in the production code and the collection of results. Major findings include i) 93% of all service level agreement (SLA) violations were due to system failures, ii) there were a few cases of bursts of SLA violations that could not be diagnosed from the logs and were revealed from the performed injections, and iii) the error injection revealed several error propagation paths leading to data corruptions that could not be detected from the analysis of failure data.

KW - SLA violations

KW - SaaS

KW - empirical reliability

KW - fault injection

KW - hazard analysis

KW - log analysis

UR - http://www.scopus.com/inward/record.url?scp=84928673968&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84928673968&partnerID=8YFLogxK

U2 - 10.1109/ISSRE.2014.26

DO - 10.1109/ISSRE.2014.26

M3 - Conference article

AN - SCOPUS:84928673968

SP - 178

EP - 188

JO - Proceedings of the International Symposium on Software Reliability Engineering, ISSRE

JF - Proceedings of the International Symposium on Software Reliability Engineering, ISSRE

SN - 1071-9458

M1 - 6982625

ER -