Analysis and diagnosis of SLA violations in a production SaaS cloud

Catello Di Martino, Santonu Sarkar, Rajeshwari Ganesan, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

Research output: Contribution to journalArticlepeer-review

Abstract

A software-as-a-service (SaaS) needs to provide its intended service as per its stated service-level agreements (SLAs). While SLA violations in a SaaS platform have been reported, not much work has been done to empirically characterize failures of SaaS. In this paper, we study SLA violations of a production SaaS platform, diagnose the causes, unearth several critical failure modes, and then, suggest various solution approaches to increase the availability of the platform as perceived by the end user. Our approach combines field failure data analysis (FFDA) and fault injection. Our study is based on 283 days of operational logs of the platform. During this time, the platform received business workload from 42 customers spread over 22 countries. We have first developed a set of home-grown FFDA tools to analyze the log, and second implemented a fault injector to automatically inject several runtime errors in the application code written in.NET/C#, and then, collate the injection results. We summarize our finding as: first, system failures have caused 93% of all SLA violations; second, our fault injector has been able to recreate a few cases of bursts of SLA violations that could not be diagnosed from the logs; and third, the fault injection mechanism could recreate several error propagation paths leading to data corruptions that the failure data analysis could not reveal. Finally, the paper presents some system-level implication of this study and how the joint use of fault injection and log analysis may help in improving the reliability of the measured platform.

Original languageEnglish (US)
Article number7835304
Pages (from-to)54-75
Number of pages22
JournalIEEE Transactions on Reliability
Volume66
Issue number1
DOIs
StatePublished - Mar 2017

Keywords

  • Cloud computing
  • failure analysis
  • service-level agreement (SLA) violation

ASJC Scopus subject areas

  • Safety, Risk, Reliability and Quality
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Analysis and diagnosis of SLA violations in a production SaaS cloud'. Together they form a unique fingerprint.

Cite this