TY - GEN
T1 - Characterization of operational failures from a business data processing SaaS platform
AU - Di Martino, Catello
AU - Kalbarczyk, Zbigniew
AU - Iyer, Ravishankar K.
AU - Goel, Geetika
AU - Sarkar, Santonu
AU - Ganesan, Rajeshwari
N1 - Copyright:
Copyright 2014 Elsevier B.V., All rights reserved.
PY - 2014
Y1 - 2014
N2 - This paper characterizes operational failures of a production Custom Package Good Software-as-a-Service (SaaS) plat-form. Events log collected over 283 days of in-field oper-ation are used to characterize platform failures. The char-acterization is performed by estimating (i) common failure types of the platform, (ii) key factors impacting platform failures, (iii) failure rate, and (iv) how user workload (files submitted for processing) impacts on the failure rate. The major findings are: (i) 34.1% of failures are caused by un-expected values in customers' data, (ii) nearly 33% of the failures are because of timeout, and (iii) the failure rate in-creases if the workload intensity (transactions/second) in-creases, while there is no statistical evidence of being in u-enced by the workload volume (size of users' data). Finally, the paper presents the lessons learned and how the findings and the implemented analysis tool allow platform develop-ers to improve platform code, system settings and customer management.
AB - This paper characterizes operational failures of a production Custom Package Good Software-as-a-Service (SaaS) plat-form. Events log collected over 283 days of in-field oper-ation are used to characterize platform failures. The char-acterization is performed by estimating (i) common failure types of the platform, (ii) key factors impacting platform failures, (iii) failure rate, and (iv) how user workload (files submitted for processing) impacts on the failure rate. The major findings are: (i) 34.1% of failures are caused by un-expected values in customers' data, (ii) nearly 33% of the failures are because of timeout, and (iii) the failure rate in-creases if the workload intensity (transactions/second) in-creases, while there is no statistical evidence of being in u-enced by the workload volume (size of users' data). Finally, the paper presents the lessons learned and how the findings and the implemented analysis tool allow platform develop-ers to improve platform code, system settings and customer management.
KW - Cloud computing
KW - Failure analysis
KW - Logs
KW - Robustness
KW - SaaS
UR - http://www.scopus.com/inward/record.url?scp=84903636862&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84903636862&partnerID=8YFLogxK
U2 - 10.1145/2591062.2591172
DO - 10.1145/2591062.2591172
M3 - Conference contribution
AN - SCOPUS:84903636862
SN - 9781450327688
T3 - 36th International Conference on Software Engineering, ICSE Companion 2014 - Proceedings
SP - 195
EP - 204
BT - 36th International Conference on Software Engineering, ICSE Companion 2014 - Proceedings
PB - Association for Computing Machinery
T2 - 36th International Conference on Software Engineering, ICSE 2014
Y2 - 31 May 2014 through 7 June 2014
ER -