TY - GEN
T1 - Software analytics for incident management of online services
T2 - 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013
AU - Lou, Jian Guang
AU - Lin, Qingwei
AU - Ding, Rui
AU - Fu, Qiang
AU - Zhang, Dongmei
AU - Xie, Tao
N1 - Copyright:
Copyright 2014 Elsevier B.V., All rights reserved.
PY - 2013
Y1 - 2013
N2 - As online services become more and more popular, incident management has become a critical task that aims to minimize the service downtime and to ensure high quality of the provided services. In practice, incident management is conducted through analyzing a huge amount of monitoring data collected at runtime of a service. Such data-driven incident management faces several significant challenges such as the large data scale, complex problem space, and incomplete knowledge. To address these challenges, we carried out two-year software-analytics research where we designed a set of novel data-driven techniques and developed an industrial system called the Service Analysis Studio (SAS) targeting real scenarios in a large-scale online service of Microsoft. SAS has been deployed to worldwide product datacenters and widely used by on-call engineers for incident management. This paper shares our experience about using software analytics to solve engineers' pain points in incident management, the developed data-analysis techniques, and the lessons learned from the process of research development and technology transfer.
AB - As online services become more and more popular, incident management has become a critical task that aims to minimize the service downtime and to ensure high quality of the provided services. In practice, incident management is conducted through analyzing a huge amount of monitoring data collected at runtime of a service. Such data-driven incident management faces several significant challenges such as the large data scale, complex problem space, and incomplete knowledge. To address these challenges, we carried out two-year software-analytics research where we designed a set of novel data-driven techniques and developed an industrial system called the Service Analysis Studio (SAS) targeting real scenarios in a large-scale online service of Microsoft. SAS has been deployed to worldwide product datacenters and widely used by on-call engineers for incident management. This paper shares our experience about using software analytics to solve engineers' pain points in incident management, the developed data-analysis techniques, and the lessons learned from the process of research development and technology transfer.
KW - Online service
KW - incident management
KW - service incident diagnosis
UR - http://www.scopus.com/inward/record.url?scp=84893540464&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84893540464&partnerID=8YFLogxK
U2 - 10.1109/ASE.2013.6693105
DO - 10.1109/ASE.2013.6693105
M3 - Conference contribution
AN - SCOPUS:84893540464
SN - 9781479902156
T3 - 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013 - Proceedings
SP - 475
EP - 485
BT - 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013 - Proceedings
Y2 - 11 November 2013 through 15 November 2013
ER -