Software analytics for incident management of online services: An experience report

Jian Guang Lou, Qingwei Lin, Rui Ding, Qiang Fu, Dongmei Zhang, Tao Xie

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As online services become more and more popular, incident management has become a critical task that aims to minimize the service downtime and to ensure high quality of the provided services. In practice, incident management is conducted through analyzing a huge amount of monitoring data collected at runtime of a service. Such data-driven incident management faces several significant challenges such as the large data scale, complex problem space, and incomplete knowledge. To address these challenges, we carried out two-year software-analytics research where we designed a set of novel data-driven techniques and developed an industrial system called the Service Analysis Studio (SAS) targeting real scenarios in a large-scale online service of Microsoft. SAS has been deployed to worldwide product datacenters and widely used by on-call engineers for incident management. This paper shares our experience about using software analytics to solve engineers' pain points in incident management, the developed data-analysis techniques, and the lessons learned from the process of research development and technology transfer.

Original languageEnglish (US)
Title of host publication2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013 - Proceedings
Pages475-485
Number of pages11
DOIs
StatePublished - Dec 1 2013
Event2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013 - Palo Alto, CA, United States
Duration: Nov 11 2013Nov 15 2013

Publication series

Name2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013 - Proceedings

Other

Other2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013
CountryUnited States
CityPalo Alto, CA
Period11/11/1311/15/13

    Fingerprint

Keywords

  • Online service
  • incident management
  • service incident diagnosis

ASJC Scopus subject areas

  • Software

Cite this

Lou, J. G., Lin, Q., Ding, R., Fu, Q., Zhang, D., & Xie, T. (2013). Software analytics for incident management of online services: An experience report. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013 - Proceedings (pp. 475-485). [6693105] (2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013 - Proceedings). https://doi.org/10.1109/ASE.2013.6693105