TY - GEN
T1 - Healing online service systems via mining historical issue repositories
AU - Ding, Rui
AU - Fu, Qiang
AU - Lou, Jian Guang
AU - Lin, Qingwei
AU - Zhang, Dongmei
AU - Shen, Jiajun
AU - Xie, Tao
PY - 2012
Y1 - 2012
N2 - Online service systems have been increasingly popular and important nowadays, with an increasing demand on the availability of services provided by these systems, while significant efforts have been made to strive for keeping services up continuously. To assure the user-perceived availability of a service, reducing the Mean Time To Restore (MTTR) of the service remains a very important step. To reduce the MTTR, a common practice is to restore the service by identifying and applying an appropriate healing action (i.e., a temporary workaround action such as rebooting a SQL machine). However, manually identifying an appropriate healing action for a given new issue (such as service down) is typically time consuming and error prone. To address this challenge, in this paper, we present an automated mining- based approach for suggesting an appropriate healing action for a given new issue. Our approach generates signatures of an issue from its corresponding transaction logs and then retrieves historical issues from a historical issue repository. Finally, our approach suggests an appropriate healing action by adapting healing actions for the retrieved historical issues. We have implemented a healing-suggestion system for our approach and applied it to a real-world online service system that serves millions of online customers globally. The studies on 77 incidents (severe issues) over three months showed that our approach can effectively provide appropriate healing actions to reduce the MTTR of the service.
AB - Online service systems have been increasingly popular and important nowadays, with an increasing demand on the availability of services provided by these systems, while significant efforts have been made to strive for keeping services up continuously. To assure the user-perceived availability of a service, reducing the Mean Time To Restore (MTTR) of the service remains a very important step. To reduce the MTTR, a common practice is to restore the service by identifying and applying an appropriate healing action (i.e., a temporary workaround action such as rebooting a SQL machine). However, manually identifying an appropriate healing action for a given new issue (such as service down) is typically time consuming and error prone. To address this challenge, in this paper, we present an automated mining- based approach for suggesting an appropriate healing action for a given new issue. Our approach generates signatures of an issue from its corresponding transaction logs and then retrieves historical issues from a historical issue repository. Finally, our approach suggests an appropriate healing action by adapting healing actions for the retrieved historical issues. We have implemented a healing-suggestion system for our approach and applied it to a real-world online service system that serves millions of online customers globally. The studies on 77 incidents (severe issues) over three months showed that our approach can effectively provide appropriate healing actions to reduce the MTTR of the service.
KW - Healing action
KW - Online service system
UR - http://www.scopus.com/inward/record.url?scp=84866948358&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84866948358&partnerID=8YFLogxK
U2 - 10.1145/2351676.2351735
DO - 10.1145/2351676.2351735
M3 - Conference contribution
AN - SCOPUS:84866948358
SN - 9781450312042
T3 - 2012 27th IEEE/ACM International Conference on Automated Software Engineering, ASE 2012 - Proceedings
SP - 318
EP - 321
BT - 2012 27th IEEE/ACM International Conference on Automated Software Engineering, ASE 2012 - Proceedings
T2 - 2012 27th IEEE/ACM International Conference on Automated Software Engineering, ASE 2012
Y2 - 3 September 2012 through 7 September 2012
ER -