TY - GEN
T1 - Automatic Root Cause Analysis via Large Language Models for Cloud Incidents
AU - Chen, Yinfang
AU - Xie, Huaibing
AU - Ma, Minghua
AU - Kang, Yu
AU - Gao, Xin
AU - Shi, Liu
AU - Cao, Yunjie
AU - Gao, Xuedong
AU - Fan, Hao
AU - Wen, Ming
AU - Zeng, Jun
AU - Ghosh, Supriyo
AU - Zhang, Xuchao
AU - Zhang, Chaoyun
AU - Lin, Qingwei
AU - Rajmohan, Saravan
AU - Zhang, Dongmei
AU - Xu, Tianyin
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/4/22
Y1 - 2024/4/22
N2 - Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual investigations of data sources such as logs and traces, are often laborious, error-prone, and challenging for on-call engineers. In this paper, we introduce RCACopilot, an innovative on-call system empowered by the large language model for automating RCA of cloud incidents. RCACopilot matches incoming incidents to corresponding incident handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative. We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from Microsoft. Our evaluation demonstrates that RCACopilot achieves RCA accuracy up to 0.766. Furthermore, the diagnostic information collection component of RCACopilot has been successfully in use at Microsoft for over four years.
AB - Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual investigations of data sources such as logs and traces, are often laborious, error-prone, and challenging for on-call engineers. In this paper, we introduce RCACopilot, an innovative on-call system empowered by the large language model for automating RCA of cloud incidents. RCACopilot matches incoming incidents to corresponding incident handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative. We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from Microsoft. Our evaluation demonstrates that RCACopilot achieves RCA accuracy up to 0.766. Furthermore, the diagnostic information collection component of RCACopilot has been successfully in use at Microsoft for over four years.
KW - Cloud Systems
KW - Large Language Models
KW - Root Cause Analysis
UR - http://www.scopus.com/inward/record.url?scp=85191980033&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85191980033&partnerID=8YFLogxK
U2 - 10.1145/3627703.3629553
DO - 10.1145/3627703.3629553
M3 - Conference contribution
AN - SCOPUS:85191980033
T3 - EuroSys 2024 - Proceedings of the 2024 European Conference on Computer Systems
SP - 674
EP - 688
BT - EuroSys 2024 - Proceedings of the 2024 European Conference on Computer Systems
PB - Association for Computing Machinery
T2 - 19th European Conference on Computer Systems, EuroSys 2024
Y2 - 22 April 2024 through 25 April 2024
ER -