Latent error prediction and fault localization for microservice applications by learning from system trace logs

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, Chuan He

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In the production environment, a large part of microservice failures are related to the complex and dynamic interactions and runtime environments, such as those related to multiple instances, environmental configurations, and asynchronous interactions of microservices. Due to the complexity and dynamism of these failures, it is often hard to reproduce and diagnose them in testing environments. It is desirable yet still challenging that these failures can be detected and the faults can be located at runtime of the production environment to allow developers to resolve them efficiently. To address this challenge, in this paper, we propose MEPFL, an approach of latent error prediction and fault localization for microservice applications by learning from system trace logs. Based on a set of features defined on the system trace logs, MEPFL trains prediction models at both the trace level and the microservice level using the system trace logs collected from automatic executions of the target application and its faulty versions produced by fault injection. The prediction models thus can be used in the production environment to predict latent errors, faulty microservices, and fault types for trace instances captured at runtime. We implement MEPFL based on the infrastructure systems of container orchestrator and service mesh, and conduct a series of experimental studies with two opensource microservice applications (one of them being the largest open-source microservice application to our best knowledge). The results indicate that MEPFL can achieve high accuracy in intraapplication prediction of latent errors, faulty microservices, and fault types, and outperforms a state-of-the-art approach of failure diagnosis for distributed systems. The results also show that MEPFL can effectively predict latent errors caused by real-world fault cases.

Original languageEnglish (US)
Title of host publicationESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
EditorsSven Apel, Marlon Dumas, Alessandra Russo, Dietmar Pfahl
PublisherAssociation for Computing Machinery, Inc
Pages683-694
Number of pages12
ISBN (Electronic)9781450355728
DOIs
StatePublished - Aug 12 2019
Event27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2019 - Tallinn, Estonia
Duration: Aug 26 2019Aug 30 2019

Publication series

NameESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Conference

Conference27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2019
CountryEstonia
CityTallinn
Period8/26/198/30/19

Fingerprint

Containers
Testing

Keywords

  • Debugging
  • Error prediction
  • Fault localization
  • Machine learning
  • Microservices
  • Tracing

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software

Cite this

Zhou, X., Peng, X., Xie, T., Sun, J., Ji, C., Liu, D., ... He, C. (2019). Latent error prediction and fault localization for microservice applications by learning from system trace logs. In S. Apel, M. Dumas, A. Russo, & D. Pfahl (Eds.), ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 683-694). (ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering). Association for Computing Machinery, Inc. https://doi.org/10.1145/3338906.3338961

Latent error prediction and fault localization for microservice applications by learning from system trace logs. / Zhou, Xiang; Peng, Xin; Xie, Tao; Sun, Jun; Ji, Chao; Liu, Dewei; Xiang, Qilin; He, Chuan.

ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ed. / Sven Apel; Marlon Dumas; Alessandra Russo; Dietmar Pfahl. Association for Computing Machinery, Inc, 2019. p. 683-694 (ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhou, X, Peng, X, Xie, T, Sun, J, Ji, C, Liu, D, Xiang, Q & He, C 2019, Latent error prediction and fault localization for microservice applications by learning from system trace logs. in S Apel, M Dumas, A Russo & D Pfahl (eds), ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Association for Computing Machinery, Inc, pp. 683-694, 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2019, Tallinn, Estonia, 8/26/19. https://doi.org/10.1145/3338906.3338961
Zhou X, Peng X, Xie T, Sun J, Ji C, Liu D et al. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In Apel S, Dumas M, Russo A, Pfahl D, editors, ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Association for Computing Machinery, Inc. 2019. p. 683-694. (ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering). https://doi.org/10.1145/3338906.3338961
Zhou, Xiang ; Peng, Xin ; Xie, Tao ; Sun, Jun ; Ji, Chao ; Liu, Dewei ; Xiang, Qilin ; He, Chuan. / Latent error prediction and fault localization for microservice applications by learning from system trace logs. ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering. editor / Sven Apel ; Marlon Dumas ; Alessandra Russo ; Dietmar Pfahl. Association for Computing Machinery, Inc, 2019. pp. 683-694 (ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering).
@inproceedings{fc67a9c6bf05446c88adbf8c52063f91,
title = "Latent error prediction and fault localization for microservice applications by learning from system trace logs",
abstract = "In the production environment, a large part of microservice failures are related to the complex and dynamic interactions and runtime environments, such as those related to multiple instances, environmental configurations, and asynchronous interactions of microservices. Due to the complexity and dynamism of these failures, it is often hard to reproduce and diagnose them in testing environments. It is desirable yet still challenging that these failures can be detected and the faults can be located at runtime of the production environment to allow developers to resolve them efficiently. To address this challenge, in this paper, we propose MEPFL, an approach of latent error prediction and fault localization for microservice applications by learning from system trace logs. Based on a set of features defined on the system trace logs, MEPFL trains prediction models at both the trace level and the microservice level using the system trace logs collected from automatic executions of the target application and its faulty versions produced by fault injection. The prediction models thus can be used in the production environment to predict latent errors, faulty microservices, and fault types for trace instances captured at runtime. We implement MEPFL based on the infrastructure systems of container orchestrator and service mesh, and conduct a series of experimental studies with two opensource microservice applications (one of them being the largest open-source microservice application to our best knowledge). The results indicate that MEPFL can achieve high accuracy in intraapplication prediction of latent errors, faulty microservices, and fault types, and outperforms a state-of-the-art approach of failure diagnosis for distributed systems. The results also show that MEPFL can effectively predict latent errors caused by real-world fault cases.",
keywords = "Debugging, Error prediction, Fault localization, Machine learning, Microservices, Tracing",
author = "Xiang Zhou and Xin Peng and Tao Xie and Jun Sun and Chao Ji and Dewei Liu and Qilin Xiang and Chuan He",
year = "2019",
month = "8",
day = "12",
doi = "10.1145/3338906.3338961",
language = "English (US)",
series = "ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering",
publisher = "Association for Computing Machinery, Inc",
pages = "683--694",
editor = "Sven Apel and Marlon Dumas and Alessandra Russo and Dietmar Pfahl",
booktitle = "ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering",

}

TY - GEN

T1 - Latent error prediction and fault localization for microservice applications by learning from system trace logs

AU - Zhou, Xiang

AU - Peng, Xin

AU - Xie, Tao

AU - Sun, Jun

AU - Ji, Chao

AU - Liu, Dewei

AU - Xiang, Qilin

AU - He, Chuan

PY - 2019/8/12

Y1 - 2019/8/12

N2 - In the production environment, a large part of microservice failures are related to the complex and dynamic interactions and runtime environments, such as those related to multiple instances, environmental configurations, and asynchronous interactions of microservices. Due to the complexity and dynamism of these failures, it is often hard to reproduce and diagnose them in testing environments. It is desirable yet still challenging that these failures can be detected and the faults can be located at runtime of the production environment to allow developers to resolve them efficiently. To address this challenge, in this paper, we propose MEPFL, an approach of latent error prediction and fault localization for microservice applications by learning from system trace logs. Based on a set of features defined on the system trace logs, MEPFL trains prediction models at both the trace level and the microservice level using the system trace logs collected from automatic executions of the target application and its faulty versions produced by fault injection. The prediction models thus can be used in the production environment to predict latent errors, faulty microservices, and fault types for trace instances captured at runtime. We implement MEPFL based on the infrastructure systems of container orchestrator and service mesh, and conduct a series of experimental studies with two opensource microservice applications (one of them being the largest open-source microservice application to our best knowledge). The results indicate that MEPFL can achieve high accuracy in intraapplication prediction of latent errors, faulty microservices, and fault types, and outperforms a state-of-the-art approach of failure diagnosis for distributed systems. The results also show that MEPFL can effectively predict latent errors caused by real-world fault cases.

AB - In the production environment, a large part of microservice failures are related to the complex and dynamic interactions and runtime environments, such as those related to multiple instances, environmental configurations, and asynchronous interactions of microservices. Due to the complexity and dynamism of these failures, it is often hard to reproduce and diagnose them in testing environments. It is desirable yet still challenging that these failures can be detected and the faults can be located at runtime of the production environment to allow developers to resolve them efficiently. To address this challenge, in this paper, we propose MEPFL, an approach of latent error prediction and fault localization for microservice applications by learning from system trace logs. Based on a set of features defined on the system trace logs, MEPFL trains prediction models at both the trace level and the microservice level using the system trace logs collected from automatic executions of the target application and its faulty versions produced by fault injection. The prediction models thus can be used in the production environment to predict latent errors, faulty microservices, and fault types for trace instances captured at runtime. We implement MEPFL based on the infrastructure systems of container orchestrator and service mesh, and conduct a series of experimental studies with two opensource microservice applications (one of them being the largest open-source microservice application to our best knowledge). The results indicate that MEPFL can achieve high accuracy in intraapplication prediction of latent errors, faulty microservices, and fault types, and outperforms a state-of-the-art approach of failure diagnosis for distributed systems. The results also show that MEPFL can effectively predict latent errors caused by real-world fault cases.

KW - Debugging

KW - Error prediction

KW - Fault localization

KW - Machine learning

KW - Microservices

KW - Tracing

UR - http://www.scopus.com/inward/record.url?scp=85071904016&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85071904016&partnerID=8YFLogxK

U2 - 10.1145/3338906.3338961

DO - 10.1145/3338906.3338961

M3 - Conference contribution

AN - SCOPUS:85071904016

T3 - ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering

SP - 683

EP - 694

BT - ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering

A2 - Apel, Sven

A2 - Dumas, Marlon

A2 - Russo, Alessandra

A2 - Pfahl, Dietmar

PB - Association for Computing Machinery, Inc

ER -