TY - GEN
T1 - TraceWeaver
T2 - 2024 ACM SIGCOMM Conference, ACM SIGCOMM 2024
AU - Ashok, Sachin
AU - Harsh, Vipul
AU - Godfrey, Brighten
AU - Mittal, Radhika
AU - Parthasarathy, Srinivasan
AU - Shwartz, Larisa
N1 - We thank Omid Azizi, Aurojit Panda, Sambhav Satija, Sangeetha Abdu Jyothi, Devikrishna Radhakrishnan, and the Systems and Networking students and faculty at UIUC for their helpful discussions. We also thank our shepherd, Xiaowei Yang, and the anonymous reviewers for their valuable feedback. This work was supported by IBM and by the NSF under award number 2312714.
PY - 2024/8/4
Y1 - 2024/8/4
N2 - Monitoring and debugging modern cloud-based applications is challenging since even a single API call can involve many interdependent distributed microservices. To provide observability for such complex systems, distributed tracing frameworks track request flow across the microservice call tree. However, such solutions require instrumenting every component of the distributed application to add and propagate tracing headers, which has slowed adoption. This paper explores whether we can trace requests without any application instrumentation, which we refer to as request trace reconstruction. To that end, we develop TraceWeaver, a system that incorporates readily available information from production settings (e.g., timestamps) and test environments (e.g., call graphs) to reconstruct request traces with usefully high accuracy. At the heart of TraceWeaver is a reconstruction algorithm that uses request-response timestamps to effectively prune the search space for mapping requests and applies statistical timing analysis techniques to reconstruct traces. Evaluation with (1) benchmark microservice applications and (2) a production microservice dataset demonstrates that TraceWeaver can achieve a high accuracy of ∼90% and can be meaningfully applied towards multiple use cases (e.g., finding slow services and A/B testing).
AB - Monitoring and debugging modern cloud-based applications is challenging since even a single API call can involve many interdependent distributed microservices. To provide observability for such complex systems, distributed tracing frameworks track request flow across the microservice call tree. However, such solutions require instrumenting every component of the distributed application to add and propagate tracing headers, which has slowed adoption. This paper explores whether we can trace requests without any application instrumentation, which we refer to as request trace reconstruction. To that end, we develop TraceWeaver, a system that incorporates readily available information from production settings (e.g., timestamps) and test environments (e.g., call graphs) to reconstruct request traces with usefully high accuracy. At the heart of TraceWeaver is a reconstruction algorithm that uses request-response timestamps to effectively prune the search space for mapping requests and applies statistical timing analysis techniques to reconstruct traces. Evaluation with (1) benchmark microservice applications and (2) a production microservice dataset demonstrates that TraceWeaver can achieve a high accuracy of ∼90% and can be meaningfully applied towards multiple use cases (e.g., finding slow services and A/B testing).
KW - distributed tracing
KW - graph analysis
KW - microservices
KW - non-intrusive
UR - http://www.scopus.com/inward/record.url?scp=85202300587&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85202300587&partnerID=8YFLogxK
U2 - 10.1145/3651890.3672254
DO - 10.1145/3651890.3672254
M3 - Conference contribution
AN - SCOPUS:85202300587
T3 - ACM SIGCOMM 2024 - Proceedings of the 2024 ACM SIGCOMM 2024 Conference
SP - 828
EP - 842
BT - ACM SIGCOMM 2024 - Proceedings of the 2024 ACM SIGCOMM 2024 Conference
PB - Association for Computing Machinery
Y2 - 4 August 2024 through 8 August 2024
ER -