TY - GEN
T1 - Enhancing Resilience in Distributed ML Inference Pipelines for Edge Computing
AU - Wu, Li
AU - Hanafy, Walid A.
AU - Souza, Abel
AU - Abdelzaher, Tarek
AU - Verma, Gunjan
AU - Shenoy, Prashant
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - As edge computing and sensing devices continue to proliferate, distributed machine learning (ML) inference pipelines are becoming popular for enabling low-latency, real-time decision-making at scale. However, the geographically dispersed and often resource-constrained nature of edge devices makes them susceptible to various failures, such as hardware malfunctions, network disruptions, and device overloading. These edge failures can significantly affect the performance and availability of inference pipelines and the sensing-to-decision-making loops they enable. In addition, the complexity of task dependencies amplifies the difficulty of maintaining performant and reliable ML operations. To address these challenges and minimize the impact of edge failures on inference pipelines, this paper presents several fault-tolerant approaches, including sensing redundancy, structural resilience, failover replication, and pipeline reconfiguration. For each approach, we explain the key techniques and highlight their effectiveness and tradeoffs. Finally, we discuss the challenges associated with these approaches and outline future directions.
AB - As edge computing and sensing devices continue to proliferate, distributed machine learning (ML) inference pipelines are becoming popular for enabling low-latency, real-time decision-making at scale. However, the geographically dispersed and often resource-constrained nature of edge devices makes them susceptible to various failures, such as hardware malfunctions, network disruptions, and device overloading. These edge failures can significantly affect the performance and availability of inference pipelines and the sensing-to-decision-making loops they enable. In addition, the complexity of task dependencies amplifies the difficulty of maintaining performant and reliable ML operations. To address these challenges and minimize the impact of edge failures on inference pipelines, this paper presents several fault-tolerant approaches, including sensing redundancy, structural resilience, failover replication, and pipeline reconfiguration. For each approach, we explain the key techniques and highlight their effectiveness and tradeoffs. Finally, we discuss the challenges associated with these approaches and outline future directions.
UR - http://www.scopus.com/inward/record.url?scp=85214571896&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85214571896&partnerID=8YFLogxK
U2 - 10.1109/MILCOM61039.2024.10773652
DO - 10.1109/MILCOM61039.2024.10773652
M3 - Conference contribution
AN - SCOPUS:85214571896
T3 - Proceedings - IEEE Military Communications Conference MILCOM
BT - 2024 IEEE Military Communications Conference, MILCOM 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE Military Communications Conference, MILCOM 2024
Y2 - 28 October 2024 through 1 November 2024
ER -