Enhancing Resilience in Distributed ML Inference Pipelines for Edge Computing

Li Wu, Walid A. Hanafy, Abel Souza, Tarek Abdelzaher, Gunjan Verma, Prashant Shenoy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As edge computing and sensing devices continue to proliferate, distributed machine learning (ML) inference pipelines are becoming popular for enabling low-latency, real-time decision-making at scale. However, the geographically dispersed and often resource-constrained nature of edge devices makes them susceptible to various failures, such as hardware malfunctions, network disruptions, and device overloading. These edge failures can significantly affect the performance and availability of inference pipelines and the sensing-to-decision-making loops they enable. In addition, the complexity of task dependencies amplifies the difficulty of maintaining performant and reliable ML operations. To address these challenges and minimize the impact of edge failures on inference pipelines, this paper presents several fault-tolerant approaches, including sensing redundancy, structural resilience, failover replication, and pipeline reconfiguration. For each approach, we explain the key techniques and highlight their effectiveness and tradeoffs. Finally, we discuss the challenges associated with these approaches and outline future directions.

Original languageEnglish (US)
Title of host publication2024 IEEE Military Communications Conference, MILCOM 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350374230
DOIs
StatePublished - 2024
Event2024 IEEE Military Communications Conference, MILCOM 2024 - Washington, United States
Duration: Oct 28 2024Nov 1 2024

Publication series

NameProceedings - IEEE Military Communications Conference MILCOM
ISSN (Print)2155-7578
ISSN (Electronic)2155-7586

Conference

Conference2024 IEEE Military Communications Conference, MILCOM 2024
Country/TerritoryUnited States
CityWashington
Period10/28/2411/1/24

ASJC Scopus subject areas

  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Enhancing Resilience in Distributed ML Inference Pipelines for Edge Computing'. Together they form a unique fingerprint.

Cite this