Skip to main navigation Skip to search Skip to main content

Practical Considerations for Failure Resilient ML Systems at the Edge

  • Krishna Praneet Gudipaty
  • , Walid A. Hanafy
  • , Li Wu
  • , Jeffrey Twigg
  • , Benjamin M. Marlin
  • , Jesse Milzman
  • , Suhas Diggavi
  • , Tarek Abdelzaher
  • , Prashant Shenoy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Machine learning at the edge is increasingly embedded in our daily lives, supporting applications running on smartphones, wearables, and industrial IoT. Prior work has mainly focused on resource efficiency and latency optimization through innovations in compact model design and resource-management techniques to meet stringent performance targets. However, edge devices and networks are inherently subject to failures and performance fluctuations, requiring an emphasis on failure resilience, especially in resource-constrained edge environments. Although recent studies have proposed resource-aware mechanisms and failure-aware models to improve the resilience of machine learning systems at the edge, many overlook deployment overheads that impede the adoption of these approaches. In this paper, we highlight practical considerations that affect failure-detection and recovery times and analyze how these considerations shape system design. We outline future research directions to enable practical, failure-resilient machine-learning systems at the edge.

Original languageEnglish (US)
Title of host publication2025 IEEE Military Communications Conference, MILCOM 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798331502928
DOIs
StatePublished - 2025
Event2025 IEEE Military Communications Conference, MILCOM 2025 - Los Angeles, United States
Duration: Oct 6 2025Oct 10 2025

Publication series

NameProceedings - IEEE Military Communications Conference MILCOM
ISSN (Print)2155-7578
ISSN (Electronic)2155-7586

Conference

Conference2025 IEEE Military Communications Conference, MILCOM 2025
Country/TerritoryUnited States
CityLos Angeles
Period10/6/2510/10/25

Keywords

  • Edge Computing
  • Failure Resilience
  • ML Inference

ASJC Scopus subject areas

  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Practical Considerations for Failure Resilient ML Systems at the Edge'. Together they form a unique fingerprint.

Cite this