TY - GEN
T1 - Practical Considerations for Failure Resilient ML Systems at the Edge
AU - Gudipaty, Krishna Praneet
AU - Hanafy, Walid A.
AU - Wu, Li
AU - Twigg, Jeffrey
AU - Marlin, Benjamin M.
AU - Milzman, Jesse
AU - Diggavi, Suhas
AU - Abdelzaher, Tarek
AU - Shenoy, Prashant
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Machine learning at the edge is increasingly embedded in our daily lives, supporting applications running on smartphones, wearables, and industrial IoT. Prior work has mainly focused on resource efficiency and latency optimization through innovations in compact model design and resource-management techniques to meet stringent performance targets. However, edge devices and networks are inherently subject to failures and performance fluctuations, requiring an emphasis on failure resilience, especially in resource-constrained edge environments. Although recent studies have proposed resource-aware mechanisms and failure-aware models to improve the resilience of machine learning systems at the edge, many overlook deployment overheads that impede the adoption of these approaches. In this paper, we highlight practical considerations that affect failure-detection and recovery times and analyze how these considerations shape system design. We outline future research directions to enable practical, failure-resilient machine-learning systems at the edge.
AB - Machine learning at the edge is increasingly embedded in our daily lives, supporting applications running on smartphones, wearables, and industrial IoT. Prior work has mainly focused on resource efficiency and latency optimization through innovations in compact model design and resource-management techniques to meet stringent performance targets. However, edge devices and networks are inherently subject to failures and performance fluctuations, requiring an emphasis on failure resilience, especially in resource-constrained edge environments. Although recent studies have proposed resource-aware mechanisms and failure-aware models to improve the resilience of machine learning systems at the edge, many overlook deployment overheads that impede the adoption of these approaches. In this paper, we highlight practical considerations that affect failure-detection and recovery times and analyze how these considerations shape system design. We outline future research directions to enable practical, failure-resilient machine-learning systems at the edge.
KW - Edge Computing
KW - Failure Resilience
KW - ML Inference
UR - https://www.scopus.com/pages/publications/105031779039
UR - https://www.scopus.com/pages/publications/105031779039#tab=citedBy
U2 - 10.1109/MILCOM64451.2025.11309995
DO - 10.1109/MILCOM64451.2025.11309995
M3 - Conference contribution
AN - SCOPUS:105031779039
T3 - Proceedings - IEEE Military Communications Conference MILCOM
BT - 2025 IEEE Military Communications Conference, MILCOM 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 IEEE Military Communications Conference, MILCOM 2025
Y2 - 6 October 2025 through 10 October 2025
ER -