Application and system-level software fault tolerance through full system restarts

Fardin Abdi, Rohan Tabish, Matthias Rungger, Majid Zamani, Marco Caccamo

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Due to the growing performance requirements, embedded systems are increasingly more complex. Meanwhile, they are also expected to be reliable. Guaranteeing reliability on complex systems is very challenging. Consequently, there is a substantial need for designs that enable the use of unverified components such as real-time operating system (RTOS) without requiring their correctness to guarantee safety. In this work, we propose a novel approach to design a controller that enables the system to restart and remain safe during and after the restart. Complementing this controller with a switching logic allows the system to use complex, unverified controller to drive the system as long as it does not jeopardize safety. Such a design also tolerates faults that occur in the underlying software layers such as RTOS and middleware and recovers from them through system-level restarts that reinitialize the software (middleware, RTOS, and applications) from a read-only storage. Our approach is implementable using one commercial off-the-shelf (COTS) processing unit. To demonstrate the efficacy of our solution, we fully implement a controller for a 3 degree of freedom (3DOF) helicopter. We test the system by injecting various types of faults into the applications and RTOS and verify that the system remains safe.

Original languageEnglish (US)
Title of host publicationProceedings - 2017 ACM/IEEE 8th International Conference on Cyber-Physical Systems, ICCPS 2017 (part of CPS Week)
PublisherAssociation for Computing Machinery
Pages197-206
Number of pages10
ISBN (Electronic)9781450349659
DOIs
StatePublished - Apr 18 2017
Event8th ACM/IEEE International Conference on Cyber-Physical Systems, ICCPS 2017 - Pittsburgh, United States
Duration: Apr 18 2017Apr 20 2017

Publication series

NameProceedings - 2017 ACM/IEEE 8th International Conference on Cyber-Physical Systems, ICCPS 2017 (part of CPS Week)

Other

Other8th ACM/IEEE International Conference on Cyber-Physical Systems, ICCPS 2017
Country/TerritoryUnited States
CityPittsburgh
Period4/18/174/20/17

Keywords

  • Cyber-physical systems
  • Embedded systems
  • Fault-recovery
  • Fault-tolerance
  • Reliability
  • Runtime restart

ASJC Scopus subject areas

  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Application and system-level software fault tolerance through full system restarts'. Together they form a unique fingerprint.

Cite this