TY - JOUR
T1 - Reliability MicroKernel
T2 - Providing application-aware reliability in the OS
AU - Wang, Long
AU - Kalbarczyk, Zbigniew
AU - Gu, Weining
AU - Iyer, Ravishankar K.
N1 - Funding Information:
Manuscript received January 15, 2007; revised May 1, 2007; accepted June 3, 2007. This work was supported in part by NSF grants CNS-0406351 (Next-generation Software), CNS-05-24695, CNS-05-51665, and ACI-0121658 ITR/AP, the Gigascale Systems Research Center (GSRC/MARCO), and Motorola Corporation as part of Motorola Center. Associate Editor: Y. Dai.
PY - 2007/12
Y1 - 2007/12
N2 - This paper describes the Reliability MicroKernel (RMK) framework, a loadable kernel module (or a device driver) for providing application-aware reliability, and dynamically configuring reliability mechanisms. Characteristics of application/system execution are exploited transparently through application-aware reliability techniques to achieve low-latency detection, and low-overhead checkpointing. The RMK prototype is implemented in both Linux, and Windows; and it supports detection of application/OS failures, and transparent application checkpointing. Experiment results show that the system hang detection and application hang detection, which exploit characteristics of application, and system behavior, can achieve high coverage (100% observed in our experiments) with a low false positive rate. Moreover, the performance overhead of RMK, and its detection/checkpointing mechanisms, is small: 0.6% for application hang detection, and 0.1% for transparent application checkpointing in the experiments.
AB - This paper describes the Reliability MicroKernel (RMK) framework, a loadable kernel module (or a device driver) for providing application-aware reliability, and dynamically configuring reliability mechanisms. Characteristics of application/system execution are exploited transparently through application-aware reliability techniques to achieve low-latency detection, and low-overhead checkpointing. The RMK prototype is implemented in both Linux, and Windows; and it supports detection of application/OS failures, and transparent application checkpointing. Experiment results show that the system hang detection and application hang detection, which exploit characteristics of application, and system behavior, can achieve high coverage (100% observed in our experiments) with a low false positive rate. Moreover, the performance overhead of RMK, and its detection/checkpointing mechanisms, is small: 0.6% for application hang detection, and 0.1% for transparent application checkpointing in the experiments.
KW - Application aware reliability
KW - Checkpointing
KW - Error detection
KW - OS-level error detection
KW - System crash/hang detection
KW - Transparent application checkpointing
UR - http://www.scopus.com/inward/record.url?scp=36949008996&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=36949008996&partnerID=8YFLogxK
U2 - 10.1109/TR.2007.909758
DO - 10.1109/TR.2007.909758
M3 - Article
AN - SCOPUS:36949008996
SN - 0018-9529
VL - 56
SP - 597
EP - 614
JO - IEEE Transactions on Reliability
JF - IEEE Transactions on Reliability
IS - 4
ER -