TY - GEN
T1 - Pinpointing crash-consistency bugs in the HPC I/O Stack
T2 - 33rd International Conference for High Performance Computing, Networking, Storage and Analysis: Science and Beyond, SC 2021
AU - Sun, Jinghan
AU - Huang, Jian
AU - Snir, Marc
N1 - Funding Information:
We thank the anonymous reviewers for their helpful comments and feedback. We also thank Weiwei Jia for his help with the setup of the Lustre file system. This work was partially supported by NSF grant CCF-1763540, CNS-1850317, and CCF-1919044.
Publisher Copyright:
© 2021 IEEE Computer Society. All rights reserved.
PY - 2021/11/14
Y1 - 2021/11/14
N2 - We present ParaCrash, a testing framework for studying crash recovery in a typical HPC I/O stack, and demonstrate its use by identifying 15 new crash-consistency bugs in various parallel file systems (PFS) and I/O libraries. ParaCrash uses a "golden version" approach to test the entire HPC I/O stack: storage state after recovery from a crash is correct if it matches the state that can be achieved by a partial execution with no crashes. It supports systematic testing of a multilayered I/O stack while properly identifying the layer responsible for the bugs.
AB - We present ParaCrash, a testing framework for studying crash recovery in a typical HPC I/O stack, and demonstrate its use by identifying 15 new crash-consistency bugs in various parallel file systems (PFS) and I/O libraries. ParaCrash uses a "golden version" approach to test the entire HPC I/O stack: storage state after recovery from a crash is correct if it matches the state that can be achieved by a partial execution with no crashes. It supports systematic testing of a multilayered I/O stack while properly identifying the layer responsible for the bugs.
KW - Crash consistency
KW - I/O library
KW - Parallel file systems
UR - http://www.scopus.com/inward/record.url?scp=85119998829&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119998829&partnerID=8YFLogxK
U2 - 10.1145/3458817.3476144
DO - 10.1145/3458817.3476144
M3 - Conference contribution
AN - SCOPUS:85119998829
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2021
PB - IEEE Computer Society
Y2 - 14 November 2021 through 19 November 2021
ER -