Understanding and finding crash-consistency bugs in parallel file systems

Jinghan Sun, Chen Wang, Jian Huang, Marc Snir

Research output: Contribution to conferencePaperpeer-review

Abstract

Parallel file systems (PFSes) and parallel I/O libraries have been the backbone of high-performance computing (HPC) infrastructures for decades. However, their crash consistency bugs have not been extensively studied, and the corresponding bug-finding or testing tools are lacking. In this paper, we first conduct a thorough bug study on the popular PFSes, such as BeeGFS and OrangeFS, with a cross-stack approach that covers HPC I/O library, PFS, and interactions with local file systems. The study results drive our design of a scalable testing framework, named PFSCHECK. PFSCHECK is easy to use with low performance overhead, as it can automatically generate test cases for triggering potential crash-consistency bugs, and trace essential file operations with low overhead. PFSCHECK is scalable for supporting large-scale HPC clusters, as it can exploit the parallelism to facilitate the verification of persistent storage states.

Original languageEnglish (US)
StatePublished - 2020
Event12th USENIX Workshop on Hot Topics in Storage and File Systems, HotStorage 2020, co-located withUSENIX ATC 2020 - Virtual, Online
Duration: Jul 13 2020Jul 14 2020

Conference

Conference12th USENIX Workshop on Hot Topics in Storage and File Systems, HotStorage 2020, co-located withUSENIX ATC 2020
CityVirtual, Online
Period7/13/207/14/20

ASJC Scopus subject areas

  • Software
  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems

Fingerprint Dive into the research topics of 'Understanding and finding crash-consistency bugs in parallel file systems'. Together they form a unique fingerprint.

Cite this