Analyzing the performance and accuracy of lossy checkpointing on sub-iteration of NWChem

Tasmia Reza, Kristopher Keipert, Sheng Di, Xin Liang, Jon Calhoun, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Future exascale systems are expected to be characterized by more frequent failures than current petascale systems. This places increased importance on the application to minimize the amount of time wasted due to recompution when recovering from a checkpoint. Typically HPC application checkpoint at iteration boundaries. However, for applications that have a high per-iteration cost, checkpointing inside the iteration limits the amount of re-computation. This paper analyzes the performance and accuracy of using lossy compressed check-pointing in the computational chemistry application NWChem. Our results indicate that lossy compression is an effective tool for reducing the sub-iteration checkpoint size. Moreover, compression error tolerances that yield acceptable deviation in accuracy and iteration count are quantified.

Original languageEnglish (US)
Title of host publicationProceedings of DRBSD-5 2019
Subtitle of host publication5th International Workshop on Data Analysis and Reduction for Big Scientific Data - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages23-27
Number of pages5
ISBN (Electronic)9781728160177
DOIs
StatePublished - Nov 2019
Externally publishedYes
Event5th IEEE/ACM International Workshop on Data Analysis and Reduction for Big Scientific Data, DRBSD-5 2019 - Denver, United States
Duration: Nov 17 2019 → …

Publication series

NameProceedings of DRBSD-5 2019: 5th International Workshop on Data Analysis and Reduction for Big Scientific Data - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference5th IEEE/ACM International Workshop on Data Analysis and Reduction for Big Scientific Data, DRBSD-5 2019
Country/TerritoryUnited States
CityDenver
Period11/17/19 → …

Keywords

  • Checkpoint-restart
  • Coupled-cluster singles and doubles
  • Lossy data compression
  • NWChem

ASJC Scopus subject areas

  • Media Technology
  • Artificial Intelligence
  • Information Systems
  • Information Systems and Management
  • Statistics, Probability and Uncertainty

Fingerprint

Dive into the research topics of 'Analyzing the performance and accuracy of lossy checkpointing on sub-iteration of NWChem'. Together they form a unique fingerprint.

Cite this