TY - GEN
T1 - Analysis of the tradeoffs between energy and run time for multilevel checkpointing
AU - Balaprakash, Prasanna
AU - Gomez, Leonardo A.Bautista
AU - Bouguerra, Mohamed Slim
AU - Wild, Stefan M.
AU - Cappello, Franck
AU - Hovland, Paul D.
N1 - Funding Information:
This work was supported by the SciDAC and X-Stack activities within the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research program under contract number DE-AC02-06CH11357.
Publisher Copyright:
© Springer International Publishing Switzerland 2015.
PY - 2015
Y1 - 2015
N2 - In high-performance computing, there is a perpetual hunt for performance and scalability. Supercomputers grow larger offering improved computational science throughput. Nevertheless, with an increase in the number of systems’ components and their interactions, the number of failures and the power consumption will increase rapidly. Energy and reliability are among the most challenging issues that need to be addressed for extreme scale computing. We develop analytical models for run time and energy usage for multilevel fault-tolerance schemes. We use these models to study the tradeoff between run time and energy in FTI, a recently developed multilevel checkpoint library, on an IBM Blue Gene/Q. Our results show that energy consumed by FTI is low and the tradeoff between the run time and energy is small. Using the analytical models, we explore the impact of various system-level parameters on run time and energy tradeoffs.
AB - In high-performance computing, there is a perpetual hunt for performance and scalability. Supercomputers grow larger offering improved computational science throughput. Nevertheless, with an increase in the number of systems’ components and their interactions, the number of failures and the power consumption will increase rapidly. Energy and reliability are among the most challenging issues that need to be addressed for extreme scale computing. We develop analytical models for run time and energy usage for multilevel fault-tolerance schemes. We use these models to study the tradeoff between run time and energy in FTI, a recently developed multilevel checkpoint library, on an IBM Blue Gene/Q. Our results show that energy consumed by FTI is low and the tradeoff between the run time and energy is small. Using the analytical models, we explore the impact of various system-level parameters on run time and energy tradeoffs.
UR - http://www.scopus.com/inward/record.url?scp=84942521141&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84942521141&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-17248-4_13
DO - 10.1007/978-3-319-17248-4_13
M3 - Conference contribution
AN - SCOPUS:84942521141
SN - 9783319172477
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 249
EP - 263
BT - High Performance Computing Systems
A2 - Hammond, Simon D.
A2 - Jarvis, Stephen A.
A2 - Wright, Steven A.
PB - Springer
T2 - 5th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems, PMBS 2014
Y2 - 16 November 2014 through 16 November 2014
ER -