TY - GEN
T1 - A 'cool' way of improving the reliability of HPC machines
AU - Sarood, Osman
AU - Meneses, Esteban
AU - Kale, Laxmikant V.
N1 - Copyright:
Copyright 2014 Elsevier B.V., All rights reserved.
PY - 2013
Y1 - 2013
N2 - Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10°C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12%. In addition, our scheme can also reduce machine energy consumption by as much as 25%. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1% efficiency), whereas our validated model predicts an efficiency of 20% by improving the machine reliability by a factor of up to 2.29.
AB - Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10°C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12%. In addition, our scheme can also reduce machine energy consumption by as much as 25%. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1% efficiency), whereas our validated model predicts an efficiency of 20% by improving the machine reliability by a factor of up to 2.29.
KW - Actionable modeling
KW - Checkpointing restart
KW - Energy minimization
KW - Fault tolerance
KW - Load balancing
KW - Temperature capping
KW - Temperature thresholds
KW - Thermal control
UR - http://www.scopus.com/inward/record.url?scp=84899668006&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84899668006&partnerID=8YFLogxK
U2 - 10.1145/2503210.2503228
DO - 10.1145/2503210.2503228
M3 - Conference contribution
AN - SCOPUS:84899668006
SN - 9781450323789
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2013
PB - IEEE Computer Society
T2 - 2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013
Y2 - 17 November 2013 through 22 November 2013
ER -