TY - GEN
T1 - FT-iSort
T2 - 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019
AU - Li, Sihuan
AU - Li, Hongbo
AU - Liang, Xin
AU - Chen, Jieyang
AU - Giem, Elisabeth
AU - Ouyang, Kaiming
AU - Zhao, Kai
AU - Di, Sheng
AU - Cappello, Franck
AU - Chen, Zizhong
N1 - Funding Information:
The material was supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357, and supported by the National Science Foundation under Grant No. 1619253. This work was also supported by National Science Foundation CCF 1513201 and National Key Research and Development Programs No. 2017YFB0202100. We acknowledge the computing resources provided on Bebop, which is operated by the Laboratory Computing Resource Center at Argonne National Laboratory. We thank the anonymous reviewers for their insightful suggestions.
Publisher Copyright:
© 2019 ACM.
PY - 2019/11/17
Y1 - 2019/11/17
N2 - Introspective sorting is a ubiquitous sorting algorithm which underlies many large scale distributed systems. Hardware-mediated soft errors can result in comparison and memory errors, and thus cause introsort to generate incorrect output, which in turn disrupts systems built upon introsort; hence, it is critical to incorporate fault tolerance capability within introsort. This paper proposes the first theoretically-sound, practical fault tolerant introsort with negligible overhead: FT-iSort. To tolerate comparison errors, we use minimal TMR protection via exploiting the properties of the effects of soft errors on introsort. This algorithm-based selective protection incurs far less overhead than nave TMR protection designed to protect an entire application. To tolerate memory errors that escape DRAM error correcting code, we propose XOR-based re-execution. We incorporate our fault tolerance method into the well-known parallel sorting implementation HykSort, and we find that fault tolerant HykSort incurs negligible overhead and obtains nearly the same scalability as unprotected HykSort.
AB - Introspective sorting is a ubiquitous sorting algorithm which underlies many large scale distributed systems. Hardware-mediated soft errors can result in comparison and memory errors, and thus cause introsort to generate incorrect output, which in turn disrupts systems built upon introsort; hence, it is critical to incorporate fault tolerance capability within introsort. This paper proposes the first theoretically-sound, practical fault tolerant introsort with negligible overhead: FT-iSort. To tolerate comparison errors, we use minimal TMR protection via exploiting the properties of the effects of soft errors on introsort. This algorithm-based selective protection incurs far less overhead than nave TMR protection designed to protect an entire application. To tolerate memory errors that escape DRAM error correcting code, we propose XOR-based re-execution. We incorporate our fault tolerance method into the well-known parallel sorting implementation HykSort, and we find that fault tolerant HykSort incurs negligible overhead and obtains nearly the same scalability as unprotected HykSort.
KW - Algorithm based fault tolerance
KW - Comparison errors
KW - Fault tolerant sorting
KW - Introsort
KW - Soft errors
UR - http://www.scopus.com/inward/record.url?scp=85076143760&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85076143760&partnerID=8YFLogxK
U2 - 10.1145/3295500.3356195
DO - 10.1145/3295500.3356195
M3 - Conference contribution
AN - SCOPUS:85076143760
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2019
PB - IEEE Computer Society
Y2 - 17 November 2019 through 22 November 2019
ER -