TY - GEN
T1 - Robust non-intrusive record-replay with processor extraction
AU - Gioachin, Filippo
AU - Zheng, Gengbin
AU - Kalé, Laxmikant V.
PY - 2010
Y1 - 2010
N2 - With the advent of increasingly larger parallel machines, debugging is becoming more and more challenging. In particular, applications at this scale tend to behave non-deterministically, leading to race condition bugs. Furthermore, gaining access to these large machines for long debugging sessions is generally infeasible. In this paper, we present a 3-step algorithm to perform what we call "processor extraction": a procedure to record the execution of a set of processors from a parallel application, and replay any of them in a controlled environment. Our technique generates very low interference in the recorded program thanks to the separation between non-determinism elimination, and detailed processor recording. In order to improve robustness and accuracy, we further augmented our algorithm with a self-correction mechanism.
AB - With the advent of increasingly larger parallel machines, debugging is becoming more and more challenging. In particular, applications at this scale tend to behave non-deterministically, leading to race condition bugs. Furthermore, gaining access to these large machines for long debugging sessions is generally infeasible. In this paper, we present a 3-step algorithm to perform what we call "processor extraction": a procedure to record the execution of a set of processors from a parallel application, and replay any of them in a controlled environment. Our technique generates very low interference in the recorded program thanks to the separation between non-determinism elimination, and detailed processor recording. In order to improve robustness and accuracy, we further augmented our algorithm with a self-correction mechanism.
UR - http://www.scopus.com/inward/record.url?scp=78650052508&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=78650052508&partnerID=8YFLogxK
U2 - 10.1145/1866210.1866211
DO - 10.1145/1866210.1866211
M3 - Conference contribution
AN - SCOPUS:78650052508
SN - 9781450301367
T3 - PADTAD 2010 - International Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging
SP - 9
EP - 19
BT - PADTAD 2010 - International Workshop on Parallel and Distributed Systems
T2 - 8th Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging, PADTAD'10
Y2 - 13 July 2010 through 13 July 2010
ER -