TY - CONF
T1 - Acceleration of an asynchronous message driven programming paradigm on IBM Blue Gene/Q
AU - Kumar, Sameer
AU - Sun, Yanhua
AU - Kalé, Laximant V.
N1 - Funding Information:
Acknowledgments. This research is supported in part by U.S. Department of Energy grants #DE-FC02-06ER25749 and #DE-PFC02-06ER25755; National Science Foundation grants #CCF-0833169, #CCF-0916302, #OCI-0926691 and #CCF-0937842; grants from Intel, Mellanox, Cisco, QLogic, and Sun Microsystems; Equipment donations from Intel, Mellanox, AMD, Appro, Chelsio, Dell, Microway, QLogic, and Sun Microsystems.
PY - 2013
Y1 - 2013
N2 - IBM Blue Gene/Q is the next generation Blue Gene machine that can scale to tens of Peta Flops with 16 cores and 64 hardware threads per node. However, significant efforts are required to fully exploit its capacity on various applications, spanning multiple programming models. In this paper, we focus on the asynchronous message driven parallel programming model -Charm++. Since its behavior (asynchronous) is substantially different from MPI, that presents a challenge in porting it efficiently to BG/Q. On the other hand, the significant synergy between BG/Q software and Charm++ creates opportunities for effective utilization of BG/Q resources. We describe various novel fine-grained threading techniques in Charm++ to exploit the hardware features of the BG/Qcompute chip. These include the use of L2 atomics to implement lockless producer-consumer queues to accelerate communication between threads, fast memory allocates, hardware communication threads that are awakened via low overhead interrupts from the BG/Q wakeup unit. Burst of short messages is processed by using the Many to Many interface to reduce runtime overhead. We also present techniques to optimize NAMD computation via Quad Processing Unit (QPX) vector instructions and the acceleration of message rate via communication threads to optimize the Particle Mesh Ewald (PME) computation. We demonstrate the benefits of our techniques via two benchmarks, 3D Fast Fourier Transform, and the molecular dynamics application NAMD. For the 92,000-atom ApoA1 molecule, we achieved 683μs/step with PME every 4 steps and 782μs/step with PME every step.
AB - IBM Blue Gene/Q is the next generation Blue Gene machine that can scale to tens of Peta Flops with 16 cores and 64 hardware threads per node. However, significant efforts are required to fully exploit its capacity on various applications, spanning multiple programming models. In this paper, we focus on the asynchronous message driven parallel programming model -Charm++. Since its behavior (asynchronous) is substantially different from MPI, that presents a challenge in porting it efficiently to BG/Q. On the other hand, the significant synergy between BG/Q software and Charm++ creates opportunities for effective utilization of BG/Q resources. We describe various novel fine-grained threading techniques in Charm++ to exploit the hardware features of the BG/Qcompute chip. These include the use of L2 atomics to implement lockless producer-consumer queues to accelerate communication between threads, fast memory allocates, hardware communication threads that are awakened via low overhead interrupts from the BG/Q wakeup unit. Burst of short messages is processed by using the Many to Many interface to reduce runtime overhead. We also present techniques to optimize NAMD computation via Quad Processing Unit (QPX) vector instructions and the acceleration of message rate via communication threads to optimize the Particle Mesh Ewald (PME) computation. We demonstrate the benefits of our techniques via two benchmarks, 3D Fast Fourier Transform, and the molecular dynamics application NAMD. For the 92,000-atom ApoA1 molecule, we achieved 683μs/step with PME every 4 steps and 782μs/step with PME every step.
KW - Blue Gene/Q
KW - Charm++
KW - L2Atomic Queue
KW - communication thread
KW - many to many
KW - message-driven
UR - http://www.scopus.com/inward/record.url?scp=84884826356&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84884826356&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2013.83
DO - 10.1109/IPDPS.2013.83
M3 - Paper
AN - SCOPUS:84884826356
SP - 689
EP - 699
T2 - 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2013
Y2 - 20 May 2013 through 24 May 2013
ER -