TY - GEN
T1 - A case study of communication optimizations on 3D mesh interconnects
AU - Bhatelé, Abhinav
AU - Bohm, Eric
AU - Kalé, Laxmikant V.
PY - 2009
Y1 - 2009
N2 - Optimal network performance is critical to efficient parallel scaling for communication-bound applications on large machines. With wormhole routing, no-load latencies do not increase significantly with number of hops traveled. Yet, we, and others have recently shown that in presence of contention, message latencies can grow substantially large. Hence task mapping strategies should take the topology of the machine into account on large machines. In this paper, we present topology aware mapping as a technique to optimize communication on 3-dimensional mesh interconnects and hence improve performance. Our methodology is facilitated by the idea of object-based decomposition used in Charm++ which separates the processes of decomposition from mapping of computation to processors and allows a more flexible mapping based on communication patterns between objects. Exploiting this and the topology of the allocated job partition, we present mapping strategies for a production code, OpenAtom to improve overall performance and scaling. OpenAtom presents complex communication scenarios of interaction involving multiple groups of objects and makes the mapping task a challenge. Results are presented for OpenAtom on up to 16,384 processors of Blue Gene/L, 8,192 processors of Blue Gene/P and 2,048 processors of Cray XT3.
AB - Optimal network performance is critical to efficient parallel scaling for communication-bound applications on large machines. With wormhole routing, no-load latencies do not increase significantly with number of hops traveled. Yet, we, and others have recently shown that in presence of contention, message latencies can grow substantially large. Hence task mapping strategies should take the topology of the machine into account on large machines. In this paper, we present topology aware mapping as a technique to optimize communication on 3-dimensional mesh interconnects and hence improve performance. Our methodology is facilitated by the idea of object-based decomposition used in Charm++ which separates the processes of decomposition from mapping of computation to processors and allows a more flexible mapping based on communication patterns between objects. Exploiting this and the topology of the allocated job partition, we present mapping strategies for a production code, OpenAtom to improve overall performance and scaling. OpenAtom presents complex communication scenarios of interaction involving multiple groups of objects and makes the mapping task a challenge. Results are presented for OpenAtom on up to 16,384 processors of Blue Gene/L, 8,192 processors of Blue Gene/P and 2,048 processors of Cray XT3.
UR - http://www.scopus.com/inward/record.url?scp=70350625179&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70350625179&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-03869-3_94
DO - 10.1007/978-3-642-03869-3_94
M3 - Conference contribution
AN - SCOPUS:70350625179
SN - 3642038689
SN - 9783642038686
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 1015
EP - 1028
BT - Euro-Par 2009 Parallel Processing - 15th International Euro-Par Conference, Proceedings
T2 - Euro-Par 2009 Parallel Processing - 15th International Euro-Par Conference, Proceedings
Y2 - 25 August 2009 through 28 August 2009
ER -