TY - JOUR
T1 - Exploiting hierarchy in parallel computer networks to optimize collective operation performance
AU - Karonis, Nicholas T.
AU - de Supinski, Bronis R.
AU - Foster, Ian
AU - Gropp, William D
AU - Lusk, Ewing
AU - Bresnahan, John
PY - 2000
Y1 - 2000
N2 - The efficient implementation of collective communication operations has received much attention. Initial efforts modeled network communication and produced `optimal' trees based on those models. However, the models used by these initial efforts assumed equal point-to-point latencies between any two processes. This assumption is violated in heterogeneous systems such as clusters of SMPs and wide-area `computational grids', and as a result, collective operations that utilize the trees generated by these models perform suboptimally. In response, more recent work has focused on creating topology-aware trees for collective operations that minimize communication across slower channels (e.g., a wide-area network). While these efforts have significant communication benefits, they all limit their view of the network to only two layers. We present a strategy based upon a multilayer view of the network. By creating multilevel topology trees we take advantage of communication cost differences at every level in the network. We used this strategy to implement topology-aware versions of several MPI collective operations in MPICH-G, the Globus-enabled version of the popular MPICH implementation of the MPI standard. Using information about topology discovered by Globus, we construct these topology-aware trees automatically during execution, thus freeing the MPI application programmer from having to write special files or functions to describe the topology to the MPICH library. We present results demonstrating the advantages of our multilevel approach by comparing it to the default (topology-unaware) implementation provided by MPICH and a topology-aware two-layer implementation.
AB - The efficient implementation of collective communication operations has received much attention. Initial efforts modeled network communication and produced `optimal' trees based on those models. However, the models used by these initial efforts assumed equal point-to-point latencies between any two processes. This assumption is violated in heterogeneous systems such as clusters of SMPs and wide-area `computational grids', and as a result, collective operations that utilize the trees generated by these models perform suboptimally. In response, more recent work has focused on creating topology-aware trees for collective operations that minimize communication across slower channels (e.g., a wide-area network). While these efforts have significant communication benefits, they all limit their view of the network to only two layers. We present a strategy based upon a multilayer view of the network. By creating multilevel topology trees we take advantage of communication cost differences at every level in the network. We used this strategy to implement topology-aware versions of several MPI collective operations in MPICH-G, the Globus-enabled version of the popular MPICH implementation of the MPI standard. Using information about topology discovered by Globus, we construct these topology-aware trees automatically during execution, thus freeing the MPI application programmer from having to write special files or functions to describe the topology to the MPICH library. We present results demonstrating the advantages of our multilevel approach by comparing it to the default (topology-unaware) implementation provided by MPICH and a topology-aware two-layer implementation.
UR - http://www.scopus.com/inward/record.url?scp=0033880619&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0033880619&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:0033880619
SN - 1063-7133
SP - 377
EP - 384
JO - Proceedings of the International Parallel Processing Symposium, IPPS
JF - Proceedings of the International Parallel Processing Symposium, IPPS
ER -