TY - GEN
T1 - Improving the performance of MPI derived datatypes by optimizing memory-access cost
AU - Byna, Surendra
AU - Gropp, William
AU - Sun, Xian He
AU - Thakur, Rajeev
N1 - Publisher Copyright:
© 2003 IEEE.
PY - 2003
Y1 - 2003
N2 - The MPI Standard supports derived datatypes, which allow users to describe noncontiguous memory layout and communicate noncontiguous data with a single communication function. This feature enables an MPI implementation to optimize the transfer of noncontiguous data. In practice, however, few MPI implementations implement derived datatypes in a way that performs better than what the user can achieve by manually packing data into a contiguous buffer and then calling an MPI function. In this paper, we present a technique for improving the performance of derived datatypes by automatically using packing algorithms that are optimized for memory-access cost. The packing algorithms use memory-optimization techniques that the user cannot apply easily without advanced knowledge of the memory architecture. We present performance results for a matrix-transpose example that demonstrate that our implementation of derived datatypes significantly outperforms both manual packing by the user and the existing derived-datatype code in the MPI implementation (MPICH).
AB - The MPI Standard supports derived datatypes, which allow users to describe noncontiguous memory layout and communicate noncontiguous data with a single communication function. This feature enables an MPI implementation to optimize the transfer of noncontiguous data. In practice, however, few MPI implementations implement derived datatypes in a way that performs better than what the user can achieve by manually packing data into a contiguous buffer and then calling an MPI function. In this paper, we present a technique for improving the performance of derived datatypes by automatically using packing algorithms that are optimized for memory-access cost. The packing algorithms use memory-optimization techniques that the user cannot apply easily without advanced knowledge of the memory architecture. We present performance results for a matrix-transpose example that demonstrate that our implementation of derived datatypes significantly outperforms both manual packing by the user and the existing derived-datatype code in the MPI implementation (MPICH).
UR - http://www.scopus.com/inward/record.url?scp=84944883574&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84944883574&partnerID=8YFLogxK
U2 - 10.1109/CLUSTR.2003.1253341
DO - 10.1109/CLUSTR.2003.1253341
M3 - Conference contribution
AN - SCOPUS:84944883574
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 412
EP - 419
BT - Proceedings - IEEE International Conference on Cluster Computing, CLUSTER 2003
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - IEEE International Conference on Cluster Computing, CLUSTER 2003
Y2 - 1 December 2003 through 4 December 2003
ER -