TY - GEN
T1 - MOPED
T2 - 17th International Symposium on High-Performance Computer Architecture, HPCA 2011
AU - Gu, Junli
AU - Lumetta, Steven S.
AU - Kumar, Rakesh
AU - Sun, Yihe
PY - 2011
Y1 - 2011
N2 - Future CMPs will combine many simple cores with deep cache hierarchies. With more cores, cache resources per core are fewer, and must be shared carefully to avoid poor utilization due to conflicts and pollution. Explicit motion of data in these architectures, such as message passing, can provide hints about program behavior that can be used to hide latency and improve cache behavior. However, to make these models attractive, synchronization overhead and data copying must also be offloaded from the processors. In this paper, we describe a Message Orchestration and Performance Enhancement Device (MOPED) that provides hardware mechanisms to support state-of-the-art message passing protocols such as MPI. MOPED extends the per-processor cache controllers and coherence protocol to support message synchronization and management in hardware, to transfer message data efficiently without intermediate buffer copies, and to place useful data in caches in a timely manner. MOPED thus allows full overlap between communication and computation on the cores. We extended a 16-core full-system simulator based on Simics and FeS2. MOPED interacts with the directory controllers to orchestrate message data. We evaluated benefits to performance and coherence traffic by integrating MOPED into the MPICH runtime. Relative to unmodified MPI execution, MOPED reduces execution time of real applications (NAS Parallel Benchmarks) by 17-45% and of communication microbenchmarks (Intel's IMB) by 76-94%. Off-chip memory misses are reduced by 43-88% for applications and by 75-100% for microbenchmarks.
AB - Future CMPs will combine many simple cores with deep cache hierarchies. With more cores, cache resources per core are fewer, and must be shared carefully to avoid poor utilization due to conflicts and pollution. Explicit motion of data in these architectures, such as message passing, can provide hints about program behavior that can be used to hide latency and improve cache behavior. However, to make these models attractive, synchronization overhead and data copying must also be offloaded from the processors. In this paper, we describe a Message Orchestration and Performance Enhancement Device (MOPED) that provides hardware mechanisms to support state-of-the-art message passing protocols such as MPI. MOPED extends the per-processor cache controllers and coherence protocol to support message synchronization and management in hardware, to transfer message data efficiently without intermediate buffer copies, and to place useful data in caches in a timely manner. MOPED thus allows full overlap between communication and computation on the cores. We extended a 16-core full-system simulator based on Simics and FeS2. MOPED interacts with the directory controllers to orchestrate message data. We evaluated benefits to performance and coherence traffic by integrating MOPED into the MPICH runtime. Relative to unmodified MPI execution, MOPED reduces execution time of real applications (NAS Parallel Benchmarks) by 17-45% and of communication microbenchmarks (Intel's IMB) by 76-94%. Off-chip memory misses are reduced by 43-88% for applications and by 75-100% for microbenchmarks.
UR - http://www.scopus.com/inward/record.url?scp=79955898153&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79955898153&partnerID=8YFLogxK
U2 - 10.1109/HPCA.2011.5749721
DO - 10.1109/HPCA.2011.5749721
M3 - Conference contribution
AN - SCOPUS:79955898153
SN - 9781424494323
T3 - Proceedings - International Symposium on High-Performance Computer Architecture
SP - 111
EP - 120
BT - Proceedings - 17th International Symposium on High-Performance Computer Architecture, HPCA 2011
Y2 - 12 February 2011 through 16 February 2011
ER -