Abstract
With processor speeds no longer doubling every 18-24 months owing to the exponential increase in power consumption and heat dissipation, modern high-end computing systems tend to rely less on the performance of single processing units and instead rely on achieving high performance by using the parallelism of a massive number of low-frequency/low-power processing cores. Using such low-frequency cores, however, puts a premium on end-host pre- and post-communication processing required within communication stacks, such as the Message Passing Interface (MPI) implementation. Similarly, small amounts of serialization within the communication stack that were acceptable on small/medium systems can be brutal on massively parallel systems. Thus, in this paper, we study the different non-data-communication overheads within the MPI implementation on the IBM Blue Gene/P system. Specifically, we analyze various aspects of MPI, including the MPI stack overhead itself, overhead of allocating and queueing requests, queue searches within the MPI stack, multi-request operations, and various others. Our experiments, that scale up to 131,072 cores of the largest Blue Gene/P system in the world (80% of the total system size), reveal several insights into overheads in the MPI stack, which were not previously considered significant, but can have a substantial impact on such massive systems.
Original language | English (US) |
---|---|
Pages (from-to) | 5-15 |
Number of pages | 11 |
Journal | International Journal of High Performance Computing Applications |
Volume | 24 |
Issue number | 1 |
DOIs | |
State | Published - 2010 |
Keywords
- Blue Gene/P
- MPI
- Non-data-communication overheads
ASJC Scopus subject areas
- Software
- Theoretical Computer Science
- Hardware and Architecture