A scalable MPI-Comm-split algorithm for exascale computing

Paul Sack, William D Gropp

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Existing algorithms for creating communicators in MPI programs will not scale well to future exascale supercomputers containing millions of cores. In this work, we present a novel communicator-creation algorithm that does scale well into millions of processes using three techniques: replacing the sorting at the end of MPI-Comm-split with merging as the color and key table is built, sorting the color and key table in parallel, and using a distributed table to store the output communicator data rather than a replicated table. This reduces the time cost of MPI-Comm-split in the worst case we consider from 22 seconds to 0.37 second. Existing algorithms build a table with as many entries as processes, using vast amounts of memory. Our algorithm uses a small, fixed amount of memory per communicator after MPI-Comm-split has finished and uses a fraction of the memory used by the conventional algorithm for temporary storage during the execution of MPI-Comm-split.

Original languageEnglish (US)
Title of host publicationRecent Advances in the Message Passing Interface - 17th European MPI Users' Group Meeting, EuroMPI 2010, Proceedings
Number of pages10
StatePublished - Nov 12 2010
Event17th European MPI Users' Group Meeting, EuroMPI 2010 - Stuttgart, Germany
Duration: Sep 12 2010Sep 15 2010

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume6305 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Other17th European MPI Users' Group Meeting, EuroMPI 2010

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'A scalable MPI-Comm-split algorithm for exascale computing'. Together they form a unique fingerprint.

Cite this