TY - GEN
T1 - Two-Face
T2 - 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2024
AU - Block, Charles
AU - Gerogiannis, Gerasimos
AU - Mendis, Charith
AU - Azad, Ariful
AU - Torrellas, Josep
N1 - We thank Fredrik Kjolstad and the reviewers for helping to improve this paper. This research was funded in part by ACE, one of the seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA; by a grant from the IBM-Illinois Discovery Accelerator Institute; by NSF grants PPoSS CCF 2316233, CNS 1956007 and CCF 2107470; by DOE grant DE-SC0022098; and by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 21-46756. This work used the Delta system at the National Center for Supercomputing Applications through allocation CIS230044 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services Support (ACCESS) program, which is supported by National Science Foundation grants 2138259, 2138286, 2138307, 2137603, and 2138296.
PY - 2024/4/27
Y1 - 2024/4/27
N2 - Sparse matrix dense matrix multiplication (SpMM) is commonly used in applications ranging from scientific computing to graph neural networks. Typically, when SpMM is executed in a distributed platform, communication costs dominate. Such costs depend on how communication is scheduled. If it is scheduled in a sparsity-unaware manner, such as with collectives, execution is often inefficient due to unnecessary data transfers. On the other hand, if communication is scheduled in a fine-grained sparsity-aware manner, communicating only the necessary data, execution can also be inefficient due to high software overhead.We observe that individual sparse matrices often contain regions that are denser and regions that are sparser. Based on this observation, we develop a model that partitions communication into sparsity-unaware and sparsity-aware components. Leveraging the partition, we develop a new algorithm that performs collective communication for the denser regions, and fine-grained, one-sided communication for the sparser regions. We call the algorithm Two-Face. We show that Two-Face attains an average speedup of 2.11x over prior work when evaluated on a 4096-core supercomputer. Additionally, Two-Face scales well with the machine size.
AB - Sparse matrix dense matrix multiplication (SpMM) is commonly used in applications ranging from scientific computing to graph neural networks. Typically, when SpMM is executed in a distributed platform, communication costs dominate. Such costs depend on how communication is scheduled. If it is scheduled in a sparsity-unaware manner, such as with collectives, execution is often inefficient due to unnecessary data transfers. On the other hand, if communication is scheduled in a fine-grained sparsity-aware manner, communicating only the necessary data, execution can also be inefficient due to high software overhead.We observe that individual sparse matrices often contain regions that are denser and regions that are sparser. Based on this observation, we develop a model that partitions communication into sparsity-unaware and sparsity-aware components. Leveraging the partition, we develop a new algorithm that performs collective communication for the denser regions, and fine-grained, one-sided communication for the sparser regions. We call the algorithm Two-Face. We show that Two-Face attains an average speedup of 2.11x over prior work when evaluated on a 4096-core supercomputer. Additionally, Two-Face scales well with the machine size.
KW - distributed algorithms
KW - high-performance computing
KW - sparse matrices
KW - SpMM
UR - http://www.scopus.com/inward/record.url?scp=85192178212&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85192178212&partnerID=8YFLogxK
U2 - 10.1145/3620665.3640427
DO - 10.1145/3620665.3640427
M3 - Conference contribution
AN - SCOPUS:85192178212
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 1200
EP - 1217
BT - Summer Cycle
PB - Association for Computing Machinery
Y2 - 27 April 2024 through 1 May 2024
ER -