Abstract
Sparse matrix-matrix multiplication (SpGEMM) is a key operation in numerous areas from information to the physical sciences. Implementing SpGEMM efficiently on throughput-oriented processors, such as the graphics processing unit (GPU), requires the programmer to expose substantial fine-grained parallelism while conserving the limited off-chip memory bandwidth. Balancing these concerns, we decompose the SpGEMM operation into three highly parallel phases: expansion, sorting, and contraction, and introduce a set of complementary bandwidth-saving performance optimizations. Our implementation is fully general and our optimization strategy adaptively processes the SpGEMM workload row-wise to substantially improve performance by decreasing the work complexity and utilizing the memory hierarchy more effectively.
Original language | English (US) |
---|---|
Article number | 25 |
Journal | ACM Transactions on Mathematical Software |
Volume | 41 |
Issue number | 4 |
DOIs | |
State | Published - Oct 2015 |
Keywords
- GPU
- Matrix-matrix
- Parallel
- Sparse
ASJC Scopus subject areas
- Software
- Applied Mathematics