Reductions are important and time-consuming operations in many scientific codes. Effective parallelization of reductions is a critical transformation for loop parallelization, especially for sparse, dynamic applications. Unfortunately, conventional reduction parallelization algorithms are not scalable. In this paper, we present new architectural support that significantly speeds-up parallel reduction and makes it scalable in shared-memory multiprocessors. The required architectural changes are mostly confined to the directory controllers. Experimental results based on simulations show that the proposed support is very effective. While conventional software-only reduction parallelization delivers average speedups of only 2.7 for 16 processors, our scheme delivers average speedups of 7.6.
|Original language||English (US)|
|Number of pages||12|
|Journal||Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT|
|State||Published - 2001|
ASJC Scopus subject areas
- Theoretical Computer Science
- Hardware and Architecture