An efficient GPU implementation technique for higher-order 3D stencils

Omer Anjum, Garcia De Gonzalo Simon, Mert Hidayetoglu, Wen Mei Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Stencils are a family of widely used computational patterns that play a critical role in various scientific and engineering applications. Stencil computations are known to be memory-bandwidth bound, thus a number of different techniques and algorithms that optimizes memory bandwidth usage have been proposed. However, existing techniques fall short in addressing the needs of large stencils, particularly more advanced stencil patterns involving non-axis aligned grid points. To handle non-axis aligned grid points, existing methods either use 3D caching or 2D caching schemes with more than one pass over the stencil per iteration, which suffers from the high intensity of memory accesses. The large number of memory accesses in these methods hinder the available performance. In this work, we present a new GPU-based implementation technique called 'SWiC' that focuses on using 2D caching to efficiently implement advanced 3D stencil patterns, involving non-axis aligned grid points, and reducing global memory transactions by increased data reuse while only requiring a single pass per iteration. In contrast to the current approaches that maintain input register queues, the proposed approach maintains and updates the output register queue instead. The analysis shows that SWiC achieves a significant reduction in memory transactions which translates to a significant application speedup, 1.6x to 5.76x, when compared to the current state-of-the-art GPU stencil implementation. 'SWiC' was evaluated across the latest three Nvidia GPU architectures as of the writing of this paper, as well as various stencil patterns and sizes. We also show that 'SWiC' does not suffer from performance penalties when applied to simpler 3D stencils without non-axis aligned grid points, covering a wide application range. When running on a multi-node setting, we study the scaling efficiency of SWiC and show that it is able to achieve a weak scaling efficiency of about 96%.

Original languageEnglish (US)
Title of host publicationProceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019
EditorsZheng Xiao, Laurence T. Yang, Pavan Balaji, Tao Li, Keqin Li, Albert Zomaya
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages552-561
Number of pages10
ISBN (Electronic)9781728120584
DOIs
StatePublished - Aug 2019
Event21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019 - Zhangjiajie, China
Duration: Aug 10 2019Aug 12 2019

Publication series

NameProceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019

Conference

Conference21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019
Country/TerritoryChina
CityZhangjiajie
Period8/10/198/12/19

Keywords

  • 3D stencil
  • CUDA
  • GPU
  • High order stencil
  • MHD
  • Stencil

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Information Systems
  • Information Systems and Management
  • Energy Engineering and Power Technology

Fingerprint

Dive into the research topics of 'An efficient GPU implementation technique for higher-order 3D stencils'. Together they form a unique fingerprint.

Cite this