An efficient GPU implementation technique for higher-order 3D stencils

Omer Anjum, Garcia De Gonzalo Simon, Mert Hidayetoglu, Wen Mei Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Stencils are a family of widely used computational patterns that play a critical role in various scientific and engineering applications. Stencil computations are known to be memory-bandwidth bound, thus a number of different techniques and algorithms that optimizes memory bandwidth usage have been proposed. However, existing techniques fall short in addressing the needs of large stencils, particularly more advanced stencil patterns involving non-axis aligned grid points. To handle non-axis aligned grid points, existing methods either use 3D caching or 2D caching schemes with more than one pass over the stencil per iteration, which suffers from the high intensity of memory accesses. The large number of memory accesses in these methods hinder the available performance. In this work, we present a new GPU-based implementation technique called 'SWiC' that focuses on using 2D caching to efficiently implement advanced 3D stencil patterns, involving non-axis aligned grid points, and reducing global memory transactions by increased data reuse while only requiring a single pass per iteration. In contrast to the current approaches that maintain input register queues, the proposed approach maintains and updates the output register queue instead. The analysis shows that SWiC achieves a significant reduction in memory transactions which translates to a significant application speedup, 1.6x to 5.76x, when compared to the current state-of-the-art GPU stencil implementation. 'SWiC' was evaluated across the latest three Nvidia GPU architectures as of the writing of this paper, as well as various stencil patterns and sizes. We also show that 'SWiC' does not suffer from performance penalties when applied to simpler 3D stencils without non-axis aligned grid points, covering a wide application range. When running on a multi-node setting, we study the scaling efficiency of SWiC and show that it is able to achieve a weak scaling efficiency of about 96%.

Original languageEnglish (US)
Title of host publicationProceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019
EditorsZheng Xiao, Laurence T. Yang, Pavan Balaji, Tao Li, Keqin Li, Albert Zomaya
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages552-561
Number of pages10
ISBN (Electronic)9781728120584
DOIs
StatePublished - Aug 2019
Event21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019 - Zhangjiajie, China
Duration: Aug 10 2019Aug 12 2019

Publication series

NameProceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019

Conference

Conference21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019
CountryChina
CityZhangjiajie
Period8/10/198/12/19

Fingerprint

Data storage equipment
Bandwidth
Graphics processing unit
Grid
Scaling
Queue

Keywords

  • 3D stencil
  • CUDA
  • GPU
  • High order stencil
  • MHD
  • Stencil

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Information Systems
  • Information Systems and Management
  • Energy Engineering and Power Technology

Cite this

Anjum, O., Simon, G. D. G., Hidayetoglu, M., & Hwu, W. M. (2019). An efficient GPU implementation technique for higher-order 3D stencils. In Z. Xiao, L. T. Yang, P. Balaji, T. Li, K. Li, & A. Zomaya (Eds.), Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019 (pp. 552-561). [8855722] (Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00086

An efficient GPU implementation technique for higher-order 3D stencils. / Anjum, Omer; Simon, Garcia De Gonzalo; Hidayetoglu, Mert; Hwu, Wen Mei.

Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019. ed. / Zheng Xiao; Laurence T. Yang; Pavan Balaji; Tao Li; Keqin Li; Albert Zomaya. Institute of Electrical and Electronics Engineers Inc., 2019. p. 552-561 8855722 (Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Anjum, O, Simon, GDG, Hidayetoglu, M & Hwu, WM 2019, An efficient GPU implementation technique for higher-order 3D stencils. in Z Xiao, LT Yang, P Balaji, T Li, K Li & A Zomaya (eds), Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019., 8855722, Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019, Institute of Electrical and Electronics Engineers Inc., pp. 552-561, 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019, Zhangjiajie, China, 8/10/19. https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00086
Anjum O, Simon GDG, Hidayetoglu M, Hwu WM. An efficient GPU implementation technique for higher-order 3D stencils. In Xiao Z, Yang LT, Balaji P, Li T, Li K, Zomaya A, editors, Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019. Institute of Electrical and Electronics Engineers Inc. 2019. p. 552-561. 8855722. (Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019). https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00086
Anjum, Omer ; Simon, Garcia De Gonzalo ; Hidayetoglu, Mert ; Hwu, Wen Mei. / An efficient GPU implementation technique for higher-order 3D stencils. Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019. editor / Zheng Xiao ; Laurence T. Yang ; Pavan Balaji ; Tao Li ; Keqin Li ; Albert Zomaya. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 552-561 (Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019).
@inproceedings{07bfbc54ea064d70b193bbe9ad028abc,
title = "An efficient GPU implementation technique for higher-order 3D stencils",
abstract = "Stencils are a family of widely used computational patterns that play a critical role in various scientific and engineering applications. Stencil computations are known to be memory-bandwidth bound, thus a number of different techniques and algorithms that optimizes memory bandwidth usage have been proposed. However, existing techniques fall short in addressing the needs of large stencils, particularly more advanced stencil patterns involving non-axis aligned grid points. To handle non-axis aligned grid points, existing methods either use 3D caching or 2D caching schemes with more than one pass over the stencil per iteration, which suffers from the high intensity of memory accesses. The large number of memory accesses in these methods hinder the available performance. In this work, we present a new GPU-based implementation technique called 'SWiC' that focuses on using 2D caching to efficiently implement advanced 3D stencil patterns, involving non-axis aligned grid points, and reducing global memory transactions by increased data reuse while only requiring a single pass per iteration. In contrast to the current approaches that maintain input register queues, the proposed approach maintains and updates the output register queue instead. The analysis shows that SWiC achieves a significant reduction in memory transactions which translates to a significant application speedup, 1.6x to 5.76x, when compared to the current state-of-the-art GPU stencil implementation. 'SWiC' was evaluated across the latest three Nvidia GPU architectures as of the writing of this paper, as well as various stencil patterns and sizes. We also show that 'SWiC' does not suffer from performance penalties when applied to simpler 3D stencils without non-axis aligned grid points, covering a wide application range. When running on a multi-node setting, we study the scaling efficiency of SWiC and show that it is able to achieve a weak scaling efficiency of about 96{\%}.",
keywords = "3D stencil, CUDA, GPU, High order stencil, MHD, Stencil",
author = "Omer Anjum and Simon, {Garcia De Gonzalo} and Mert Hidayetoglu and Hwu, {Wen Mei}",
year = "2019",
month = "8",
doi = "10.1109/HPCC/SmartCity/DSS.2019.00086",
language = "English (US)",
series = "Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "552--561",
editor = "Zheng Xiao and Yang, {Laurence T.} and Pavan Balaji and Tao Li and Keqin Li and Albert Zomaya",
booktitle = "Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019",
address = "United States",

}

TY - GEN

T1 - An efficient GPU implementation technique for higher-order 3D stencils

AU - Anjum, Omer

AU - Simon, Garcia De Gonzalo

AU - Hidayetoglu, Mert

AU - Hwu, Wen Mei

PY - 2019/8

Y1 - 2019/8

N2 - Stencils are a family of widely used computational patterns that play a critical role in various scientific and engineering applications. Stencil computations are known to be memory-bandwidth bound, thus a number of different techniques and algorithms that optimizes memory bandwidth usage have been proposed. However, existing techniques fall short in addressing the needs of large stencils, particularly more advanced stencil patterns involving non-axis aligned grid points. To handle non-axis aligned grid points, existing methods either use 3D caching or 2D caching schemes with more than one pass over the stencil per iteration, which suffers from the high intensity of memory accesses. The large number of memory accesses in these methods hinder the available performance. In this work, we present a new GPU-based implementation technique called 'SWiC' that focuses on using 2D caching to efficiently implement advanced 3D stencil patterns, involving non-axis aligned grid points, and reducing global memory transactions by increased data reuse while only requiring a single pass per iteration. In contrast to the current approaches that maintain input register queues, the proposed approach maintains and updates the output register queue instead. The analysis shows that SWiC achieves a significant reduction in memory transactions which translates to a significant application speedup, 1.6x to 5.76x, when compared to the current state-of-the-art GPU stencil implementation. 'SWiC' was evaluated across the latest three Nvidia GPU architectures as of the writing of this paper, as well as various stencil patterns and sizes. We also show that 'SWiC' does not suffer from performance penalties when applied to simpler 3D stencils without non-axis aligned grid points, covering a wide application range. When running on a multi-node setting, we study the scaling efficiency of SWiC and show that it is able to achieve a weak scaling efficiency of about 96%.

AB - Stencils are a family of widely used computational patterns that play a critical role in various scientific and engineering applications. Stencil computations are known to be memory-bandwidth bound, thus a number of different techniques and algorithms that optimizes memory bandwidth usage have been proposed. However, existing techniques fall short in addressing the needs of large stencils, particularly more advanced stencil patterns involving non-axis aligned grid points. To handle non-axis aligned grid points, existing methods either use 3D caching or 2D caching schemes with more than one pass over the stencil per iteration, which suffers from the high intensity of memory accesses. The large number of memory accesses in these methods hinder the available performance. In this work, we present a new GPU-based implementation technique called 'SWiC' that focuses on using 2D caching to efficiently implement advanced 3D stencil patterns, involving non-axis aligned grid points, and reducing global memory transactions by increased data reuse while only requiring a single pass per iteration. In contrast to the current approaches that maintain input register queues, the proposed approach maintains and updates the output register queue instead. The analysis shows that SWiC achieves a significant reduction in memory transactions which translates to a significant application speedup, 1.6x to 5.76x, when compared to the current state-of-the-art GPU stencil implementation. 'SWiC' was evaluated across the latest three Nvidia GPU architectures as of the writing of this paper, as well as various stencil patterns and sizes. We also show that 'SWiC' does not suffer from performance penalties when applied to simpler 3D stencils without non-axis aligned grid points, covering a wide application range. When running on a multi-node setting, we study the scaling efficiency of SWiC and show that it is able to achieve a weak scaling efficiency of about 96%.

KW - 3D stencil

KW - CUDA

KW - GPU

KW - High order stencil

KW - MHD

KW - Stencil

UR - http://www.scopus.com/inward/record.url?scp=85073523663&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85073523663&partnerID=8YFLogxK

U2 - 10.1109/HPCC/SmartCity/DSS.2019.00086

DO - 10.1109/HPCC/SmartCity/DSS.2019.00086

M3 - Conference contribution

AN - SCOPUS:85073523663

T3 - Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019

SP - 552

EP - 561

BT - Proceedings - 21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019

A2 - Xiao, Zheng

A2 - Yang, Laurence T.

A2 - Balaji, Pavan

A2 - Li, Tao

A2 - Li, Keqin

A2 - Zomaya, Albert

PB - Institute of Electrical and Electronics Engineers Inc.

ER -