Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications

John Sartori, Rakesh Kumar

Research output: Contribution to journalConference article

Abstract

Control and memory divergence between threads in the same execution bundle, or warp, can significantly throttle the performance of GPU applications. We exploit the observation that many GPU applications exhibit error tolerance to propose branch and data herding. Branch herding eliminates control divergence by forcing all threads in a warp to take the same control path. Data herding eliminates memory divergence by forcing each thread in a warp to load from the same memory block. To safely and efficiently support branch and data herding, we propose a static analysis and compiler framework to prevent exceptions when control and data errors are introduced, a profiling framework that aims to maximize performance while maintaining acceptable output quality, and hardware optimizations to improve the performance benefits of exploiting error tolerance through branch and data herding. Our software implementation of branch herding on NVIDIA GeForce GTX 480 improves performance by up to 34% (13%, on average) for a suite of NVIDIA CUDA SDK and Parboil [7] benchmarks. Our hardware implementation of branch herding improves performance by up to 55% (30%, on average). Data herding improves performance by up to 32% (25%, on average). Observed output quality degradation is minimal for several applications that exhibit error tolerance, especially for visual computing applications. For a more detailed exposition of this work, see [6].

Original languageEnglish (US)
Pages (from-to)427-428
Number of pages2
JournalParallel Architectures and Compilation Techniques - Conference Proceedings, PACT
DOIs
StatePublished - Oct 22 2012
Event21st International Conference on Parallel Architectures and Compilation Techniques, PACT 2012 - Minneapolis, MN, United States
Duration: Sep 19 2012Sep 23 2012

Fingerprint

Herding
Divergence
Branch
Data storage equipment
Thread
Hardware
Tolerance
Static analysis
Forcing
Eliminate
Degradation
Graphics processing unit
Output
Hardware Implementation
Static Analysis
Profiling
Compiler
Exception
Bundle
Maximise

Keywords

  • Control divergence
  • Error tolerance
  • GPGPU
  • High performance
  • Memory divergence

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture

Cite this

@article{e0c73d8cf15b489aac89ad874394a7f7,
title = "Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications",
abstract = "Control and memory divergence between threads in the same execution bundle, or warp, can significantly throttle the performance of GPU applications. We exploit the observation that many GPU applications exhibit error tolerance to propose branch and data herding. Branch herding eliminates control divergence by forcing all threads in a warp to take the same control path. Data herding eliminates memory divergence by forcing each thread in a warp to load from the same memory block. To safely and efficiently support branch and data herding, we propose a static analysis and compiler framework to prevent exceptions when control and data errors are introduced, a profiling framework that aims to maximize performance while maintaining acceptable output quality, and hardware optimizations to improve the performance benefits of exploiting error tolerance through branch and data herding. Our software implementation of branch herding on NVIDIA GeForce GTX 480 improves performance by up to 34{\%} (13{\%}, on average) for a suite of NVIDIA CUDA SDK and Parboil [7] benchmarks. Our hardware implementation of branch herding improves performance by up to 55{\%} (30{\%}, on average). Data herding improves performance by up to 32{\%} (25{\%}, on average). Observed output quality degradation is minimal for several applications that exhibit error tolerance, especially for visual computing applications. For a more detailed exposition of this work, see [6].",
keywords = "Control divergence, Error tolerance, GPGPU, High performance, Memory divergence",
author = "John Sartori and Rakesh Kumar",
year = "2012",
month = "10",
day = "22",
doi = "10.1145/2370816.2370879",
language = "English (US)",
pages = "427--428",
journal = "Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT",
issn = "1089-795X",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Branch and data herding

T2 - Reducing control and memory divergence for error-tolerant GPU applications

AU - Sartori, John

AU - Kumar, Rakesh

PY - 2012/10/22

Y1 - 2012/10/22

N2 - Control and memory divergence between threads in the same execution bundle, or warp, can significantly throttle the performance of GPU applications. We exploit the observation that many GPU applications exhibit error tolerance to propose branch and data herding. Branch herding eliminates control divergence by forcing all threads in a warp to take the same control path. Data herding eliminates memory divergence by forcing each thread in a warp to load from the same memory block. To safely and efficiently support branch and data herding, we propose a static analysis and compiler framework to prevent exceptions when control and data errors are introduced, a profiling framework that aims to maximize performance while maintaining acceptable output quality, and hardware optimizations to improve the performance benefits of exploiting error tolerance through branch and data herding. Our software implementation of branch herding on NVIDIA GeForce GTX 480 improves performance by up to 34% (13%, on average) for a suite of NVIDIA CUDA SDK and Parboil [7] benchmarks. Our hardware implementation of branch herding improves performance by up to 55% (30%, on average). Data herding improves performance by up to 32% (25%, on average). Observed output quality degradation is minimal for several applications that exhibit error tolerance, especially for visual computing applications. For a more detailed exposition of this work, see [6].

AB - Control and memory divergence between threads in the same execution bundle, or warp, can significantly throttle the performance of GPU applications. We exploit the observation that many GPU applications exhibit error tolerance to propose branch and data herding. Branch herding eliminates control divergence by forcing all threads in a warp to take the same control path. Data herding eliminates memory divergence by forcing each thread in a warp to load from the same memory block. To safely and efficiently support branch and data herding, we propose a static analysis and compiler framework to prevent exceptions when control and data errors are introduced, a profiling framework that aims to maximize performance while maintaining acceptable output quality, and hardware optimizations to improve the performance benefits of exploiting error tolerance through branch and data herding. Our software implementation of branch herding on NVIDIA GeForce GTX 480 improves performance by up to 34% (13%, on average) for a suite of NVIDIA CUDA SDK and Parboil [7] benchmarks. Our hardware implementation of branch herding improves performance by up to 55% (30%, on average). Data herding improves performance by up to 32% (25%, on average). Observed output quality degradation is minimal for several applications that exhibit error tolerance, especially for visual computing applications. For a more detailed exposition of this work, see [6].

KW - Control divergence

KW - Error tolerance

KW - GPGPU

KW - High performance

KW - Memory divergence

UR - http://www.scopus.com/inward/record.url?scp=84867496543&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84867496543&partnerID=8YFLogxK

U2 - 10.1145/2370816.2370879

DO - 10.1145/2370816.2370879

M3 - Conference article

AN - SCOPUS:84867496543

SP - 427

EP - 428

JO - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT

JF - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT

SN - 1089-795X

ER -