A scalable tridiagonal solver for GPUs

Hee Seok Kim, Shengzhao Wu, Li Wen Chang, Wen-Mei W Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present the design and evaluation of a scalable tridiagonal solver targeted for GPU architectures. We observed that two distinct steps are required to solve a large tridiagonal system in parallel: 1) breaking down a problem into multiple subproblems each of which is independent of other, and 2) solving the subproblems using an efficient algorithm. We propose a hybrid method of tiled parallel cyclic reduction(tiled PCR) and thread-level parallel Thomas algorithm(p-Thomas). Algorithm transition from tiled PCR to p-Thomas is determined by input system size and hardware capability in order to achieve optimal performance. The proposed method is scalable as it can cope with various input system sizes by properly adjusting algorithm trasition point. Our method on a NVidia GTX480 shows up to 8.3x and 49x speedups over multithreaded and sequential MKL implementations on a 3.33GHz Intel i7 975 in double precision, respectively.

Original languageEnglish (US)
Title of host publicationProceedings - 2011 International Conference on Parallel Processing, ICPP 2011
Pages444-453
Number of pages10
DOIs
StatePublished - Nov 7 2011
Event40th International Conference on Parallel Processing, ICPP 2011 - Taipei City, Taiwan, Province of China
Duration: Sep 13 2011Sep 16 2011

Publication series

NameProceedings of the International Conference on Parallel Processing
ISSN (Print)0190-3918

Other

Other40th International Conference on Parallel Processing, ICPP 2011
CountryTaiwan, Province of China
CityTaipei City
Period9/13/119/16/11

Fingerprint

Tridiagonal matrix
Cyclic Reduction
Parallel algorithms
Parallel Algorithms
Tridiagonal Systems
Hybrid Method
Thread
Efficient Algorithms
Hardware
Distinct
Evaluation
Graphics processing unit

Keywords

  • GPGPU
  • GPU computing
  • Tridiagonal solver
  • Tridiagonal systems

ASJC Scopus subject areas

  • Software
  • Mathematics(all)
  • Hardware and Architecture

Cite this

Kim, H. S., Wu, S., Chang, L. W., & Hwu, W-M. W. (2011). A scalable tridiagonal solver for GPUs. In Proceedings - 2011 International Conference on Parallel Processing, ICPP 2011 (pp. 444-453). [6047212] (Proceedings of the International Conference on Parallel Processing). https://doi.org/10.1109/ICPP.2011.41

A scalable tridiagonal solver for GPUs. / Kim, Hee Seok; Wu, Shengzhao; Chang, Li Wen; Hwu, Wen-Mei W.

Proceedings - 2011 International Conference on Parallel Processing, ICPP 2011. 2011. p. 444-453 6047212 (Proceedings of the International Conference on Parallel Processing).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kim, HS, Wu, S, Chang, LW & Hwu, W-MW 2011, A scalable tridiagonal solver for GPUs. in Proceedings - 2011 International Conference on Parallel Processing, ICPP 2011., 6047212, Proceedings of the International Conference on Parallel Processing, pp. 444-453, 40th International Conference on Parallel Processing, ICPP 2011, Taipei City, Taiwan, Province of China, 9/13/11. https://doi.org/10.1109/ICPP.2011.41
Kim HS, Wu S, Chang LW, Hwu W-MW. A scalable tridiagonal solver for GPUs. In Proceedings - 2011 International Conference on Parallel Processing, ICPP 2011. 2011. p. 444-453. 6047212. (Proceedings of the International Conference on Parallel Processing). https://doi.org/10.1109/ICPP.2011.41
Kim, Hee Seok ; Wu, Shengzhao ; Chang, Li Wen ; Hwu, Wen-Mei W. / A scalable tridiagonal solver for GPUs. Proceedings - 2011 International Conference on Parallel Processing, ICPP 2011. 2011. pp. 444-453 (Proceedings of the International Conference on Parallel Processing).
@inproceedings{634b6baf6810490db92de1d5af216aa9,
title = "A scalable tridiagonal solver for GPUs",
abstract = "We present the design and evaluation of a scalable tridiagonal solver targeted for GPU architectures. We observed that two distinct steps are required to solve a large tridiagonal system in parallel: 1) breaking down a problem into multiple subproblems each of which is independent of other, and 2) solving the subproblems using an efficient algorithm. We propose a hybrid method of tiled parallel cyclic reduction(tiled PCR) and thread-level parallel Thomas algorithm(p-Thomas). Algorithm transition from tiled PCR to p-Thomas is determined by input system size and hardware capability in order to achieve optimal performance. The proposed method is scalable as it can cope with various input system sizes by properly adjusting algorithm trasition point. Our method on a NVidia GTX480 shows up to 8.3x and 49x speedups over multithreaded and sequential MKL implementations on a 3.33GHz Intel i7 975 in double precision, respectively.",
keywords = "GPGPU, GPU computing, Tridiagonal solver, Tridiagonal systems",
author = "Kim, {Hee Seok} and Shengzhao Wu and Chang, {Li Wen} and Hwu, {Wen-Mei W}",
year = "2011",
month = "11",
day = "7",
doi = "10.1109/ICPP.2011.41",
language = "English (US)",
isbn = "9780769545103",
series = "Proceedings of the International Conference on Parallel Processing",
pages = "444--453",
booktitle = "Proceedings - 2011 International Conference on Parallel Processing, ICPP 2011",

}

TY - GEN

T1 - A scalable tridiagonal solver for GPUs

AU - Kim, Hee Seok

AU - Wu, Shengzhao

AU - Chang, Li Wen

AU - Hwu, Wen-Mei W

PY - 2011/11/7

Y1 - 2011/11/7

N2 - We present the design and evaluation of a scalable tridiagonal solver targeted for GPU architectures. We observed that two distinct steps are required to solve a large tridiagonal system in parallel: 1) breaking down a problem into multiple subproblems each of which is independent of other, and 2) solving the subproblems using an efficient algorithm. We propose a hybrid method of tiled parallel cyclic reduction(tiled PCR) and thread-level parallel Thomas algorithm(p-Thomas). Algorithm transition from tiled PCR to p-Thomas is determined by input system size and hardware capability in order to achieve optimal performance. The proposed method is scalable as it can cope with various input system sizes by properly adjusting algorithm trasition point. Our method on a NVidia GTX480 shows up to 8.3x and 49x speedups over multithreaded and sequential MKL implementations on a 3.33GHz Intel i7 975 in double precision, respectively.

AB - We present the design and evaluation of a scalable tridiagonal solver targeted for GPU architectures. We observed that two distinct steps are required to solve a large tridiagonal system in parallel: 1) breaking down a problem into multiple subproblems each of which is independent of other, and 2) solving the subproblems using an efficient algorithm. We propose a hybrid method of tiled parallel cyclic reduction(tiled PCR) and thread-level parallel Thomas algorithm(p-Thomas). Algorithm transition from tiled PCR to p-Thomas is determined by input system size and hardware capability in order to achieve optimal performance. The proposed method is scalable as it can cope with various input system sizes by properly adjusting algorithm trasition point. Our method on a NVidia GTX480 shows up to 8.3x and 49x speedups over multithreaded and sequential MKL implementations on a 3.33GHz Intel i7 975 in double precision, respectively.

KW - GPGPU

KW - GPU computing

KW - Tridiagonal solver

KW - Tridiagonal systems

UR - http://www.scopus.com/inward/record.url?scp=80155140292&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80155140292&partnerID=8YFLogxK

U2 - 10.1109/ICPP.2011.41

DO - 10.1109/ICPP.2011.41

M3 - Conference contribution

AN - SCOPUS:80155140292

SN - 9780769545103

T3 - Proceedings of the International Conference on Parallel Processing

SP - 444

EP - 453

BT - Proceedings - 2011 International Conference on Parallel Processing, ICPP 2011

ER -