In-Place Matrix Transposition on GPUs

Juan Gomez-Luna, I. Jui Sung, Li Wen Chang, Jose Maria Gonzalez-Linares, Nicolas Guil, Wen-Mei W Hwu

Research output: Contribution to journalArticle

Abstract

Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. With more and more algebra libraries offloading to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPU to achieve good performance. In this paper we present our in-place matrix transposition approach for GPUs that is performed using elementary tile-wise transpositions. We propose low-level optimizations for the elementary transpositions, and find the best performing configurations for them. Then, we compare all sequences of transpositions that achieve full transposition, and detect which is the most favorable for each matrix. We present an heuristic to guide the selection of tile sizes, and compare them to brute-force search. We diagnose the drawback of our approach, and propose a solution using minimal padding. With fast padding and unpadding kernels, the overall throughput is significantly increased. Finally, we compare our method to another recent implementation.

Original languageEnglish (US)
Article number7059219
Pages (from-to)776-788
Number of pages13
JournalIEEE Transactions on Parallel and Distributed Systems
Volume27
Issue number3
DOIs
StatePublished - Mar 1 2016

Fingerprint

Tile
Throughput
Fast Fourier transforms
Algebra
Program processors
Data storage equipment
Graphics processing unit

Keywords

  • GPU
  • In-Place
  • Transposition

ASJC Scopus subject areas

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Cite this

Gomez-Luna, J., Sung, I. J., Chang, L. W., Gonzalez-Linares, J. M., Guil, N., & Hwu, W-M. W. (2016). In-Place Matrix Transposition on GPUs. IEEE Transactions on Parallel and Distributed Systems, 27(3), 776-788. [7059219]. https://doi.org/10.1109/TPDS.2015.2412549

In-Place Matrix Transposition on GPUs. / Gomez-Luna, Juan; Sung, I. Jui; Chang, Li Wen; Gonzalez-Linares, Jose Maria; Guil, Nicolas; Hwu, Wen-Mei W.

In: IEEE Transactions on Parallel and Distributed Systems, Vol. 27, No. 3, 7059219, 01.03.2016, p. 776-788.

Research output: Contribution to journalArticle

Gomez-Luna, J, Sung, IJ, Chang, LW, Gonzalez-Linares, JM, Guil, N & Hwu, W-MW 2016, 'In-Place Matrix Transposition on GPUs', IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 3, 7059219, pp. 776-788. https://doi.org/10.1109/TPDS.2015.2412549
Gomez-Luna J, Sung IJ, Chang LW, Gonzalez-Linares JM, Guil N, Hwu W-MW. In-Place Matrix Transposition on GPUs. IEEE Transactions on Parallel and Distributed Systems. 2016 Mar 1;27(3):776-788. 7059219. https://doi.org/10.1109/TPDS.2015.2412549
Gomez-Luna, Juan ; Sung, I. Jui ; Chang, Li Wen ; Gonzalez-Linares, Jose Maria ; Guil, Nicolas ; Hwu, Wen-Mei W. / In-Place Matrix Transposition on GPUs. In: IEEE Transactions on Parallel and Distributed Systems. 2016 ; Vol. 27, No. 3. pp. 776-788.
@article{5b872b02db484ea78810619cddb3ea2f,
title = "In-Place Matrix Transposition on GPUs",
abstract = "Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. With more and more algebra libraries offloading to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPU to achieve good performance. In this paper we present our in-place matrix transposition approach for GPUs that is performed using elementary tile-wise transpositions. We propose low-level optimizations for the elementary transpositions, and find the best performing configurations for them. Then, we compare all sequences of transpositions that achieve full transposition, and detect which is the most favorable for each matrix. We present an heuristic to guide the selection of tile sizes, and compare them to brute-force search. We diagnose the drawback of our approach, and propose a solution using minimal padding. With fast padding and unpadding kernels, the overall throughput is significantly increased. Finally, we compare our method to another recent implementation.",
keywords = "GPU, In-Place, Transposition",
author = "Juan Gomez-Luna and Sung, {I. Jui} and Chang, {Li Wen} and Gonzalez-Linares, {Jose Maria} and Nicolas Guil and Hwu, {Wen-Mei W}",
year = "2016",
month = "3",
day = "1",
doi = "10.1109/TPDS.2015.2412549",
language = "English (US)",
volume = "27",
pages = "776--788",
journal = "IEEE Transactions on Parallel and Distributed Systems",
issn = "1045-9219",
publisher = "IEEE Computer Society",
number = "3",

}

TY - JOUR

T1 - In-Place Matrix Transposition on GPUs

AU - Gomez-Luna, Juan

AU - Sung, I. Jui

AU - Chang, Li Wen

AU - Gonzalez-Linares, Jose Maria

AU - Guil, Nicolas

AU - Hwu, Wen-Mei W

PY - 2016/3/1

Y1 - 2016/3/1

N2 - Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. With more and more algebra libraries offloading to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPU to achieve good performance. In this paper we present our in-place matrix transposition approach for GPUs that is performed using elementary tile-wise transpositions. We propose low-level optimizations for the elementary transpositions, and find the best performing configurations for them. Then, we compare all sequences of transpositions that achieve full transposition, and detect which is the most favorable for each matrix. We present an heuristic to guide the selection of tile sizes, and compare them to brute-force search. We diagnose the drawback of our approach, and propose a solution using minimal padding. With fast padding and unpadding kernels, the overall throughput is significantly increased. Finally, we compare our method to another recent implementation.

AB - Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. With more and more algebra libraries offloading to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPU to achieve good performance. In this paper we present our in-place matrix transposition approach for GPUs that is performed using elementary tile-wise transpositions. We propose low-level optimizations for the elementary transpositions, and find the best performing configurations for them. Then, we compare all sequences of transpositions that achieve full transposition, and detect which is the most favorable for each matrix. We present an heuristic to guide the selection of tile sizes, and compare them to brute-force search. We diagnose the drawback of our approach, and propose a solution using minimal padding. With fast padding and unpadding kernels, the overall throughput is significantly increased. Finally, we compare our method to another recent implementation.

KW - GPU

KW - In-Place

KW - Transposition

UR - http://www.scopus.com/inward/record.url?scp=84962603967&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84962603967&partnerID=8YFLogxK

U2 - 10.1109/TPDS.2015.2412549

DO - 10.1109/TPDS.2015.2412549

M3 - Article

VL - 27

SP - 776

EP - 788

JO - IEEE Transactions on Parallel and Distributed Systems

JF - IEEE Transactions on Parallel and Distributed Systems

SN - 1045-9219

IS - 3

M1 - 7059219

ER -