In-Place Matrix Transposition on GPUs

Juan Gomez-Luna, I. Jui Sung, Li Wen Chang, Jose Maria Gonzalez-Linares, Nicolas Guil, Wen Mei W. Hwu

Research output: Contribution to journalArticlepeer-review


Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. With more and more algebra libraries offloading to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPU to achieve good performance. In this paper we present our in-place matrix transposition approach for GPUs that is performed using elementary tile-wise transpositions. We propose low-level optimizations for the elementary transpositions, and find the best performing configurations for them. Then, we compare all sequences of transpositions that achieve full transposition, and detect which is the most favorable for each matrix. We present an heuristic to guide the selection of tile sizes, and compare them to brute-force search. We diagnose the drawback of our approach, and propose a solution using minimal padding. With fast padding and unpadding kernels, the overall throughput is significantly increased. Finally, we compare our method to another recent implementation.

Original languageEnglish (US)
Article number7059219
Pages (from-to)776-788
Number of pages13
JournalIEEE Transactions on Parallel and Distributed Systems
Issue number3
StatePublished - Mar 1 2016


  • GPU
  • In-Place
  • Transposition

ASJC Scopus subject areas

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics


Dive into the research topics of 'In-Place Matrix Transposition on GPUs'. Together they form a unique fingerprint.

Cite this