TY - GEN
T1 - Accelerating Fourier and Number Theoretic Transforms using Tensor Cores and Warp Shuffles
AU - Durrani, Sultan
AU - Chughtai, Muhammad Saad
AU - Hidayetoglu, Mert
AU - Tahir, Rashid
AU - Dakkak, Abdul
AU - Rauchwerger, Lawrence
AU - Zaffar, Fareed
AU - Hwu, Wen Mei
N1 - Publisher Copyright:
© 2021 IEEE
PY - 2021
Y1 - 2021
N2 - The discrete Fourier transform (DFT) and its specialized case, the number theoretic transform (NTT), are two important mathematical tools having applications in several areas of science and engineering. However, despite their usefulness and utility, their adoption continues to be a challenge as computing the DFT of a signal can be a time-consuming and expensive operation. To speed things up, fast Fourier transform (FFT) algorithms, which are reduced-complexity formulations for computing the DFT of a sequence, have been proposed and implemented for traditional processors and their corresponding instruction sets. With the rise of GPUs, NVIDIA introduced its own FFT computation library called cuFFT, which leverages the power of GPUs to compute the DFT. However, as this paper demonstrates, there is a lot of room for improvement to accelerate the FFT and NTT algorithms on modern GPUs by utilizing specialized operations and architectural advancements. In particular, we present four major types of optimizations that leverage tensor cores and the warp-shuffle instruction. Through extensive evaluations, we show that our approach consistently outperforms existing GPU-based implementations with a speedup of up to 4× for NTT and a speed of up to 1.5× for FFT.
AB - The discrete Fourier transform (DFT) and its specialized case, the number theoretic transform (NTT), are two important mathematical tools having applications in several areas of science and engineering. However, despite their usefulness and utility, their adoption continues to be a challenge as computing the DFT of a signal can be a time-consuming and expensive operation. To speed things up, fast Fourier transform (FFT) algorithms, which are reduced-complexity formulations for computing the DFT of a sequence, have been proposed and implemented for traditional processors and their corresponding instruction sets. With the rise of GPUs, NVIDIA introduced its own FFT computation library called cuFFT, which leverages the power of GPUs to compute the DFT. However, as this paper demonstrates, there is a lot of room for improvement to accelerate the FFT and NTT algorithms on modern GPUs by utilizing specialized operations and architectural advancements. In particular, we present four major types of optimizations that leverage tensor cores and the warp-shuffle instruction. Through extensive evaluations, we show that our approach consistently outperforms existing GPU-based implementations with a speedup of up to 4× for NTT and a speed of up to 1.5× for FFT.
KW - Cyphertext
KW - DFT
KW - FFT
KW - GPU
KW - Homomorphic encryption
KW - NTT
KW - Tensor cores
UR - http://www.scopus.com/inward/record.url?scp=85125729888&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85125729888&partnerID=8YFLogxK
U2 - 10.1109/PACT52795.2021.00032
DO - 10.1109/PACT52795.2021.00032
M3 - Conference contribution
AN - SCOPUS:85125729888
T3 - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
SP - 345
EP - 355
BT - Proceedings - 30th International Conference on Parallel Architectures and Compilation Techniques, PACT 2021
A2 - Lee, Jaejin
A2 - Cohen, Albert
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 30th International Conference on Parallel Architectures and Compilation Techniques, PACT 2021
Y2 - 26 September 2021 through 29 September 2021
ER -