TY - GEN
T1 - Comparison based sorting for systems with multiple GPUs
AU - Tanasic, Ivan
AU - Vilanova, Lluís
AU - Jordà, Marc
AU - Cabezas, Javier
AU - Gelado, Isaac
AU - Navarro, Nacho
AU - Hwu, Wen Mei
N1 - Copyright:
Copyright 2013 Elsevier B.V., All rights reserved.
PY - 2013
Y1 - 2013
N2 - As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for single-GPU systems. However, some workstations and HPC systems have multiple GPUs, and applications running on them are de- signed to use all available GPUs in the system. In this paper we present a high performance multi-GPU merge sort algorithm that solves the problem of sorting data distributed across several GPUs. Our merge sort algorithm first sorts the data on each GPU using an existing single- GPU sorting algorithm. Then, a series of merge steps pro- duce a globally sorted array distributed across all the GPUs in the system. This merge phase is enabled by a novel pivot selection algorithm that ensures that merge steps always distribute data evenly among all GPUs. We also present the implementation of our sorting algorithm in CUDA, as well as a novel inter-GPU communication technique that enables this pivot selection algorithm. Experimental results show that an efficient implementation of our algorithm achieves a speed up of 1.9x when running on two GPUs and 3.3x when running on four GPUs, compared to sorting on a single GPU. At the same time, it is able to sort two and four times more records, compared to sorting on one GPU.
AB - As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for single-GPU systems. However, some workstations and HPC systems have multiple GPUs, and applications running on them are de- signed to use all available GPUs in the system. In this paper we present a high performance multi-GPU merge sort algorithm that solves the problem of sorting data distributed across several GPUs. Our merge sort algorithm first sorts the data on each GPU using an existing single- GPU sorting algorithm. Then, a series of merge steps pro- duce a globally sorted array distributed across all the GPUs in the system. This merge phase is enabled by a novel pivot selection algorithm that ensures that merge steps always distribute data evenly among all GPUs. We also present the implementation of our sorting algorithm in CUDA, as well as a novel inter-GPU communication technique that enables this pivot selection algorithm. Experimental results show that an efficient implementation of our algorithm achieves a speed up of 1.9x when running on two GPUs and 3.3x when running on four GPUs, compared to sorting on a single GPU. At the same time, it is able to sort two and four times more records, compared to sorting on one GPU.
KW - CUDA
KW - GPU
KW - Parallel
KW - Sorting
UR - http://www.scopus.com/inward/record.url?scp=84875986099&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84875986099&partnerID=8YFLogxK
U2 - 10.1145/2458523.2458524
DO - 10.1145/2458523.2458524
M3 - Conference contribution
AN - SCOPUS:84875986099
SN - 9781450320177
T3 - ACM International Conference Proceeding Series
SP - 1
EP - 11
BT - Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU 2013
T2 - 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU 2013
Y2 - 16 March 2013 through 16 March 2013
ER -