TY - GEN
T1 - Optimal Round and Sample-Size Complexity for Partitioning in Parallel Sorting
AU - Yang, Wentao
AU - Harsh, Vipul
AU - Solomonik, Edgar
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/6/17
Y1 - 2023/6/17
N2 - State-of-the-art parallel sorting algorithms for distributed-memory architectures are based on computing a balanced partitioning via sampling and histogramming. By finding samples that partition the sorted keys into evenly-sized chunks, these algorithms minimize the number of communication rounds required. Histogramming (computing positions of samples) guides sampling, enabling a decrease in the overall number of samples collected. We derive lower and upper bounds on the number of sampling/histogramming rounds required to compute a balanced partitioning. We improve on prior results to demonstrate that when using p processors, O(log∗p) rounds with O(p/log∗p) samples per round suffice. We match that with a lower bound that shows that any algorithm with O(p) samples per round requires at least ω(log∗p) rounds. Additionally, we prove the ω(p log p) samples lower bound for one round, thus proving that existing one round algorithms: sample sort, AMS sort [2] and HSS [16] have optimal sample size complexity. To derive the lower bound, we propose a hard randomized input distribution and apply classical results from the distribution theory of runs.
AB - State-of-the-art parallel sorting algorithms for distributed-memory architectures are based on computing a balanced partitioning via sampling and histogramming. By finding samples that partition the sorted keys into evenly-sized chunks, these algorithms minimize the number of communication rounds required. Histogramming (computing positions of samples) guides sampling, enabling a decrease in the overall number of samples collected. We derive lower and upper bounds on the number of sampling/histogramming rounds required to compute a balanced partitioning. We improve on prior results to demonstrate that when using p processors, O(log∗p) rounds with O(p/log∗p) samples per round suffice. We match that with a lower bound that shows that any algorithm with O(p) samples per round requires at least ω(log∗p) rounds. Additionally, we prove the ω(p log p) samples lower bound for one round, thus proving that existing one round algorithms: sample sort, AMS sort [2] and HSS [16] have optimal sample size complexity. To derive the lower bound, we propose a hard randomized input distribution and apply classical results from the distribution theory of runs.
KW - communication cost
KW - communication lower bounds
KW - parallel algorithms
KW - parallel sorting
UR - http://www.scopus.com/inward/record.url?scp=85164300064&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85164300064&partnerID=8YFLogxK
U2 - 10.1145/3558481.3591076
DO - 10.1145/3558481.3591076
M3 - Conference contribution
AN - SCOPUS:85164300064
T3 - Annual ACM Symposium on Parallelism in Algorithms and Architectures
SP - 467
EP - 478
BT - SPAA 2023 - Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures
PB - Association for Computing Machinery
T2 - 35th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2023
Y2 - 17 June 2023 through 19 June 2023
ER -