TY - GEN
T1 - Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs
AU - Tian, Jiannan
AU - Di, Sheng
AU - Yu, Xiaodong
AU - Rivera, Cody
AU - Zhao, Kai
AU - Jin, Sian
AU - Feng, Yunhe
AU - Liang, Xin
AU - Tao, Dingwen
AU - Cappello, Franck
N1 - Publisher Copyright:
©2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Error-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. With ever-emerging heterogeneous high-performance computing (HPC) architecture, GPU-accelerated error-bounded compressors (such as CUSZ and cuZFP) have been developed. However, they suffer from either low performance or low compression ratios. To this end, we propose CUSZ+ to target both high compression ratios and throughputs. We identify that data sparsity and data smoothness are key factors for high compression throughputs. Our key contributions in this work are fourfold: (1) We propose an efficient compression workflow to adaptively perform run-length encoding and/or variable-length encoding. (2) We derive Lorenzo reconstruction in decompression as multidimensional partial-sum computation and propose a fine-grained Lorenzo reconstruction algorithm for GPU architectures. (3) We carefully optimize each of CUSZ kernels by leveraging state-of-the-art CUDA parallel primitives. (4) We evaluate CUSZ+ using seven real-world HPC application datasets on V100 and A100 GPUs. Experiments show CUSZ+ improves the compression throughputs and ratios by up to 18.4× and 5.3×, respectively, over CUSZ on the tested datasets.
AB - Error-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. With ever-emerging heterogeneous high-performance computing (HPC) architecture, GPU-accelerated error-bounded compressors (such as CUSZ and cuZFP) have been developed. However, they suffer from either low performance or low compression ratios. To this end, we propose CUSZ+ to target both high compression ratios and throughputs. We identify that data sparsity and data smoothness are key factors for high compression throughputs. Our key contributions in this work are fourfold: (1) We propose an efficient compression workflow to adaptively perform run-length encoding and/or variable-length encoding. (2) We derive Lorenzo reconstruction in decompression as multidimensional partial-sum computation and propose a fine-grained Lorenzo reconstruction algorithm for GPU architectures. (3) We carefully optimize each of CUSZ kernels by leveraging state-of-the-art CUDA parallel primitives. (4) We evaluate CUSZ+ using seven real-world HPC application datasets on V100 and A100 GPUs. Experiments show CUSZ+ improves the compression throughputs and ratios by up to 18.4× and 5.3×, respectively, over CUSZ on the tested datasets.
UR - http://www.scopus.com/inward/record.url?scp=85123728843&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123728843&partnerID=8YFLogxK
U2 - 10.1109/Cluster48925.2021.00047
DO - 10.1109/Cluster48925.2021.00047
M3 - Conference contribution
AN - SCOPUS:85123728843
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 283
EP - 293
BT - Proceedings - 2021 IEEE International Conference on Cluster Computing, Cluster 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE International Conference on Cluster Computing, Cluster 2021
Y2 - 7 September 2021 through 10 September 2021
ER -