TY - GEN
T1 - Efficient Error-Bounded Lossy Compression for CPU Architectures
AU - Dube, Griffin
AU - Tian, Jiannan
AU - Di, Sheng
AU - Tao, Dingwen
AU - Calhoun, Jon C.
AU - Cappello, Franck
N1 - Funding Information:
ACKNOWLEDGMENT This research was supported by the Exascale Computing Project (ECP), Project Number: 17-SC-20-SC, a collaborative effort of two DOE organizations – the Office of Science and the National Nuclear Security Administration, responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, to support the nation’s exascale computing imperative. The material was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research (ASCR), under contract DE-AC02-06CH11357. This material is based upon work supported by the National Science Foundation under Grant No. SHF-1910197, SHF-1943114, OAC-2003709, OAC-2042084, OAC-2104023, and OAC-2104024.
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Modern HPC applications produce increasingly large amounts of data, which limits the performance of current extreme-scale systems. Lossy compression, helps to mitigate this issue by decreasing the size of data generated by these applications. SZ, a current state-of-the-art lossy compressor, is able to achieve high compression ratios, but its prediction/quantization methods contain RAW dependencies that prevent parallelizing this step of the compression. Recent work proposes a parallel dual prediction/quantization algorithm for GPUs which removes these dependencies. However, some HPC systems and applications do not use GPUs, and could still benefit from the fine-grained parallelism of this method. Using the dual-quantization technique, we implement and optimize a SIMD vectorized CPU version of SZ (vecSZ), and create a heuristic for selecting the optimal block size and vector length. We propose a novel block padding algorithm to decrease the number of unpredictable values along compression block borders and find it reduces the number of prediction outliers by up to 100%. We measure performance of our vecSZ against an CPU version of SZ using dual-quantization, pSZ, as well as SZ-1.4. Using real-world scientific datasets, we evaluate vecSZ on the Intel Skylake and AMD Rome architectures. vecSZ results in up to 32% improvement in rate-distortion and up to 15× speedup over SZ-1.4, achieving a prediction and quantization bandwidth in excess of 3.4 GB/s.
AB - Modern HPC applications produce increasingly large amounts of data, which limits the performance of current extreme-scale systems. Lossy compression, helps to mitigate this issue by decreasing the size of data generated by these applications. SZ, a current state-of-the-art lossy compressor, is able to achieve high compression ratios, but its prediction/quantization methods contain RAW dependencies that prevent parallelizing this step of the compression. Recent work proposes a parallel dual prediction/quantization algorithm for GPUs which removes these dependencies. However, some HPC systems and applications do not use GPUs, and could still benefit from the fine-grained parallelism of this method. Using the dual-quantization technique, we implement and optimize a SIMD vectorized CPU version of SZ (vecSZ), and create a heuristic for selecting the optimal block size and vector length. We propose a novel block padding algorithm to decrease the number of unpredictable values along compression block borders and find it reduces the number of prediction outliers by up to 100%. We measure performance of our vecSZ against an CPU version of SZ using dual-quantization, pSZ, as well as SZ-1.4. Using real-world scientific datasets, we evaluate vecSZ on the Intel Skylake and AMD Rome architectures. vecSZ results in up to 32% improvement in rate-distortion and up to 15× speedup over SZ-1.4, achieving a prediction and quantization bandwidth in excess of 3.4 GB/s.
KW - big data
KW - compression
KW - lossy compression
KW - program optimization
KW - veetorization
UR - http://www.scopus.com/inward/record.url?scp=85149917789&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85149917789&partnerID=8YFLogxK
U2 - 10.1109/MASCOTS56607.2022.00020
DO - 10.1109/MASCOTS56607.2022.00020
M3 - Conference contribution
AN - SCOPUS:85149917789
T3 - Proceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS
SP - 89
EP - 96
BT - Proceedings - 2022 30th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2022
PB - IEEE Computer Society
T2 - 30th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2022
Y2 - 18 October 2022 through 20 October 2022
ER -