Error-bounded lossy compression is a state-of-the-art data reduction technique for HPC applications because it not only significantly reduces storage overhead but also can retain high fidelityfor postanalysis. Because supercomputers and HPC applicationsare becoming heterogeneous using accelerator-based architectures,in particular GPUs, several development teams have recently released GPU versions of their lossy compressors. However, existingstate-of-the-art GPU-based lossy compressors suffer from eitherlow compression and decompression throughput or low compression quality. In this paper, we present an optimized GPU version,cuSZ, for one of the best error-bounded lossy compressors-SZ.To the best of our knowledge, cuSZ is the first error-boundedlossy compressor on GPUs for scientific data. Our contributions arefourfold. (1) We propose a dual-qantization scheme to entirelyremove the data dependency in the prediction step of SZ such thatthis step can be performed very efficiently on GPUs. (2) We developan efficient customized Huffman coding for the SZ compressor onGPUs. (3) We implement cuSZ using CUDA and optimize its performance by improving the utilization of GPU memory bandwidth. (4)We evaluate our cuSZ on five real-world HPC application datasetsfrom the Scientific Data Reduction Benchmarks and compare it withother state-of-the-art methods on both CPUs and GPUs. Experiments show that our cuSZ improves SZ's compression throughputby up to 370.1× and 13.1×, respectively, over the production version running on single and multiple CPU cores, respectively, whilegetting the same quality of reconstructed data. It also improves thecompression ratio by up to 3.48× on the tested data compared withanother state-of-the-art GPU supported lossy compressor.