Data parallel languages such as CUDA and OpenCL efficiently describe many parallel threads of computation, and HLS tools can effectively translate these descriptions into independent optimized cores. As the number of instantiated cores grows, average external memory access latency can be a significant factor in system performance. However, although each core produces outputs independently, the cores often heavily share input data. Exploiting on-chip data sharing both reduces external bandwidth demand and improves the average memory access latency, allowing the system to improve performance at the same number of cores. In this paper, we develop a network-on-chip coupled with computation cores synthesized from CUDA for FPGAs that enables on-chip data sharing. We demonstrate reduced external bandwidth demand by up to 60% (average 56%) and total application latency in cycles by up to 43% (average 27%).