Throughput oriented high level synthesis allows efficient de- sign and optimization using parallel input languages. Par- allel languages offer the benefit of parallelism extraction at multiple levels of granularity, offering effective design space exploration to select efficient single core implementations, and easy scaling of parallelism through multiple core instan- tiations. However, study of high level synthesis for paral- lel languages has concentrated on optimization of core and on-chip communications, while neglecting platform integra- tion, which can have a significant impact on achieved per- formance. In this paper, we create an automated flow to perform efficient platform integration for an existing CUDA- to-RTL throughput oriented HLS, and we open source the FCUDA tool, platform integration, and benchmark appli- cations. We demonstrate platform integration of 16 bench- marks on two Zynq-based systems in bare-metal and OS mode. We study implementation optimization for platform integration, compare to an embedded GPU (Tegra TK1) and verify designs on a Zedboard Zynq 7020 (bare-metal) and Omnitek Zynq 7045 (OS).