Graph processing has attracted much attention recently due to its popularity in many big data analytic applications. With high performance and energy efficiency, FPGAs can be an attractive architecture for graph processing. A number of techniques such as caching using block RAMs (BRAMs) to reduce random accesses of global memory and multiple processing element (PE) instances for high throughput have been explored. OpenCL-based FPGAs natively support a high-level programming paradigm, providing good programmability to developers. However, challenges remain because the run-time dependency introduced by multiple PEs usually cannot be handled efficiently by OpenCL's high-level control granularity. In this paper, we propose a novel on-the-fly parallel data shuffling technique that can be implemented in OpenCL to solve this problem. We have integrated our shuffling technique to an edge-centric graph processing framework which achieves a throughput of more than 1,000 million traversed edges per second (MTEPS) on PageRank, SpMV, BFS and SSSP applications and is even better than existing RTL-based designs.