Recently, as machine learning algorithms have become more practical, there has been much effort to implement them on edge devices that can be used in our daily lives. However, unlike server-scale devices, edge devices are relatively small and thus have much more limited resources. Therefore, control of resource usage and hardware optimization play an important role when we implement machine learning algorithms on an edge device. In this paper, we target convolutional neural networks (CNN) and explore various optimization and design techniques to realize them on FPGA devices. The key idea explored in this paper is Backward Pipeline Scheduling together with Latency Balancing which optimize the pipeline between CNN layers in order to significantly reduce the overall latency for processing a single image. We also develop a batch processing design to improve the throughput of the FPGA solution. We have achieved latency of 175.7µs for classifying one image in the MNIST data set using LeNet and 653.4µs for classifying one image in Cifar-10 data set using CifarNet. Without retraining, we are still able to maintain high accuracy of 97.6% for MNIST data set and 83.6% for the Cifar-10 data set. Our achieved single-image latency is 5.2x faster for LeNet and 1.95x faster for CifarNet compared to the NVIDIA Jetson TX1 solution.