Deep Neural Networks (DNNs) have become promising solutions for data analysis especially for raw data processing from sensors. However, using DNN-based approaches can easily introduce huge demands of computation and memory consumption, which may not be feasible for direct deployment onto the Internet of Thing (IoT) devices, since they have strict constraints on hardware resources, power budgets, response latency, and manufacturing cost. To bring DNNs into IoT devices, embedded FPGA can be one of the most suitable candidates by providing better energy efficiency than GPU and CPU based solutions, and higher flexibility than ASICs. In this paper, we propose a systematic solution to deploy DNNs on embedded FPGAs, which includes a ternarized hardware Deep Learning Accelerator (T-DLA), and a framework for ternary neural network (TNN) training. T-DLA is a highly optimized hardware unit in FPGA specializing in accelerating the TNNs, while the proposed framework can significantly compress the DNN parameters down to two bits with little accuracy drop. Results show that our training framework can compress the DNN up to 14.14x while maintaining nearly the same accuracy compared to the floating point version. By illustrating our proposed design techniques, the T-DLA can deliver up to 0.4TOPS with 2.576W power consumption, showing 873.6x and 5.1x higher energy efficiency (fps/W) on ImageNet with Resnet-18 model comparing to Xeon E5-2630 CPU and Nvidia 1080 Ti GPU. To the best of our knowledge, this is the first instruction-based highly efficient ternary DLA design reported from the literature.