Currently, GPUs and data parallel processors leverage latency tolerance techniques such as multithreading and prefetching to maximize performance per Watt. However, choosing a technique that provides energy-efficiency on a wide variety of workloads is difficult, as the type of latency to tolerate, required hardware complexity, and energy consumption is directly related to application behavior. After qualitatively evaluating five commonly used latency tolerance techniques, we develop a hybrid technique utilizing multithreading and decoupled execution to maximize performance while minimizing hardware complexity and energy consumption across a wide variety of workloads. We compare our hybrid technique with the five commonly used techniques on a 1024-core data parallel processor by performing a comprehensive design space exploration, leveraging detailed performance and physical design models. By intelligently leveraging both decoupled execution and multithreading, our hybrid latency tolerance technique is able to improve energy-efficiency by 28% to 89% over any single technique on data parallel benchmarks. Compared to other combinations of latency tolerance techniques, we find that our hybrid latency tolerance technique provides the highest energy-efficiency by over 26%.