GPUs are an increasingly popular implementation platform for a variety of general purpose applications from mobile and embedded devices to high performance computing. The CUDA and OpenCL parallel programming models enable easy utilization of the GPU's resources. However, tuning GPU applications' performance is a complex and labor intensive task. Software programmers employ a variety of optimization techniques to explore tradeoffs between the thread parallelism and performance of a single thread. However, prior techniques ignore register allocation, a significant factor in single thread performance and, indirectly affects the number of simultaneously active threads. In this paper, we show that joint optimization of register allocation and thread structure has great potential to significantly improve performance. However, the design space for this joint optimization can be large; therefore, we develop performance metrics appropriate for evaluation within a compiler's inner loop and efficient design space exploration techniques that use the metrics to narrow the search space. Across a range of GPU applications, we achieve average performance speedup of 1.33X (up to 1.73X) with design space exploration 355X faster than the exhaustive search.