Execution of applications on upcoming high- performance computing (HPC) systems introduces a variety of new challenges and amplifies many existing ones. These systems will be composed of a large number of "fat" nodes, where each node consists of multiple processors on a chip with symmetric multithreading capabilities, interconnected via high-performance networks. Traditional system software for parallel computing considers these chip multiprocessors (CMPs) as arrays of symmetric multiprocessing cores, when in fact there are fundamental differences among them. Opportunities for optimization on CMPs are lost using this approach. We show that support for fine-grained parallelism coupled with an integrated approach for scheduling of compute and communication tasks is required for efficient execution on this architecture. We propose Phoenix, a runtime system designed specifically for execution on CMP architectures to address the challenges of performance and programmability for upcoming HPC systems. An implementation of message passing interface (MPI) atop Phoenix is presented. Micro-benchmarks and a production MPI application are used to highlight the benefits of our implementation vis-a-vis traditional MPI implementations on CMP architectures.