Teraflop performance is no longer something of the future as complex integrated and multiscale 3D simulations drive supercomputer development. This tutorial addresses computation at the highest end. An overview of architectures is given (BlueGene/L, Columbia, NEC SX-8, Cray and IBM lines, high-performing clusters) along with programming tools necessary for application development. Parallel programming concepts (MPI, OpenMP, HPF, UPC, CAF) are reviewed and compared. What are the major issues facing application code developers today? How do the challenges vary from cluster computing to the complex hybrid architectures with superscalar and vector processors? What are the barriers we must overcome to achieve true sustained petascale performance? We address these questions and give tips, tricks, and tools of the trade for large mulitscale application development. We discuss some advanced MPI including dynamic process management and optimization. We draw from a series of terascale and multiscale applications and discuss specific challenges and performance issues.