Global climate modeling is one of the grand challenges of computational science, and ocean modeling plays an important role in both understanding the current climatic conditions and predicting the future climate change. Three-dimensional time-dependent ocean general circulation models (OGCMs) require a large amount of memory and processing time to run realistic simulations. Recent advances in computing hardware have dramatically affected the prospect of studying the global climate. The significant computational resources of massively parallel supercomputers promise to make such studies feasible. In addition to using advanced hardware, designing and implementing a well-optimized parallel ocean code will significantly improve the computational performance and reduce the total research time to complete these studies. In our present work, we chose the most widely used OGCM code as our base code. This OGCM is based on the Parallel Ocean Program (POP) developed in FORTRAN 90 on the Los Alamos CM-2 Connection Machine by the Los Alamos ocean modeling research group. During the first half of 1994, the code was ported to the Cray T3D by Cray Research using SHMEM-based message passing. Since the code on the T3D was still time-consuming when large problems were encountered, improving the code performance was considered essential. We have developed several general strategies to optimize the ocean general circulation model on the Cray T3D. These strategies include memory optimization, effective use of arithmetic pipelines, and usage of optimized libraries. The optimized code runs 2 to 2.5 times faster than the original code, which gives significant performance improvements for modeling large scaled ocean flows. Many test runs for both of the original and the optimized code have been carried out on the Cray T3D using various numbers of processors (1-256). Comparisons are made for a variety of real-world problems. A nearly linear scaling performance line is obtained for the optimized code, while the speed up data of the optimized code also shows excellent improvement over the original code. In addition to discussing the optimization of the code, we also address the issue of portability. Given the short life cycle of the massively parallel computer, usually on the order of three to five years, we emphasize the portability of the ocean model and the associated optimization routines across several computing platforms. Currently, the ocean modeling code has been ported successfully to the Hewlett Packard (HP)/Convex SPP-2000, and is readily portable to Cray T3E. This paper reports our efforts to optimize the parallel implementations of the oceanic model. So far, the work has focused on improving the load balancing and single node performance of the code on the Cray T3D. As a result, the atmosphere and ocean model components running side-by-side can achieve a performance level of slightly more than 10 GFLOPS on 512 processors of that machine. We have also developed a user-friendly coupling interface with atmospheric and biogeochemical models, in order to make the global climate modeling more complete and more realistic.