Collective operations are used in MPI programs to express common communication patterns, collective computations, or synchronization. In many collectives, such as MPI-Allreduce, the intra-node component of the collective lies on the critical path, as the inter-node communication cannot start until the intra-node component has completed. With increasing number of core counts in each node, intra-node optimizations that leverage shared memory become more important. In this paper, we focus on the performance benefit of optimizing intra-node collectives using POSIX shared memory for synchronization and data sharing. We implement several collectives using basic primitives or steps as building blocks. Key components of our implementation include a dedicated intra- node collectives layer, careful layout of the data structures, as well as optimizations to exploit the memory hierarchy to balance parallelism and latencies of data movement. A comparison of our implementation on top of MPICH shows significant performance speedups with respect to the original MPICH implementation, MVAPICH, and OpenMPI.