Considerable research has been devoted to utilizing multimodal features for better understanding multimedia data. However, two core research issues have not yet been adequately addressed. First, given a set of features extracted from multiple media sources (e.g., extracted from the visual, audio, and caption track of videos), how do we determine the best modalities? Second, once a set of modalities has been identified, how do we best fuse them to map to semantics? In this paper, we propose a two-step approach. The first step finds statistically independent modalities from raw features. In the second step, we use super-kernel fusion to determine the optimal combination of individual modalities. We carefully analyze the tradeoffs between three design factors that affect fusion performance: modality independence, curse of dimensionality, and fusion-model complexity. Through analytical and empirical studies, we demonstrate that our two-step approach, which achieves a careful balance of the three design factors, can improve class-prediction accuracy over traditional techniques.