In this work, we decompose a first-person action into verb and noun. We then study how the coupling of an action's constituent verb and noun affects the learners' ability to learn them separately and to combine them to perform recognition. We compare different information fusion methods on conventional action recognition and zero-shot learning, of which the latter is a strong indication of the feature's ability to capture one concept (verb/noun) and not be confounded by the other. To achieve the decoupling of verb/noun concepts, we extract features that are specialized for each of them. Specifically, we use improved dense trajectories and convolutional neural network activations. We show that by constructing specialized features for the decomposed concepts, our method succeeds in zero-shot learning. More surprisingly, it also outperforms previous results in conventional action recognition when the performance gaps of different features on verb/noun concepts are significant.