We propose to decompose the fine-grained human activ- ity analysis problem into two sequential tasks with increas- ing granularity. Firstly, we infer the coarse interaction sta- tus, i.e., which object is being manipulated and where it is. Knowing that the major challenge is frequent mutual oc- clusions during manipulation, we propose an 'interaction tracking' framework in which hand/object position and in- teraction status are jointly tracked by explicitly modeling the contextual information between mutual occlusion and interaction status. Secondly, the inferred hand/object posi- tion and interaction status are utilized to provide 1) more compact feature pooling by effectively pruning large num- ber of motion features from irrelevant spatio-temporal po- sitions and 2) discriminative action detection by a granu- larity fusion strategy. Comprehensive experiments on two challenging fine-grained activity datasets (i.e., cooking ac- tion) show that the proposed framework achieves high ac- curacy/robustness in tracking multiple mutually occluded hands/objects during manipulation as well as significant performance improvement on fine-grained action detection over state-of-the-art methods.