Detecting fine-grained human action from video sequence is challenging. In this work, we propose to decompose this difficult analytic problem into two sequential tasks with increasing granularity. Firstly, we infer the coarse interaction status, i.e., which object is being manipulated and where the interaction occurs. To address the issue of frequent mutual occlusions during manipulation, we propose an interaction tracking framework in which hand (object) position and interaction status are jointly tracked by explicitly modeling the occlusion context. Secondly, for a given query sequence, the inferred interaction status is utilized to efficiently identify a small set of candidate matching sequences from the annotated training set. Frame-level action labels are then transferred to the query sequence by setting up the matching between the query and candidate sequences. Comprehensive experiments on two challenging fine-grained activity datasets show that: (1) the proposed interaction tracking approach achieves high tracking accuracy for multiple mutually occluded objects (hands) during manipulation action; and (2) the proposed multiple granularity analysis framework achieves superior action detection performance improvement over state-of-the-art methods.
- Fine-grained action detection
- Multiple granularity
- Multiple object tracking
- Nonparametric label transfer
ASJC Scopus subject areas
- Computer Vision and Pattern Recognition
- Artificial Intelligence