TY - GEN
T1 - Action detection using multiple Spatial-Temporal Interest Point features
AU - Cao, Liangliang
AU - Tian, Ying Li
AU - Liu, Zicheng
AU - Yao, Benjamin
AU - Zhang, Zhengyou
AU - Huang, Thomas S.
PY - 2010
Y1 - 2010
N2 - This paper considers the problem of detecting actions from cluttered videos. Compared with the classical action recognition problem, this paper aims to estimate not only the scene category of a given video sequence, but also the spatial-temporal locations of the action instances. In recent years, many feature extraction schemes have been designed to describe various aspects of actions. However, due to the difficulty of action detection, e.g., the cluttered background and potential occlusions, a single type of features cannot solve the action detection problems perfectly in cluttered videos. In this paper, we attack the detection problem by combining multiple Spatial-Temporal Interest Point (STIP) features, which detect salient patches in the video domain, and describe these patches by feature of local regions. The difficulty of combining multiple STIP features for action detection is two folds: First, the number of salient patches detected by different STIP methods varies across different salient patches. How to combine such features is not considered by existing fusion methods [13] [5]. Second, the detection in the videos should be efficient, which excludes many slow machine learning algorithms. To handle these two difficulties, we propose a new approach which combines Gaussian MixtureModel with Branch-and-Bound search to efficiently locate the action of interest. We build a new challenging dataset for our action detection task, and our algorithm obtains impressive results. On classical KTH dataset, our method outperforms the state-of-theart methods.
AB - This paper considers the problem of detecting actions from cluttered videos. Compared with the classical action recognition problem, this paper aims to estimate not only the scene category of a given video sequence, but also the spatial-temporal locations of the action instances. In recent years, many feature extraction schemes have been designed to describe various aspects of actions. However, due to the difficulty of action detection, e.g., the cluttered background and potential occlusions, a single type of features cannot solve the action detection problems perfectly in cluttered videos. In this paper, we attack the detection problem by combining multiple Spatial-Temporal Interest Point (STIP) features, which detect salient patches in the video domain, and describe these patches by feature of local regions. The difficulty of combining multiple STIP features for action detection is two folds: First, the number of salient patches detected by different STIP methods varies across different salient patches. How to combine such features is not considered by existing fusion methods [13] [5]. Second, the detection in the videos should be efficient, which excludes many slow machine learning algorithms. To handle these two difficulties, we propose a new approach which combines Gaussian MixtureModel with Branch-and-Bound search to efficiently locate the action of interest. We build a new challenging dataset for our action detection task, and our algorithm obtains impressive results. On classical KTH dataset, our method outperforms the state-of-theart methods.
UR - http://www.scopus.com/inward/record.url?scp=78349269463&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=78349269463&partnerID=8YFLogxK
U2 - 10.1109/ICME.2010.5583562
DO - 10.1109/ICME.2010.5583562
M3 - Conference contribution
AN - SCOPUS:78349269463
SN - 9781424474912
T3 - 2010 IEEE International Conference on Multimedia and Expo, ICME 2010
SP - 340
EP - 345
BT - 2010 IEEE International Conference on Multimedia and Expo, ICME 2010
T2 - 2010 IEEE International Conference on Multimedia and Expo, ICME 2010
Y2 - 19 July 2010 through 23 July 2010
ER -