TY - GEN
T1 - Action detection in complex scenes with spatial and temporal ambiguities
AU - Hu, Yuxiao
AU - Cao, Liangliang
AU - Lv, Fengjun
AU - Yan, Shuicheng
AU - Gong, Yihong
AU - Huang, Thomas S.
PY - 2009
Y1 - 2009
N2 - In this paper, we investigate the detection of semantic human actions in complex scenes. Unlike conventional action recognition in well-controlled environments, action detection in complex scenes suffers from cluttered backgrounds, heavy crowds, occluded bodies, and spatial-temporal boundary ambiguities caused by imperfect human detection and tracking. Conventional algorithms are likely to fail with such spatial-temporal ambiguities. In this work, the candidate regions of an action are treated as a bag of instances. Then a novel multiple-instance learning framework, named SMILE-SVM (Simulated annealingMultiple Instance LEarning Support Vector Machines), is presented for learning human action detector based on imprecise action locations. SMILE-SVM is extensively evaluated with satisfactory performances on two tasks: 1) human action detection on a public video action database with cluttered backgrounds, and 2) a real world problem of detecting whether the customers in a shopping mall show an intention to purchase the merchandise on shelf (even if they didn't buy it eventually). In addition, the complementary nature of motion and appearance features in action detection are also validated, demonstrating a boosted performance in our experiments.
AB - In this paper, we investigate the detection of semantic human actions in complex scenes. Unlike conventional action recognition in well-controlled environments, action detection in complex scenes suffers from cluttered backgrounds, heavy crowds, occluded bodies, and spatial-temporal boundary ambiguities caused by imperfect human detection and tracking. Conventional algorithms are likely to fail with such spatial-temporal ambiguities. In this work, the candidate regions of an action are treated as a bag of instances. Then a novel multiple-instance learning framework, named SMILE-SVM (Simulated annealingMultiple Instance LEarning Support Vector Machines), is presented for learning human action detector based on imprecise action locations. SMILE-SVM is extensively evaluated with satisfactory performances on two tasks: 1) human action detection on a public video action database with cluttered backgrounds, and 2) a real world problem of detecting whether the customers in a shopping mall show an intention to purchase the merchandise on shelf (even if they didn't buy it eventually). In addition, the complementary nature of motion and appearance features in action detection are also validated, demonstrating a boosted performance in our experiments.
UR - http://www.scopus.com/inward/record.url?scp=77953194241&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77953194241&partnerID=8YFLogxK
U2 - 10.1109/ICCV.2009.5459153
DO - 10.1109/ICCV.2009.5459153
M3 - Conference contribution
AN - SCOPUS:77953194241
SN - 9781424444205
T3 - Proceedings of the IEEE International Conference on Computer Vision
SP - 128
EP - 135
BT - 2009 IEEE 12th International Conference on Computer Vision, ICCV 2009
T2 - 12th International Conference on Computer Vision, ICCV 2009
Y2 - 29 September 2009 through 2 October 2009
ER -