TY - GEN
T1 - Robust multi-modal cues for dyadic human interaction recognition
AU - Trabelsi, Rim
AU - Varadarajan, Jagannadan
AU - Pei, Yong
AU - Zhang, Le
AU - Jabri, Issam
AU - Bouallegue, Ammar
AU - Moulin, Pierre
N1 - Funding Information:
Œis work was funded by the research grant from Singapore Agency for Science, Technology and Research (A*STAR) through the ARAP program.
Publisher Copyright:
© 2017 Copyright held by the owner/author(s).
PY - 2017/10/27
Y1 - 2017/10/27
N2 - Activity analysis methods usually tend to focus on elementary human actions but ignore to analyze complex scenarios. In this paper, we focus particularly on classifying interactions between two persons in a supervised fashion. We propose a robust multi-modal proxemic descriptor based on 3D joint locations, depth and color videos. The proposed descriptor incorporates inter-person and intraperson joint distances calculated from 3D skeleton data and multiframe dense optical flow features obtained from the application of temporal Convolutional neural networks (CNN) on depth and color images. The descriptors from the three modalities are derived from sparse key-frames surrounding high activity content and fused using a linear SVM classifier. Through experiments on two publicly available RGB-D interaction datasets, we show that our method can efficiently classify complex interactions using only short video snippet, outperforming existing state-of-the-art results.
AB - Activity analysis methods usually tend to focus on elementary human actions but ignore to analyze complex scenarios. In this paper, we focus particularly on classifying interactions between two persons in a supervised fashion. We propose a robust multi-modal proxemic descriptor based on 3D joint locations, depth and color videos. The proposed descriptor incorporates inter-person and intraperson joint distances calculated from 3D skeleton data and multiframe dense optical flow features obtained from the application of temporal Convolutional neural networks (CNN) on depth and color images. The descriptors from the three modalities are derived from sparse key-frames surrounding high activity content and fused using a linear SVM classifier. Through experiments on two publicly available RGB-D interaction datasets, we show that our method can efficiently classify complex interactions using only short video snippet, outperforming existing state-of-the-art results.
KW - CNN features
KW - Interaction recognition
KW - Multi-modal features
KW - RGB-D
KW - Skeleton data
UR - http://www.scopus.com/inward/record.url?scp=85035747317&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85035747317&partnerID=8YFLogxK
U2 - 10.1145/3132515.3132517
DO - 10.1145/3132515.3132517
M3 - Conference contribution
AN - SCOPUS:85035747317
T3 - MUSA2 2017 - Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes, co-located with MM 2017
SP - 47
EP - 53
BT - MUSA2 2017 - Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes, co-located with MM 2017
PB - Association for Computing Machinery, Inc
T2 - 1st ACM MM Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes, MUSA2 2017
Y2 - 27 October 2017
ER -