TY - GEN
T1 - Robust emotion recognition from low quality and low bit rate video
T2 - 7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017
AU - Cheng, Bowen
AU - Wang, Zhangyang
AU - Zhang, Zhaobin
AU - Li, Zhu
AU - Liu, Ding
AU - Yang, Jianchao
AU - Huang, Shuai
AU - Huang, Thomas S.
N1 - Acknowledgments Bowen Cheng, Ding Liu and Thomas Huang’s research works are supported in part by US Army Research Office grant W911NF-15-1-0317. The authors sincerely acknowledge the valuable efforts of the AVEC challenge organizers [19]. The authors would also like to acknowledge the helpful discussions with Dr. Pooya Khorrami and Dr. Thomas Paine.
PY - 2017/7/2
Y1 - 2017/7/2
N2 - Emotion recognition from facial expressions is tremendously useful, especially when coupled with smart devices and wireless multimedia applications. However, the inadequate network bandwidth often limits the spatial resolution of the transmitted video, which will heavily degrade the recognition reliability. We develop a novel framework to achieve robust emotion recognition from low bit rate video. While video frames are downsampled at the encoder side, the decoder is embedded with a deep network model for joint super-resolution (SR) and recognition. Notably, we propose a novel max-mix training strategy, leading to a single 'One-for-All' model that is remarkably robust to a vast range of downsampling factors. That makes our framework well adapted for the varied bandwidths in real transmission scenarios, without hampering scalability or efficiency. The proposed framework is evaluated on the AVEC 2016 benchmark, and demonstrates significantly improved stand-alone recognition performance, as well as rate-distortion (R-D) performance, than either directly recognizing from LR frames, or separating SR and recognition.
AB - Emotion recognition from facial expressions is tremendously useful, especially when coupled with smart devices and wireless multimedia applications. However, the inadequate network bandwidth often limits the spatial resolution of the transmitted video, which will heavily degrade the recognition reliability. We develop a novel framework to achieve robust emotion recognition from low bit rate video. While video frames are downsampled at the encoder side, the decoder is embedded with a deep network model for joint super-resolution (SR) and recognition. Notably, we propose a novel max-mix training strategy, leading to a single 'One-for-All' model that is remarkably robust to a vast range of downsampling factors. That makes our framework well adapted for the varied bandwidths in real transmission scenarios, without hampering scalability or efficiency. The proposed framework is evaluated on the AVEC 2016 benchmark, and demonstrates significantly improved stand-alone recognition performance, as well as rate-distortion (R-D) performance, than either directly recognizing from LR frames, or separating SR and recognition.
UR - https://www.scopus.com/pages/publications/85047357764
UR - https://www.scopus.com/pages/publications/85047357764#tab=citedBy
U2 - 10.1109/ACII.2017.8273580
DO - 10.1109/ACII.2017.8273580
M3 - Conference contribution
AN - SCOPUS:85047357764
T3 - 2017 7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017
SP - 65
EP - 70
BT - 2017 7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 23 October 2017 through 26 October 2017
ER -