TY - GEN
T1 - Forecasting Human-Object Interaction
T2 - 16th European Conference on Computer Vision, ECCV 2020
AU - Liu, Miao
AU - Tang, Siyu
AU - Li, Yin
AU - Rehg, James M.
N1 - Funding Information:
The financial support of the EPSRC under Contract GR/ K 23465 for part of the work performed is gratefully acknowledged. The authors also wish to thank Dr. K. M. Leung for numerous discussions.
Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - We address the challenging task of anticipating human-object interaction in first person videos. Most existing methods either ignore how the camera wearer interacts with objects, or simply considers body motion as a separate modality. In contrast, we observe that the intentional hand movement reveals critical information about the future activity. Motivated by this observation, we adopt intentional hand movement as a feature representation, and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action. Specifically, we consider the future hand motion as the motor attention, and model this attention using probabilistic variables in our deep model. The predicted motor attention is further used to select the discriminative spatial-temporal visual features for predicting actions and interaction hotspots. We present extensive experiments demonstrating the benefit of the proposed joint model. Importantly, our model produces new state-of-the-art results for action anticipation on both EGTEA Gaze+ and the EPIC-Kitchens datasets. Our project page is available at https://aptx4869lm.github.io/ForecastingHOI/.
AB - We address the challenging task of anticipating human-object interaction in first person videos. Most existing methods either ignore how the camera wearer interacts with objects, or simply considers body motion as a separate modality. In contrast, we observe that the intentional hand movement reveals critical information about the future activity. Motivated by this observation, we adopt intentional hand movement as a feature representation, and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action. Specifically, we consider the future hand motion as the motor attention, and model this attention using probabilistic variables in our deep model. The predicted motor attention is further used to select the discriminative spatial-temporal visual features for predicting actions and interaction hotspots. We present extensive experiments demonstrating the benefit of the proposed joint model. Importantly, our model produces new state-of-the-art results for action anticipation on both EGTEA Gaze+ and the EPIC-Kitchens datasets. Our project page is available at https://aptx4869lm.github.io/ForecastingHOI/.
KW - Action anticipation
KW - First Person Vision
KW - Motor attention
UR - http://www.scopus.com/inward/record.url?scp=85097216015&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85097216015&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-58452-8_41
DO - 10.1007/978-3-030-58452-8_41
M3 - Conference contribution
AN - SCOPUS:85097216015
SN - 9783030584511
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 704
EP - 721
BT - Computer Vision – ECCV 2020 - 16th European Conference, 2020, Proceedings
A2 - Vedaldi, Andrea
A2 - Bischof, Horst
A2 - Brox, Thomas
A2 - Frahm, Jan-Michael
PB - Springer
Y2 - 23 August 2020 through 28 August 2020
ER -