Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video

Miao Liu, Siyu Tang, Yin Li, James M. Rehg

Research output: Chapter in Book/Report/Conference proceedingConference contribution


We address the challenging task of anticipating human-object interaction in first person videos. Most existing methods either ignore how the camera wearer interacts with objects, or simply considers body motion as a separate modality. In contrast, we observe that the intentional hand movement reveals critical information about the future activity. Motivated by this observation, we adopt intentional hand movement as a feature representation, and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action. Specifically, we consider the future hand motion as the motor attention, and model this attention using probabilistic variables in our deep model. The predicted motor attention is further used to select the discriminative spatial-temporal visual features for predicting actions and interaction hotspots. We present extensive experiments demonstrating the benefit of the proposed joint model. Importantly, our model produces new state-of-the-art results for action anticipation on both EGTEA Gaze+ and the EPIC-Kitchens datasets. Our project page is available at

Original languageEnglish (US)
Title of host publicationComputer Vision – ECCV 2020 - 16th European Conference, 2020, Proceedings
EditorsAndrea Vedaldi, Horst Bischof, Thomas Brox, Jan-Michael Frahm
Number of pages18
ISBN (Print)9783030584511
StatePublished - 2020
Externally publishedYes
Event16th European Conference on Computer Vision, ECCV 2020 - Glasgow, United Kingdom
Duration: Aug 23 2020Aug 28 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12346 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference16th European Conference on Computer Vision, ECCV 2020
Country/TerritoryUnited Kingdom


  • Action anticipation
  • First Person Vision
  • Motor attention

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science


Dive into the research topics of 'Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video'. Together they form a unique fingerprint.

Cite this