TY - JOUR
T1 - Detecting attended visual targets in video
AU - Chong, Eunji
AU - Wang, Yongxin
AU - Ruiz, Nataniel
AU - Rehg, James M.
N1 - Funding Information:
We thank Caroline Dalluge and Pooja Parikh for the gaze target annotations in the VideoAttentionTarget dataset, and Stephan Lee for building the annotation tool and performing annotations. The toddler dataset used in Sec. 5.3 was collected and annotated under the direction of Agata Rozga, Rebecca Jones, Audrey Southerland, and Elysha Clark-Whitney. This study was funded in part by the Simons Foundation under grant 383667 and NIH R01 MH114999.
Publisher Copyright:
© 2020 IEEE
PY - 2020
Y1 - 2020
N2 - We address the problem of detecting attention targets in video. Our goal is to identify where each person in each frame of a video is looking, and correctly handle the case where the gaze target is out-of-frame. Our novel architecture models the dynamic interaction between the scene and head features and infers time-varying attention targets. We introduce a new annotated dataset, VideoAttentionTarget, containing complex and dynamic patterns of real-world gaze behavior. Our experiments show that our model can effectively infer dynamic attention in videos. In addition, we apply our predicted attention maps to two social gaze behavior recognition tasks, and show that the resulting classifiers significantly outperform existing methods. We achieve state-of-the-art performance on three datasets: GazeFollow (static images), VideoAttentionTarget (videos), and VideoCoAtt (videos), and obtain the first results for automatically classifying clinically-relevant gaze behavior without wearable cameras or eye trackers.
AB - We address the problem of detecting attention targets in video. Our goal is to identify where each person in each frame of a video is looking, and correctly handle the case where the gaze target is out-of-frame. Our novel architecture models the dynamic interaction between the scene and head features and infers time-varying attention targets. We introduce a new annotated dataset, VideoAttentionTarget, containing complex and dynamic patterns of real-world gaze behavior. Our experiments show that our model can effectively infer dynamic attention in videos. In addition, we apply our predicted attention maps to two social gaze behavior recognition tasks, and show that the resulting classifiers significantly outperform existing methods. We achieve state-of-the-art performance on three datasets: GazeFollow (static images), VideoAttentionTarget (videos), and VideoCoAtt (videos), and obtain the first results for automatically classifying clinically-relevant gaze behavior without wearable cameras or eye trackers.
UR - http://www.scopus.com/inward/record.url?scp=85094643320&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85094643320&partnerID=8YFLogxK
U2 - 10.1109/CVPR42600.2020.00544
DO - 10.1109/CVPR42600.2020.00544
M3 - Conference article
AN - SCOPUS:85094643320
SN - 1063-6919
SP - 5395
EP - 5405
JO - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
JF - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
M1 - 9157393
T2 - 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020
Y2 - 14 June 2020 through 19 June 2020
ER -