TY - GEN
T1 - Video Event Extraction via Tracking Visual States of Arguments
AU - Yang, Guang
AU - Li, Manling
AU - Zhang, Jiajie
AU - Lin, Xudong
AU - Ji, Heng
AU - Chang, Shih Fu
N1 - Publisher Copyright:
Copyright © 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2023/6/27
Y1 - 2023/6/27
N2 - Video event extraction aims to detect salient events from a video and identify the arguments for each event as well as their semantic roles. Existing methods focus on capturing the overall visual scene of each frame, ignoring finegrained argument-level information. Inspired by the definition of events as changes of states, we propose a novel framework to detect video events by tracking the changes in the visual states of all involved arguments, which are expected to provide the most informative evidence for the extraction of video events. In order to capture the visual state changes of arguments, we decompose them into changes in pixels within objects, displacements of objects, and interactions among multiple arguments. We further propose Object State Embedding, Object Motion-aware Embedding and Argument Interaction Embedding to encode and track these changes respectively. Experiments on various video event extraction tasks demonstrate significant improvements compared to state-of-the-art models. In particular, on verb classification, we achieve 3.49% absolute gains (19.53% relative gains) in F1@5 on Video Situation Recognition. Our Code is publicly available at https://github.com/Shinetism/VStates for research purposes.
AB - Video event extraction aims to detect salient events from a video and identify the arguments for each event as well as their semantic roles. Existing methods focus on capturing the overall visual scene of each frame, ignoring finegrained argument-level information. Inspired by the definition of events as changes of states, we propose a novel framework to detect video events by tracking the changes in the visual states of all involved arguments, which are expected to provide the most informative evidence for the extraction of video events. In order to capture the visual state changes of arguments, we decompose them into changes in pixels within objects, displacements of objects, and interactions among multiple arguments. We further propose Object State Embedding, Object Motion-aware Embedding and Argument Interaction Embedding to encode and track these changes respectively. Experiments on various video event extraction tasks demonstrate significant improvements compared to state-of-the-art models. In particular, on verb classification, we achieve 3.49% absolute gains (19.53% relative gains) in F1@5 on Video Situation Recognition. Our Code is publicly available at https://github.com/Shinetism/VStates for research purposes.
UR - http://www.scopus.com/inward/record.url?scp=85167972299&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85167972299&partnerID=8YFLogxK
U2 - 10.1609/aaai.v37i3.25418
DO - 10.1609/aaai.v37i3.25418
M3 - Conference contribution
AN - SCOPUS:85167972299
T3 - Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023
SP - 3136
EP - 3144
BT - AAAI-23 Technical Tracks 3
A2 - Williams, Brian
A2 - Chen, Yiling
A2 - Neville, Jennifer
PB - American Association for Artificial Intelligence (AAAI) Press
T2 - 37th AAAI Conference on Artificial Intelligence, AAAI 2023
Y2 - 7 February 2023 through 14 February 2023
ER -