TY - GEN
T1 - Joint Multimedia Event Extraction from Video and Article
AU - Chen, Brian
AU - Lin, Xudong
AU - Thomas, Christopher
AU - Li, Manling
AU - Yoshida, Shoya
AU - Chum, Lovish
AU - Ji, Heng
AU - Chang, Shih Fu
N1 - Funding Information:
This research is based upon work supported by U.S. DARPA KAIROS Program No. FA8750-19-2-1004, and U.S. DARPA AIDA Program No. FA8750-18-2-0014. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
Publisher Copyright:
© 2021 Association for Computational Linguistics.
PY - 2021
Y1 - 2021
N2 - Visual and textual modalities contribute complementary information about events described in multimedia documents. Videos contain rich dynamics and detailed unfoldings of events, while text describes more high-level and abstract concepts. However, existing event extraction methods either do not handle video or solely target video while ignoring other modalities. In contrast, we propose the first approach to jointly extract events from video and text articles. We introduce the new task of Video MultiMedia Event Extraction (VM2E2) and propose two novel components to build the first system towards this task. First, we propose the first self-supervised multimodal event coreference model that can determine coreference between video events and text events without any manually annotated pairs. Second, we introduce the first multimodal transformer which extracts structured event information jointly from both videos and text documents. We also construct and will publicly release a new benchmark of video-article pairs, consisting of 860 video-article pairs with extensive annotations for evaluating methods on this task. Our experimental results demonstrate the effectiveness of our proposed method on our new benchmark dataset. We achieve 6.0% and 5.8% absolute F-score gain on multimodal event coreference resolution and multimedia event extraction.
AB - Visual and textual modalities contribute complementary information about events described in multimedia documents. Videos contain rich dynamics and detailed unfoldings of events, while text describes more high-level and abstract concepts. However, existing event extraction methods either do not handle video or solely target video while ignoring other modalities. In contrast, we propose the first approach to jointly extract events from video and text articles. We introduce the new task of Video MultiMedia Event Extraction (VM2E2) and propose two novel components to build the first system towards this task. First, we propose the first self-supervised multimodal event coreference model that can determine coreference between video events and text events without any manually annotated pairs. Second, we introduce the first multimodal transformer which extracts structured event information jointly from both videos and text documents. We also construct and will publicly release a new benchmark of video-article pairs, consisting of 860 video-article pairs with extensive annotations for evaluating methods on this task. Our experimental results demonstrate the effectiveness of our proposed method on our new benchmark dataset. We achieve 6.0% and 5.8% absolute F-score gain on multimodal event coreference resolution and multimedia event extraction.
UR - http://www.scopus.com/inward/record.url?scp=85128064192&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85128064192&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85128064192
T3 - Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021
SP - 74
EP - 88
BT - Findings of the Association for Computational Linguistics, Findings of ACL
A2 - Moens, Marie-Francine
A2 - Huang, Xuanjing
A2 - Specia, Lucia
A2 - Yih, Scott Wen-Tau
PB - Association for Computational Linguistics (ACL)
T2 - 2021 Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021
Y2 - 7 November 2021 through 11 November 2021
ER -