TY - GEN
T1 - Coreference by Appearance
T2 - 4th Workshop on Computational Models of Reference, Anaphora and Coreference, CRAC 2021
AU - Wang, Liming
AU - Feng, Shengyu
AU - Lin, Xudong
AU - Li, Manling
AU - Ji, Heng
AU - Chang, Shih Fu
N1 - Publisher Copyright:
© 2021 Association for Computational Linguistics.
PY - 2021
Y1 - 2021
N2 - Event coreference resolution is critical to understand events in growing number of online news with multiple modalities including text, video, speech, etc. However, the events and entities depicting in different modalities may not be perfectly aligned and can be difficult to annotate, which makes the task especially challenging with little supervision available. To address the above issues, we propose a supervised model based on attention mechanism and an unsupervised model based on statistical machine translation, capable of learning the relative importance of modalities for event coreference resolution. Experiments on a video multimedia event dataset show that our multimodal models outperform text-only systems in the event coreference resolution task. A careful analysis reveals that the performance gain of the multimodal model especially under the unsupervised setting comes from better learning of visually salient events.
AB - Event coreference resolution is critical to understand events in growing number of online news with multiple modalities including text, video, speech, etc. However, the events and entities depicting in different modalities may not be perfectly aligned and can be difficult to annotate, which makes the task especially challenging with little supervision available. To address the above issues, we propose a supervised model based on attention mechanism and an unsupervised model based on statistical machine translation, capable of learning the relative importance of modalities for event coreference resolution. Experiments on a video multimedia event dataset show that our multimodal models outperform text-only systems in the event coreference resolution task. A careful analysis reveals that the performance gain of the multimodal model especially under the unsupervised setting comes from better learning of visually salient events.
UR - http://www.scopus.com/inward/record.url?scp=85138493132&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85138493132&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85138493132
T3 - 4th Workshop on Computational Models of Reference, Anaphora and Coreference, CRAC 2021 - Proceedings of the Workshop
SP - 132
EP - 140
BT - 4th Workshop on Computational Models of Reference, Anaphora and Coreference, CRAC 2021 - Proceedings of the Workshop
A2 - Ogrodniczuk, Maciej
A2 - Pradhan, Sameer
A2 - Poesio, Massimo
A2 - Grishina, Yulia
A2 - Ng, Vincent
PB - Association for Computational Linguistics (ACL)
Y2 - 10 November 2021 through 11 November 2021
ER -