TY - GEN
T1 - Visual attention model for name tagging in multimodal social media
AU - Lu, Di
AU - Neves, Leonardo
AU - Carvalho, Vitor
AU - Zhang, Ning
AU - Ji, Heng
N1 - Funding Information:
This work was partially supported by the U.S. DARPA AIDA Program No. FA8750-18-2-0014 and U.S. ARL NS-CTA No. W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.
Publisher Copyright:
© 2018 Association for Computational Linguistics
PY - 2018
Y1 - 2018
N2 - Everyday billions of multimodal posts containing both images and text are shared in social media sites such as Snapchat, Twitter or Instagram. This combination of image and text in a single message allows for more creative and expressive forms of communication, and has become increasingly common in such sites. This new paradigm brings new challenges for natural language understanding, as the textual component tends to be shorter, more informal, and often is only understood if combined with the visual context. In this paper, we explore the task of name tagging in multimodal social media posts. We start by creating two new multimodal datasets: one based on Twitter posts1 and the other based on Snapchat captions (exclusively submitted to public and crowd-sourced stories). We then propose a novel model based on Visual Attention that not only provides deeper visual understanding on the decisions of the model, but also significantly outperforms other state-of-the-art baseline methods for this task2
AB - Everyday billions of multimodal posts containing both images and text are shared in social media sites such as Snapchat, Twitter or Instagram. This combination of image and text in a single message allows for more creative and expressive forms of communication, and has become increasingly common in such sites. This new paradigm brings new challenges for natural language understanding, as the textual component tends to be shorter, more informal, and often is only understood if combined with the visual context. In this paper, we explore the task of name tagging in multimodal social media posts. We start by creating two new multimodal datasets: one based on Twitter posts1 and the other based on Snapchat captions (exclusively submitted to public and crowd-sourced stories). We then propose a novel model based on Visual Attention that not only provides deeper visual understanding on the decisions of the model, but also significantly outperforms other state-of-the-art baseline methods for this task2
UR - http://www.scopus.com/inward/record.url?scp=85055091394&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85055091394&partnerID=8YFLogxK
U2 - 10.18653/v1/p18-1185
DO - 10.18653/v1/p18-1185
M3 - Conference contribution
AN - SCOPUS:85055091394
T3 - ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
SP - 1990
EP - 1999
BT - ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
PB - Association for Computational Linguistics (ACL)
T2 - 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018
Y2 - 15 July 2018 through 20 July 2018
ER -