Semantic Aligned Multi-modal Transformer for Vision-Language Understanding: A Preliminary Study on Visual QA

Han Ding, Li Erran Li, Zhiting Hu, Yi Xu, Dilek Hakkani-Tur, Zheng Du, Belinda Zeng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recent vision-language understanding approaches adopt a multi-modal transformer pre-training and finetuning paradigm. Prior work learns representations of text tokens and visual features with cross-attention mechanisms and captures the alignment solely based on indirect signals. In this work, we propose to enhance the alignment mechanism by incorporating image scene graph structures as the bridge between the two modalities, and learning with new contrastive objectives. In our preliminary study on the challenging compositional visual question answering task, we show the proposed approach achieves improved results, demonstrating potentials to enhance vision-language understanding.

Original languageEnglish (US)
Title of host publicationMultimodal Artificial Intelligence, MAI Workshop 2021 - Proceedings of the 3rd Workshop
EditorsAmir Zadeh, Louis-Philippe Morency, Paul Pu Liang, Candace Ross, Ruslan Salakhutdinov, Soujanya Poria, Erik Cambria, Kelly Shi
PublisherAssociation for Computational Linguistics (ACL)
Pages74-78
Number of pages5
ISBN (Electronic)9781954085251
DOIs
StatePublished - 2021
Externally publishedYes
Event3rd NAACL Workshop on Multimodal Artificial Intelligence, MAI Workshop 2021 - Mexico City, Mexico
Duration: Jun 6 2021 → …

Publication series

NameMultimodal Artificial Intelligence, MAI Workshop 2021 - Proceedings of the 3rd Workshop

Conference

Conference3rd NAACL Workshop on Multimodal Artificial Intelligence, MAI Workshop 2021
Country/TerritoryMexico
CityMexico City
Period6/6/21 → …

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Semantic Aligned Multi-modal Transformer for Vision-Language Understanding: A Preliminary Study on Visual QA'. Together they form a unique fingerprint.

Cite this