TY - GEN
T1 - MuMuQA
T2 - 36th AAAI Conference on Artificial Intelligence, AAAI 2022
AU - Reddy, Revanth Gangi
AU - Rui, Xilin
AU - Li, Manling
AU - Lin, Xudong
AU - Wen, Haoyang
AU - Cho, Jaemin
AU - Huang, Lifu
AU - Bansal, Mohit
AU - Sil, Avirup
AU - Chang, Shih Fu
AU - Schwing, Alexander
AU - Ji, Heng
N1 - We would like to thank Sean Kosman, Rebecca Lee, Kathryn Conger and Martha Palmer for their help on data annotations, and thank Prof. Ernest Davis (NYU) for insightful advice and feedback on our data set and paper. This research is based upon work supported in part by U.S. DARPA AIDA Program No. FA8750-18-2-0014 and U.S. DARPA KAIROS Program No. FA8750-19-2-1004. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
PY - 2022/6/30
Y1 - 2022/6/30
N2 - Recently, there has been an increasing interest in building question answering (QA) models that reason across multiple modalities, such as text and images. However, QA using images is often limited to just picking the answer from a predefined set of options. In addition, images in the real world, especially in news, have objects that are co-referential to the text, with complementary information from both modalities. In this paper, we present a new QA evaluation benchmark with 1,384 questions over news articles that require crossmedia grounding of objects in images onto text. Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question. In addition, we introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task. We evaluate both pipeline-based and end-to-end pretraining-based multimedia QA models on our benchmark, and show that they achieve promising performance, while considerably lagging behind human performance hence leaving large room for future work on this challenging new task.
AB - Recently, there has been an increasing interest in building question answering (QA) models that reason across multiple modalities, such as text and images. However, QA using images is often limited to just picking the answer from a predefined set of options. In addition, images in the real world, especially in news, have objects that are co-referential to the text, with complementary information from both modalities. In this paper, we present a new QA evaluation benchmark with 1,384 questions over news articles that require crossmedia grounding of objects in images onto text. Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question. In addition, we introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task. We evaluate both pipeline-based and end-to-end pretraining-based multimedia QA models on our benchmark, and show that they achieve promising performance, while considerably lagging behind human performance hence leaving large room for future work on this challenging new task.
UR - http://www.scopus.com/inward/record.url?scp=85147542925&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85147542925&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85147542925
T3 - Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022
SP - 11200
EP - 11208
BT - AAAI-22 Technical Tracks 10
PB - Association for the Advancement of Artificial Intelligence
Y2 - 22 February 2022 through 1 March 2022
ER -