TY - GEN
T1 - Separating Skills and Concepts for Novel Visual Question Answering
AU - Whitehead, Spencer
AU - Wu, Hui
AU - Ji, Heng
AU - Feris, Rogerio
AU - Saenko, Kate
N1 - Funding Information:
We thank David Cox for the helpful discussions. From the UIUC side: This work was in part supported by the U.S. DARPA AIDA Program No. FA8750-18-2-0014. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.
Funding Information:
We propose a new setting for generalization in VQA: measuring the ability to compose the skills needed to answer a question and the visual concepts that should be grounded to the image. We show that existing approaches have difficulty generalizing to unseen compositions of these two factors. We present a novel approach that implicitly disentangles skills and concepts, while grounding concepts visually, using a contrastive learning procedure. Our approach is able to learn from unlabeled VQA data in order to answer questions about previously unseen concepts. Results on the VQA v2 show that the proposed framework can achieve state-of-the-art performance on novel skill-concept compositions as well as generalize from unlabeled data. Acknowledgements: We thank David Cox for the helpful discussions. From the UIUC side: This work was in part supported by the U.S. DARPA AIDA Program No. FA8750-18-2-0014. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.
Publisher Copyright:
© 2021 IEEE
PY - 2021
Y1 - 2021
N2 - Generalization to out-of-distribution data has been a problem for Visual Question Answering (VQA) models. To measure generalization to novel questions, we propose to separate them into “skills” and “concepts”. “Skills” are visual tasks, such as counting or attribute recognition, and are applied to “concepts” mentioned in the question, such as objects and people. VQA methods should be able to compose skills and concepts in novel ways, regardless of whether the specific composition has been seen in training, yet we demonstrate that existing models have much to improve upon towards handling new compositions. We present a novel method for learning to compose skills and concepts that separates these two factors implicitly within a model by learning grounded concept representations and disentangling the encoding of skills from that of concepts. We enforce these properties with a novel contrastive learning procedure that does not rely on external annotations and can be learned from unlabeled image-question pairs. Experiments demonstrate the effectiveness of our approach for improving compositional and grounding performance.
AB - Generalization to out-of-distribution data has been a problem for Visual Question Answering (VQA) models. To measure generalization to novel questions, we propose to separate them into “skills” and “concepts”. “Skills” are visual tasks, such as counting or attribute recognition, and are applied to “concepts” mentioned in the question, such as objects and people. VQA methods should be able to compose skills and concepts in novel ways, regardless of whether the specific composition has been seen in training, yet we demonstrate that existing models have much to improve upon towards handling new compositions. We present a novel method for learning to compose skills and concepts that separates these two factors implicitly within a model by learning grounded concept representations and disentangling the encoding of skills from that of concepts. We enforce these properties with a novel contrastive learning procedure that does not rely on external annotations and can be learned from unlabeled image-question pairs. Experiments demonstrate the effectiveness of our approach for improving compositional and grounding performance.
UR - https://www.scopus.com/pages/publications/85117775738
UR - https://www.scopus.com/pages/publications/85117775738#tab=citedBy
U2 - 10.1109/CVPR46437.2021.00558
DO - 10.1109/CVPR46437.2021.00558
M3 - Conference contribution
AN - SCOPUS:85117775738
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 5628
EP - 5637
BT - Proceedings - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021
PB - IEEE Computer Society
T2 - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021
Y2 - 19 June 2021 through 25 June 2021
ER -