TY - GEN
T1 - Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks
AU - Gupta, Tanmay
AU - Shih, Kevin
AU - Singh, Saurabh
AU - Hoiem, Derek
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/12/22
Y1 - 2017/12/22
N2 - An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it leads to better cross-task transfer than standard multitask learning. In particular, the task of visual recognition is aligned to the task of visual question answering by forcing each to use the same word-region embeddings. We show this leads to greater inductive transfer from recognition to VQA than standard multitask learning. Visual recognition also improves, especially for categories that have relatively few recognition training labels but appear often in the VQA setting. Thus, our paper takes a small step towards creating more general vision systems by showing the benefit of interpretable, flexible, and trainable core representations.
AB - An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it leads to better cross-task transfer than standard multitask learning. In particular, the task of visual recognition is aligned to the task of visual question answering by forcing each to use the same word-region embeddings. We show this leads to greater inductive transfer from recognition to VQA than standard multitask learning. Visual recognition also improves, especially for categories that have relatively few recognition training labels but appear often in the VQA setting. Thus, our paper takes a small step towards creating more general vision systems by showing the benefit of interpretable, flexible, and trainable core representations.
UR - http://www.scopus.com/inward/record.url?scp=85041909843&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85041909843&partnerID=8YFLogxK
U2 - 10.1109/ICCV.2017.452
DO - 10.1109/ICCV.2017.452
M3 - Conference contribution
AN - SCOPUS:85041909843
T3 - Proceedings of the IEEE International Conference on Computer Vision
SP - 4223
EP - 4232
BT - Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 16th IEEE International Conference on Computer Vision, ICCV 2017
Y2 - 22 October 2017 through 29 October 2017
ER -