TY - GEN
T1 - Towards General Purpose Vision Systems
T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
AU - Gupta, Tanmay
AU - Kamath, Amita
AU - Kembhavi, Aniruddha
AU - Hoiem, Derek
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Computer vision systems today are primarily N-purpose systems, designed and trained for a predefined set of tasks. Adapting such systems to new tasks is challenging and often requires nontrivial modifications to the network architecture (e.g. adding new output heads) or training process (e.g. adding new losses). To reduce the time and expertise required to develop new applications, we would like to create general purpose vision systems that can learn and perform a range of tasks without any modification to the architecture or learning process. In this paper, we propose GPV-1, a task-agnostic vision-language architecture that can learn and perform tasks that involve receiving an image and producing text and/or bounding boxes, including classification, localization, visual question answering, captioning, and more. We also propose evaluations of generality of architecture, skill-concept11For this work, we define concepts, skills and tasks as follows: Concepts - nouns (e.g. car, person, dog), Skills - operations that we wish to perform on the given inputs (e.g. classification, object detection, image captioning), Tasks - predefined combinations of a set of skills performed on a set of concepts (e.g. ImageNet classification task involves the skill of image classification across 1000 concepts). transfer, and learning efficiency that may informfuture work on general purpose vision. Our experiments indicate GPV-1 is effective at multiple tasks, reuses some concept knowledge across tasks, can perform the Referring Expressions task zero-shot, and further improves upon the zero-shot performance using a few training samples.
AB - Computer vision systems today are primarily N-purpose systems, designed and trained for a predefined set of tasks. Adapting such systems to new tasks is challenging and often requires nontrivial modifications to the network architecture (e.g. adding new output heads) or training process (e.g. adding new losses). To reduce the time and expertise required to develop new applications, we would like to create general purpose vision systems that can learn and perform a range of tasks without any modification to the architecture or learning process. In this paper, we propose GPV-1, a task-agnostic vision-language architecture that can learn and perform tasks that involve receiving an image and producing text and/or bounding boxes, including classification, localization, visual question answering, captioning, and more. We also propose evaluations of generality of architecture, skill-concept11For this work, we define concepts, skills and tasks as follows: Concepts - nouns (e.g. car, person, dog), Skills - operations that we wish to perform on the given inputs (e.g. classification, object detection, image captioning), Tasks - predefined combinations of a set of skills performed on a set of concepts (e.g. ImageNet classification task involves the skill of image classification across 1000 concepts). transfer, and learning efficiency that may informfuture work on general purpose vision. Our experiments indicate GPV-1 is effective at multiple tasks, reuses some concept knowledge across tasks, can perform the Referring Expressions task zero-shot, and further improves upon the zero-shot performance using a few training samples.
KW - Vision + language
UR - http://www.scopus.com/inward/record.url?scp=85143490318&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85143490318&partnerID=8YFLogxK
U2 - 10.1109/CVPR52688.2022.01591
DO - 10.1109/CVPR52688.2022.01591
M3 - Conference contribution
AN - SCOPUS:85143490318
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 16378
EP - 16388
BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
PB - IEEE Computer Society
Y2 - 19 June 2022 through 24 June 2022
ER -