Where to look: Focus regions for visual question answering

Kevin J. Shih, Saurabh Singh, Derek Hoiem

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present a method that learns to answer visual questions by selecting image regions relevant to the text-based query. Our method maps textual queries and visual features from various regions into a shared space where they are compared for relevance with an inner product. Our method exhibits significant improvements in answering questions such as 'what color,' where it is necessary to evaluate a specific location, and 'what room,' where it selectively identifies informative image regions. Our model is tested on the recently released VQA [1] dataset, which features free-form human-annotated questions and answers.

Original languageEnglish (US)
Title of host publicationProceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016
PublisherIEEE Computer Society
Pages4613-4621
Number of pages9
ISBN (Electronic)9781467388504
DOIs
StatePublished - Dec 9 2016
Event29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 - Las Vegas, United States
Duration: Jun 26 2016Jul 1 2016

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2016-December
ISSN (Print)1063-6919

Conference

Conference29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016
Country/TerritoryUnited States
CityLas Vegas
Period6/26/167/1/16

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Where to look: Focus regions for visual question answering'. Together they form a unique fingerprint.

Cite this