Out of the box: Reasoning with graph convolution nets for factual visual question answering

Research output: Contribution to journalConference article

Abstract

Accurately answering a question about a given image requires combining observations with general knowledge. While this is effortless for humans, reasoning with general knowledge remains an algorithmic challenge. To advance research in this direction a novel 'fact-based' visual question answering (FVQA) task has been introduced recently along with a large set of curated facts which link two entities, i.e., two possible answers, via a relation. Given a question-image pair, deep network techniques have been employed to successively reduce the large set of facts until one of the two entities of the final remaining fact is predicted as the answer. We observe that a successive process which considers one fact at a time to form a local decision is sub-optimal. Instead, we develop an entity graph and use a graph convolutional network to 'reason' about the correct answer by jointly considering all entities. We show on the challenging FVQA dataset that this leads to an improvement in accuracy of around 7% compared to the state of the art.

Original languageEnglish (US)
Pages (from-to)2654-2665
Number of pages12
JournalAdvances in Neural Information Processing Systems
Volume2018-December
StatePublished - Jan 1 2018
Event32nd Conference on Neural Information Processing Systems, NeurIPS 2018 - Montreal, Canada
Duration: Dec 2 2018Dec 8 2018

Fingerprint

Convolution

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Cite this

Out of the box : Reasoning with graph convolution nets for factual visual question answering. / Narasimhan, Medhini; Lazebnik, Svetlana; Schwing, Alexander Gerhard.

In: Advances in Neural Information Processing Systems, Vol. 2018-December, 01.01.2018, p. 2654-2665.

Research output: Contribution to journalConference article

@article{df69087976bd4be7b4a2710a4680fb3d,
title = "Out of the box: Reasoning with graph convolution nets for factual visual question answering",
abstract = "Accurately answering a question about a given image requires combining observations with general knowledge. While this is effortless for humans, reasoning with general knowledge remains an algorithmic challenge. To advance research in this direction a novel 'fact-based' visual question answering (FVQA) task has been introduced recently along with a large set of curated facts which link two entities, i.e., two possible answers, via a relation. Given a question-image pair, deep network techniques have been employed to successively reduce the large set of facts until one of the two entities of the final remaining fact is predicted as the answer. We observe that a successive process which considers one fact at a time to form a local decision is sub-optimal. Instead, we develop an entity graph and use a graph convolutional network to 'reason' about the correct answer by jointly considering all entities. We show on the challenging FVQA dataset that this leads to an improvement in accuracy of around 7{\%} compared to the state of the art.",
author = "Medhini Narasimhan and Svetlana Lazebnik and Schwing, {Alexander Gerhard}",
year = "2018",
month = "1",
day = "1",
language = "English (US)",
volume = "2018-December",
pages = "2654--2665",
journal = "Advances in Neural Information Processing Systems",
issn = "1049-5258",

}

TY - JOUR

T1 - Out of the box

T2 - Reasoning with graph convolution nets for factual visual question answering

AU - Narasimhan, Medhini

AU - Lazebnik, Svetlana

AU - Schwing, Alexander Gerhard

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Accurately answering a question about a given image requires combining observations with general knowledge. While this is effortless for humans, reasoning with general knowledge remains an algorithmic challenge. To advance research in this direction a novel 'fact-based' visual question answering (FVQA) task has been introduced recently along with a large set of curated facts which link two entities, i.e., two possible answers, via a relation. Given a question-image pair, deep network techniques have been employed to successively reduce the large set of facts until one of the two entities of the final remaining fact is predicted as the answer. We observe that a successive process which considers one fact at a time to form a local decision is sub-optimal. Instead, we develop an entity graph and use a graph convolutional network to 'reason' about the correct answer by jointly considering all entities. We show on the challenging FVQA dataset that this leads to an improvement in accuracy of around 7% compared to the state of the art.

AB - Accurately answering a question about a given image requires combining observations with general knowledge. While this is effortless for humans, reasoning with general knowledge remains an algorithmic challenge. To advance research in this direction a novel 'fact-based' visual question answering (FVQA) task has been introduced recently along with a large set of curated facts which link two entities, i.e., two possible answers, via a relation. Given a question-image pair, deep network techniques have been employed to successively reduce the large set of facts until one of the two entities of the final remaining fact is predicted as the answer. We observe that a successive process which considers one fact at a time to form a local decision is sub-optimal. Instead, we develop an entity graph and use a graph convolutional network to 'reason' about the correct answer by jointly considering all entities. We show on the challenging FVQA dataset that this leads to an improvement in accuracy of around 7% compared to the state of the art.

UR - http://www.scopus.com/inward/record.url?scp=85064822805&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064822805&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85064822805

VL - 2018-December

SP - 2654

EP - 2665

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

SN - 1049-5258

ER -