High-order attention models for visual question answering

Idan Schwartz, Alexander Gerhard Schwing, Tamir Hazan

Research output: Contribution to journalConference article

Abstract

The quest for algorithms that enable cognitive abilities is an important part of machine learning. A common trait in many recently investigated cognitive-like tasks is that they take into account different data modalities, such as visual and textual input. In this paper we propose a novel and generally applicable form of attention mechanism that learns high-order correlations between various data modalities. We show that high-order correlations effectively direct the appropriate attention to the relevant elements in the different data modalities that are required to solve the joint task. We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset.

Original languageEnglish (US)
Pages (from-to)3665-3675
Number of pages11
JournalAdvances in Neural Information Processing Systems
Volume2017-December
StatePublished - Jan 1 2017
Event31st Annual Conference on Neural Information Processing Systems, NIPS 2017 - Long Beach, United States
Duration: Dec 4 2017Dec 9 2017

Fingerprint

Learning systems

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Cite this

High-order attention models for visual question answering. / Schwartz, Idan; Schwing, Alexander Gerhard; Hazan, Tamir.

In: Advances in Neural Information Processing Systems, Vol. 2017-December, 01.01.2017, p. 3665-3675.

Research output: Contribution to journalConference article

@article{066d1cb46f6049de80989f1033ff283e,
title = "High-order attention models for visual question answering",
abstract = "The quest for algorithms that enable cognitive abilities is an important part of machine learning. A common trait in many recently investigated cognitive-like tasks is that they take into account different data modalities, such as visual and textual input. In this paper we propose a novel and generally applicable form of attention mechanism that learns high-order correlations between various data modalities. We show that high-order correlations effectively direct the appropriate attention to the relevant elements in the different data modalities that are required to solve the joint task. We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset.",
author = "Idan Schwartz and Schwing, {Alexander Gerhard} and Tamir Hazan",
year = "2017",
month = "1",
day = "1",
language = "English (US)",
volume = "2017-December",
pages = "3665--3675",
journal = "Advances in Neural Information Processing Systems",
issn = "1049-5258",

}

TY - JOUR

T1 - High-order attention models for visual question answering

AU - Schwartz, Idan

AU - Schwing, Alexander Gerhard

AU - Hazan, Tamir

PY - 2017/1/1

Y1 - 2017/1/1

N2 - The quest for algorithms that enable cognitive abilities is an important part of machine learning. A common trait in many recently investigated cognitive-like tasks is that they take into account different data modalities, such as visual and textual input. In this paper we propose a novel and generally applicable form of attention mechanism that learns high-order correlations between various data modalities. We show that high-order correlations effectively direct the appropriate attention to the relevant elements in the different data modalities that are required to solve the joint task. We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset.

AB - The quest for algorithms that enable cognitive abilities is an important part of machine learning. A common trait in many recently investigated cognitive-like tasks is that they take into account different data modalities, such as visual and textual input. In this paper we propose a novel and generally applicable form of attention mechanism that learns high-order correlations between various data modalities. We show that high-order correlations effectively direct the appropriate attention to the relevant elements in the different data modalities that are required to solve the joint task. We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset.

UR - http://www.scopus.com/inward/record.url?scp=85047015057&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85047015057&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85047015057

VL - 2017-December

SP - 3665

EP - 3675

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

SN - 1049-5258

ER -