Interpretable and globally optimal prediction for textual grounding using image concepts

Raymond A. Yeh, Jinjun Xiong, Wen-Mei W Hwu, Minh N Do, Alexander Gerhard Schwing

Research output: Contribution to journalConference article

Abstract

Textual grounding is an important but challenging task for human-computer interaction, robotics and knowledge mining. Existing algorithms generally formulate the task as selection from a set of bounding box proposals obtained from deep net based systems. In this work, we demonstrate that we can cast the problem of textual grounding into a unified framework that permits efficient search over all possible bounding boxes. Hence, the method is able to consider significantly more proposals and doesn't rely on a successful first stage hypothesizing bounding box proposals. Beyond, we demonstrate that the trained parameters of our model can be used as word-embeddings which capture spatial-image relationships and provide interpretability. Lastly, at the time of submission, our approach outperformed the current state-of-the-art methods on the Flickr 30k Entities and the ReferItGame dataset by 3.08% and 7.77% respectively.

Original languageEnglish (US)
Pages (from-to)1913-1923
Number of pages11
JournalAdvances in Neural Information Processing Systems
Volume2017-December
StatePublished - Jan 1 2017
Event31st Annual Conference on Neural Information Processing Systems, NIPS 2017 - Long Beach, United States
Duration: Dec 4 2017Dec 9 2017

Fingerprint

Electric grounding
Human computer interaction
Robotics

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Cite this

Interpretable and globally optimal prediction for textual grounding using image concepts. / Yeh, Raymond A.; Xiong, Jinjun; Hwu, Wen-Mei W; Do, Minh N; Schwing, Alexander Gerhard.

In: Advances in Neural Information Processing Systems, Vol. 2017-December, 01.01.2017, p. 1913-1923.

Research output: Contribution to journalConference article

@article{3352509636b646e7ad6a0e18a823bdc8,
title = "Interpretable and globally optimal prediction for textual grounding using image concepts",
abstract = "Textual grounding is an important but challenging task for human-computer interaction, robotics and knowledge mining. Existing algorithms generally formulate the task as selection from a set of bounding box proposals obtained from deep net based systems. In this work, we demonstrate that we can cast the problem of textual grounding into a unified framework that permits efficient search over all possible bounding boxes. Hence, the method is able to consider significantly more proposals and doesn't rely on a successful first stage hypothesizing bounding box proposals. Beyond, we demonstrate that the trained parameters of our model can be used as word-embeddings which capture spatial-image relationships and provide interpretability. Lastly, at the time of submission, our approach outperformed the current state-of-the-art methods on the Flickr 30k Entities and the ReferItGame dataset by 3.08{\%} and 7.77{\%} respectively.",
author = "Yeh, {Raymond A.} and Jinjun Xiong and Hwu, {Wen-Mei W} and Do, {Minh N} and Schwing, {Alexander Gerhard}",
year = "2017",
month = "1",
day = "1",
language = "English (US)",
volume = "2017-December",
pages = "1913--1923",
journal = "Advances in Neural Information Processing Systems",
issn = "1049-5258",

}

TY - JOUR

T1 - Interpretable and globally optimal prediction for textual grounding using image concepts

AU - Yeh, Raymond A.

AU - Xiong, Jinjun

AU - Hwu, Wen-Mei W

AU - Do, Minh N

AU - Schwing, Alexander Gerhard

PY - 2017/1/1

Y1 - 2017/1/1

N2 - Textual grounding is an important but challenging task for human-computer interaction, robotics and knowledge mining. Existing algorithms generally formulate the task as selection from a set of bounding box proposals obtained from deep net based systems. In this work, we demonstrate that we can cast the problem of textual grounding into a unified framework that permits efficient search over all possible bounding boxes. Hence, the method is able to consider significantly more proposals and doesn't rely on a successful first stage hypothesizing bounding box proposals. Beyond, we demonstrate that the trained parameters of our model can be used as word-embeddings which capture spatial-image relationships and provide interpretability. Lastly, at the time of submission, our approach outperformed the current state-of-the-art methods on the Flickr 30k Entities and the ReferItGame dataset by 3.08% and 7.77% respectively.

AB - Textual grounding is an important but challenging task for human-computer interaction, robotics and knowledge mining. Existing algorithms generally formulate the task as selection from a set of bounding box proposals obtained from deep net based systems. In this work, we demonstrate that we can cast the problem of textual grounding into a unified framework that permits efficient search over all possible bounding boxes. Hence, the method is able to consider significantly more proposals and doesn't rely on a successful first stage hypothesizing bounding box proposals. Beyond, we demonstrate that the trained parameters of our model can be used as word-embeddings which capture spatial-image relationships and provide interpretability. Lastly, at the time of submission, our approach outperformed the current state-of-the-art methods on the Flickr 30k Entities and the ReferItGame dataset by 3.08% and 7.77% respectively.

UR - http://www.scopus.com/inward/record.url?scp=85047004330&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85047004330&partnerID=8YFLogxK

M3 - Conference article

VL - 2017-December

SP - 1913

EP - 1923

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

SN - 1049-5258

ER -