Resolving Referring Expressions in Images with Labeled Elements

Nevan Wichers, Dilek Hakkani-Tur, Jindong Chen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Images may have elements containing text and a bounding box associated with them, for example, text identified via optical character recognition on a computer screen image, or a natural image with labeled objects. We present an end-to-end trainable architecture to incorporate the information from these elements and the image to segment/identify the part of the image a natural language expression is referring to. We calculate an embedding for each element and then project it onto the corresponding location (i.e., the associated bounding box) of the image feature map. We show that this architecture gives an improvement in resolving referring expressions, over only using the image, and other methods that incorporate the element information. We demonstrate experimental results on the referring expression datasets based on COCO, and on a webpage image referring expression dataset that we developed.

Original languageEnglish (US)
Title of host publication2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages800-806
Number of pages7
ISBN (Electronic)9781538643341
DOIs
StatePublished - Jul 2 2018
Externally publishedYes
Event2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Athens, Greece
Duration: Dec 18 2018Dec 21 2018

Publication series

Name2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings

Conference

Conference2018 IEEE Spoken Language Technology Workshop, SLT 2018
Country/TerritoryGreece
CityAthens
Period12/18/1812/21/18

Keywords

  • Deep Learning
  • Natural Language Processing
  • Referring Expression Resolution
  • Segmentation

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Resolving Referring Expressions in Images with Labeled Elements'. Together they form a unique fingerprint.

Cite this