Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues

Bryan A. Plummer, Arun Mallya, Christopher M. Cervantes, Julia Hockenmaier, Svetlana Lazebnik

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues. We model the appearance, size, and position of entity bounding boxes, adjectives that contain attribute information, and spatial relationships between pairs of entities connected by verbs or prepositions. Special attention is given to relationships between people and clothing or body part mentions, as they are useful for distinguishing individuals. We automatically learn weights for combining these cues and at test time, perform joint inference over all phrases in a caption. The resulting system produces state of the art performance on phrase localization on the Flickr30k Entities dataset [33] and visual relationship detection on the Stanford VRD dataset [27].

Original languageEnglish (US)
Title of host publicationProceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1946-1955
Number of pages10
ISBN (Electronic)9781538610329
DOIs
StatePublished - Dec 22 2017
Event16th IEEE International Conference on Computer Vision, ICCV 2017 - Venice, Italy
Duration: Oct 22 2017Oct 29 2017

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
Volume2017-October
ISSN (Print)1550-5499

Other

Other16th IEEE International Conference on Computer Vision, ICCV 2017
Country/TerritoryItaly
CityVenice
Period10/22/1710/29/17

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues'. Together they form a unique fingerprint.

Cite this