Learning deep structure-preserving image-text embeddings

Liwei Wang, Yin Li, Svetlana Lazebnik

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper proposes a method for learning joint embeddings of images and text using a two-branch neural network with multiple layers of linear projections followed by nonlinearities. The network is trained using a largemargin objective that combines cross-view ranking constraints with within-view neighborhood structure preservation constraints inspired by metric learning literature. Extensive experiments show that our approach gains significant improvements in accuracy for image-to-text and textto-image retrieval. Our method achieves new state-of-theart results on the Flickr30K and MSCOCO image-sentence datasets and shows promise on the new task of phrase localization on the Flickr30K Entities dataset.

Original languageEnglish (US)
Title of host publicationProceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016
PublisherIEEE Computer Society
Pages5005-5013
Number of pages9
ISBN (Electronic)9781467388504
DOIs
StatePublished - Dec 9 2016
Event29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 - Las Vegas, United States
Duration: Jun 26 2016Jul 1 2016

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2016-December
ISSN (Print)1063-6919

Conference

Conference29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016
Country/TerritoryUnited States
CityLas Vegas
Period6/26/167/1/16

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Learning deep structure-preserving image-text embeddings'. Together they form a unique fingerprint.

Cite this