Diverse and accurate image description using a variational auto-encoder with an additive Gaussian encoding space

Research output: Contribution to journalConference article

Abstract

This paper explores image caption generation using conditional variational autoencoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space around K components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g., several kinds of objects). Our first model uses a Gaussian Mixture model (GMM) prior, while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. We show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a "vanilla" CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise.

Original languageEnglish (US)
Pages (from-to)5757-5767
Number of pages11
JournalAdvances in Neural Information Processing Systems
Volume2017-December
StatePublished - Jan 1 2017
Event31st Annual Conference on Neural Information Processing Systems, NIPS 2017 - Long Beach, United States
Duration: Dec 4 2017Dec 9 2017

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Cite this

@article{ffa672db11fe4637946dd67b6fcec8f9,
title = "Diverse and accurate image description using a variational auto-encoder with an additive Gaussian encoding space",
abstract = "This paper explores image caption generation using conditional variational autoencoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space around K components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g., several kinds of objects). Our first model uses a Gaussian Mixture model (GMM) prior, while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. We show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a {"}vanilla{"} CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise.",
author = "Liwei Wang and Schwing, {Alexander Gerhard} and Svetlana Lazebnik",
year = "2017",
month = "1",
day = "1",
language = "English (US)",
volume = "2017-December",
pages = "5757--5767",
journal = "Advances in Neural Information Processing Systems",
issn = "1049-5258",

}

TY - JOUR

T1 - Diverse and accurate image description using a variational auto-encoder with an additive Gaussian encoding space

AU - Wang, Liwei

AU - Schwing, Alexander Gerhard

AU - Lazebnik, Svetlana

PY - 2017/1/1

Y1 - 2017/1/1

N2 - This paper explores image caption generation using conditional variational autoencoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space around K components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g., several kinds of objects). Our first model uses a Gaussian Mixture model (GMM) prior, while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. We show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a "vanilla" CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise.

AB - This paper explores image caption generation using conditional variational autoencoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space around K components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g., several kinds of objects). Our first model uses a Gaussian Mixture model (GMM) prior, while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. We show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a "vanilla" CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise.

UR - http://www.scopus.com/inward/record.url?scp=85046997549&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85046997549&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85046997549

VL - 2017-December

SP - 5757

EP - 5767

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

SN - 1049-5258

ER -