TY - JOUR
T1 - Diverse and accurate image description using a variational auto-encoder with an additive Gaussian encoding space
AU - Wang, Liwei
AU - Schwing, Alexander G.
AU - Lazebnik, Svetlana
N1 - Acknowledgments: This material is based upon work supported in part by the National Science Foundation under Grants No. 1563727 and 1718221, and by the Sloan Foundation. We would like to thank Jian Peng and Yang Liu for helpful discussions.
PY - 2017
Y1 - 2017
N2 - This paper explores image caption generation using conditional variational autoencoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space around K components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g., several kinds of objects). Our first model uses a Gaussian Mixture model (GMM) prior, while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. We show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a "vanilla" CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise.
AB - This paper explores image caption generation using conditional variational autoencoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space around K components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g., several kinds of objects). Our first model uses a Gaussian Mixture model (GMM) prior, while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. We show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a "vanilla" CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise.
UR - http://www.scopus.com/inward/record.url?scp=85046997549&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85046997549&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85046997549
SN - 1049-5258
VL - 2017-December
SP - 5757
EP - 5767
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 31st Annual Conference on Neural Information Processing Systems, NIPS 2017
Y2 - 4 December 2017 through 9 December 2017
ER -