Diverse and Coherent Paragraph Generation from Images

Moitreya Chatterjee, Alexander G. Schwing

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Paragraph generation from images, which has gained popularity recently, is an important task for video summarization, editing, and support of the disabled. Traditional image captioning methods fall short on this front, since they aren’t designed to generate long informative descriptions. Moreover, the vanilla approach of simply concatenating multiple short sentences, possibly synthesized from a classical image captioning system, doesn’t embrace the intricacies of paragraphs: coherent sentences, globally consistent structure, and diversity. To address those challenges, we propose to augment paragraph generation techniques with “coherence vectors,” “global topic vectors,” and modeling of the inherent ambiguity of associating paragraphs with images, via a variational auto-encoder formulation. We demonstrate the effectiveness of the developed approach on two datasets, outperforming existing state-of-the-art techniques on both.

Original languageEnglish (US)
Title of host publicationComputer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings
EditorsMartial Hebert, Yair Weiss, Vittorio Ferrari, Cristian Sminchisescu
Number of pages17
ISBN (Print)9783030012151
StatePublished - 2018
Event15th European Conference on Computer Vision, ECCV 2018 - Munich, Germany
Duration: Sep 8 2018Sep 14 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11206 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Other15th European Conference on Computer Vision, ECCV 2018


  • Captioning
  • Review generation
  • Variational autoencoders

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)


Dive into the research topics of 'Diverse and Coherent Paragraph Generation from Images'. Together they form a unique fingerprint.

Cite this