Imagine this! scripts to compositions to videos

Tanmay Gupta, Dustin Schwenk, Ali Farhadi, Derek Hoiem, Aniruddha Kembhavi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Imagining a scene described in natural language with realistic layout and appearance of entities is the ultimate test of spatial, visual, and semantic world knowledge. Towards this goal, we present the Composition, Retrieval and Fusion Network (Craft), a model capable of learning this knowledge from video-caption data and applying it while generating videos from novel captions. Craft explicitly predicts a temporal-layout of mentioned entities (characters and objects), retrieves spatio-temporal entity segments from a video database and fuses them to generate scene videos. Our contributions include sequential training of components of Craft while jointly modeling layout and appearances, and losses that encourage learning compositional representations for retrieval. We evaluate Craft on semantic fidelity to caption, composition consistency, and visual quality. Craft outperforms direct pixel generation approaches and generalizes well to unseen captions and to unseen video databases with no text annotations. We demonstrate Craft on Flintstones (Flintstones is available at https://prior.allenai.org/projects/craft), a new richly annotated video-caption dataset with over 25000 videos. For a glimpse of videos generated by Craft, see https://youtu.be/688Vv86n0z8.

Original languageEnglish (US)
Title of host publicationComputer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings
EditorsVittorio Ferrari, Cristian Sminchisescu, Yair Weiss, Martial Hebert
PublisherSpringer
Pages610-626
Number of pages17
ISBN (Print)9783030012366
DOIs
StatePublished - 2018
Event15th European Conference on Computer Vision, ECCV 2018 - Munich, Germany
Duration: Sep 8 2018Sep 14 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11212 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other15th European Conference on Computer Vision, ECCV 2018
Country/TerritoryGermany
CityMunich
Period9/8/189/14/18

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Imagine this! scripts to compositions to videos'. Together they form a unique fingerprint.

Cite this