Learning models for actions and person-object interactions with transfer to question answering

Arun Mallya, Svetlana Lazebnik

Research output: Chapter in Book/Report/Conference proceedingConference contribution


This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two recent datasets with hundreds of labels each. We use multiple instance learning to handle the lack of supervision on the level of individual person instances, and weighted loss to handle unbalanced training data. Further, we show how specialized features trained on these datasets can be used to improve accuracy on the Visual Question Answering (VQA) task, in the form of multiple choice fill-in-the-blank questions (Visual Madlibs). Specifically, we tackle two types of questions on person activity and person-object relationship and show improvements over generic features trained on the ImageNet classification task.

Original languageEnglish (US)
Title of host publicationComputer Vision - 14th European Conference, ECCV 2016, Proceedings
EditorsBastian Leibe, Jiri Matas, Nicu Sebe, Max Welling
Number of pages15
ISBN (Print)9783319464473
StatePublished - 2016
Event14th European Conference on Computer Vision, ECCV 2016 - Amsterdam, Netherlands
Duration: Oct 11 2016Oct 14 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9905 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Other14th European Conference on Computer Vision, ECCV 2016


  • Activity prediction
  • Deep networks
  • Visual question answering

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science


Dive into the research topics of 'Learning models for actions and person-object interactions with transfer to question answering'. Together they form a unique fingerprint.

Cite this