Towards Streaming Perception

Mengtian Li, Yu Xiong Wang, Deva Ramanan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Embodied perception refers to the ability of an autonomous agent to perceive its environment so that it can (re)act. The responsiveness of the agent is largely governed by latency of its processing pipeline. While past work has studied the algorithmic trade-off between latency and accuracy, there has not been a clear metric to compare different methods along the Pareto optimal latency-accuracy curve. We point out a discrepancy between standard offline evaluation and real-time applications: by the time an algorithm finishes processing a particular image frame, the surrounding world has changed. To these ends, we present an approach that coherently integrates latency and accuracy into a single metric for real-time online perception, which we refer to as “streaming accuracy”. The key insight behind this metric is to jointly evaluate the output of the entire perception stack at every time instant, forcing the stack to consider the amount of streaming data that should be ignored while computation is occurring. More broadly, building upon this metric, we introduce a meta-benchmark that systematically converts any image understanding task into a streaming perception task. We focus on the illustrative tasks of object detection and instance segmentation in urban video streams, and contribute a novel dataset with high-quality and temporally-dense annotations. Our proposed solutions and their empirical analysis demonstrate a number of surprising conclusions: (1) there exists an optimal “sweet spot” that maximizes streaming accuracy along the Pareto optimal latency-accuracy curve, (2) asynchronous tracking and future forecasting naturally emerge as internal representations that enable streaming image understanding, and (3) dynamic scheduling can be used to overcome temporal aliasing, yielding the paradoxical result that latency is sometimes minimized by sitting idle and “doing nothing”.

Original languageEnglish (US)
Title of host publicationComputer Vision – ECCV 2020 - 16th European Conference, 2020, Proceedings
EditorsAndrea Vedaldi, Horst Bischof, Thomas Brox, Jan-Michael Frahm
PublisherSpringer
Pages473-488
Number of pages16
ISBN (Print)9783030585358
DOIs
StatePublished - 2020
Event16th European Conference on Computer Vision, ECCV 2020 - Glasgow, United Kingdom
Duration: Aug 23 2020Aug 28 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12347 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference16th European Conference on Computer Vision, ECCV 2020
Country/TerritoryUnited Kingdom
CityGlasgow
Period8/23/208/28/20

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Towards Streaming Perception'. Together they form a unique fingerprint.

Cite this