EScALation: A framework for efficient and scalable spatio-Temporal action localization

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Spatio-Temporal action localization aims to detect the spatial location and the start/end time of the action in a video. The state-of-The-Art approach uses convolutional neural networks to extract possible bounding boxes for the action in each frame and then link bounding boxes into action tubes based on the location and the class-specific score of each bounding box. Though this approach has been successful at achieving a good localization accuracy, it is computation-intensive. High-end GPUs are usually demanded for it to achieve real-Time performance. In addition, this approach does not scale well on a large number of action classes. In this work, we present a framework, EScALation, for making spatio-Temporal action localization efficient and scalable. Our framework involves two main strategies. One is the frame sampling technique that utilizes the temporal correlation between frames and selects key frame(s) from a temporally correlated set of frames to perform bounding box detection. The other is the class filtering technique that exploits bounding box information to predict the action class prior to linking bounding boxes. We compare EScALation with the state-of-The-Art approach on UCF101-24 and J-HMDB-21 datasets. One of our experiments shows EScALation is able to save 72.2% of the time with only 6.1% loss of mAP. In addition, we show that EScALation scales better to a large number of action classes than the state-of-The-Art approach.

Original languageEnglish (US)
Title of host publicationMMSys 2021 - Proceedings of the 2021 Multimedia Systems Conference
PublisherAssociation for Computing Machinery
Number of pages12
ISBN (Electronic)9781450384346
StatePublished - Jul 15 2021
Event12th ACM Multimedia Systems Conference, MMSys 2021 - Virtual, Online, Turkey
Duration: Sep 28 2021Oct 1 2021

Publication series

NameMMSys 2021 - Proceedings of the 2021 Multimedia Systems Conference


Conference12th ACM Multimedia Systems Conference, MMSys 2021
CityVirtual, Online


  • scalability
  • spatio-Temporal action localization
  • video analytics

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Computer Graphics and Computer-Aided Design


Dive into the research topics of 'EScALation: A framework for efficient and scalable spatio-Temporal action localization'. Together they form a unique fingerprint.

Cite this