Determining computable scenes in films and their structures using audio-visual memory models

H. Sundaram, S. F. Chang

Research output: Contribution to conferencePaperpeer-review


In this paper we present novel algorithms for computing scenes and within-scene structures in films. We begin by mapping insights from film-making rules and experimental results from the psychology of audition into a computational scene model. We define a computable scene to be a chunk of audio-visual data that exhibits long-term consistency with regard to three properties: (a) chromaticity (b) lighting (c) ambient sound. Central to the computational model is the notion of a causal, finite-memory viewer model. We segment the audio and video data saparately. In each case we determine the degree of correlation of the most recent data in the memory with the past. The respective scene boundaries are determined using local minima and aligned using a nearest neighbor algorithm. We introduce a periodic analysis transform to automatically determine the structure within a scene. We then use statistical tests on the transform to determine the presence of a dialogue. The algorithms were tested on a difficult data set: five commercial films. We take the first hour of data from each of the five films. The best results: scene detection: 88% recall and 72% precision, dialogue detection: 91% recall and 100% precision.

Original languageEnglish (US)
Number of pages10
StatePublished - 2000
Externally publishedYes
Event8th ACM International Conference on Multimedia (ACM Multimedia 2000) - Los Angeles, CA, United States
Duration: Oct 30 2000Nov 4 2000


Other8th ACM International Conference on Multimedia (ACM Multimedia 2000)
Country/TerritoryUnited States
CityLos Angeles, CA


  • Computable scenes
  • Films
  • Memory models
  • Periodic analysis transform
  • Scene detection
  • Short-level structure

ASJC Scopus subject areas

  • General Computer Science


Dive into the research topics of 'Determining computable scenes in films and their structures using audio-visual memory models'. Together they form a unique fingerprint.

Cite this