Multimodal speaker detection using error feedback dynamic Bayesian networks

Vladimir Pavlović, Ashutosh Garg, James M. Rehg, Thomas S. Huang

Research output: Contribution to journalConference articlepeer-review

Abstract

Design and development of novel human-computer interfaces poses a challenging problem: actions and intentions of users have to be inferred from sequences of noisy and ambiguous multi-sensory data such as video and sound. Temporal fusion of multiple sensors has been efficiently formulated using dynamic Bayesian networks (DBNs) which allow the power of statistical inference and learning to be combined with contextual knowledge of the problem. Unfortunately, simple learning methods can cause such appealing models to fail when the data exhibits complex behavior. We formulate a learning framework for DBNs based on error-feedback and statistical boosting theory. We apply this framework to the problem of audio/visual speaker detection in an interactive kiosk environment using 'off-the-shelf' visual and audio sensors (face, skin, texture, mouth motion, and silence detectors). Detection results obtained in this setup demonstrate superiority of our learning framework over that of the classical ML learning in DBNs.

Original languageEnglish (US)
Pages (from-to)34-41
Number of pages8
JournalProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2
DOIs
StatePublished - 2000
EventIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2000 - Hilton Head Island, SC, USA
Duration: Jun 13 2000Jun 15 2000

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Multimodal speaker detection using error feedback dynamic Bayesian networks'. Together they form a unique fingerprint.

Cite this