Fusing audio and visual features of speech

H. Pan, Zhi-Pei Liang, Thomas S Huang

Research output: Chapter in Book/Report/Conference proceedingConference contribution


In this paper, the audio and visual features of speech are integrated using a novel fused-HMM. We assume that the two sets of features may have different data rates and duration. Hidden Markov models (HMMs) are first used to model them separately, and then a general Bayesian fusion method, which is optimal in the maximum entropy sense, is employed to fuse them together. Particularly, an efficient learning algorithm is introduced. Instead of maximizing the joint likelihood of the fuse-HMM, the learning algorithm maximizes the two HMMs separately, and then fuses the HMMs together. In addition, an inference algorithm is proposed. We have tested the proposed method by person verification experiments. Results show that the proposed method significantly reduces the recognition error rates as compared to the unimodal HMMs and the loosely-coupled fusion model.

Original languageEnglish (US)
Title of host publicationIEEE International Conference on Image Processing
StatePublished - Dec 1 2000
EventInternational Conference on Image Processing (ICIP 2000) - Vancouver, BC, Canada
Duration: Sep 10 2000Sep 13 2000


OtherInternational Conference on Image Processing (ICIP 2000)
CityVancouver, BC

ASJC Scopus subject areas

  • Hardware and Architecture
  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering

Fingerprint Dive into the research topics of 'Fusing audio and visual features of speech'. Together they form a unique fingerprint.

Cite this