We present a general framework of integrating multi modal sensory signals for spatial temporal pattern recognition. Statistical methods are used to model time varying events in a collaborative manner such that the inter-modal CO-occurrence are taken into account. We discuss various data fusion strategics, modeling of the inter-modal correlations and extracting statistical parameters for multi-modal models. A bimodal speech recognition system is implemented. A speaker-independent experiment is carried out to test the audio-visual speech recognizer under different kinds of noises from a noise database. Consistent improvements of word recognition accuracy (WRA) are achieved using a cross-validation scheme over different signal-to-noise ratios.

Original languageEnglish (US)
Number of pages4
StatePublished - Dec 1 2000
Event2000 IEEE International Conference on Multimedia and Expo (ICME 2000) - New York, NY, United States
Duration: Jul 30 2000Aug 2 2000


Other2000 IEEE International Conference on Multimedia and Expo (ICME 2000)
CountryUnited States
CityNew York, NY

ASJC Scopus subject areas

  • Engineering(all)

Fingerprint Dive into the research topics of 'Speaker independent audio-visual speech recognition'. Together they form a unique fingerprint.

  • Cite this

    Zhang, Y., Levinson, S., & Huang, T. (2000). Speaker independent audio-visual speech recognition. 1073-1076. Paper presented at 2000 IEEE International Conference on Multimedia and Expo (ICME 2000), New York, NY, United States.