Abstract

We present a general framework of integrating multi modal sensory signals for spatial temporal pattern recognition. Statistical methods are used to model time varying events in a collaborative manner such that the inter-modal CO-occurrence are taken into account. We discuss various data fusion strategics, modeling of the inter-modal correlations and extracting statistical parameters for multi-modal models. A bimodal speech recognition system is implemented. A speaker-independent experiment is carried out to test the audio-visual speech recognizer under different kinds of noises from a noise database. Consistent improvements of word recognition accuracy (WRA) are achieved using a cross-validation scheme over different signal-to-noise ratios.

Original languageEnglish (US)
Pages1073-1076
Number of pages4
StatePublished - Dec 1 2000
Event2000 IEEE International Conference on Multimedia and Expo (ICME 2000) - New York, NY, United States
Duration: Jul 30 2000Aug 2 2000

Other

Other2000 IEEE International Conference on Multimedia and Expo (ICME 2000)
CountryUnited States
CityNew York, NY
Period7/30/008/2/00

Fingerprint

Speech recognition
Data fusion
Pattern recognition
Signal to noise ratio
Statistical methods
Experiments

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Zhang, Y., Levinson, S. E., & Huang, T. S. (2000). Speaker independent audio-visual speech recognition. 1073-1076. Paper presented at 2000 IEEE International Conference on Multimedia and Expo (ICME 2000), New York, NY, United States.

Speaker independent audio-visual speech recognition. / Zhang, Yuanhui; Levinson, Stephen E; Huang, Thomas S.

2000. 1073-1076 Paper presented at 2000 IEEE International Conference on Multimedia and Expo (ICME 2000), New York, NY, United States.

Research output: Contribution to conferencePaper

Zhang, Y, Levinson, SE & Huang, TS 2000, 'Speaker independent audio-visual speech recognition' Paper presented at 2000 IEEE International Conference on Multimedia and Expo (ICME 2000), New York, NY, United States, 7/30/00 - 8/2/00, pp. 1073-1076.
Zhang Y, Levinson SE, Huang TS. Speaker independent audio-visual speech recognition. 2000. Paper presented at 2000 IEEE International Conference on Multimedia and Expo (ICME 2000), New York, NY, United States.
Zhang, Yuanhui ; Levinson, Stephen E ; Huang, Thomas S. / Speaker independent audio-visual speech recognition. Paper presented at 2000 IEEE International Conference on Multimedia and Expo (ICME 2000), New York, NY, United States.4 p.
@conference{941542402c6b4598937214792735d025,
title = "Speaker independent audio-visual speech recognition",
abstract = "We present a general framework of integrating multi modal sensory signals for spatial temporal pattern recognition. Statistical methods are used to model time varying events in a collaborative manner such that the inter-modal CO-occurrence are taken into account. We discuss various data fusion strategics, modeling of the inter-modal correlations and extracting statistical parameters for multi-modal models. A bimodal speech recognition system is implemented. A speaker-independent experiment is carried out to test the audio-visual speech recognizer under different kinds of noises from a noise database. Consistent improvements of word recognition accuracy (WRA) are achieved using a cross-validation scheme over different signal-to-noise ratios.",
author = "Yuanhui Zhang and Levinson, {Stephen E} and Huang, {Thomas S}",
year = "2000",
month = "12",
day = "1",
language = "English (US)",
pages = "1073--1076",
note = "2000 IEEE International Conference on Multimedia and Expo (ICME 2000) ; Conference date: 30-07-2000 Through 02-08-2000",

}

TY - CONF

T1 - Speaker independent audio-visual speech recognition

AU - Zhang, Yuanhui

AU - Levinson, Stephen E

AU - Huang, Thomas S

PY - 2000/12/1

Y1 - 2000/12/1

N2 - We present a general framework of integrating multi modal sensory signals for spatial temporal pattern recognition. Statistical methods are used to model time varying events in a collaborative manner such that the inter-modal CO-occurrence are taken into account. We discuss various data fusion strategics, modeling of the inter-modal correlations and extracting statistical parameters for multi-modal models. A bimodal speech recognition system is implemented. A speaker-independent experiment is carried out to test the audio-visual speech recognizer under different kinds of noises from a noise database. Consistent improvements of word recognition accuracy (WRA) are achieved using a cross-validation scheme over different signal-to-noise ratios.

AB - We present a general framework of integrating multi modal sensory signals for spatial temporal pattern recognition. Statistical methods are used to model time varying events in a collaborative manner such that the inter-modal CO-occurrence are taken into account. We discuss various data fusion strategics, modeling of the inter-modal correlations and extracting statistical parameters for multi-modal models. A bimodal speech recognition system is implemented. A speaker-independent experiment is carried out to test the audio-visual speech recognizer under different kinds of noises from a noise database. Consistent improvements of word recognition accuracy (WRA) are achieved using a cross-validation scheme over different signal-to-noise ratios.

UR - http://www.scopus.com/inward/record.url?scp=0034502214&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0034502214&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:0034502214

SP - 1073

EP - 1076

ER -