Speaker adaptive audio-visual fusion for the open-vocabulary section of AVICAR

Leda Sarı, Mark Allan Hasegawa-Johnson, S. Kumaran, Georg Stemmer, Krishnakumar N. Nair

Research output: Contribution to journalConference article

Abstract

This experimental study establishes the first audio-visual speech recognition baseline for the TIMIT sentence portion of the AVICAR dataset, a dataset recorded in a real, noisy car environment. We use an automatic speech recognizer trained on a larger dataset to generate an audio-only recognition baseline for AVICAR. We utilize the forced alignment of the audio modality of AVICAR to get training targets for the convolutional neural network based visual front end. Based on our observation that there is a great amount of variation between visual features of different speakers, we apply feature space maximum likelihood linear regression (fMMLR) based speaker adaptation to the visual features. We find that the quality of fMLLR is sensitive to the quality of the alignment probabilities used to compute it; experimental tests compare the quality of fMLLR trained using audio-visual versus audio-only alignment probabilities. We report the first audio-visual results for TIMIT subset of AVICAR and show that the word error rate of the proposed audio-visual system is significantly better than that of the audio-only system.

Original languageEnglish (US)
Pages (from-to)3524-3528
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
StatePublished - Jan 1 2018
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: Sep 2 2018Sep 6 2018

Fingerprint

Fusion
Fusion reactions
Alignment
Speech recognition
Linear regression
Maximum likelihood
Baseline
Railroad cars
Speaker Adaptation
Neural networks
Visual System
Speech Recognition
Feature Space
Large Data Sets
Modality
Maximum Likelihood
Error Rate
Vision
Vocabulary
Experimental Study

Keywords

  • Audio-visual speech recognition
  • Neural networks
  • Speaker adaptation

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

Speaker adaptive audio-visual fusion for the open-vocabulary section of AVICAR. / Sarı, Leda; Hasegawa-Johnson, Mark Allan; Kumaran, S.; Stemmer, Georg; Nair, Krishnakumar N.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2018-September, 01.01.2018, p. 3524-3528.

Research output: Contribution to journalConference article

@article{ba7d0c0013d144418e867880694399d2,
title = "Speaker adaptive audio-visual fusion for the open-vocabulary section of AVICAR",
abstract = "This experimental study establishes the first audio-visual speech recognition baseline for the TIMIT sentence portion of the AVICAR dataset, a dataset recorded in a real, noisy car environment. We use an automatic speech recognizer trained on a larger dataset to generate an audio-only recognition baseline for AVICAR. We utilize the forced alignment of the audio modality of AVICAR to get training targets for the convolutional neural network based visual front end. Based on our observation that there is a great amount of variation between visual features of different speakers, we apply feature space maximum likelihood linear regression (fMMLR) based speaker adaptation to the visual features. We find that the quality of fMLLR is sensitive to the quality of the alignment probabilities used to compute it; experimental tests compare the quality of fMLLR trained using audio-visual versus audio-only alignment probabilities. We report the first audio-visual results for TIMIT subset of AVICAR and show that the word error rate of the proposed audio-visual system is significantly better than that of the audio-only system.",
keywords = "Audio-visual speech recognition, Neural networks, Speaker adaptation",
author = "Leda Sarı and Hasegawa-Johnson, {Mark Allan} and S. Kumaran and Georg Stemmer and Nair, {Krishnakumar N.}",
year = "2018",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2018-2359",
language = "English (US)",
volume = "2018-September",
pages = "3524--3528",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Speaker adaptive audio-visual fusion for the open-vocabulary section of AVICAR

AU - Sarı, Leda

AU - Hasegawa-Johnson, Mark Allan

AU - Kumaran, S.

AU - Stemmer, Georg

AU - Nair, Krishnakumar N.

PY - 2018/1/1

Y1 - 2018/1/1

N2 - This experimental study establishes the first audio-visual speech recognition baseline for the TIMIT sentence portion of the AVICAR dataset, a dataset recorded in a real, noisy car environment. We use an automatic speech recognizer trained on a larger dataset to generate an audio-only recognition baseline for AVICAR. We utilize the forced alignment of the audio modality of AVICAR to get training targets for the convolutional neural network based visual front end. Based on our observation that there is a great amount of variation between visual features of different speakers, we apply feature space maximum likelihood linear regression (fMMLR) based speaker adaptation to the visual features. We find that the quality of fMLLR is sensitive to the quality of the alignment probabilities used to compute it; experimental tests compare the quality of fMLLR trained using audio-visual versus audio-only alignment probabilities. We report the first audio-visual results for TIMIT subset of AVICAR and show that the word error rate of the proposed audio-visual system is significantly better than that of the audio-only system.

AB - This experimental study establishes the first audio-visual speech recognition baseline for the TIMIT sentence portion of the AVICAR dataset, a dataset recorded in a real, noisy car environment. We use an automatic speech recognizer trained on a larger dataset to generate an audio-only recognition baseline for AVICAR. We utilize the forced alignment of the audio modality of AVICAR to get training targets for the convolutional neural network based visual front end. Based on our observation that there is a great amount of variation between visual features of different speakers, we apply feature space maximum likelihood linear regression (fMMLR) based speaker adaptation to the visual features. We find that the quality of fMLLR is sensitive to the quality of the alignment probabilities used to compute it; experimental tests compare the quality of fMLLR trained using audio-visual versus audio-only alignment probabilities. We report the first audio-visual results for TIMIT subset of AVICAR and show that the word error rate of the proposed audio-visual system is significantly better than that of the audio-only system.

KW - Audio-visual speech recognition

KW - Neural networks

KW - Speaker adaptation

UR - http://www.scopus.com/inward/record.url?scp=85055003600&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85055003600&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2018-2359

DO - 10.21437/Interspeech.2018-2359

M3 - Conference article

VL - 2018-September

SP - 3524

EP - 3528

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -