A minimum converted trajectory error (MCTE) approach to high quality speech-to-lips conversion

Xiaodan Zhuang, Lijuan Wang, Frank Soong, Mark Hasegawa-Johnson

Research output: Contribution to conferencePaper

Abstract

High quality speech-to-lips conversion, investigated in this work, renders realistic lips movement (video) consistent with input speech (audio) without knowing its linguistic content. Instead of memoryless frame-based conversion, we adopt maximum likelihood estimation of the visual parameter trajectories using an audio-visual joint Gaussian Mixture Model (GMM). We propose a minimum converted trajectory error approach (MCTE) to further refine the converted visual parameters. First, we reduce the conversion error by training the joint audio-visual GMM with weighted audio and visual likelihood. Then MCTE uses the generalized probabilistic descent algorithm to minimize a conversion error of the visual parameter trajectories defined on the optimal Gaussian kernel sequence according to the input speech. We demonstrate the effectiveness of the proposed methods using the LIPS 2009 Visual Speech Synthesis Challenge dataset, without knowing the linguistic (phonetic) content of the input speech.

Original languageEnglish (US)
Pages1736-1739
Number of pages4
StatePublished - Dec 1 2010
Event11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010 - Makuhari, Chiba, Japan
Duration: Sep 26 2010Sep 30 2010

Other

Other11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010
CountryJapan
CityMakuhari, Chiba
Period9/26/109/30/10

Keywords

  • Minimum conversion error
  • Minimum generation error
  • Speech-to-lips conversion
  • Visual speech synthesis

ASJC Scopus subject areas

  • Language and Linguistics
  • Speech and Hearing

Fingerprint Dive into the research topics of 'A minimum converted trajectory error (MCTE) approach to high quality speech-to-lips conversion'. Together they form a unique fingerprint.

  • Cite this

    Zhuang, X., Wang, L., Soong, F., & Hasegawa-Johnson, M. (2010). A minimum converted trajectory error (MCTE) approach to high quality speech-to-lips conversion. 1736-1739. Paper presented at 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, Makuhari, Chiba, Japan.