TY - GEN
T1 - A minimum converted trajectory error (MCTE) approach to high quality speech-to-lips conversion
AU - Zhuang, Xiaodan
AU - Wang, Lijuan
AU - Soong, Frank
AU - Hasegawa-Johnson, Mark
N1 - Funding Information:
This research is partially funded by NSF grant IIS-0703624.
PY - 2010
Y1 - 2010
N2 - High quality speech-to-lips conversion, investigated in this work, renders realistic lips movement (video) consistent with input speech (audio) without knowing its linguistic content. Instead of memoryless frame-based conversion, we adopt maximum likelihood estimation of the visual parameter trajectories using an audio-visual joint Gaussian Mixture Model (GMM). We propose a minimum converted trajectory error approach (MCTE) to further refine the converted visual parameters. First, we reduce the conversion error by training the joint audio-visual GMM with weighted audio and visual likelihood. Then MCTE uses the generalized probabilistic descent algorithm to minimize a conversion error of the visual parameter trajectories defined on the optimal Gaussian kernel sequence according to the input speech. We demonstrate the effectiveness of the proposed methods using the LIPS 2009 Visual Speech Synthesis Challenge dataset, without knowing the linguistic (phonetic) content of the input speech.
AB - High quality speech-to-lips conversion, investigated in this work, renders realistic lips movement (video) consistent with input speech (audio) without knowing its linguistic content. Instead of memoryless frame-based conversion, we adopt maximum likelihood estimation of the visual parameter trajectories using an audio-visual joint Gaussian Mixture Model (GMM). We propose a minimum converted trajectory error approach (MCTE) to further refine the converted visual parameters. First, we reduce the conversion error by training the joint audio-visual GMM with weighted audio and visual likelihood. Then MCTE uses the generalized probabilistic descent algorithm to minimize a conversion error of the visual parameter trajectories defined on the optimal Gaussian kernel sequence according to the input speech. We demonstrate the effectiveness of the proposed methods using the LIPS 2009 Visual Speech Synthesis Challenge dataset, without knowing the linguistic (phonetic) content of the input speech.
KW - Minimum conversion error
KW - Minimum generation error
KW - Speech-to-lips conversion
KW - Visual speech synthesis
UR - http://www.scopus.com/inward/record.url?scp=79959844243&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79959844243&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:79959844243
T3 - Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010
SP - 1736
EP - 1739
BT - Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010
PB - International Speech Communication Association
ER -