TY - JOUR
T1 - Humanoid audio-visual avatar with emotive text-to-speech synthesis
AU - Tang, Hao
AU - Fu, Yun
AU - Tu, Jilin
AU - Hasegawa-Johnson, Mark
AU - Huang, Thomas S.
N1 - Funding Information:
Manuscript received September 10, 2007; revised February 24, 2008. First published October 3, 2008; current version published October 24, 2008. This work was supported in part by the U.S. Government VACE program, in part by the National Science Foundation under Grant CCF04-26627 and in part by NIH Grant R21 DC 008090A. The views and conclusions are those of the authors, not of the U.S. Government or its Agencies. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Kiyoharu Aizawa.
PY - 2008/10
Y1 - 2008/10
N2 - Emotive audio-visual avatars are virtual computer agents which have the potential of improving the quality of human-machine interaction and human-human communication significantly. However, the understanding of human communication has not yet advanced to the point where it is possible to make realistic avatars that demonstrate interactions with naturalsounding emotive speech and realistic-looking emotional facial expressions. In this paper, We propose the various technical approaches of a novel multimodal framework leading to a text-driven emotive audio-visual avatar. Our primary work is focused on emotive speech synthesis, realistic emotional facial expression animation, and the co-articulation between speech gestures (i.e., lip movements) and facial expressions. A general framework of emotive text-to-speech (TTS) synthesis using a diphone synthesizer is designed and integrated into a generic 3-D avatar face model. Under the guidance of this framework, we therefore developed a realistic 3-D avatar prototype. A rule-based emotive TTS synthesis system module based on the Festival-MBROLA architecture has been designed to demonstrate the effectiveness of the framework design. Subjective listening experiments were carried out to evaluate the expressiveness of the synthetic talking avatar.
AB - Emotive audio-visual avatars are virtual computer agents which have the potential of improving the quality of human-machine interaction and human-human communication significantly. However, the understanding of human communication has not yet advanced to the point where it is possible to make realistic avatars that demonstrate interactions with naturalsounding emotive speech and realistic-looking emotional facial expressions. In this paper, We propose the various technical approaches of a novel multimodal framework leading to a text-driven emotive audio-visual avatar. Our primary work is focused on emotive speech synthesis, realistic emotional facial expression animation, and the co-articulation between speech gestures (i.e., lip movements) and facial expressions. A general framework of emotive text-to-speech (TTS) synthesis using a diphone synthesizer is designed and integrated into a generic 3-D avatar face model. Under the guidance of this framework, we therefore developed a realistic 3-D avatar prototype. A rule-based emotive TTS synthesis system module based on the Festival-MBROLA architecture has been designed to demonstrate the effectiveness of the framework design. Subjective listening experiments were carried out to evaluate the expressiveness of the synthetic talking avatar.
KW - 3-D face modeling and animation
KW - Audio-visual avatar
KW - Emotive speech synthesis
KW - Human-computer interaction
KW - Multimodal system
KW - TTS
UR - http://www.scopus.com/inward/record.url?scp=54949115779&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=54949115779&partnerID=8YFLogxK
U2 - 10.1109/TMM.2008.2001355
DO - 10.1109/TMM.2008.2001355
M3 - Article
AN - SCOPUS:54949115779
SN - 1520-9210
VL - 10
SP - 969
EP - 981
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
IS - 6
M1 - 4637888
ER -