Humanoid audio-visual avatar with emotive text-to-speech synthesis

Hao Tang, Yun Fu, Jilin Tu, Mark Hasegawa-Johnson, Thomas S. Huang

Research output: Contribution to journalArticlepeer-review


Emotive audio-visual avatars are virtual computer agents which have the potential of improving the quality of human-machine interaction and human-human communication significantly. However, the understanding of human communication has not yet advanced to the point where it is possible to make realistic avatars that demonstrate interactions with naturalsounding emotive speech and realistic-looking emotional facial expressions. In this paper, We propose the various technical approaches of a novel multimodal framework leading to a text-driven emotive audio-visual avatar. Our primary work is focused on emotive speech synthesis, realistic emotional facial expression animation, and the co-articulation between speech gestures (i.e., lip movements) and facial expressions. A general framework of emotive text-to-speech (TTS) synthesis using a diphone synthesizer is designed and integrated into a generic 3-D avatar face model. Under the guidance of this framework, we therefore developed a realistic 3-D avatar prototype. A rule-based emotive TTS synthesis system module based on the Festival-MBROLA architecture has been designed to demonstrate the effectiveness of the framework design. Subjective listening experiments were carried out to evaluate the expressiveness of the synthetic talking avatar.

Original languageEnglish (US)
Article number4637888
Pages (from-to)969-981
Number of pages13
JournalIEEE Transactions on Multimedia
Issue number6
StatePublished - Oct 2008


  • 3-D face modeling and animation
  • Audio-visual avatar
  • Emotive speech synthesis
  • Human-computer interaction
  • Multimodal system
  • TTS

ASJC Scopus subject areas

  • Signal Processing
  • Media Technology
  • Computer Science Applications
  • Electrical and Electronic Engineering


Dive into the research topics of 'Humanoid audio-visual avatar with emotive text-to-speech synthesis'. Together they form a unique fingerprint.

Cite this