Two-stage prosody prediction for emotional text-to-speech synthesis

Hao Tang, Xi Zhou, Matthias Odisio, Mark Allan Hasegawa-Johnson, Thomas S Huang

Research output: Contribution to journalConference article

Abstract

In this paper, we adopt a difference approach to prosody prediction for emotional text-to-speech synthesis, where the prosodic variations between emotional and neutral speech are decomposed into the global and local prosodic variations and predicted using a two-stage model. The global prosodic variations are modeled by the means and standard deviations of the prosodic parameters, while the local prosodic variations are modeled by the classification and regression tree (CART) and dynamic programming. The proposed two-stage prosody prediction model has been successfully implemented as a prosodic module in a Festival-MBROLA architecture based emotional text-to-speech synthesis system, which is able to synthesize highly intelligible, natural and expressive speech.

Original languageEnglish (US)
Pages (from-to)2138-2141
Number of pages4
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
StatePublished - Dec 1 2008
EventINTERSPEECH 2008 - 9th Annual Conference of the International Speech Communication Association - Brisbane, QLD, Australia
Duration: Sep 22 2008Sep 26 2008

Fingerprint

Speech synthesis
Dynamic programming
Holidays

Keywords

  • CART
  • Dynamic programming
  • Prosody prediction
  • Speech synthesis
  • TTS

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Sensory Systems

Cite this

@article{33f207cd87c14c69bb2183d3e39e91fd,
title = "Two-stage prosody prediction for emotional text-to-speech synthesis",
abstract = "In this paper, we adopt a difference approach to prosody prediction for emotional text-to-speech synthesis, where the prosodic variations between emotional and neutral speech are decomposed into the global and local prosodic variations and predicted using a two-stage model. The global prosodic variations are modeled by the means and standard deviations of the prosodic parameters, while the local prosodic variations are modeled by the classification and regression tree (CART) and dynamic programming. The proposed two-stage prosody prediction model has been successfully implemented as a prosodic module in a Festival-MBROLA architecture based emotional text-to-speech synthesis system, which is able to synthesize highly intelligible, natural and expressive speech.",
keywords = "CART, Dynamic programming, Prosody prediction, Speech synthesis, TTS",
author = "Hao Tang and Xi Zhou and Matthias Odisio and Hasegawa-Johnson, {Mark Allan} and Huang, {Thomas S}",
year = "2008",
month = "12",
day = "1",
language = "English (US)",
pages = "2138--2141",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Two-stage prosody prediction for emotional text-to-speech synthesis

AU - Tang, Hao

AU - Zhou, Xi

AU - Odisio, Matthias

AU - Hasegawa-Johnson, Mark Allan

AU - Huang, Thomas S

PY - 2008/12/1

Y1 - 2008/12/1

N2 - In this paper, we adopt a difference approach to prosody prediction for emotional text-to-speech synthesis, where the prosodic variations between emotional and neutral speech are decomposed into the global and local prosodic variations and predicted using a two-stage model. The global prosodic variations are modeled by the means and standard deviations of the prosodic parameters, while the local prosodic variations are modeled by the classification and regression tree (CART) and dynamic programming. The proposed two-stage prosody prediction model has been successfully implemented as a prosodic module in a Festival-MBROLA architecture based emotional text-to-speech synthesis system, which is able to synthesize highly intelligible, natural and expressive speech.

AB - In this paper, we adopt a difference approach to prosody prediction for emotional text-to-speech synthesis, where the prosodic variations between emotional and neutral speech are decomposed into the global and local prosodic variations and predicted using a two-stage model. The global prosodic variations are modeled by the means and standard deviations of the prosodic parameters, while the local prosodic variations are modeled by the classification and regression tree (CART) and dynamic programming. The proposed two-stage prosody prediction model has been successfully implemented as a prosodic module in a Festival-MBROLA architecture based emotional text-to-speech synthesis system, which is able to synthesize highly intelligible, natural and expressive speech.

KW - CART

KW - Dynamic programming

KW - Prosody prediction

KW - Speech synthesis

KW - TTS

UR - http://www.scopus.com/inward/record.url?scp=84867192290&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84867192290&partnerID=8YFLogxK

M3 - Conference article

SP - 2138

EP - 2141

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -