An automatic prosody labeling system using ANN-based syntactic-prosodic model and GMM-based acoustic-prosodic model

Ken Chen, Mark Allan Hasegawa-Johnson, Aaron Cohen

Research output: Contribution to journalConference article

Abstract

Automatic prosody labeling is important for both speech synthesis and automatic speech understanding. Humans use both syntactic cues and acoustic cues to develop their prediction of prosody for a given utterance. This process can be effectively modeled by an ANN-based syntactic-prosodic model that predicts prosody from syntax and a GMM-based acoustic-prosodic model that predicts prosody from acoustic-prosodic observations. Our experiments on the Radio News Corpus show that ANN is effective in learning the stochastic mapping from the syntactic representation of word strings to prosody labels, with an accuracy of 82.7% for pitch accent labeling and 90.5% for intonational phrase boundary (IPB) labeling. When acoustic observations and reasonably accurate phoneme transcriptions are given, a GMM-based acoustic-prosodic model, coupled with the syntactial-prosodic model, can achieve 84% pitch accent recognition accuracy and 93% IPB recognition accuracy. These results are obtained using different speakers for training and testing and have considerably exceeded all previously reported results on the same corpus, especially for the task of IPB detection.

Original languageEnglish (US)
JournalICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume1
StatePublished - Sep 28 2004
EventProceedings - IEEE International Conference on Acoustics, Speech, and Signal Processing - Montreal, Que, Canada
Duration: May 17 2004May 21 2004

Fingerprint

Syntactics
Labeling
Acoustics
Speech synthesis
Transcription
Labels
Testing
Experiments

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

@article{c92a2091f2644828aa0452c127ea8cd8,
title = "An automatic prosody labeling system using ANN-based syntactic-prosodic model and GMM-based acoustic-prosodic model",
abstract = "Automatic prosody labeling is important for both speech synthesis and automatic speech understanding. Humans use both syntactic cues and acoustic cues to develop their prediction of prosody for a given utterance. This process can be effectively modeled by an ANN-based syntactic-prosodic model that predicts prosody from syntax and a GMM-based acoustic-prosodic model that predicts prosody from acoustic-prosodic observations. Our experiments on the Radio News Corpus show that ANN is effective in learning the stochastic mapping from the syntactic representation of word strings to prosody labels, with an accuracy of 82.7{\%} for pitch accent labeling and 90.5{\%} for intonational phrase boundary (IPB) labeling. When acoustic observations and reasonably accurate phoneme transcriptions are given, a GMM-based acoustic-prosodic model, coupled with the syntactial-prosodic model, can achieve 84{\%} pitch accent recognition accuracy and 93{\%} IPB recognition accuracy. These results are obtained using different speakers for training and testing and have considerably exceeded all previously reported results on the same corpus, especially for the task of IPB detection.",
author = "Ken Chen and Hasegawa-Johnson, {Mark Allan} and Aaron Cohen",
year = "2004",
month = "9",
day = "28",
language = "English (US)",
volume = "1",
journal = "Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing",
issn = "0736-7791",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - An automatic prosody labeling system using ANN-based syntactic-prosodic model and GMM-based acoustic-prosodic model

AU - Chen, Ken

AU - Hasegawa-Johnson, Mark Allan

AU - Cohen, Aaron

PY - 2004/9/28

Y1 - 2004/9/28

N2 - Automatic prosody labeling is important for both speech synthesis and automatic speech understanding. Humans use both syntactic cues and acoustic cues to develop their prediction of prosody for a given utterance. This process can be effectively modeled by an ANN-based syntactic-prosodic model that predicts prosody from syntax and a GMM-based acoustic-prosodic model that predicts prosody from acoustic-prosodic observations. Our experiments on the Radio News Corpus show that ANN is effective in learning the stochastic mapping from the syntactic representation of word strings to prosody labels, with an accuracy of 82.7% for pitch accent labeling and 90.5% for intonational phrase boundary (IPB) labeling. When acoustic observations and reasonably accurate phoneme transcriptions are given, a GMM-based acoustic-prosodic model, coupled with the syntactial-prosodic model, can achieve 84% pitch accent recognition accuracy and 93% IPB recognition accuracy. These results are obtained using different speakers for training and testing and have considerably exceeded all previously reported results on the same corpus, especially for the task of IPB detection.

AB - Automatic prosody labeling is important for both speech synthesis and automatic speech understanding. Humans use both syntactic cues and acoustic cues to develop their prediction of prosody for a given utterance. This process can be effectively modeled by an ANN-based syntactic-prosodic model that predicts prosody from syntax and a GMM-based acoustic-prosodic model that predicts prosody from acoustic-prosodic observations. Our experiments on the Radio News Corpus show that ANN is effective in learning the stochastic mapping from the syntactic representation of word strings to prosody labels, with an accuracy of 82.7% for pitch accent labeling and 90.5% for intonational phrase boundary (IPB) labeling. When acoustic observations and reasonably accurate phoneme transcriptions are given, a GMM-based acoustic-prosodic model, coupled with the syntactial-prosodic model, can achieve 84% pitch accent recognition accuracy and 93% IPB recognition accuracy. These results are obtained using different speakers for training and testing and have considerably exceeded all previously reported results on the same corpus, especially for the task of IPB detection.

UR - http://www.scopus.com/inward/record.url?scp=4544275067&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=4544275067&partnerID=8YFLogxK

M3 - Conference article

VL - 1

JO - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing

JF - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing

SN - 0736-7791

ER -