TY - GEN
T1 - A procedure for estimating gestural scores from natural speech
AU - Nam, Hosung
AU - Mitra, Vikramjit
AU - Tiede, Mark
AU - Saltzman, Elliot
AU - Goldstein, Louis
AU - Espy-Wilson, Carol
AU - Hasegawa-Johnson, Mark
N1 - Funding Information:
This research was supported by NSF Grant # IIS0703859, IIS0703048, and IIS0703782. We acknowledge the help from Dr. Jiahong Yuan for providing us the forced-aligned phones and word transcripts for Wisconsin X-Ray Microbeam database.
PY - 2010
Y1 - 2010
N2 - Speech can be represented as a constellation of constricting events, gestures, which are defined at distinct vocal tract sites, in the form of a gestural score. Gestures and their output trajectories, tract variables, which are available only in synthetic speech, have recently been shown to improve automatic speech recognition (ASR) performance. In this paper we propose an iterative analysis-by-synthesis landmark based time-warping architecture to obtain gestural scores for natural speech. Given an utterance, the Haskins Laboratories Task Dynamics and Application (TADA) model was used to generate its prototype gestural score and the corresponding synthetic acoustic output. An optimal gestural score was estimated through iterative time-warping processes such that the distance between original and TADA-synthesized speech is minimized. We compared the performance of our approach to that of a conventional dynamic time warping procedure using Log-Spectral and Itakura Distance measures. We also performed a word recognition experiment using the gestural annotations to show that the gestural scores are suitable for word recognition.
AB - Speech can be represented as a constellation of constricting events, gestures, which are defined at distinct vocal tract sites, in the form of a gestural score. Gestures and their output trajectories, tract variables, which are available only in synthetic speech, have recently been shown to improve automatic speech recognition (ASR) performance. In this paper we propose an iterative analysis-by-synthesis landmark based time-warping architecture to obtain gestural scores for natural speech. Given an utterance, the Haskins Laboratories Task Dynamics and Application (TADA) model was used to generate its prototype gestural score and the corresponding synthetic acoustic output. An optimal gestural score was estimated through iterative time-warping processes such that the distance between original and TADA-synthesized speech is minimized. We compared the performance of our approach to that of a conventional dynamic time warping procedure using Log-Spectral and Itakura Distance measures. We also performed a word recognition experiment using the gestural annotations to show that the gestural scores are suitable for word recognition.
KW - Articulatory phonology
KW - Gestures
KW - TADA model
KW - Time warping
KW - Vocal tract variables
KW - X-ray microbeam data
UR - http://www.scopus.com/inward/record.url?scp=79959846806&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79959846806&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:79959846806
T3 - Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010
SP - 30
EP - 33
BT - Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010
PB - International Speech Communication Association
ER -