TY - GEN
T1 - FSM-based pronunciation modeling using articulatory phonological code
AU - Hu, Chi
AU - Zhuang, Xiaodan
AU - Hasegawa-Johnson, Mark
N1 - Funding Information:
This research is funded by NSF grant IIS-0703624. The authors would like to thank Vikramjit Mitra and Hosung Nam for assistance with the dataset.
PY - 2010
Y1 - 2010
N2 - According to articulatory phonology, the gestural score is an invariant speech representation. Though the timing schemes, i.e., the onsets and offsets, of the gestural activations may vary, the ensemble of these activations tends to remain unchanged, informing the speech content. In this work, we propose a pronunciation modeling method that uses a finite state machine (FSM) to represent the invariance of a gestural score. Given the "canonical" gestural score (CGS) of a word with a known activation timing scheme, the plausible activation onsets and offsets are recursively generated and encoded as a weighted FSM. An empirical measure is used to prune out gestural activation timing schemes that deviate too much from the CGS. Speech recognition is achieved by matching the recovered gestural activations to the FSM-encoded gestural scores of different speech contents. We carry out pilot word classification experiments using synthesized data from one speaker. The proposed pronunciation modeling achieves over 90% accuracy for a vocabulary of 139 words with no training observations, outperforming direct use of the CGS.
AB - According to articulatory phonology, the gestural score is an invariant speech representation. Though the timing schemes, i.e., the onsets and offsets, of the gestural activations may vary, the ensemble of these activations tends to remain unchanged, informing the speech content. In this work, we propose a pronunciation modeling method that uses a finite state machine (FSM) to represent the invariance of a gestural score. Given the "canonical" gestural score (CGS) of a word with a known activation timing scheme, the plausible activation onsets and offsets are recursively generated and encoded as a weighted FSM. An empirical measure is used to prune out gestural activation timing schemes that deviate too much from the CGS. Speech recognition is achieved by matching the recovered gestural activations to the FSM-encoded gestural scores of different speech contents. We carry out pilot word classification experiments using synthesized data from one speaker. The proposed pronunciation modeling achieves over 90% accuracy for a vocabulary of 139 words with no training observations, outperforming direct use of the CGS.
KW - Articulatory phonology
KW - Finite state machine
KW - Speech gesture
KW - Speech production
UR - http://www.scopus.com/inward/record.url?scp=79959812754&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79959812754&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:79959812754
T3 - Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010
SP - 2274
EP - 2277
BT - Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010
PB - International Speech Communication Association
ER -