TY - JOUR
T1 - Estimation of Articulatory Trajectories Based on Gaussian Mixture Model (GMM) With Audio-Visual Information Fusion and Dynamic Kalman Smoothing
AU - Özbek, Yücel
AU - Hasegawa-Johnson, Mark
AU - Demirekler, Mübeccel
N1 - Funding Information:
Manuscript received October 07, 2009; revised April 16, 2010; accepted September 22, 2010. Date of publication October 18, 2010; date of current version May 06, 2011.This work was supported by the Scientific and Technological Research Council of Turkey (TUBITAK). The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Malcolm Slaney. ˙. Y. Özbek was with the Department of Electrical and Electronics Engineering, Middle East Technical University, 06531 Ankara, Turkey. He is now with the Electrical and Electronics Engineering Department, Atatürk University, 25240 Erzurum, Turkey (e-mail: iozbek@atauni.edu.tr).
PY - 2011/7
Y1 - 2011/7
N2 - This paper presents a detailed framework for Gaussian mixture model (GMM)-based articulatory inversion equipped with special postprocessing smoothers, and with the capability to perform audio-visual information fusion. The effects of different acoustic features on the GMM inversion performance are investigated and it is shown that the integration of various types of acoustic (and visual) features improves the performance of the articulatory inversion process. Dynamic Kalman smoothers are proposed to adapt the cutoff frequency of the smoother to data and noise characteristics; Kalman smoothers also enable the incorporation of auxiliary information such as phonetic transcriptions to improve articulatory estimation. Two types of dynamic Kalman smoothers are introduced: global Kalman (GK) and phoneme-based Kalman (PBK). The same dynamic model is used for all phonemes in the GK smoother; it is shown that GK improves the performance of articulatory inversion better than the conventional low-pass (LP) smoother. However, the PBK smoother, which uses one dynamic model for each phoneme, gives significantly better results than the GK smoother. Different methodologies to fuse the audio and visual information are examined. A novel modified late fusion algorithm, designed to consider the observability degree of the articulators, is shown to give better results than either the early or the late fusion methods. Extensive experimental studies are conducted with the MOCHA database to illustrate the performance gains obtained by the proposed algorithms. The average RMS error and correlation coefficient between the true (measured) and the estimated articulatory trajectories are 1.227 mm and 0.868 using audiovisual information fusion and GK smoothing, and 1.199 mm and 0.876 using audiovisual information fusion together with PBK smoothing based on a phonetic transcription of the utterance.
AB - This paper presents a detailed framework for Gaussian mixture model (GMM)-based articulatory inversion equipped with special postprocessing smoothers, and with the capability to perform audio-visual information fusion. The effects of different acoustic features on the GMM inversion performance are investigated and it is shown that the integration of various types of acoustic (and visual) features improves the performance of the articulatory inversion process. Dynamic Kalman smoothers are proposed to adapt the cutoff frequency of the smoother to data and noise characteristics; Kalman smoothers also enable the incorporation of auxiliary information such as phonetic transcriptions to improve articulatory estimation. Two types of dynamic Kalman smoothers are introduced: global Kalman (GK) and phoneme-based Kalman (PBK). The same dynamic model is used for all phonemes in the GK smoother; it is shown that GK improves the performance of articulatory inversion better than the conventional low-pass (LP) smoother. However, the PBK smoother, which uses one dynamic model for each phoneme, gives significantly better results than the GK smoother. Different methodologies to fuse the audio and visual information are examined. A novel modified late fusion algorithm, designed to consider the observability degree of the articulators, is shown to give better results than either the early or the late fusion methods. Extensive experimental studies are conducted with the MOCHA database to illustrate the performance gains obtained by the proposed algorithms. The average RMS error and correlation coefficient between the true (measured) and the estimated articulatory trajectories are 1.227 mm and 0.868 using audiovisual information fusion and GK smoothing, and 1.199 mm and 0.876 using audiovisual information fusion together with PBK smoothing based on a phonetic transcription of the utterance.
KW - Audiovisual fusion
KW - Gaussian mixture model (GMM)
KW - Kalman smoother
KW - audiovisual-to-articulatory inversion
KW - maximum-likelihood trajectory estimation
UR - http://www.scopus.com/inward/record.url?scp=85008009610&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85008009610&partnerID=8YFLogxK
U2 - 10.1109/TASL.2010.2087751
DO - 10.1109/TASL.2010.2087751
M3 - Article
AN - SCOPUS:85008009610
SN - 1558-7916
VL - 19
SP - 1180
EP - 1195
JO - IEEE Transactions on Speech and Audio Processing
JF - IEEE Transactions on Speech and Audio Processing
IS - 5
ER -