TY - JOUR
T1 - Multitask Learning for Phone Recognition of Underresourced Languages Using Mismatched Transcription
AU - Do, Van Hai
AU - Chen, Nancy F.
AU - Lim, Boon Pang
AU - Hasegawa-Johnson, Mark A.
N1 - Funding Information:
Manuscript received June 4, 2017; revised October 16, 2017 and November 27, 2017; accepted November 27, 2017. Date of publication December 11, 2017; date of current version January 8, 2018. This work was supported by the research grant for the Human-Centered Cyber-physical Systems Programme at the Advanced Digital Sciences Center from Singapore’s Agency for Science, Technology, and Research (A*STAR). The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Sabato Marco Siniscalchi. (Corresponding author: Van Hai Do.) V. H. Do was with the Advanced Digital Sciences Center, Singapore 138632. He is now with the Thuyloi University, Hanoi 116705, Vietnam (e-mail: haidv@tlu.edu.vn).
Publisher Copyright:
© 2014 IEEE.
PY - 2018/3
Y1 - 2018/3
N2 - It is challenging to obtain large amounts of native (matched) labels for speech audio in underresourced languages. This challenge is often due to a lack of literate speakers of the language, or in extreme cases, a lack of universally acknowledged orthography as well. One solution is to increase the amount of labeled data by using mismatched transcription, which employs transcribers who do not speak the underresourced language of interest called the target language (in place of native speakers), to transcribe what they hear as nonsense speech in their own annotation language (≠ target language). Previous uses of mismatched transcription converted it to a probabilistic transcription (PT), but PT is limited by the errors of nonnative perception. This paper proposes, instead, a multitask learning framework in which one deep neural network (DNN) is trained to optimize two separate tasks: acoustic modeling of a small number of matched transcription with matched target-language graphemes; and acoustic modeling of a large number of mismatched transcription with mismatched annotation-language graphemes. We find that: first, the multitask learning framework gives significant improvement over monolingual, semisupervised learning, multilingual DNN training, and transfer learning baselines; second, a Gaussian Mixture Model-Hidden-Markov Model (GMM-HMM) model adapted using PT improves alignments, thereby improving training; and third, bottleneck features trained on the mismatched transcriptions lead to even better alignments, resulting in further performance gains of the multitask DNN. Our experiments are conducted on the IARPA Georgian and Vietnamese BABEL corpora as well as on our newly collected speech corpus of Singapore Hokkien, an underresourced language with no standard written form.
AB - It is challenging to obtain large amounts of native (matched) labels for speech audio in underresourced languages. This challenge is often due to a lack of literate speakers of the language, or in extreme cases, a lack of universally acknowledged orthography as well. One solution is to increase the amount of labeled data by using mismatched transcription, which employs transcribers who do not speak the underresourced language of interest called the target language (in place of native speakers), to transcribe what they hear as nonsense speech in their own annotation language (≠ target language). Previous uses of mismatched transcription converted it to a probabilistic transcription (PT), but PT is limited by the errors of nonnative perception. This paper proposes, instead, a multitask learning framework in which one deep neural network (DNN) is trained to optimize two separate tasks: acoustic modeling of a small number of matched transcription with matched target-language graphemes; and acoustic modeling of a large number of mismatched transcription with mismatched annotation-language graphemes. We find that: first, the multitask learning framework gives significant improvement over monolingual, semisupervised learning, multilingual DNN training, and transfer learning baselines; second, a Gaussian Mixture Model-Hidden-Markov Model (GMM-HMM) model adapted using PT improves alignments, thereby improving training; and third, bottleneck features trained on the mismatched transcriptions lead to even better alignments, resulting in further performance gains of the multitask DNN. Our experiments are conducted on the IARPA Georgian and Vietnamese BABEL corpora as well as on our newly collected speech corpus of Singapore Hokkien, an underresourced language with no standard written form.
KW - Phone recognition
KW - mismatched transcription
KW - multi-task learning
KW - probabilistic transcription
KW - under-resourced languages
UR - http://www.scopus.com/inward/record.url?scp=85038409618&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85038409618&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2017.2782360
DO - 10.1109/TASLP.2017.2782360
M3 - Article
AN - SCOPUS:85038409618
SN - 2329-9290
VL - 26
SP - 501
EP - 514
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
IS - 3
M1 - 8186239
ER -