TY - GEN
T1 - Grapheme-to-phoneme transduction for cross-language ASR
AU - Hasegawa-Johnson, Mark
AU - Rolston, Leanne
AU - Goudeseune, Camille
AU - Levow, Gina Anne
AU - Kirchhoff, Katrin
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2020.
PY - 2020
Y1 - 2020
N2 - Automatic speech recognition (ASR) can be deployed in a previously unknown language, in less than 24 h, given just three resources: an acoustic model trained on other languages, a set of language-model training data, and a grapheme-to-phoneme (G2P) transducer to connect them. The LanguageNet G2Ps were created with the goal of being small, fast, and easy to port to a previously unseen language. Data come from pronunciation lexicons if available, but if there are no pronunciation lexicons in the target language, then data are generated from minimal resources: from a Wikipedia description of the target language, or from a one-hour interview with a native speaker of the language. Using such methods, the LanguageNet G2Ps now include simple models in nearly 150 languages, with trained finite state transducers in 122 languages, 59 of which are sufficiently well-resourced to permit measurement of their phone error rates. This paper proposes a measure of the distance between the G2Ps in different languages, and demonstrates that agglomerative clustering of the LanguageNet languages bears some resemblance to a phylogeographic language family tree. The LanguageNet G2Ps proposed in this paper have already been applied in three cross-language ASRs, using both hybrid and end-to-end neural architectures, and further experiments are ongoing.
AB - Automatic speech recognition (ASR) can be deployed in a previously unknown language, in less than 24 h, given just three resources: an acoustic model trained on other languages, a set of language-model training data, and a grapheme-to-phoneme (G2P) transducer to connect them. The LanguageNet G2Ps were created with the goal of being small, fast, and easy to port to a previously unseen language. Data come from pronunciation lexicons if available, but if there are no pronunciation lexicons in the target language, then data are generated from minimal resources: from a Wikipedia description of the target language, or from a one-hour interview with a native speaker of the language. Using such methods, the LanguageNet G2Ps now include simple models in nearly 150 languages, with trained finite state transducers in 122 languages, 59 of which are sufficiently well-resourced to permit measurement of their phone error rates. This paper proposes a measure of the distance between the G2Ps in different languages, and demonstrates that agglomerative clustering of the LanguageNet languages bears some resemblance to a phylogeographic language family tree. The LanguageNet G2Ps proposed in this paper have already been applied in three cross-language ASRs, using both hybrid and end-to-end neural architectures, and further experiments are ongoing.
KW - Automatic speech recognition
KW - Cross-language speech recognition
KW - Grapheme-to-phoneme transducers
KW - Under-resourced languages
UR - http://www.scopus.com/inward/record.url?scp=85092156261&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85092156261&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-59430-5_1
DO - 10.1007/978-3-030-59430-5_1
M3 - Conference contribution
AN - SCOPUS:85092156261
SN - 9783030594299
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 3
EP - 19
BT - Statistical Language and Speech Processing - 8th International Conference, SLSP 2020, Proceedings
A2 - Espinosa-Anke, Luis
A2 - Spasic, Irena
A2 - Martín-Vide, Carlos
PB - Springer
T2 - 8th International Conference on Statistical Language and Speech Processing, SLSP 2020
Y2 - 14 October 2020 through 16 October 2020
ER -