Improved ASR for under-resourced languages through multi-task learning with acoustic landmarks

Di He, Boon Pang Lim, Xuesong Yang, Mark Allan Hasegawa-Johnson, Deming Chen

Research output: Contribution to journalConference article

Abstract

Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental. Acoustic landmarks are perceptually salient, even in a language one doesn't speak, and it has been demonstrated that non-speakers of the language can identify features such as the primary articulator of the landmark. These factors suggest a strategy for developing language-independent automatic speech recognition: landmarks can potentially be learned once from a suitably labeled corpus and rapidly applied to many other languages. This paper proposes enhancing the cross-lingual portability of a neural network by using landmarks as the secondary task in multi-task learning (MTL). The network is trained in a well-resourced source language with both phone and landmark labels (English), then adapted to an under-resourced target language with only word labels (Iban). Landmark-tasked MTL reduces source-language phone error rate by 2.9% relative, and reduces target-language word error rate by 1.9%-5.9% depending on the amount of target-language training data. These results suggest that landmark-tasked MTL causes the DNN to learn hidden-node features that are useful for cross-lingual adaptation.

Original languageEnglish (US)
Pages (from-to)2618-2622
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
StatePublished - Jan 1 2018
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: Sep 2 2018Sep 6 2018

Fingerprint

Multi-task Learning
Landmarks
Labels
Acoustics
Speech recognition
Neural networks
Target
Error Rate
Speech Perception
Language
Automatic Speech Recognition
Portability
Neural Networks

Keywords

  • Acoustic landmarks
  • Multi-task learning
  • Under-resourced ASR

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

@article{d8d20239167b4d57b17ca13c59ad2ab6,
title = "Improved ASR for under-resourced languages through multi-task learning with acoustic landmarks",
abstract = "Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental. Acoustic landmarks are perceptually salient, even in a language one doesn't speak, and it has been demonstrated that non-speakers of the language can identify features such as the primary articulator of the landmark. These factors suggest a strategy for developing language-independent automatic speech recognition: landmarks can potentially be learned once from a suitably labeled corpus and rapidly applied to many other languages. This paper proposes enhancing the cross-lingual portability of a neural network by using landmarks as the secondary task in multi-task learning (MTL). The network is trained in a well-resourced source language with both phone and landmark labels (English), then adapted to an under-resourced target language with only word labels (Iban). Landmark-tasked MTL reduces source-language phone error rate by 2.9{\%} relative, and reduces target-language word error rate by 1.9{\%}-5.9{\%} depending on the amount of target-language training data. These results suggest that landmark-tasked MTL causes the DNN to learn hidden-node features that are useful for cross-lingual adaptation.",
keywords = "Acoustic landmarks, Multi-task learning, Under-resourced ASR",
author = "Di He and Lim, {Boon Pang} and Xuesong Yang and Hasegawa-Johnson, {Mark Allan} and Deming Chen",
year = "2018",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2018-1124",
language = "English (US)",
volume = "2018-September",
pages = "2618--2622",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Improved ASR for under-resourced languages through multi-task learning with acoustic landmarks

AU - He, Di

AU - Lim, Boon Pang

AU - Yang, Xuesong

AU - Hasegawa-Johnson, Mark Allan

AU - Chen, Deming

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental. Acoustic landmarks are perceptually salient, even in a language one doesn't speak, and it has been demonstrated that non-speakers of the language can identify features such as the primary articulator of the landmark. These factors suggest a strategy for developing language-independent automatic speech recognition: landmarks can potentially be learned once from a suitably labeled corpus and rapidly applied to many other languages. This paper proposes enhancing the cross-lingual portability of a neural network by using landmarks as the secondary task in multi-task learning (MTL). The network is trained in a well-resourced source language with both phone and landmark labels (English), then adapted to an under-resourced target language with only word labels (Iban). Landmark-tasked MTL reduces source-language phone error rate by 2.9% relative, and reduces target-language word error rate by 1.9%-5.9% depending on the amount of target-language training data. These results suggest that landmark-tasked MTL causes the DNN to learn hidden-node features that are useful for cross-lingual adaptation.

AB - Furui first demonstrated that the identity of both consonant and vowel can be perceived from the C-V transition; later, Stevens proposed that acoustic landmarks are the primary cues for speech perception, and that steady-state regions are secondary or supplemental. Acoustic landmarks are perceptually salient, even in a language one doesn't speak, and it has been demonstrated that non-speakers of the language can identify features such as the primary articulator of the landmark. These factors suggest a strategy for developing language-independent automatic speech recognition: landmarks can potentially be learned once from a suitably labeled corpus and rapidly applied to many other languages. This paper proposes enhancing the cross-lingual portability of a neural network by using landmarks as the secondary task in multi-task learning (MTL). The network is trained in a well-resourced source language with both phone and landmark labels (English), then adapted to an under-resourced target language with only word labels (Iban). Landmark-tasked MTL reduces source-language phone error rate by 2.9% relative, and reduces target-language word error rate by 1.9%-5.9% depending on the amount of target-language training data. These results suggest that landmark-tasked MTL causes the DNN to learn hidden-node features that are useful for cross-lingual adaptation.

KW - Acoustic landmarks

KW - Multi-task learning

KW - Under-resourced ASR

UR - http://www.scopus.com/inward/record.url?scp=85054981739&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054981739&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2018-1124

DO - 10.21437/Interspeech.2018-1124

M3 - Conference article

VL - 2018-September

SP - 2618

EP - 2622

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -