Improving DNNs trained with non-native transcriptions using knowledge distillation and target interpolation

Research output: Contribution to journalConference article

Abstract

Often, it is quite hard to find native transcribers in under-resourced languages. However, Turkers (crowd workers) available in online marketplaces can serve as valuable alternative resources by providing transcriptions in the target language. Since the Turkers may neither speak nor have any familiarity with the target language, their transcriptions are non-native by nature and are usually filled with incorrect labels. After some post-processing, these transcriptions can be converted to Probabilistic Transcriptions (PT). Conventional Deep Neural Networks (DNNs) trained using PTs do not necessarily improve error rates over Gaussian Mixture Models (GMMs) due to the presence of label noise. Previously reported results have demonstrated some success by adopting Multi-Task Learning (MTL) training for PTs. In this study, we report further improvements using Knowledge Distillation (KD) and Target Interpolation (TI) to alleviate transcription errors in PTs. In the KD method, knowledge is transfered from a well-trained multilingual DNN to the target language DNN trained using PTs. In the TI method, the confidences of the labels provided by PTs are modified using the confidences of the target language DNN. Results show an average absolute improvement in phone error rates (PER) by about 1.9% across Swahili, Amharic, Dinka, and Mandarin using each proposed method.

Original languageEnglish (US)
Pages (from-to)2434-2438
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
StatePublished - Jan 1 2018
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: Sep 2 2018Sep 6 2018

Fingerprint

Distillation
Transcription
Interpolation
Interpolate
Neural Networks
Target
Labels
Confidence
Error Rate
Multi-task Learning
Gaussian Mixture Model
Interpolation Method
Post-processing
Knowledge
Deep neural networks
Language
Processing
Resources
Alternatives

Keywords

  • Cross-lingual speech recognition
  • Deep neural networks
  • Knowledge distillation
  • Target interpolation
  • Under-resourced

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

@article{b9a6000dd34d4a88a7068580292f3096,
title = "Improving DNNs trained with non-native transcriptions using knowledge distillation and target interpolation",
abstract = "Often, it is quite hard to find native transcribers in under-resourced languages. However, Turkers (crowd workers) available in online marketplaces can serve as valuable alternative resources by providing transcriptions in the target language. Since the Turkers may neither speak nor have any familiarity with the target language, their transcriptions are non-native by nature and are usually filled with incorrect labels. After some post-processing, these transcriptions can be converted to Probabilistic Transcriptions (PT). Conventional Deep Neural Networks (DNNs) trained using PTs do not necessarily improve error rates over Gaussian Mixture Models (GMMs) due to the presence of label noise. Previously reported results have demonstrated some success by adopting Multi-Task Learning (MTL) training for PTs. In this study, we report further improvements using Knowledge Distillation (KD) and Target Interpolation (TI) to alleviate transcription errors in PTs. In the KD method, knowledge is transfered from a well-trained multilingual DNN to the target language DNN trained using PTs. In the TI method, the confidences of the labels provided by PTs are modified using the confidences of the target language DNN. Results show an average absolute improvement in phone error rates (PER) by about 1.9{\%} across Swahili, Amharic, Dinka, and Mandarin using each proposed method.",
keywords = "Cross-lingual speech recognition, Deep neural networks, Knowledge distillation, Target interpolation, Under-resourced",
author = "Amit Das and Hasegawa-Johnson, {Mark Allan}",
year = "2018",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2018-1450",
language = "English (US)",
volume = "2018-September",
pages = "2434--2438",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Improving DNNs trained with non-native transcriptions using knowledge distillation and target interpolation

AU - Das, Amit

AU - Hasegawa-Johnson, Mark Allan

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Often, it is quite hard to find native transcribers in under-resourced languages. However, Turkers (crowd workers) available in online marketplaces can serve as valuable alternative resources by providing transcriptions in the target language. Since the Turkers may neither speak nor have any familiarity with the target language, their transcriptions are non-native by nature and are usually filled with incorrect labels. After some post-processing, these transcriptions can be converted to Probabilistic Transcriptions (PT). Conventional Deep Neural Networks (DNNs) trained using PTs do not necessarily improve error rates over Gaussian Mixture Models (GMMs) due to the presence of label noise. Previously reported results have demonstrated some success by adopting Multi-Task Learning (MTL) training for PTs. In this study, we report further improvements using Knowledge Distillation (KD) and Target Interpolation (TI) to alleviate transcription errors in PTs. In the KD method, knowledge is transfered from a well-trained multilingual DNN to the target language DNN trained using PTs. In the TI method, the confidences of the labels provided by PTs are modified using the confidences of the target language DNN. Results show an average absolute improvement in phone error rates (PER) by about 1.9% across Swahili, Amharic, Dinka, and Mandarin using each proposed method.

AB - Often, it is quite hard to find native transcribers in under-resourced languages. However, Turkers (crowd workers) available in online marketplaces can serve as valuable alternative resources by providing transcriptions in the target language. Since the Turkers may neither speak nor have any familiarity with the target language, their transcriptions are non-native by nature and are usually filled with incorrect labels. After some post-processing, these transcriptions can be converted to Probabilistic Transcriptions (PT). Conventional Deep Neural Networks (DNNs) trained using PTs do not necessarily improve error rates over Gaussian Mixture Models (GMMs) due to the presence of label noise. Previously reported results have demonstrated some success by adopting Multi-Task Learning (MTL) training for PTs. In this study, we report further improvements using Knowledge Distillation (KD) and Target Interpolation (TI) to alleviate transcription errors in PTs. In the KD method, knowledge is transfered from a well-trained multilingual DNN to the target language DNN trained using PTs. In the TI method, the confidences of the labels provided by PTs are modified using the confidences of the target language DNN. Results show an average absolute improvement in phone error rates (PER) by about 1.9% across Swahili, Amharic, Dinka, and Mandarin using each proposed method.

KW - Cross-lingual speech recognition

KW - Deep neural networks

KW - Knowledge distillation

KW - Target interpolation

KW - Under-resourced

UR - http://www.scopus.com/inward/record.url?scp=85055000210&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85055000210&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2018-1450

DO - 10.21437/Interspeech.2018-1450

M3 - Conference article

AN - SCOPUS:85055000210

VL - 2018-September

SP - 2434

EP - 2438

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -