Deep auto-encoder based multi-task learning using probabilistic transcriptions

Amit Das, Mark Allan Hasegawa-Johnson, Karel Veselý

Research output: Contribution to journalConference article

Abstract

We examine a scenario where we have no access to native transcribers in the target language. This is typical of language communities that are under-resourced. However, turkers (online crowd workers) available in online marketplaces can serve as valuable alternative resources for providing transcripts in the target language. We assume that the turkers neither speak nor have any familiarity with the target language. Thus, they are unable to distinguish all phone pairs in the target language; their transcripts therefore specify, at best, a probability distribution called a probabilistic transcript (PT). Standard deep neural network (DNN) training using PTs do not necessarily improve error rates. Previously reported results have demonstrated some success by adopting the multi-task learning (MTL) approach. In this study, we report further improvements by introducing a deep auto-encoder based MTL. This method leverages large amounts of untranscribed data in the target language in addition to the PTs obtained from turkers. Furthermore, to encourage transfer learning in the feature space, we also examine the effect of using monophones from transcripts in well-resourced languages. We report consistent improvement in phone error rates (PER) for Swahili, Amharic, Dinka, and Mandarin.

Original languageEnglish (US)
Pages (from-to)2073-2077
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2017-August
DOIs
StatePublished - Jan 1 2017
Event18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 - Stockholm, Sweden
Duration: Aug 20 2017Aug 24 2017

Fingerprint

Multi-task Learning
Transcription
Encoder
Probability distributions
Target
Error Rate
Transfer Learning
Language
Feature Space
Leverage
Probability Distribution
Neural Networks
Scenarios
Resources
Alternatives
Deep neural networks

Keywords

  • Cross-lingual speech recognition
  • Deep neural networks
  • Multi-task learning
  • Probabilistic transcription

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

Deep auto-encoder based multi-task learning using probabilistic transcriptions. / Das, Amit; Hasegawa-Johnson, Mark Allan; Veselý, Karel.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2017-August, 01.01.2017, p. 2073-2077.

Research output: Contribution to journalConference article

@article{d4ae8d29ca5047baac7f7c2aa2fd40c0,
title = "Deep auto-encoder based multi-task learning using probabilistic transcriptions",
abstract = "We examine a scenario where we have no access to native transcribers in the target language. This is typical of language communities that are under-resourced. However, turkers (online crowd workers) available in online marketplaces can serve as valuable alternative resources for providing transcripts in the target language. We assume that the turkers neither speak nor have any familiarity with the target language. Thus, they are unable to distinguish all phone pairs in the target language; their transcripts therefore specify, at best, a probability distribution called a probabilistic transcript (PT). Standard deep neural network (DNN) training using PTs do not necessarily improve error rates. Previously reported results have demonstrated some success by adopting the multi-task learning (MTL) approach. In this study, we report further improvements by introducing a deep auto-encoder based MTL. This method leverages large amounts of untranscribed data in the target language in addition to the PTs obtained from turkers. Furthermore, to encourage transfer learning in the feature space, we also examine the effect of using monophones from transcripts in well-resourced languages. We report consistent improvement in phone error rates (PER) for Swahili, Amharic, Dinka, and Mandarin.",
keywords = "Cross-lingual speech recognition, Deep neural networks, Multi-task learning, Probabilistic transcription",
author = "Amit Das and Hasegawa-Johnson, {Mark Allan} and Karel Vesel{\'y}",
year = "2017",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2017-582",
language = "English (US)",
volume = "2017-August",
pages = "2073--2077",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Deep auto-encoder based multi-task learning using probabilistic transcriptions

AU - Das, Amit

AU - Hasegawa-Johnson, Mark Allan

AU - Veselý, Karel

PY - 2017/1/1

Y1 - 2017/1/1

N2 - We examine a scenario where we have no access to native transcribers in the target language. This is typical of language communities that are under-resourced. However, turkers (online crowd workers) available in online marketplaces can serve as valuable alternative resources for providing transcripts in the target language. We assume that the turkers neither speak nor have any familiarity with the target language. Thus, they are unable to distinguish all phone pairs in the target language; their transcripts therefore specify, at best, a probability distribution called a probabilistic transcript (PT). Standard deep neural network (DNN) training using PTs do not necessarily improve error rates. Previously reported results have demonstrated some success by adopting the multi-task learning (MTL) approach. In this study, we report further improvements by introducing a deep auto-encoder based MTL. This method leverages large amounts of untranscribed data in the target language in addition to the PTs obtained from turkers. Furthermore, to encourage transfer learning in the feature space, we also examine the effect of using monophones from transcripts in well-resourced languages. We report consistent improvement in phone error rates (PER) for Swahili, Amharic, Dinka, and Mandarin.

AB - We examine a scenario where we have no access to native transcribers in the target language. This is typical of language communities that are under-resourced. However, turkers (online crowd workers) available in online marketplaces can serve as valuable alternative resources for providing transcripts in the target language. We assume that the turkers neither speak nor have any familiarity with the target language. Thus, they are unable to distinguish all phone pairs in the target language; their transcripts therefore specify, at best, a probability distribution called a probabilistic transcript (PT). Standard deep neural network (DNN) training using PTs do not necessarily improve error rates. Previously reported results have demonstrated some success by adopting the multi-task learning (MTL) approach. In this study, we report further improvements by introducing a deep auto-encoder based MTL. This method leverages large amounts of untranscribed data in the target language in addition to the PTs obtained from turkers. Furthermore, to encourage transfer learning in the feature space, we also examine the effect of using monophones from transcripts in well-resourced languages. We report consistent improvement in phone error rates (PER) for Swahili, Amharic, Dinka, and Mandarin.

KW - Cross-lingual speech recognition

KW - Deep neural networks

KW - Multi-task learning

KW - Probabilistic transcription

UR - http://www.scopus.com/inward/record.url?scp=85039159851&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85039159851&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2017-582

DO - 10.21437/Interspeech.2017-582

M3 - Conference article

AN - SCOPUS:85039159851

VL - 2017-August

SP - 2073

EP - 2077

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -