Analysis of mismatched transcriptions generated by humans and machines for under-resourced languages

Van Hai Do, Nancy F. Chen, Boon Pang Lim, Mark Allan Hasegawa-Johnson

Research output: Contribution to journalConference article

Abstract

When speech data with native transcriptions are scarce in an under-resourced language, automatic speech recognition (ASR) must be trained using other methods. Semi-supervised learning first labels the speech using ASR from other languages, then re-trains the ASR using the generated labels. Mismatched crowdsourcing asks crowd-workers unfamiliar with the language to transcribe it. In this paper, self-training and mismatched crowdsourcing are compared under exactly matched conditions. Specifically, speech data of the target language are decoded by the source language ASR systems into source language phone/word sequences. We find that (1) human mismatched crowdsourcing and cross-lingual ASR have similar error patterns, but different specific errors. (2) These two sources of information can be usefully combined in order to train a better target-language ASR. (3) The differences between the error patterns of non-native human listeners and non-native ASR are small, but when differences are observed, they provide information about the relationship between the phoneme systems of the annotator/source language (Mandarin) and the target language (Vietnamese).

Original languageEnglish (US)
Pages (from-to)3863-3867
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume08-12-September-2016
DOIs
StatePublished - Jan 1 2016
Event17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 - San Francisco, United States
Duration: Sep 8 2016Sep 16 2016

Fingerprint

Transcription
Speech recognition
Automatic Speech Recognition
Labels
Target
Supervised learning
Human
Language
Semi-supervised Learning
Source Language

Keywords

  • Mismatched crowdsourcing
  • Semi-supervised learning
  • Speech recognition
  • Under-resourced languages

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

Analysis of mismatched transcriptions generated by humans and machines for under-resourced languages. / Do, Van Hai; Chen, Nancy F.; Lim, Boon Pang; Hasegawa-Johnson, Mark Allan.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 08-12-September-2016, 01.01.2016, p. 3863-3867.

Research output: Contribution to journalConference article

@article{ab81d6393bef42c6a03a6976d1f6e210,
title = "Analysis of mismatched transcriptions generated by humans and machines for under-resourced languages",
abstract = "When speech data with native transcriptions are scarce in an under-resourced language, automatic speech recognition (ASR) must be trained using other methods. Semi-supervised learning first labels the speech using ASR from other languages, then re-trains the ASR using the generated labels. Mismatched crowdsourcing asks crowd-workers unfamiliar with the language to transcribe it. In this paper, self-training and mismatched crowdsourcing are compared under exactly matched conditions. Specifically, speech data of the target language are decoded by the source language ASR systems into source language phone/word sequences. We find that (1) human mismatched crowdsourcing and cross-lingual ASR have similar error patterns, but different specific errors. (2) These two sources of information can be usefully combined in order to train a better target-language ASR. (3) The differences between the error patterns of non-native human listeners and non-native ASR are small, but when differences are observed, they provide information about the relationship between the phoneme systems of the annotator/source language (Mandarin) and the target language (Vietnamese).",
keywords = "Mismatched crowdsourcing, Semi-supervised learning, Speech recognition, Under-resourced languages",
author = "Do, {Van Hai} and Chen, {Nancy F.} and Lim, {Boon Pang} and Hasegawa-Johnson, {Mark Allan}",
year = "2016",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2016-736",
language = "English (US)",
volume = "08-12-September-2016",
pages = "3863--3867",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Analysis of mismatched transcriptions generated by humans and machines for under-resourced languages

AU - Do, Van Hai

AU - Chen, Nancy F.

AU - Lim, Boon Pang

AU - Hasegawa-Johnson, Mark Allan

PY - 2016/1/1

Y1 - 2016/1/1

N2 - When speech data with native transcriptions are scarce in an under-resourced language, automatic speech recognition (ASR) must be trained using other methods. Semi-supervised learning first labels the speech using ASR from other languages, then re-trains the ASR using the generated labels. Mismatched crowdsourcing asks crowd-workers unfamiliar with the language to transcribe it. In this paper, self-training and mismatched crowdsourcing are compared under exactly matched conditions. Specifically, speech data of the target language are decoded by the source language ASR systems into source language phone/word sequences. We find that (1) human mismatched crowdsourcing and cross-lingual ASR have similar error patterns, but different specific errors. (2) These two sources of information can be usefully combined in order to train a better target-language ASR. (3) The differences between the error patterns of non-native human listeners and non-native ASR are small, but when differences are observed, they provide information about the relationship between the phoneme systems of the annotator/source language (Mandarin) and the target language (Vietnamese).

AB - When speech data with native transcriptions are scarce in an under-resourced language, automatic speech recognition (ASR) must be trained using other methods. Semi-supervised learning first labels the speech using ASR from other languages, then re-trains the ASR using the generated labels. Mismatched crowdsourcing asks crowd-workers unfamiliar with the language to transcribe it. In this paper, self-training and mismatched crowdsourcing are compared under exactly matched conditions. Specifically, speech data of the target language are decoded by the source language ASR systems into source language phone/word sequences. We find that (1) human mismatched crowdsourcing and cross-lingual ASR have similar error patterns, but different specific errors. (2) These two sources of information can be usefully combined in order to train a better target-language ASR. (3) The differences between the error patterns of non-native human listeners and non-native ASR are small, but when differences are observed, they provide information about the relationship between the phoneme systems of the annotator/source language (Mandarin) and the target language (Vietnamese).

KW - Mismatched crowdsourcing

KW - Semi-supervised learning

KW - Speech recognition

KW - Under-resourced languages

UR - http://www.scopus.com/inward/record.url?scp=84994341282&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84994341282&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2016-736

DO - 10.21437/Interspeech.2016-736

M3 - Conference article

AN - SCOPUS:84994341282

VL - 08-12-September-2016

SP - 3863

EP - 3867

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -