Mismatched crowdsourcing: Mining latent skills to acquire speech transcriptions

Mark Hasegawa-Johnson, Preethi Jyothi, Wenda Chen, Van Hai Do

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Automatic speech recognition (ASR) converts audio to text. ASR is usually trained using a large quantity of labeled data, i.e., audio with text transcription. In many languages, however, text transcription is hard to find, e.g., in both Hokkien and Dinka, we found native speakers who had received all their primary education in some other language, and who therefore had difficulty writing in their own language. Fortunately, speech in every language is produced by human mouths, and designed to be interpreted by human ears. Speakers of a majority language (English, say, or Mandarin Chinese) are therefore able to make some sense of even the strangest language (Zulu, say, or Cantonese): language-unique distinctions are mostly lost, but universal distinctions such as consonant versus vowel are, for the most part, correctly transmitted. We can decode such mismatched transcripts using an information-theoretic decoder, resulting in a low-entropy probability distribution over the possible native-language transcriptions. Mismatched transcripts can be used to train ASR. Combining ten hours of mismatched transcripts with 12-48 minutes of native transcripts, if available, results in lower phone error rate. On the other hand, if we don't even know the native phoneme inventory, mismatched transcripts in two or more annotation languages can be used to infer the native phoneme inventory (with entropy depending on the distinctive feature inventory of the annotation languages).

Original languageEnglish (US)
Title of host publicationConference Record of 51st Asilomar Conference on Signals, Systems and Computers, ACSSC 2017
EditorsMichael B. Matthews
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1277-1281
Number of pages5
ISBN (Electronic)9781538618233
DOIs
StatePublished - Apr 10 2018
Event51st Asilomar Conference on Signals, Systems and Computers, ACSSC 2017 - Pacific Grove, United States
Duration: Oct 29 2017Nov 1 2017

Publication series

NameConference Record of 51st Asilomar Conference on Signals, Systems and Computers, ACSSC 2017
Volume2017-October

Other

Other51st Asilomar Conference on Signals, Systems and Computers, ACSSC 2017
CountryUnited States
CityPacific Grove
Period10/29/1711/1/17

    Fingerprint

ASJC Scopus subject areas

  • Control and Optimization
  • Computer Networks and Communications
  • Hardware and Architecture
  • Signal Processing
  • Biomedical Engineering
  • Instrumentation

Cite this

Hasegawa-Johnson, M., Jyothi, P., Chen, W., & Do, V. H. (2018). Mismatched crowdsourcing: Mining latent skills to acquire speech transcriptions. In M. B. Matthews (Ed.), Conference Record of 51st Asilomar Conference on Signals, Systems and Computers, ACSSC 2017 (pp. 1277-1281). [8335558] (Conference Record of 51st Asilomar Conference on Signals, Systems and Computers, ACSSC 2017; Vol. 2017-October). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ACSSC.2017.8335558