TY - GEN
T1 - Mismatched crowdsourcing
T2 - 51st Asilomar Conference on Signals, Systems and Computers, ACSSC 2017
AU - Hasegawa-Johnson, Mark
AU - Jyothi, Preethi
AU - Chen, Wenda
AU - Do, Van Hai
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/2
Y1 - 2017/7/2
N2 - Automatic speech recognition (ASR) converts audio to text. ASR is usually trained using a large quantity of labeled data, i.e., audio with text transcription. In many languages, however, text transcription is hard to find, e.g., in both Hokkien and Dinka, we found native speakers who had received all their primary education in some other language, and who therefore had difficulty writing in their own language. Fortunately, speech in every language is produced by human mouths, and designed to be interpreted by human ears. Speakers of a majority language (English, say, or Mandarin Chinese) are therefore able to make some sense of even the strangest language (Zulu, say, or Cantonese): language-unique distinctions are mostly lost, but universal distinctions such as consonant versus vowel are, for the most part, correctly transmitted. We can decode such mismatched transcripts using an information-theoretic decoder, resulting in a low-entropy probability distribution over the possible native-language transcriptions. Mismatched transcripts can be used to train ASR. Combining ten hours of mismatched transcripts with 12-48 minutes of native transcripts, if available, results in lower phone error rate. On the other hand, if we don't even know the native phoneme inventory, mismatched transcripts in two or more annotation languages can be used to infer the native phoneme inventory (with entropy depending on the distinctive feature inventory of the annotation languages).
AB - Automatic speech recognition (ASR) converts audio to text. ASR is usually trained using a large quantity of labeled data, i.e., audio with text transcription. In many languages, however, text transcription is hard to find, e.g., in both Hokkien and Dinka, we found native speakers who had received all their primary education in some other language, and who therefore had difficulty writing in their own language. Fortunately, speech in every language is produced by human mouths, and designed to be interpreted by human ears. Speakers of a majority language (English, say, or Mandarin Chinese) are therefore able to make some sense of even the strangest language (Zulu, say, or Cantonese): language-unique distinctions are mostly lost, but universal distinctions such as consonant versus vowel are, for the most part, correctly transmitted. We can decode such mismatched transcripts using an information-theoretic decoder, resulting in a low-entropy probability distribution over the possible native-language transcriptions. Mismatched transcripts can be used to train ASR. Combining ten hours of mismatched transcripts with 12-48 minutes of native transcripts, if available, results in lower phone error rate. On the other hand, if we don't even know the native phoneme inventory, mismatched transcripts in two or more annotation languages can be used to infer the native phoneme inventory (with entropy depending on the distinctive feature inventory of the annotation languages).
UR - http://www.scopus.com/inward/record.url?scp=85050984670&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85050984670&partnerID=8YFLogxK
U2 - 10.1109/ACSSC.2017.8335558
DO - 10.1109/ACSSC.2017.8335558
M3 - Conference contribution
AN - SCOPUS:85050984670
T3 - Conference Record of 51st Asilomar Conference on Signals, Systems and Computers, ACSSC 2017
SP - 1277
EP - 1281
BT - Conference Record of 51st Asilomar Conference on Signals, Systems and Computers, ACSSC 2017
A2 - Matthews, Michael B.
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 29 October 2017 through 1 November 2017
ER -