Improved Hindi broadcast ASR by adapting the language model and pronunciation model using a priori syntactic and morphophonemic knowledge

Research output: Contribution to journalConference article

Abstract

In this work, we present a new large-vocabulary, broadcast news ASR system for Hindi. Since Hindi has a largely phonemic orthography, the pronunciation model was automatically generated from text. We experiment with several variants of this model and study the effect of incorporating word boundary information with these models. We also experiment with knowledge-based adaptations to the language model in Hindi, derived in an unsupervised manner, that lead to small improvements in word error rate (WER). Our experiments were conducted on a new corpus assembled from publicly-available Hindi news broadcasts. We evaluate our techniques on an openvocabulary task and obtain competitive WERs on an unseen test set.

Original languageEnglish (US)
Pages (from-to)3164-3168
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2015-January
StatePublished - Jan 1 2015
Event16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, Germany
Duration: Sep 6 2015Sep 10 2015

Fingerprint

Language Model
Syntactics
Broadcast
Experiment
Test Set
Knowledge-based
Error Rate
Experiments
Model
Evaluate
Syntax
Knowledge
Morphophonemics
News Broadcasts

Keywords

  • Broadcast news ASR
  • Grapheme and phoneme-based models
  • Hindi LVCSR system
  • Knowledge-based language-model adaptation

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

@article{eb0d7b05418d4da5974a75d0fb873ce1,
title = "Improved Hindi broadcast ASR by adapting the language model and pronunciation model using a priori syntactic and morphophonemic knowledge",
abstract = "In this work, we present a new large-vocabulary, broadcast news ASR system for Hindi. Since Hindi has a largely phonemic orthography, the pronunciation model was automatically generated from text. We experiment with several variants of this model and study the effect of incorporating word boundary information with these models. We also experiment with knowledge-based adaptations to the language model in Hindi, derived in an unsupervised manner, that lead to small improvements in word error rate (WER). Our experiments were conducted on a new corpus assembled from publicly-available Hindi news broadcasts. We evaluate our techniques on an openvocabulary task and obtain competitive WERs on an unseen test set.",
keywords = "Broadcast news ASR, Grapheme and phoneme-based models, Hindi LVCSR system, Knowledge-based language-model adaptation",
author = "Preethi Jyothi and Hasegawa-Johnson, {Mark Allan}",
year = "2015",
month = "1",
day = "1",
language = "English (US)",
volume = "2015-January",
pages = "3164--3168",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Improved Hindi broadcast ASR by adapting the language model and pronunciation model using a priori syntactic and morphophonemic knowledge

AU - Jyothi, Preethi

AU - Hasegawa-Johnson, Mark Allan

PY - 2015/1/1

Y1 - 2015/1/1

N2 - In this work, we present a new large-vocabulary, broadcast news ASR system for Hindi. Since Hindi has a largely phonemic orthography, the pronunciation model was automatically generated from text. We experiment with several variants of this model and study the effect of incorporating word boundary information with these models. We also experiment with knowledge-based adaptations to the language model in Hindi, derived in an unsupervised manner, that lead to small improvements in word error rate (WER). Our experiments were conducted on a new corpus assembled from publicly-available Hindi news broadcasts. We evaluate our techniques on an openvocabulary task and obtain competitive WERs on an unseen test set.

AB - In this work, we present a new large-vocabulary, broadcast news ASR system for Hindi. Since Hindi has a largely phonemic orthography, the pronunciation model was automatically generated from text. We experiment with several variants of this model and study the effect of incorporating word boundary information with these models. We also experiment with knowledge-based adaptations to the language model in Hindi, derived in an unsupervised manner, that lead to small improvements in word error rate (WER). Our experiments were conducted on a new corpus assembled from publicly-available Hindi news broadcasts. We evaluate our techniques on an openvocabulary task and obtain competitive WERs on an unseen test set.

KW - Broadcast news ASR

KW - Grapheme and phoneme-based models

KW - Hindi LVCSR system

KW - Knowledge-based language-model adaptation

UR - http://www.scopus.com/inward/record.url?scp=84959144685&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959144685&partnerID=8YFLogxK

M3 - Conference article

VL - 2015-January

SP - 3164

EP - 3168

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -