Development of an Acoustic-Phonetic Hidden Markov Model for Continuous Speech Recognition

Andrej Ljolje, Stephen E. Levinson

Research output: Contribution to journalArticlepeer-review

Abstract

It has recently been proposed that it is possible to model the acoustic-phonetic structure of the English language using a single ergodic hidden Markov model. In this paper we report on the techniques used to develop such a model, the problems associated with rep-resenting the whole acoustic-phonetic structure, the characteristics of the model, and how it performs as a phonetic decoder for recognition of fluent speech. The continuous variable duration model was trained using 450 sentences of fluent speech, each of which was spoken by a single speaker, and segmented and labeled using a fixed number of phonemes, each of which has a direct correspondence to the states of the model. The inherent variability of each phoneme is modeled as the observable random process of the Markov chain, while the phonotactic model of the unobservable phonetic sequence is represented by the state transition matrix of the hidden Markov model. The model assumes that the observed spectral data were generated by a Gaussian source. However, an analysis of the data clearly shows that the spectra for most of the phonemes are not normally distributed and that an alternative representation would be beneficial. Also, the recognition results indicate that a form of liftering of the cepstral and delta cepstral coefficients considerably improves the recognition results. Additionally, an identical recognition experiment was performed using a traditional form of a hidden Markov model that does not have an explicit duration model, but preserved the rest of the model parameters. The absence of a correct duration model increases the error rate by 50%. It is shown that the difficulties of developing an acoustic-phoentic model are not due to the inherent deficiencies of the concept presented here, using a single ergodic hidden Markov model for acoustic-pho-netic modeling. Instead they are due to the choice of phonemes to be modeled, the selected parametrization of the data, and the appropriate choice of a variant of an ergodic hidden markov model.

Original languageEnglish (US)
Pages (from-to)29-39
Number of pages11
JournalIEEE Transactions on Signal Processing
Volume39
Issue number1
DOIs
StatePublished - Jan 1991
Externally publishedYes

ASJC Scopus subject areas

  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Development of an Acoustic-Phonetic Hidden Markov Model for Continuous Speech Recognition'. Together they form a unique fingerprint.

Cite this