G2PU: GRAPHEME-TO-PHONEME TRANSDUCER WITH SPEECH UNITS

Heting Gao, Mark Hasegawa-Johnson, Chang D. Yoo

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Most phoneme transcripts are generated using forced alignment: typically a grapheme-to-phoneme transducer (G2P) is applied to text sequences to generate candidate phoneme transcripts, which are then time-aligned to the waveform using an acoustic model. This paper demonstrates, for the first time, simultaneous optimization of the G2P, the acoustic model, and the acoustic alignment to a corpus. To this end, we propose G2PU, a joint CTC-attention model consisting of an encoder-decoder G2P network and an encoder-CTC unit-to-phoneme (U2P) network, where the units are extracted from speech. We demonstrate that the G2P and U2P, operating in parallel, produce lower phone error rates than those of state-of-the-art open-source G2P and forced alignment systems. Furthermore, although the G2P and U2P are trained using parallel speech and text, their synergy can be generalized to text-only test corpora if we also train a grapheme-to-unit (G2U) network that generates speech units from text in the absence of parallel speech. Our G2PU model is trained using phoneme transcripts generated by a teacher G2P tool. Our experiments on Chinese and Japanese show that G2PU reduces phoneme error rate by 7% to 29% relative compared to its teacher. Finally, we include case studies to provide insights into the system's workings.

Original languageEnglish (US)
Title of host publication2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages10061-10065
Number of pages5
ISBN (Electronic)9798350344851
DOIs
StatePublished - 2024
Event49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of
Duration: Apr 14 2024Apr 19 2024

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)1520-6149

Conference

Conference49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
Country/TerritoryKorea, Republic of
CitySeoul
Period4/14/244/19/24

Keywords

  • g2p
  • grapheme-to-phoneme transducer
  • speech recognition

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'G2PU: GRAPHEME-TO-PHONEME TRANSDUCER WITH SPEECH UNITS'. Together they form a unique fingerprint.

Cite this