Abstract

ToBI [1] is a prosody labeling system that transcribes American English prosody in terms of phonological tones and break indices. Previous works on automatic ToBI transcription require additional information such as word boundaries and use modular feature extraction with separately optimized feature detectors and classifiers [2]. We are interested in investigating if a neural network-based approach would also result in high performance on automatic ToBI transcription without additional information. In this paper, we investigate the problem of pitch accent detection and prosody boundary detection using the Wav2vec 2.0 model [3] with only acoustic information. Our model is trained on the Boston University Radio News Corpus and evaluated on both the Boston University Radio News Corpus and the Boston Directions Corpus. We show that it achieves an F1 score of 0.82 on pitch accent detection and 0.86 on phrase boundary detection. Code and model weights are available.

Original languageEnglish (US)
Pages (from-to)2748-2752
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2023-August
DOIs
StatePublished - 2023
Event24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: Aug 20 2023Aug 24 2023

Keywords

  • Prosodic boundaries
  • ToBI-label generation
  • Wav2vec2

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'Wav2ToBI: a new approach to automatic ToBI transcription'. Together they form a unique fingerprint.

Cite this