Abstract
ToBI [1] is a prosody labeling system that transcribes American English prosody in terms of phonological tones and break indices. Previous works on automatic ToBI transcription require additional information such as word boundaries and use modular feature extraction with separately optimized feature detectors and classifiers [2]. We are interested in investigating if a neural network-based approach would also result in high performance on automatic ToBI transcription without additional information. In this paper, we investigate the problem of pitch accent detection and prosody boundary detection using the Wav2vec 2.0 model [3] with only acoustic information. Our model is trained on the Boston University Radio News Corpus and evaluated on both the Boston University Radio News Corpus and the Boston Directions Corpus. We show that it achieves an F1 score of 0.82 on pitch accent detection and 0.86 on phrase boundary detection. Code and model weights are available.
Original language | English (US) |
---|---|
Pages (from-to) | 2748-2752 |
Number of pages | 5 |
Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Volume | 2023-August |
DOIs | |
State | Published - 2023 |
Event | 24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland Duration: Aug 20 2023 → Aug 24 2023 |
Keywords
- Prosodic boundaries
- ToBI-label generation
- Wav2vec2
ASJC Scopus subject areas
- Language and Linguistics
- Human-Computer Interaction
- Signal Processing
- Software
- Modeling and Simulation