Abstract

To understand why self-supervised learning (SSL) models have empirically achieved strong performances on several speech-processing downstream tasks, numerous studies have focused on analyzing the encoded information of the SSL layer representations in adult speech. Limited work has investigated how pre-training and fine-tuning affect SSL models encoding children's speech and vocalizations. In this study, we aim to bridge this gap by probing SSL models on two relevant downstream tasks: (1) phoneme recognition (PR) on the speech of adults, older children (8-10 years old), and younger children (1-4 years old), and (2) vocalization classification (VC) distinguishing cry, fuss, and babble for infants under 14 months old. For younger children's PR, the superiority of fine-tuned SSL models is largely due to their ability to learn features that represent older children's speech and then adapt those features to the speech of younger children. For infant VC, SSL models pre-trained on large-scale home recordings learn to leverage phonetic representations at middle layers, and thereby enhance the performance of this task.

Original languageEnglish (US)
Title of host publication2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages550-554
Number of pages5
ISBN (Electronic)9798350374513
DOIs
StatePublished - 2024
Event49th IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Seoul, Korea, Republic of
Duration: Apr 14 2024Apr 19 2024

Publication series

Name2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings

Conference

Conference49th IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024
Country/TerritoryKorea, Republic of
CitySeoul
Period4/14/244/19/24

Keywords

  • canonical correlation analysis
  • children's speech infant vocalizations
  • paralinguistic features
  • Self-supervised learning

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Signal Processing
  • Media Technology
  • Acoustics and Ultrasonics

Fingerprint

Dive into the research topics of 'Analysis of Self-Supervised Speech Models on Children's Speech and Infant Vocalizations'. Together they form a unique fingerprint.

Cite this