Efficient Personalized Speech Enhancement Through Self-Supervised Learning

Aswin Sivaraman, Minje Kim

Research output: Contribution to journalArticlepeer-review

Abstract

This work presents self-supervised learning methods for monaural speaker-specific (i.e., personalized) speech enhancement models. While general-purpose models must broadly address many speakers, personalized models can adapt to a particular speaker's voice, expecting to solve a narrower problem. Hence, personalization can achieve more optimal performance in addition to reducing computational complexity. However, naive personalization methods can inconveniently require clean speech from the target user, e.g., due to subpar recording conditions. To this end, we pose personalization as either a zero-shot task, in which no clean speech of the target speaker is used, or a few-shot learning task, which is to minimize the duration of the clean speech used for transfer learning. With this paper, we propose self-supervised learning methods as a solution to both zero-and few-shot personalization tasks. The proposed methods learn the personalized speech features from unlabeled data (i.e., in-the-wild noisy recordings from the target user) rather than from the clean sources. We investigate three different self-supervised learning mechanisms. We set up a pseudo speech enhancement problem as a pretext task, which pretrains the models to estimate noisy speech as if it were the clean target. Contrastive learning and data purification methods regularize the loss function of the pseudo enhancement problem, overcoming the limitations of learning from unlabeled data. We assess our methods by personalizing the well-known ConvTasNet architecture to twenty different target speakers. The results show that self-supervision-based personalization improves the original ConvTasNet's enhancement quality with fewer model parameters and less clean data from the target user.

Original languageEnglish (US)
Pages (from-to)1342-1356
Number of pages15
JournalIEEE Journal on Selected Topics in Signal Processing
Volume16
Issue number6
DOIs
StatePublished - Oct 1 2022
Externally publishedYes

Keywords

  • Data efficiency
  • model complexity
  • personalized speech enhancement
  • self-supervised learning

ASJC Scopus subject areas

  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Efficient Personalized Speech Enhancement Through Self-Supervised Learning'. Together they form a unique fingerprint.

Cite this