CONTENTVEC: An Improved Self-Supervised Speech Representation by Disentangling Speakers

Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng I.Jeff Lai, David Cox, Mark Hasegawa-Johnson, Shiyu Chang

Research output: Contribution to journalConference articlepeer-review

Abstract

Self-supervised learning (SSL) in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks. Since the majority of the downstream tasks of SSL learning in speech largely focus on the content information in speech, the most desirable speech representations should be able to disentangle unwanted variations, such as speaker variations, from the content. However, disentangling speakers is very challenging, because removing the speaker information could easily result in a loss of content as well, and the damage of the latter usually far outweighs the benefit of the former. In this paper, we propose a new SSL method that can achieve speaker disentanglement without severe loss of content. Our approach is adapted from the HuBERT framework, and incorporates disentangling mechanisms to regularize both the teachers (masked prediction labels) and the students (learned representations). We evaluate the benefit of speaker disentanglement on a set of content-related downstream tasks, and observe a consistent and notable performance advantage of our speaker-disentangled representations.

Original languageEnglish (US)
Pages (from-to)18003-18017
Number of pages15
JournalProceedings of Machine Learning Research
Volume162
StatePublished - 2022
Event39th International Conference on Machine Learning, ICML 2022 - Baltimore, United States
Duration: Jul 17 2022Jul 23 2022

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Fingerprint

Dive into the research topics of 'CONTENTVEC: An Improved Self-Supervised Speech Representation by Disentangling Speakers'. Together they form a unique fingerprint.

Cite this