Identify speakers in cocktail parties with end-to-end attention

Junzhe Zhu, Mark Hasegawa-Johnson, Leda Sari

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In scenarios where multiple speakers talk at the same time, it is important to be able to identify the talkers accurately. This paper presents an end-to-end system that integrates speech source extraction and speaker identification, and proposes a new way to jointly optimize these two parts by max-pooling the speaker predictions along the channel dimension. Residual attention permits us to learn spectrogram masks that are optimized for the purpose of speaker identification, while residual forward connections permit dilated convolution with a sufficiently large context window to guarantee correct streaming across syllable boundaries. End-to-end training results in a system that recognizes one speaker in a two-speaker broadcast speech mixture with 99.9% accuracy and both speakers with 93.9% accuracy, and that recognizes all speakers in three-speaker scenarios with 81.2% accuracy.

Original languageEnglish (US)
Title of host publicationInterspeech 2020
PublisherInternational Speech Communication Association
Pages3092-3096
Number of pages5
ISBN (Print)9781713820697
DOIs
StatePublished - 2020
Event21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China
Duration: Oct 25 2020Oct 29 2020

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2020-October
ISSN (Print)2308-457X
ISSN (Electronic)1990-9772

Conference

Conference21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Country/TerritoryChina
CityShanghai
Period10/25/2010/29/20

Keywords

  • Cocktail party effect
  • Source separation
  • Speaker recognition

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'Identify speakers in cocktail parties with end-to-end attention'. Together they form a unique fingerprint.

Cite this