WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models

Heting Gao, Junrui Ni, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson

Research output: Contribution to journalConference articlepeer-review

Abstract

Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks with only a few text examples, without the need for fine-tuning. Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings functioning like the text embeddings of the language model. Interested in exploring the possibility of transferring the few-shot learning ability to the audio-text setting, we propose a novel speech understanding framework, WAVPROMPT, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model. We show that WAVPROMPT is a few-shot learner that can perform speech understanding tasks better than a naïve text baseline. We conduct detailed ablation studies on different components and hyperparameters to empirically identify the best model configuration. In addition, we conduct a non-speech understanding experiment to show WAVPROMPT can extract more information than just the transcriptions. The source code is available at https://github.com/Hertin/WavPrompt.

Original languageEnglish (US)
Pages (from-to)2738-2742
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2022-September
DOIs
StatePublished - 2022
Event23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
Duration: Sep 18 2022Sep 22 2022

Keywords

  • few-shot learning
  • language model
  • speech understanding

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models'. Together they form a unique fingerprint.

Cite this