Audio Editing with Non-Rigid Text Prompts

Francesco Paissan, Luca Della Libera, Zhepei Wang, Paris Smaragdis, Mirco Ravanelli, Cem Subakan

Research output: Contribution to journalConference articlepeer-review

Abstract

In this paper, we explore audio editing with non-rigid text prompts via Latent Diffusion Models. Our methodology is based on carrying out a fine-tuning step on the latent diffusion model, which increases the overall faithfulness of the generated edits to the input audio. We quantitatively and qualitatively show that our pipeline obtains results which outperform current state-of-the-art neural audio editing pipelines for addition, style transfer, and inpainting. Through a user study, we show that our method results in higher user preference compared to several baselines. We also show that the produced edits obtain better trade-offs in terms of fidelity to the text prompt and to the input audio compared to the baselines. Finally, we benchmark the impact of LoRA to improve editing speed while maintaining edits quality.

Original languageEnglish (US)
Pages (from-to)3290-3294
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2024
Event25th Interspeech Conferece 2024 - Kos Island, Greece
Duration: Sep 1 2024Sep 5 2024

Keywords

  • audio editing
  • generative models for audio
  • Latent diffusion

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'Audio Editing with Non-Rigid Text Prompts'. Together they form a unique fingerprint.

Cite this