Abstract
In this paper, we explore audio editing with non-rigid text prompts via Latent Diffusion Models. Our methodology is based on carrying out a fine-tuning step on the latent diffusion model, which increases the overall faithfulness of the generated edits to the input audio. We quantitatively and qualitatively show that our pipeline obtains results which outperform current state-of-the-art neural audio editing pipelines for addition, style transfer, and inpainting. Through a user study, we show that our method results in higher user preference compared to several baselines. We also show that the produced edits obtain better trade-offs in terms of fidelity to the text prompt and to the input audio compared to the baselines. Finally, we benchmark the impact of LoRA to improve editing speed while maintaining edits quality.
Original language | English (US) |
---|---|
Pages (from-to) | 3290-3294 |
Number of pages | 5 |
Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
DOIs | |
State | Published - 2024 |
Event | 25th Interspeech Conferece 2024 - Kos Island, Greece Duration: Sep 1 2024 → Sep 5 2024 |
Keywords
- audio editing
- generative models for audio
- Latent diffusion
ASJC Scopus subject areas
- Language and Linguistics
- Human-Computer Interaction
- Signal Processing
- Software
- Modeling and Simulation