Abstract
We introduce flexibility to the supervised learning-based speech enhancement framework to achieve scalable and efficient speech enhancement (SESE). To this end, SESE conducts a series of segmented speech enhancement inference routines, each of which incrementally improves the result of its preceding inference. The formulation is conceptually similar to cold diffusion, while we modify the sampling process so each step benefits from an easier milestone task rather than aggressively targeting the clean speech. In addition, the incremental enhancement steps are learned to recover the residual between the adjacent milestones, thus improving the overall enhancement performance. We show that the proposed method improves the baseline supervised model’s performance, while it necessitates fewer diffusion steps to achieve the comparable performance with the more complex cold diffusion-based counterpart. Furthermore, SESE’s scalability can be useful in applications where moderately suppressed non-speech interference is preferred to aggressive enhancement results, e.g., boosting dialog in movie soundtracks, speech enhancement on hearing aids, etc.
Original language | English (US) |
---|---|
Pages (from-to) | 1216-1220 |
Number of pages | 5 |
Journal | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
DOIs | |
State | Published - 2024 |
Event | 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of Duration: Apr 14 2024 → Apr 19 2024 |
Keywords
- cold diffusion
- model compression
- scalability
- Speech enhancement
ASJC Scopus subject areas
- Software
- Signal Processing
- Electrical and Electronic Engineering