TY - GEN
T1 - AlignDiff
T2 - 18th European Conference on Computer Vision, ECCV 2024
AU - Qiu, Ri Zhao
AU - Wang, Yu Xiong
AU - Hauser, Kris
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025
Y1 - 2025
N2 - Text-to-image diffusion models have shown remarkable success in synthesizing photo-realistic images. Apart from creative applications, can we use such models to synthesize samples that aid the few-shot training of discriminative models? In this work, we propose AlignDiff, a general framework for synthesizing training images and masks for few-shot segmentation. We identify two crucial misalignments that arise when utilizing pre-trained diffusion models in segmentation tasks, which need to be addressed to create realistic training samples and align the synthetic data distribution with the real training distribution: 1) instance-level misalignment, where generated samples of rare categories are often misaligned with target tasks) and 2) annotation-level misalignment, where diffusion models are limited to generating images without pixel-level annotations. AlignDiff overcomes both challenges by leveraging a few real samples to guide the generation, thus improving novel IoU over baseline methods in few-shot segmentation and generalized few-shot segmentation on Pascal-5i and COCO-20i by up to 80%. Notably, AlignDiff is capable of augmenting the learning of out-of-distribution uncommon categories on FSS-1000, while naïve diffusion model generates samples that diminish segmentation performance.
AB - Text-to-image diffusion models have shown remarkable success in synthesizing photo-realistic images. Apart from creative applications, can we use such models to synthesize samples that aid the few-shot training of discriminative models? In this work, we propose AlignDiff, a general framework for synthesizing training images and masks for few-shot segmentation. We identify two crucial misalignments that arise when utilizing pre-trained diffusion models in segmentation tasks, which need to be addressed to create realistic training samples and align the synthetic data distribution with the real training distribution: 1) instance-level misalignment, where generated samples of rare categories are often misaligned with target tasks) and 2) annotation-level misalignment, where diffusion models are limited to generating images without pixel-level annotations. AlignDiff overcomes both challenges by leveraging a few real samples to guide the generation, thus improving novel IoU over baseline methods in few-shot segmentation and generalized few-shot segmentation on Pascal-5i and COCO-20i by up to 80%. Notably, AlignDiff is capable of augmenting the learning of out-of-distribution uncommon categories on FSS-1000, while naïve diffusion model generates samples that diminish segmentation performance.
KW - Data Synthesis
KW - Semantic Segmentation
KW - Text-to-Image Diffusion
UR - http://www.scopus.com/inward/record.url?scp=85210155208&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85210155208&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-72940-9_22
DO - 10.1007/978-3-031-72940-9_22
M3 - Conference contribution
AN - SCOPUS:85210155208
SN - 9783031729393
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 384
EP - 400
BT - Computer Vision – ECCV 2024 - 18th European Conference, Proceedings
A2 - Leonardis, Aleš
A2 - Ricci, Elisa
A2 - Roth, Stefan
A2 - Russakovsky, Olga
A2 - Sattler, Torsten
A2 - Varol, Gül
PB - Springer
Y2 - 29 September 2024 through 4 October 2024
ER -