TY - GEN
T1 - Separate-and-Enhance
T2 - SIGGRAPH 2024 Conference Papers
AU - Bao, Zhipeng
AU - Li, Yijun
AU - Singh, Krishna Kumar
AU - Wang, Yu Xiong
AU - Hebert, Martial
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/7/13
Y1 - 2024/7/13
N2 - Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. In this work, we first show the fundamental reasons for such misalignment by identifying issues related to low attention activation and mask overlaps. Then we propose a compositional finetuning framework with two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Unlike conventional test-time adaptation methods, our model, once finetuned on critical parameters, is able to directly perform inference given an arbitrary multi-object prompt, which enhances the scalability and generalizability. Through comprehensive evaluations, our model demonstrates superior performance in image realism, text-image alignment, and adaptability, significantly surpassing established baselines. Furthermore, we show that training our model with a diverse range of concepts enables it to generalize effectively to novel concepts, exhibiting enhanced performance compared to models trained on individual concept pairs.
AB - Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. In this work, we first show the fundamental reasons for such misalignment by identifying issues related to low attention activation and mask overlaps. Then we propose a compositional finetuning framework with two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Unlike conventional test-time adaptation methods, our model, once finetuned on critical parameters, is able to directly perform inference given an arbitrary multi-object prompt, which enhances the scalability and generalizability. Through comprehensive evaluations, our model demonstrates superior performance in image realism, text-image alignment, and adaptability, significantly surpassing established baselines. Furthermore, we show that training our model with a diverse range of concepts enables it to generalize effectively to novel concepts, exhibiting enhanced performance compared to models trained on individual concept pairs.
KW - Diffusion Models
KW - Image Generation
UR - http://www.scopus.com/inward/record.url?scp=85199917080&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85199917080&partnerID=8YFLogxK
U2 - 10.1145/3641519.3657527
DO - 10.1145/3641519.3657527
M3 - Conference contribution
AN - SCOPUS:85199917080
T3 - Proceedings - SIGGRAPH 2024 Conference Papers
BT - Proceedings - SIGGRAPH 2024 Conference Papers
A2 - Spencer, Stephen N.
PB - Association for Computing Machinery
Y2 - 28 July 2024 through 1 August 2024
ER -