Separate-and-Enhance: Compositional Finetuning for Text-to-Image Diffusion Models

Zhipeng Bao, Yijun Li, Krishna Kumar Singh, Yu Xiong Wang, Martial Hebert

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. In this work, we first show the fundamental reasons for such misalignment by identifying issues related to low attention activation and mask overlaps. Then we propose a compositional finetuning framework with two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Unlike conventional test-time adaptation methods, our model, once finetuned on critical parameters, is able to directly perform inference given an arbitrary multi-object prompt, which enhances the scalability and generalizability. Through comprehensive evaluations, our model demonstrates superior performance in image realism, text-image alignment, and adaptability, significantly surpassing established baselines. Furthermore, we show that training our model with a diverse range of concepts enables it to generalize effectively to novel concepts, exhibiting enhanced performance compared to models trained on individual concept pairs.

Original languageEnglish (US)
Title of host publicationProceedings - SIGGRAPH 2024 Conference Papers
EditorsStephen N. Spencer
PublisherAssociation for Computing Machinery
ISBN (Electronic)9798400705250
DOIs
StatePublished - Jul 13 2024
Externally publishedYes
EventSIGGRAPH 2024 Conference Papers - Denver, United States
Duration: Jul 28 2024Aug 1 2024

Publication series

NameProceedings - SIGGRAPH 2024 Conference Papers

Conference

ConferenceSIGGRAPH 2024 Conference Papers
Country/TerritoryUnited States
CityDenver
Period7/28/248/1/24

Keywords

  • Diffusion Models
  • Image Generation

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Visual Arts and Performing Arts
  • Computer Graphics and Computer-Aided Design

Fingerprint

Dive into the research topics of 'Separate-and-Enhance: Compositional Finetuning for Text-to-Image Diffusion Models'. Together they form a unique fingerprint.

Cite this