LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

1Mohamed Bin Zayed University of Artificial Intelligence
  2KAUST   3Australian National University
  ICLR 2024

Current state-of-the-art text-to-image models face challenges when dealing with lengthy and detailed text prompts, resulting in the exclusion of objects and fine-grained details. Our approach adeptly encompasses all the objects described, preserving their intricate features and spatial characteristics as outlined in the two white boxes.

Abstract

Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in generating images from short, single-object descriptions, these models often struggle to faithfully capture all the nuanced details within longer and more elaborate textual inputs. In response, we present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs.

Methodology

LLM-Blueprint design
Global Scene Generation: Our proposed approach takes a long text prompt describing a complex scene and leverages an LLM to generate k layouts which are then interpolated to a single layout, ensuring the spatial accuracy of object placement. Along with the layouts, we also query an LLM to generate object descriptions along with a concise background prompt summarizing the scene’s essence. A Layout-to-Image model is employed which transforms the layout into an initial image. Iterative Refinement Scheme: The content of each box proposal is refined using a diffusion model conditioned on a box mask, a (generated) reference image for the box, and the source image, guided by a multi-modal signal.

Qualitative Comparisons

comparison
Qualitative comaprisons with state-of-the-art methods. The underlined text in the text prompts represents the objects, their characteristics, and spatial properties. Red text highlights missing objects, purple signifies inaccuracies in object positioning, and black text points out implausible or deformed elements. Baseline methods often omit objects and struggle with spatial accuracy (first four columns), while our approach excels in capturing all objects and preserving spatial attributes (last column)

BibTeX

@misc{gani2023llm,
      title={LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts}, 
      author={Hanan Gani and Shariq Farooq Bhat and Muzammal Naseer and Salman Khan and Peter Wonka},
      booktitle={Twelfth International Conference on Learning Representations}
      year={2024}
}