Image generation continued its rapid march forward with a new version of Google’s flagship text-to-image model.
What’s new: Google introduced Imagen 3, a proprietary model that improves upon the previous version’s image quality and prompt adherence, with features like inpainting and outpainting to be added in the future. Imagen 3 is available via Google’s ImageFX web user interface and Vertex AI Platform. It follows closely upon the releases of Black Forest Labs’ Flux.1 family (open to varying degrees), Midjourney v6.1, and Stability AI Stable Diffusion XL 1 (open weights) — all in the last month.
How it works: The accompanying paper does not describe the model’s architecture and training procedures in detail. The authors trained a diffusion model on an unspecified “large” dataset of images, text, and associated annotations that was filtered to remove unsafe, violent, low-quality, generated, and duplicate images as well as personally identifying information. Google’s Gemini large language model generated some image captions used in training to make their language more diverse.
Results: Imagen 3 mostly outperformed competing models in head-to-head comparisons based on prompts from datasets including GenAI-Bench, DrawBench, and DALL-E 3 Eval. The team compared Imagen 3 to Midjourney v6.0, OpenAI DALL-E 3, Stable Diffusion 3 Large, and Stable Diffusion XL 1.0. More than 3,000 evaluators from 71 countries rated the models’ responses in side-by-side comparisons. The raters evaluated image quality, preference regardless of the prompt, adherence to the prompt, adherence to a highly detailed prompt, and ability to generate the correct numbers of objects specified in a prompt. Their ratings (between 1 and 5) were used to compute Elo ratings.
- Imagen 3 swept the overall preference tests. On GenAI-Bench and DrawBench, Imagen 3 (1,099 Elo and 1,068 Elo respectively) beat the next-best Stable Diffusion 3 (1,047 Elo and 1,053 Elo respectively). On DALL-E 3 Eval, Imagen 3 (1,079 Elo) beat the next-best MidJourney v6.0 (1,068 Elo).
- Likewise, Imagen 3 swept the prompt-image alignment benchmarks. On GenAI-Bench and DrawBench, Imagen 3 (1,083 Elo and 1,064 Elo respectively) outperformed the next-best Stable Diffusion 3 (1,047 Elo for both datasets). On DALL-E 3 Eval, Imagen 3 (1,078) narrowly edged out DALL-E 3 (1,077 Elo) and Stable Diffusion 3 (1,069 Elo).
- Imagen 3 showed exceptional strength in following detailed prompts in the DOCCI dataset (photographs with detailed descriptions that averaged 136 words). In that category, Imagen 3 (1,193 Elo) outperformed next-best Midjourney v6.0 (1,079 Elo).
- Although none of the models tested did very well at generating specified numbers of objects from the GeckoNum dataset, Imagen 3 (58.6 Elo) outperformed the next-best DALL-E 3 (46.0 Elo).
- Imagen 3 lost to Midjourney v6.0 across the board in tests of visual appeal regardless of the prompt. It was slightly behind on GenAI-Bench (1,095 Elo versus 1,101 Elo), farther behind on DrawBench (1,063 Elo versus 1,075 Elo), and well behind on DALL-E 3 Eval (1,047 Elo versus 1,095 Elo).
Why it matters: Each wave of advances makes image generators more useful for a wider variety of purposes. Google’s emphasis on filtering the training data for safety may limit Imagen 3’s utility in some situations (indeed, some users complained that Imagen 3 is more restrictive than Imagen 2, while the Grok2 large language model’s use of an unguardrailed version of Flux.1 for image generation has garnered headlines). Nonetheless, precautions are an important ingredient in the evolving text-to-image recipe.
We’re thinking: It’s difficult to compare the benchmarks reported for Imagen 3 and the recently released Flux.1, which claims similar improvements over earlier models. In any case, Google has yet to publish a benchmark for generating text, a valuable capability for commercial applications. The Flux.1 models, two of which are open to some degree, may prove to be formidable rivals in this area.