nabata

Posted on Nov 2, 2024 • Edited on Nov 8, 2024

Comparing Prompt Accuracy Across Various Image Generation AIs (Stable Diffusion 3.5, FLUX1.1, Imagen 3, DALL·E 3, Adobe Firefly)

#stablediffusion #ai #fluxai #openai

Introduction

Recently, Stability AI introduced Stable Diffusion 3.5.

Today we are introducing Stable Diffusion 3.5. This open release includes multiple model variants, including Stable Diffusion 3.5 Large and Stable Diffusion 3.5 Large Turbo. Additionally, Stable Diffusion 3.5 Medium will be released on October 29th.

Reference: Stable Diffusion 3.5 — Stability AI

The release announcement also states, "Additionally, our analysis shows that Stable Diffusion 3.5 Large leads the market in prompt adherence and rivals much larger models in image quality" This article aims to explore that claim by comparing the accuracy of prompt adherence across various popular image generation models.

Please note that this evaluation is subjective and is intended as a reference for understanding how these models handle straightforward prompts that may not always yield ideal results.

Image Generation AIs Used in This Comparison

The models tested include:

Stable Diffusion 3.5 Large by Stability AI
FLUX1.1 [pro] by Black Forest Labs
Imagen 3 by Google
DALL·E 3 by OpenAI
Adobe Firefly by Adobe

Each model was tested once per prompt. If multiple images were generated simultaneously, I selected the "top-left" result.

Stable Diffusion 3.5 Large and FLUX1.1 [pro] images were generated through Web API, while the others were created directly in-browser. Imagen 3 was accessed through ImageFX and DALL·E 3 through ChatGPT.

For Firefly, I used Firefly Image 3 with Fast mode turned off, then upscaled the images after generation. As a result, Firefly images are 2048x2048, while all other images are 1024x1024.

The code used to generate FLUX1.1 [pro] images is adapted from the article "Using the Web API for FLUX 1.1 [pro]: The Latest Image Generation AI Model by the Original Team of Stable Diffusion" with the size updated to 1024x1024.

Below is the code used for Stable Diffusion 3.5 Large. The STABILITY_API_KEY environment variable stores the API key. For more details, see the API Reference.

import os
import requests
import time

api_host = os.getenv('API_HOST', 'https://api.stability.ai')
api_key = os.getenv("STABILITY_API_KEY")
prompt = "Describe the prompt here"

# Ensure API Key is available
if api_key is None:
    raise Exception("Missing Stability API key.")

# API call
response = requests.post(
    f"{api_host}/v2beta/stable-image/generate/sd3",
    headers={
        "Accept": "image/*",
        "Authorization": f"Bearer {api_key}"
    },
    files={"none": ''},
    data={
        "prompt": prompt,
        "output_format": "png",
        "model": "sd3.5-large"
    },
)

# Save image with timestamped filename
if response.status_code == 200:
    with open(f"./{int(time.time())}.png", "wb") as file:
        file.write(response.content)
else:
    raise Exception(str(response.json()))

Now, let’s dive into the results.

No.1 - A Single Banana

One known issue in AI image generation is The Lone Banana Problem.

The bias to two bananas in a picture is, I believe, an example of a subtle bias (OK, it’s not that subtle, but it is more subtle than many of the more concerning news-grabbing biases that we regularly read about). A naïve explanation may be that in the training dataset there have been many pictures of bananas added to Midjourney’s database that have been labelled “banana” but not labelled “two bananas”. It may also be that Midjourney has never seen an individual banana, so it doesn’t know that a single banana is possible.

Reference: The Lone Banana Problem. Or, the new programming: “speaking” AI - TL;DR - Digital Science

A similar phenomenon, known as The Strawberry Problem, has also recently become a topic of interest.

To see how each model addresses this issue, I started with the following prompt:

Prompt

There is a single banana on the table.There is a single banana hanging from the ceiling.There is a single banana placed on the chair.There is a man with a single banana on his head.There is a woman washing a single banana.

Stable Diffusion 3.5 Large

FLUX1.1 [pro]

Imagen 3

DALL·E 3

Adobe Firefly

Comments

Unfortunately, none of the models performed well on this prompt.

It’s possible the prompt was too complex. My apologies.

SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
✗	✗	✗	✗	✗

No.2 - Retrying the Single Banana

I simplified the initial prompt and tried again to see if reducing complexity would improve results.

Prompt

A single banana placed in the center of a white background. The banana should be ripe, with a bright yellow peel and a few brown spots, indicating its ripeness. The shape of the banana should be curved in a natural way, and it should be clearly identifiable as one piece of fruit without any additional objects or bananas in the image.

Stable Diffusion 3.5 Large

FLUX1.1 [pro]

Imagen 3

DALL·E 3

The following error prevented image generation:

I couldn't generate the requested image because it didn't align with the content policy. If you have another idea or request, feel free to share, and I'll do my best to create it!

Adobe Firefly

Comments

Stable Diffusion 3.5 Large produced some unusual results here, as with the previous attempt, highlighting potential limitations in handling simpler prompts.

Imagen 3 generated a banana that appears slightly under-ripe, and Firefly’s result has a subtle unnatural quality. However, both images reasonably reflect the prompt’s intent.

It’s unclear what aspect of the prompt conflicted with DALL·E 3’s content policy.

SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
✗	✓	✓	-	✓

No.3 - Space Battle

To explore each AI's handling of more fantastical themes, I introduced a space battle scenario.

Prompt

A large-scale space battle between two fleets of futuristic spaceships. Lasers and missiles are being fired, with explosions happening in the background. The scene takes place in deep space, with a distant galaxy visible in the background and some debris floating nearby.

Stable Diffusion 3.5 Large

FLUX1.1 [pro]

Imagen 3

DALL·E 3

Adobe Firefly

Comments

Direct comparisons were challenging due to varying interpretations by each AI. It’s difficult to identify both lasers and missiles in every image, and some results lack a strong sense of combat.

SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
~	~	~	~	~

No.4 - Steampunk Invention

Next, I tried a prompt centered on the steampunk genre to see how well each AI captures this distinct aesthetic.

Steampunk is a subgenre of science fiction that incorporates retrofuturistic technology and aesthetics inspired by, but not limited to, 19th-century industrial steam-powered machinery.

Reference: Steampunk - Wikipedia

Prompt

An intricate steampunk device on a workbench, made of brass, gears, and glass tubes. The device is emitting a faint steam cloud, with tiny dials and gauges displaying various readings. Nearby, a pair of leather gloves and a set of old blueprints are scattered on the wooden table.

Stable Diffusion 3.5 Large

FLUX1.1 [pro]

Imagen 3

DALL·E 3

Adobe Firefly

Comments

Most images represented the prompt well, though DALL·E 3 missed the scattered blueprints, and the gloves did not appear as a pair.

Firefly also did not include leather gloves.

SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
✓	✓	✓	✗	✗

No.5 - Chibi-Style Character

For this test, I prompted each AI to generate a distinctive character in a chibi style.

Prompt

A chibi-style character of a smiling young girl with big eyes, short pink hair, and a school uniform. She is holding a small cat in her arms, standing on a grassy hill under a bright blue sky with fluffy clouds.

Stable Diffusion 3.5 Large

FLUX1.1 [pro]

Imagen 3

DALL·E 3

Adobe Firefly

Comments

Imagen 3 did not meet the prompt specification for pink hair, and DALL·E 3 omitted the cat the girl was supposed to be holding.

In Firefly, the character was given cat ears, and both the cat and the girl’s hands are somewhat awkwardly rendered.

Stable Diffusion 3.5 Large mostly captures the prompt details, though some aspects, like the cat’s body shape, appear slightly unnatural, so I rated it △.

SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
~	✓	✗	✗	✗

No.6 - Colorful Coral Reef

Next, I asked the AIs to generate a serene and vibrant underwater scene.

Prompt

A colorful underwater scene featuring a coral reef filled with vibrant fish, sea turtles, and a few small sharks. Sunlight beams are penetrating through the water's surface, illuminating the sea life and creating a beautiful, serene atmosphere.

Stable Diffusion 3.5 Large

FLUX1.1 [pro]

Imagen 3

DALL·E 3

Adobe Firefly

This prompt returned a processing error, so Firefly could not generate an image.

Comments

FLUX1.1 [pro] was missing sea turtles, Imagen 3 lacked multiple turtles and sharks, and DALL·E 3 did not include any sharks.

It’s unclear what caused Firefly’s processing error.

Incidentally, I’ve noticed that Imagen 3 frequently fails to generate images, even with other prompts.

SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
✓	✗	✗	✗	-

No.7 - Japanese Tea Ceremony

For the final test, I chose a prompt with a specific cultural theme to see how well each model captures details from a traditional Japanese tea ceremony.

Prompt

A traditional Japanese tea ceremony taking place in a tatami room. A woman in a kimono is gracefully preparing tea, while a guest kneels in front of her, observing respectfully. The room is decorated with traditional Japanese art and sliding shoji doors.

Stable Diffusion 3.5 Large

FLUX1.1 [pro]

Imagen 3

DALL·E 3

Adobe Firefly

Comments

Strict accuracy was not evaluated here, as the details of tea ceremony protocol could disqualify all the images.

Stable Diffusion 3.5 Large produced a somewhat ambiguous tea preparation scene and used an unconventional shoji door style.

DALL·E 3 displayed notably distorted tatami and other room elements.

Firefly lacked the observing guest, and its shoji doors and tatami differed from traditional interpretations.

SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
✗	✓	✓	✗	✗

In Conclusion

The following table summarizes the results.

		SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
1	Single Banana	✗	✗	✗	✗	✗
2	Single Banana Retry	✗	✓	✓	-	✓
3	Space Battle	~	~	~	~	~
4	Steampunk Invention	✓	✓	✓	✗	✗
5	Chibi Character	~	✓	✗	✗	✗
6	Coral Reef	✓	✗	✗	✗	-
7	Tea Ceremony	✗	✓	✓	✗	✗

Overall, this was a subjective review, and based on these results, it’s clear that Stable Diffusion 3.5 Large does not definitively outperform other models.

Here is a rough grouping by adherence accuracy:

Prompt Adherence Level S
- FLUX1.1 [pro], Imagen 3
Prompt Adherence Level A
- Stable Diffusion 3.5 Large
Prompt Adherence Level B
- DALL·E 3, Firefly

One takeaway is that getting an AI to generate an image that perfectly matches the intent remains a challenging task—extremely, incredibly challenging.

However, the pace of AI’s improvement is undeniably impressive.

Original Japanese Article

プロンプトがどれだけ正確に反映されるのかを様々な画像生成AIで比較してみた（Stable Diffusion 3.5、FLUX1.1、Imagen 3、DALL·E 3、Adobe Firefly）