Dhanush Reddy

Posted on Jul 28

How to create custom nodes in ComfyUI

#comfyui #stablediffusion #genai #python

What is ComfyUI?

ComfyUI is a powerful and flexible user interface for Stable Diffusion, allowing users to create complex image generation workflows through a node-based system. While ComfyUI comes with a variety of built-in nodes, its true strength lies in its extensibility. Custom nodes enable users to add new functionality, integrate external services, and tailor it to their specific needs.

In this blog post, we will walk through the process of creating a custom node for image captioning using ComfyUI. This node will take an image as input and return a generated caption using an external API.

We will be using Google Gemini API for generating the caption of an image.

So here is the entire code which does the ImageCaptioning using Gemini API.

You can copy the following code into any file under the custom_nodes folder in ComfyUI, I have named mine as gemini-caption.py

Complete code for Generating Image Captions

import numpy as np
from PIL import Image
import requests
import io
import base64

class ImageCaptioningNode:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {"image": ("IMAGE",), "api_key": ("STRING", {"default": ""})}
        }

    RETURN_TYPES = ("STRING",)
    FUNCTION = "caption_image"
    CATEGORY = "image"
    OUTPUT_NODE = True

    def caption_image(self, image, api_key):
        # Convert the image tensor to a PIL Image
        image = Image.fromarray(
            np.clip(255.0 * image.cpu().numpy().squeeze(), 0, 255).astype(np.uint8)
        )

        # Convert the image to base64
        buffered = io.BytesIO()
        image.save(buffered, format="PNG")
        img_str = base64.b64encode(buffered.getvalue()).decode()
        api_url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key={api_key}"
        payload = {
            "contents": [
                {
                    "parts": [
                        {
                            "text": "Generate a caption for this image in as detail as possible. Don't send anything else apart from the caption."
                        },
                        {"inline_data": {"mime_type": "image/png", "data": img_str}},
                    ]
                }
            ]
        }

        # Send the request to the Gemini API
        try:
            response = requests.post(api_url, json=payload)
            response.raise_for_status()
            caption = response.json()["candidates"][0]["content"]["parts"][0]["text"]
        except requests.exceptions.RequestException as e:
            caption = f"Error: Unable to generate caption. {str(e)}"

        print(caption)
        return (caption,)


NODE_CLASS_MAPPINGS = {"ImageCaptioningNode": ImageCaptioningNode}

Here is how the node looks on the UI:

Let's go over it line by line, to get an understanding how do we go about creating a similar node for your use case. First of all whatever node you want to create, make it as a function, so you can call it just in the same way in ComfyUI, as I did here for my caption_image function.

Import the necessary libraries needed

import numpy as np
from PIL import Image
import requests
import io
import base64

These lines import the necessary libraries for my Image Captioning node:

numpy for numerical operations
PIL (Python Imaging Library) for image processing
requests for making HTTP requests to Gemini API
io for handling byte streams
base64 for encoding the image

Defining the ClassName for your ComfyUI node

class ImageCaptioningNode:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {"image": ("IMAGE",), "api_key": ("STRING", {"default": ""})}
        }

In my case, I have named it as ImageCaptioningNode as it does what is says.

The class method defines the input types for our node:

An "image" input of type "IMAGE"
An "api_key" input of type "STRING" with a default empty value, needed for sending API requests to Gemini API.

    RETURN_TYPES = ("STRING",)
    FUNCTION = "caption_image"
    CATEGORY = "image"
    OUTPUT_NODE = True

These class variables define:

The return type (a string)
The main function to be called ("caption_image")
The category in which the node will appear in ComfyUI
That this node can be an output node

    def caption_image(self, image, api_key):
        # Convert the image tensor to a PIL Image
        image = Image.fromarray(
            np.clip(255.0 * image.cpu().numpy().squeeze(), 0, 255).astype(np.uint8)
        )

        # Convert the image to base64
        buffered = io.BytesIO()
        image.save(buffered, format="PNG")
        img_str = base64.b64encode(buffered.getvalue()).decode()
        api_url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key={api_key}"

        # Prepare the request payload
        payload = {
            "contents": [
                {
                    "parts": [
                        {
                            "text": "Generate a caption for this image in as detail as possible. Don't send anything else apart from the caption."
                        },
                        {"inline_data": {"mime_type": "image/png", "data": img_str}},
                    ]
                }
            ]
        }
        try:
            response = requests.post(api_url, json=payload)
            response.raise_for_status()
            caption = response.json()["candidates"][0]["content"]["parts"][0]["text"]
        except requests.exceptions.RequestException as e:
            caption = f"Error: Unable to generate caption. {str(e)}"

        print(caption)
        return (caption,)

This is a standalone function which I have written that takes an Image as input, and sends it to Gemini API using the API key. The code is straightforward, we are just doing base64 encoding so image gets sent via API. We instruct Gemini to caption the image in detail using the prompt. The response from API is parsed, and printed in the console and returned as a tuple (required by ComfyUI).

NODE_CLASS_MAPPINGS = {"ImageCaptioningNode": ImageCaptioningNode}

This dictionary maps the class name to the class itself, which is used by ComfyUI to register the custom node.

To conclude your article on creating a custom ComfyUI node, you can summarize the key points and provide some final thoughts. Here's a suggested conclusion:

Conclusion:

Creating custom nodes for ComfyUI opens up a world of possibilities for extending and enhancing your image generation workflows. In this article, we've walked through the process of building a custom image captioning node, demonstrating how to:

Define input and output types
Integrate with external APIs (in this case, the Gemini API for image captioning)

By following these steps, you can create your own custom nodes to add virtually any functionality you need to ComfyUI. Whether you're integrating new LLM models, adding specialized image processing techniques, or creating shortcuts for common tasks, custom nodes allow you to tailor ComfyUI to your specific requirements.

Remember that while we've focused on image captioning in this example, the same principles can be applied to create nodes for a wide variety of tasks. The key is to understand the structure of a ComfyUI node and how to interface with the expected inputs and outputs.

In case if you still have any questions regarding this post or want to discuss something with me feel free to connect on LinkedIn or Twitter.

If you run an organization and want me to write for you, please connect with me on my Socials 🙃

Top comments (6)

sc0v0ne • Jul 28

Congratulations on the post, you were very clear in the explanation. I have a question, what could ComfyUI add to create a workflow structure compared to Airflow?

One observation, if you accept a tip. Put the link to the tool for the reader. Help to find reference.

comfyanonymous / ComfyUI

The most powerful and modular stable diffusion GUI, api and backend with a graph/nodes interface.

ComfyUI

The most powerful and modular stable diffusion GUI and backend.

This ui will let you design and execute advanced stable diffusion pipelines using a graph/nodes/flowchart based interface. For some workflow examples and see what ComfyUI can do you can check out:

ComfyUI Examples

Installing ComfyUI

Features

Nodes/graph/flowchart interface to experiment and create complex Stable Diffusion workflows without needing to code anything.
Fully supports SD1.x, SD2.x, SDXL, Stable Video Diffusion, Stable Cascade, SD3 and Stable Audio
Asynchronous Queue system
Many optimizations: Only re-executes the parts of the workflow that changes between executions.
Smart memory management: can automatically run models on GPUs with as low as 1GB vram.
Works even if you don't have a GPU with: --cpu (slow)
Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs and CLIP models.
Embeddings/Textual inversion
Loras (regular, locon and loha)
Hypernetworks
Loading full workflows (with seeds) from generated PNG, WebP…

View on GitHub

Dhanush Reddy • Jul 29

Thanks @sc0v0ne, I have added repo link now.
I am not really sure about Apache Airflow, as I havent worked on it in the past

Axorax • Jul 28

great article!

Dhanush Reddy • Jul 28

Thanks @axorax

Axorax • Jul 28

no problem!

Sourabh Gupta • Oct 30

This is a fantastic guide to extending ComfyUI’s functionality with custom nodes! The detailed explanation of setting up inputs and outputs and working with external APIs like the Gemini API really clarifies the process. Custom nodes seem like a powerful way to adapt ComfyUI to specific needs, especially for unique workflows in image generation. Are there any other APIs you’d recommend for different types of image processing tasks, or would you suggest experimenting with any particular techniques when building custom nodes? Thanks for the detailed guide!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community