DEV Community

Cover image for How Video Generation Works in the Open Source Project Wunjo CE
Wladislav Radchenko
Wladislav Radchenko

Posted on

How Video Generation Works in the Open Source Project Wunjo CE

Getting to Know the Functionality

If you’re interested in exploring the full functionality, detailed parameter explanations, and usage instructions of Wunjo CE, a comprehensive video will be attached at the end of this article. For those who prefer to skip the technical intricacies, a video demonstrating the video generation process and a comparison with Pika and Gen-2 from Runway will also be provided.

Anatole France once said, “You can only learn joyfully… To digest knowledge, you must absorb it with appetite.” This sentiment perfectly captures the essence of Wunjo CE: it offers something valuable for everyone, allowing users to engage with the material at their own pace and interest level.

Specifications for Video Generation

Videos can be created from text and images using Stable Diffusion models. Custom 1.5 models can be added and extended to XL. However, in my opinion, this might not be the best approach since the video generation model doesn’t deliver the quality achievable with Stable Diffusion XL models and requires significantly more power.

Generation Parameters

  • FPS: The maximum is 24 FPS, which allows for up to 4 seconds of video. Lowering the FPS increases the video length.
  • VRAM: Video generation is feasible with 8 GB of VRAM.
  • Formats: Various aspect ratios are available for text-based generation: 16:9 (YouTube), 9:16 (Shorts), 1:1, 5:2, 4:5, and 4:3. When generating from an image, the original aspect ratio is maintained.

Generate Video from Text

Initially, an image is generated in your chosen format. You have the option to view the image, regenerate it, or modify details before proceeding to video creation.

Generate Video from Image

The original aspect ratio of the image is preserved, but you can fine-tune its elements before generating the video.

This reminds me of something

Manual setup

If automatic download of the necessary repositories failed, you will need about 85-90 GB of space on your hard drive. You can choose the storage location of the models yourself, more details on this in GitHub Wiki.

The program downloads all repositories and models automatically, but there may be problems with downloading from Hugging Face without VPN. In this case, you need to go to the directory .wunjo/all_models.

Download manually

Runwayml: This is a repository that includes the necessary models such as vae, safety_checker, text_encoder, unet and others. You need to create a directory runwayml and download the models to the appropriate folders.

You can also use a console command to automate this process.

git clone https://huggingface.co/runwayml/stable-diffusion-v1-5
Enter fullscreen mode Exit fullscreen mode

Before you start downloading, make sure you have the ability to download large files.

git lfs install
Enter fullscreen mode Exit fullscreen mode

Downloading Custom Stable Diffusion Models

To the directory .wunjo/all_models/diffusion you can download various Stable Diffusion 1.5 models that will be used to generate images. These models can be found in the public domain, for example, at Hugging Face or Civitai.

Setting up custom_diffusion.json

In file .wunjo/all_models/custom_diffusion.json you specify the paths to your models. Example of setup:

[
    {
        "name": "Fantasy World", "model": "fantasyWorld_v10.safetensors", "img": "https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/2e321e71-c144-4ba6-8b55-3f62756fc0a1/width=1024,quality=90/01828-3669354034-giant%20pink%20balloon,cities%20floating%20on%20balloons,magnificent%20clouds,pink%20theme,.jpeg", "url": ""
    }
]
Enter fullscreen mode Exit fullscreen mode

If you specify the url, then the model is downloaded automatically.

This step can be skipped as two Stable Diffusion 1.5 models are enabled by default, which is sufficient for creating both realistic and hand-drawn content.

Switching to the diffusers library

Previously, the code worked only with the necessary parts of generative models, which saved time and reduced the amount of unnecessary code. However, to expand the functionality, I decided to switch to the library diffusers. Specifically the repository runwayml used to generate images.

Example code from the project:

# Defining the model components runwayml

vae = AutoencoderKL.from_pretrained(sd_path, subfolder="vae", torch_dtype=torch.float16)
text_encoder = CLIPTextModel.from_pretrained(sd_path, subfolder="text_encoder", torch_dtype=torch.float16)
tokenizer = CLIPTokenizer.from_pretrained(sd_path, subfolder="tokenizer")
unet = UNet2DConditionModel.from_pretrained(sd_path, subfolder="unet", torch_dtype=torch.float16)
safety_checker = StableDiffusionSafetyChecker.from_pretrained(sd_path, subfolder="safety_checker", torch_dtype=torch.float16)  # to filter naked content
feature_extractor = CLIPImageProcessor.from_pretrained(sd_path, subfolder="feature_extractor")

# If a custom model 1.5 is specified, I load its weights
if weights_path:
    weights = load_file(weights_path)  # weights_path
    unet.load_state_dict(weights, strict=False)
Enter fullscreen mode Exit fullscreen mode

You can change Stable Diffusion 1.5 to a more powerful model, since it uses pipelines from the library diffusersTo do this, it is enough to change sd_path And StableDiffusionPipeline.

pipe = StableDiffusionPipeline(
    vae=vae,
    text_encoder=text_encoder,
    tokenizer=tokenizer,
    unet=unet,
    scheduler=DDIMScheduler.from_pretrained(sd_path, subfolder="scheduler"),
    safety_checker=safety_checker,  # The value None disables SD model censoring
    requires_safety_checker=False,
    feature_extractor=feature_extractor,
).to(device)
Enter fullscreen mode Exit fullscreen mode

And links to documentation on Stable Diffusion XL And Turbo.

ControlNet

ControlNet and the corresponding pipelines are used to change the image elements in the application. Example of the setup:

def make_canny_condition(image):
    image = np.array(image)
    image = cv2.Canny(image, 100, 200)
    image = image[:, :, None]
    image = np.concatenate([image, image, image], axis=2)
    image = Image.fromarray(image)
    return image

if controlnet_type == "canny":
    control_image = make_canny_condition(init_image)
else:
    control_image = init_image

controlnet = ControlNetModel.from_pretrained(controlnet_path, torch_dtype=torch.float16).to(device)
Enter fullscreen mode Exit fullscreen mode

By default the canny method is available, but you can extend the application by adding different methods. ControlNet. For example, for XL models, you can replace ControlNet with IP-Adapter or T2I-Adapter.

Downloading ControlNet Models

To the directory .wunjo/all_models need to create directory controlnet_canny and download models from the repository sd controlnet canny.

git clone https://huggingface.co/lllyasviel/sd-controlnet-canny
Enter fullscreen mode Exit fullscreen mode

Also create a directory controlnet_tile and download models from the repository control_v11f1e_sd15_tile.

git clone https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile
Enter fullscreen mode Exit fullscreen mode

Why ControlNet Tile? More on that later.

Video generation

To generate videos I use the repository stabilityai and the corresponding Pipeline. The question arises: if the model is limited to images of 576×1024 format and generates no more than 25 frames, how can we use any format in the model and get 4 seconds of video at 24 FPS?

# Components
vae = AutoencoderKLTemporalDecoder.from_pretrained(sd_path, subfolder="vae", torch_dtype=torch.float16)
image_encoder = CLIPVisionModelWithProjection.from_pretrained(sd_path, subfolder="image_encoder", torch_dtype=torch.float16)
scheduler = EulerDiscreteScheduler.from_pretrained(sd_path, subfolder="scheduler")
unet = UNetSpatioTemporalConditionModel.from_pretrained(sd_path, subfolder="unet", torch_dtype=torch.float16)
feature_extractor = CLIPImageProcessor.from_pretrained(sd_path, subfolder="feature_extractor")
# Init
pipe = StableVideoDiffusionPipeline(
    vae=vae,
    image_encoder=image_encoder,
    scheduler=scheduler,
    unet=unet,
    feature_extractor=feature_extractor
)
pipe.enable_model_cpu_offload()
pipe.unet.enable_forward_chunking()
pipe.enable_xformers_memory_efficient_attention()
Enter fullscreen mode Exit fullscreen mode

Preparing images

Before feeding the image to the model for video generation, I build it up to 576×1024 format without changing the content of the user frame. After outpaint I use controlnet_tile with the appropriate mask to improve the quality of the added zones. The better the quality of the completed zones, the better the animation. Generation is better focused on the movement of objects rather than the created zones.

Generation iterations

The video cannot be generated indefinitely, because the model adds noise to each final frame. After the third iteration, you get a mess of pixels. So I generate the first two iterations, reverse them, generate the iterations again, and combine everything, cropping to the aspect ratio for the user. These tricks expand the possibilities for stabilityai.

Improving the quality of video generation

You can replace the repository stable-video-diffusion-img2vid-xt on stable-video-diffusion-img2vid-xt-1-1 on Hugging Face to get better video quality than the model I use.

Comparison

For you, I have collected in one video various examples of generation in comparison with Pika and Gen-2. I have not included Gen-3, Luma, Sora models, since open source models cannot compete with them. From the entire list, I was able to use only Luma for free, and even then with restrictions – for example, no more than 5 generations per day. In this comparison, I focused on approaches that give approximately the same generation time. It is important to understand that Wunjo CE is a completely free and unlimited open source solution.

Features of the models

By the location of the result on the video.

  • Pika: The model does not add much movement, but smooths the result, sometimes even too much. Can add sound to the video.
  • Wunjo C.E.: Preserves the original image quality and adds interesting movements to some objects. However, the directions of these movements can randomly change in the frame and generating one video takes 15 minutes on NVIDIA GeForce RTX 3070 8 GB VRAM.
  • Gen-2: Adds more realism, but can create distortions for unusual objects. It is possible to increase the duration of the result, but the quality decreases with each iteration,

Examples of generation

The prompts were simple, like “robotic hand” or “dogs riding in a car”. I used only one video generation for each approach, without specially selecting or improving the results, to demonstrate the real potential of the models.

Dogs are riding in the car

Robotic hand

This particular generation made me laugh: the model created the frames in such a way that it looks like the girl is angrily cursing at life for reasons known only to her.

Lofi girl

See also all examples of compare Pika, Wunjo CE and Gen-2.

To explain in detail the parameters of video generation and the use of the application, I made a tutorial on YouTube.

Additional Functionality

You can learn about the rest of Wunjo CE's functionality from the video playlist provided. These videos cover the main features, installation instructions (both from the code and installer), and how to use the API for your projects.

Support the Project

If you'd like to support the project, visit GitHub page and bookmark it to stay updated with the latest developments. Future plans include adding audio generation for videos and creating animations of talking heads from images. You can download installers from the official website wunjo.online and Boosty. On Boosty, you can vote for which features' code should be open-sourced and made available on GitHub. Your interest drives these decisions.

Alternatives

No discussion about video generation would be complete without mentioning alternatives. One intriguing open-source project is ToonCrafter, which generates hand-drawn animations. It creates motion between the first and last frame, rather than from text or a single image. While the resolution is quite low at 320×512, and I haven't tested its video memory requirements, it's a promising alternative with room for improvement. The ToonCrafter model's ability to add motion to animations is particularly appealing. I collect all interesting solutions for video, voice cloning, and more on favorite page.

Your Suggestions

Please share your open-source video generation alternatives in the comments. Your suggestions will help improve current approaches and contribute to a knowledge base for this exciting new field.

A Bit of Philosophy

Generating video from text and images isn't just a technological achievement; it's a new form of creativity and self-expression. Open-source and commercial projects alike are enabling new ways to express ideas, where simple text can produce videos that would otherwise be complex and costly to create.

Imagine adapting one video for different countries and regions with a single request, altering skin tones, faces, objects, and captions, and adding new elements and videos. The future promises even more realistic and detailed models. Video generation is evolving from a mere tool into a new philosophy in creativity, blending technology, simplicity, and art.

Top comments (0)