Generative AI models like DALL-E and stable diffusion generate jaw-dropping and mind-boggling images. While the results may seem magical, understanding how these models work often involves delving into mathematical concepts. In this article, I will break down the workings of these models for non-technical readers, aiming to demystify the complexities behind their magic.
Diffusion models
Models like DALL-E belong to a class of machine learning models called Diffusion models. They are currently the best performing models when it comes to generating realistic images.
Diffusion models were inspired by the physical phenomenon known as diffusion. In physics, diffusion is the process by which particles move from an area of higher concentration to an area of lower concentration.
For instance, when you spray paint on a wall, it moves from the paint container where there's a high amount of paint to the wall where there isn't any.
In machine learning, diffusion models aren't so different. Their task involves adding noise to an image incrementally at different time steps until the image becomes barely visible.
This process is called forward diffusion. In the picture above, at time step 1, the image is clean, but starting from time step 2, we begin incrementally adding noise. This is similar to spraying paint on a wall, which is a form of diffusion in physics. In our case, we are spraying the "paint" on the image.
You might be wondering how this principle helps AI generate new images. Well, forward diffusion itself doesn't, but the reverse process does.
Reverse diffusion
Reverse diffusion is the process of taking a noisy image and reverting it back to the prior time step.
The image above reverses the noise from time step 4 back to 3. So, how is this helpful in generating new images? Well, let's walk through one scenario.
A friend presents us with the image above and challenges us to restore the image to its original form. We look at the image, since parts of it are still visible, it shouldn't be too hard to infer the other missing parts. So, we accept their challenge.
Here's how we are going to fill in the gaps. We start by breaking the noise removal process into time steps, and at each time step, we decide what part of the image to fill. Let's divide the time steps into 5. The initial noisy image is at time step 5, and time step 1 is when we have removed all the noise present in the image.
In our initial noisy image, which is at time step 5, we notice the stick figure is holding something in its left arm. We can make the assumption that it is a suitcase or a box, so we can remove the noise by completing the drawing. This will result in a new image, now at time step 4.
At time step 4, we have less noise. Let's remove the noise at the stick figure's foot. Since we can't see the direction of his right foot, we can make the assumption it is facing the same direction as his other foot. So, we remove the noise and draw his new foot again based on our assumptions. This will give us the image at time step 3.
At time step 3, we have to decide how to go about drawing the stick figure's second arm. What direction should we place his arm? Let's look at the noise. From the shape of the noise, we can assume the figure's arm is raised. The figure is apparently holding a suitcase, so that means the figure might be a businessman. What do businessmen tend to do? They tend to make calls a lot. So, we can assume this figure is making a phone call with its other arm. We draw this, clean off the noise, and move on to time step 2.
At time step 2, our diagram is now pretty clear with just a few noises, which we can easily remove. This will give us our final image at time step 1 with no noise at all.
How were we able to fill in the gaps? Well, it came from our knowledge of the world. Over the course of our lives, we have seen a lot of stick figures, and we have an understanding of how stick figures should appear. We made assumptions based on our past experiences in life.
Our friend who gave us this challenge provides us with the original image and suggests that we compare it with what we had generated through noise removal.
Oops, we were wrong; the stick figure wasn't making a call with his other hand. But would you look at that? While trying to fill in the gaps in the noise, we created a totally new image. In theory, that's how diffusion models generate new data.
If we want a diffusion model to solve this task as we did, we will present it with a lot of image data of stick figures. By a lot, I mean a significant amount, preferably in the thousands or even millions.
By providing the model access to such a dataset, it will develop an understanding of how stick figures should look. So, when we give it an image with noise, it will behave similarly to us and think, based on its training data, what is most likely supposed to fill that noise.
The Need for Time Steps
While generating our image, we divided the process into time steps. Why? By breaking the task into different time steps, it enables the model to gradually think through the task, much like we did. This is what makes diffusion models proficient at generating images compared to other models. At every time step, the model only cares about the prior time step, which helps it generate the next time step image. This process is known as a Markov chain.
Image Synthesis
In our last task, we focused on restoring an already existing image that was just noisy. Now, let's explore the challenge of creating an image without any preexisting image. The process of generating images from scratch is called image synthesis. This process isn't new, several ML algorithms are capable of performing it perfectly. However, diffusion models are currently considered the best at generating realistic images.
So, how can we get our model to make images out of thin air, or should I say, out of thin noise? Before we delve into that, let's consider how humans would achieve this task.
Our friend from before presents us with a highly noisy image, asking us to clean the image once again.
We take a look at the image and tell them it is indeed very noisy. They reply and say they know, but we should examine the image and utilize the patterns in the noise to recreate the image.
Like the previous example, we will divide our noise removal process into time steps. From the noisy image above, I can see a couple of black dots. Let's try to connect these dots and see what we can come up with.
After connecting the dots, we obtain two lines for this current time step. We then move on to the next time step.
In this time step, we connect our other two dots to the currently existing lines. I can definitely see a stick figure now. We can argue that our stick figure has limbs now, but then there's still one dot to connect.
We connect our existing lines to that dot, and we have what appears to be a stick figure with no head. Now, we have to carefully consider where to place its head.
Let's place the head at the bottom of the image. Although it is unlikely for a stick figure to be upside down, it is even more unlikely for us to have a stick figure with longer arms than legs. We can say the stick figure is somersaulting.
We then remove the remaining noise and present the figure to our friend, declaring that we are done. We then ask for the original image so we can compare our generated image. However, our friend replies to us that there was never an original image, and all he gave to us is noise.
That's unexpected; I could have sworn I saw a stick figure somersaulting in that noise. Well, this is what we humans do. We sometimes see patterns where there are none, and it can lead to some of the most artistic results. Take the ancient Greeks, for instance.
They saw these dots in the night sky and envisioned a hunter holding a club and a shield. This interpretation gave birth to the Orion constellation, even though there wasn't a hunter in the sky; it was just noise in the form of stars.
Over the course of generations (time steps), various artists have reimagined the Orion constellation, making it look more realistic.
This is exactly what diffusion models do. They are initially given noise, but thanks to their vast training data, they start seeing patterns that are clearly not there. These apparent patterns are enough for them to create novel and realistic images.
You are probably asking, 'When I use DALL-E, I provide a prompt, and then it generates an image. How does this work?' Well, it is a text-guided diffusion model. That will be the subject of the next article.
If you found this article interesting or useful, please drop a ❤️. Share it with anyone who is curious about Generative AI. Feel free to drop a comment if you have any questions.
Top comments (0)