DEV Community

bakenator
bakenator

Posted on • Edited on

Supervisor Intensity, what is it?

Did you know that Elixir Supervisors will stop trying to restart a child process if they detect something has gone haywire in the child?

They do!

This leads to the next question, how do they know when a child process is a lost cause?

Supervisor Intensity!

What is it?

The intensity setting of a Supervisor is how many failures it can tolerate from a child process within a certain period of time. If more failures are recorded, then the Supervisor stops all child processes and fails itself.

The intensity and period are optional settings that can be set in Elixir like so

opts = [..., max_restarts: 3, max_seconds: 5]
Supervisor.start_link(children, opts)

Note that I have used the values that Elixir currently defaults to, 3 failures over a 5 second period.

When Do You Run Into This?

I figured this out when stepping through a nice bug. At the top of one of my child processes I had this line

def start_link(port: port, dispatch: dispatch) do
    {:ok, socket} = :gen_tcp.listen(port, active: false, packet: :http_bin, reuseaddr: true) do
    ...

I was crashing the process on purpose and subsequently the entire app would shutdown. This confused me, because shouldn't the Supervisor restart the process?
This issue was that :gen_tcp.listen returns an error if the port is already in use by another socket.

So after the first time, this module throws an error on the first line of start_link. The Supervisor very quickly tries and fails three times on that same bug and the Supervisor is shut down.

Why Do Supervisors Do This?

In short it is to prevent infinite loops of processes trying to be restarted over and over.

However note that the Supervisor does not just stop trying to restart the child. If the intensity limits are surpassed the Supervisor shuts down itself.

Why?

It shuts down so that any possible Supervisor of that Supervisor can be notified that something wacky is going on and try to fix the issue. The original Supervisor already gave it its best effort and is now basically passing the issue up a level to ask for help.

Try at Home!

Here is a nice little module you can try out in your own app.

Drop this line into your Supervisor child start list

{FailTwoSeconds, []}

And add this module to your project

defmodule FailTwoSeconds do
    def start_link([]) do
        IO.inspect "New Process Starting"

        pid = spawn_link(fn -> 
            Process.sleep(2000)
            raise "Failing Now"
        end)
        {:ok, pid}
    end

    def child_spec(opts) do
        %{
            id: FailTwoSeconds, 
            start: {FailTwoSeconds, :start_link, [opts]}, 
        }
    end
end

We can see that the module exists just to start up, wait 2 seconds, and crash.

But crashing every 2 seconds is within the default intensity settings for Elixir. So this module will go on crashing and being restarted til the cows come home.

If you switch the sleep time to 1000, then it will trip the Supervisor's intensity limits and the Supervisor will say "I've had too much!" and shutdown.

Thanks for reading and hope this saves you a headache one day!

Top comments (4)

Collapse
 
stealthmusic profile image
Jan Wedel

Thanks for the article! So how did you eventually configured the supervisor for the tcp socket example? I guess it could happen hubdrefs of times per second, couldnt it?

Collapse
 
bakenator profile image
bakenator • Edited

For this failing example, within my server I was using spawn_link(process_request_function), you can get a lot of added safety switching this to spawn(process_request_function).

But I did figure out a way to restart the server safely for fun.

def start_link(port: port, dispatch: dispatch) do
    case :gen_tcp.listen(port, active: false, packet: :http_bin, reuseaddr: true) do 
        {:ok, socket} ->
            Logger.info("Accepting connections on port #{port}")
            # saving socket to close in case of error
            MyApp.SocketStore.set(socket)
            {:ok, spawn_link(Http, :accept, [socket, dispatch])}
        _ ->
            :gen_tcp.close(MyApp.SocketStore.get())
    end
end

Right after the server starts, I save the socket in a Singleton Genserver.
Then on failure I close the socket and let the process fail.
It fails because it does not return an {:ok, pid} tuple.

Then next time the Supervisor restarts the process, it should succeed since there is no socket bound to the port.

Not the most robust, but it works for this toy example.

Collapse
 
stealthmusic profile image
Jan Wedel

Thanks for the example. I remember sometimes the port is occupied for a longer time, I’ve seen this a couple of times even if no process actually uses it. There is just some delay until it’s freed by the os. So in that case it still would not help, Right?

Thread Thread
 
bakenator profile image
bakenator

I haven't used this hand made server very much, but what you are saying about the delay in closing the socket sounds like it could happen.

There may be a blocking command that I am unfamiliar with to check whether the port is free. Otherwise I would think of putting a sleep command in there to give the os time to free it.