Recently while working on an Elixir project I ran into an interesting gotcha with Agents that caused massive amounts of resource usage. Read on to find out what happened.
What are Agents in Elixir?
Agents are a simple abstraction around state.
Often in Elixir there is a need to share or store state that must be accessed from different processes or by the same process at different points in time.
The
Agent
module provides a basic server implementation that allows state to be retrieved and updated via a simple API.
Elixir is an immutable language where nothing is shared by default. This has many benefits, but it also means that when you do want to share data between processes you need to do some extra work. Fortunately, Elixir provides a lot of great building blocks to achieve this like Agents, ETS and Mnesia.
So what’s the problem?
There are two ways to use the state stored in an agent:
- By operating on the data form within the Agents process:
# Compute in the agent/server
def get_something(agent) do
Agent.get(agent, fn state -> do_something_expensive(state) end)
end
- By pulling the data into the client process and operating on it there:
# Compute in the client
def get_something(agent) do
Agent.get(agent, & &1) |> do_something_expensive()
end
If you look at the code the differences are very subtle. The difference in behaviour, however, is not subtle.
In approach #1 the data will remain in the Agent process. However, if you perform expensive operations there the agent will be blocked for the entire duration of the operation, meaning no other process can access the data until the operation is finished. Using this model to respond to an HTTP request is killing for performance.
In approach #2 the Agent will not be blocked, but the data will be copied into the process that is accessing the data. When the amount of data is small this is not really a problem, but if you start storing larger amounts of data this becomes really expensive real quick.
Real life example
The impact of this can be huge as I will demonstrate in the case below.
In the project I’m working on we were storing a set of rules in an Agent. A rule is a struct with 27 fields and we were storing approximately ~ 5000 rules in the Agent. There’s an HTTP endpoint that for every request uses these rules to determine the response.
For a while this was fine, but when the load started increasing we noticed the server going out of memory. To debug this I started throwing load at the endpoint using wrk. Results below:
Running 1m test @ http://localhost:4001
4 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.23s 553.43ms 3.00s 57.78%
Req/Sec 59.72 121.98 680.00 88.64%
Latency Distribution
50% 2.24s
75% 2.67s
90% 2.88s
99% 3.00s
5551 requests in 1.00m, 11.80MB read
Socket errors: connect 0, read 5971, write 3, timeout 5506
Non-2xx or 3xx responses: 4575
Requests/sec: 92.38
Transfer/sec: 201.05KB
As you can see 92 requests per second are handled and a lot of requests time out (take more than 3 seconds). During the test, the Elixir process consumed around 10GB of memory.
Solutions
As we’ve seen in the previous section, storing these amounts of data in an Agent requires a lot of memory and performance is frankly not great.
Looking at the code and reading the Agent documentation, I quickly realised that the root cause of this issue was the fact that all rules were copied to the process handling the HTTP request, for every request. So how can we prevent this?
‘Shared nothing’ is a very core principle of Elixir/Erlang, so the short answer is you can’t prevent the data from being copied if you want to share the data between processes. This affects all ways of storing data in memory, so not just Agents.
There are workarounds, like fast_global. Fastglobal works by dynamically compiling a module at runtime, but it’s not without drawbacks.
So the solution is to make sure the data does not have to be shared between processes. There are a variety of ways to do this. The approach I took was to create a pool of worker processes (with Poolboy) that handle executing the rules. When an HTTP request comes in, the rule matching is handled by one of the worker processes.
In code this looks roughly like this (simplified):
defmodule Worker do
use GenServer
def start_link(_) do
GenServer.start_link( __MODULE__ , nil, [])
end
def init(_) do
rules = State.get()
{:ok, rules}
end
def handle_call({:match_rules, input}, _from, rules) do
matches = match_rules(rules, input)
{:reply, matches, rules}
end
end
When a worker starts it loads (copies) the rules from the Agent (State
is a module wrapping the Agent) into the worker process. Each worker process contains a copy of the rules so the memory usage is predictable.
If the rules change at runtime, the processes are simply killed and restarted so the new rules will be used automatically. Poolboy takes care of starting N workers and selecting a worker from the pool.
End result
With that in place, wrk
results started looking as follows:
Running 1m test @ http://localhost:4001
4 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.04s 270.71ms 1.66s 69.89%
Req/Sec 221.14 65.34 405.00 66.08%
Latency Distribution
50% 1.07s
75% 1.25s
90% 1.36s
99% 1.47s
52823 requests in 1.00m, 14.41MB read
Socket errors: connect 0, read 1014, write 0, timeout 0
Requests/sec: 879.16
Transfer/sec: 245.55KB
As you can see the throughput increased from 92 req/sec to 879 req/sec. Average latency went down from 2.23s to 1.04s. Memory used went down from 10GB to 400MB.
Not bad!
Top comments (2)
That was a really cool read! Thanks for sharing!
Could you talk a little bit about how you got to the hypothesis that Agent was the bottleneck? You said you tested the endpoint and your memory usage went up, but how did you know it was your Rules agents? Observers?
Thanks, that's a great question!
I noticed the memory increasing so rapidly that I figured it had to be copying the full set of rules somewhere. So reading the code and the Agent docs I stumbled on this snippet in the docs [1]:
Before reading this I didn't realize that using an Agent like this copies the whole state upon reading. It actually makes a lot of sense, it's just something I hadn't ran into earlier.
[1] hexdocs.pm/elixir/Agent.html#modul...