What you will learn in this blog.
Our RabbitMQ consultancy customers come from a wide range of industries. As a result, we have seen almost all of the unexpected behaviours that it can throw at you. RabbitMQ is a complex piece of software that employs concurrency and distributed computing (via Erlang), so debugging it is not always straightforward. To get to the root cause of an unexpected (and unwanted) behaviour, you need the right tools and the right methodology. In this article, we will demonstrate both to help you learn the craft of debugging in RabbitMQ.
The problem of debugging RabbitMQ.
The inspiration for this blog comes from a real-life example. One of our customers had the RabbitMQ Management HTTP API serving crucial information to their system. The system relied on the API heavily, specifically on /api/queues
endpoint because the system needed to know the number of messages ready in each queue in a RabbitMQ cluster. The problem was that sometimes a HTTP request to the endpoint lasted up to tens of seconds (in the worst case they weren't even able to get a response from the API).
So what caused some requests to take so much time? To answer that question, we tried to reproduce the issue through load testing.
Running load tests
We use a platform that we created for MongooseIM to run our Continuous Load Testing. Here are some of the most important aspects of the platform:
- all the services that are involved in a load test run inside docker containers
- the load is generated by Amoc; it's an open source tool written in Erlang for generating massively parallel loads of any kind (AMQP in our case)
- metrics from the system under test and Amoc site are collected for further analysis.
The diagram below depicts a logical architecture of an example load test with RabbitMQ:
In the diagram, the left-hand side, shows a cluster of Amoc nodes that emulate AMQP clients which, in turn, generate the load against RabbitMQ. On the other side, we can see a RabbitMQ cluster that serves the AMQP clients. All the metrics from both the Amoc and RabbitMQ services are collected and stored in an InfluxDB database.
Slow Management HTTP API queries
We tried to reproduce the slow queries to Management HTTP API in our load tests. The test scenario was fairly simple. A bunch of publishers were publishing messages to default exchange. Messages from each publisher were routed to a dedicated queue (each publisher had a dedicated queue). There were also consumers attached to each queue. Queue mirroring was enabled.
For concrete values, check the table below:
That setup stressed the Rabbit servers on our infrastructure. As seen in the graphs below:
Every RabbitMQ node consumed about 6 (out of 7) CPU cores and roughly 1.4GB of RAM except for rabbitmq-1
which consumed significantly more than the others. That was likely because it had to serve more of the Management HTTP API requests than the other two nodes.
During the load test /api/queues
endpoint was queried every two seconds for the list of all queues together with corresponding messages_ready
values. A query looked like this:
http://rabbitmq-1:15672/api/queues?columns=name,messages_ready
Here are the results from the test:
The figure above shows the query time during a load test. It's clear that things are very slow. The median equals 1.5s while the 95, 99, 999
percentiles and max reach 20s.
Debugging
Once the issue is confirmed and can be reproduced, we are ready to start debugging. The first idea was to find the Erlang function that is called when a request to the RabbitMQ Management HTTP API comes in and determine where that function spends its execution time. If we were able to do this it would allow us to localise the most time expensive code behind the API.
Finding the entrypoint function
To find the function we were looking for we took the following steps:
- looked through the RabbitMQ Management Plugin to find the appropriate "HTTP path to function" mapping,
- used the Erlang tracing feature to verify if a found function is really called when a request comes in.
The management plugin uses cowboy (an Erlang HTTP server) underneath to serve the API requests. Each HTTP endpoint requires a cowboy callback module, so we easily found the rabbit_mgmt_wm_queues:to_json/2
function which seemed to handle requests coming to the /api/queues
. We confirmed that with tracing (using a recon library that is shipped with RabbitMQ by default).
root@rmq-test-rabbitmq-1:/rabbitmq_server-v3.7.9# erl -remsh rabbit@rmq-test-rabbitmq-1 -sname test2 -setcookie rabbit
Erlang/OTP 21 [erts-10.1] [source] [64-bit] [smp:22:7] [ds:22:7:10] [async-threads:1]
Eshell V10.1 (abort with ^G)
(rabbit@rmq-test-rabbitmq-1)1> recon_trace:calls({rabbit_mgmt_wm_queues, to_json, 2}, 1).
1
11:0:48.464423 <0.1294.15> rabbit_mgmt_wm_queues:to_json(#{bindings => #{},body_length => 0,cert => undefined,charset => undefined,
has_body => false,
headers =>
#{<<"accept">> => <<"*/*">>,
<<"authorization">> => <<"Basic Z3Vlc3Q6Z3Vlc3Q=">>,
<<"host">> => <<"10.100.10.140:53553">>,
<<"user-agent">> => <<"curl/7.54.0">>},
host => <<"10.100.10.140">>,host_info => undefined,
media_type => {<<"application">>,<<"json">>,[]},
method => <<"GET">>,path => <<"/api/queues">>,path_info => undefined,
peer => {{10,100,10,4},54136},
pid => <0.1293.15>,port => 53553,qs => <<"columns=name,messages_ready">>,
ref => rabbit_web_dispatch_sup_15672,
resp_headers =>
#{<<"content-security-policy">> => <<"default-src 'self'">>,
<<"content-type">> => [<<"application">>,<<"/">>,<<"json">>,<<>>],
<<"vary">> =>
[<<"accept">>,
[<<", ">>,<<"accept-encoding">>],
[<<", ">>,<<"origin">>]]},
scheme => <<"http">>,
sock => {{172,17,0,4},15672},
streamid => 1,version => 'HTTP/1.1'}, {context,{user,<<"guest">>,
[administrator],
[{rabbit_auth_backend_internal,none}]},
<<"guest">>,undefined})
Recon tracer rate limit tripped.
```
`
The snippet above shows that we enabled tracing for `rabbit_mgmt_wm_queues:to_json/2` first, then we manually sent a request to the Management API (using curl; not visible on the snippet) and which generated the trace event. That's how we found our entrypoint for further analysis.
### Using flame graphs
Having found a function that serves the requests, we can now check how that function spends its execution time. The ideal technique to do this is [Flame Graphs](http://www.brendangregg.com/flamegraphs.html). One of its [definitions](http://www.brendangregg.com/flamegraphs.html) states:
*Flame graphs are a visualisation of profiled software, allowing the most frequent code-paths to be identified quickly and accurately.*
In our case, we could use flame graphs to visualise the stack trace of the function or, in other words, which functions are called inside a traced function, and how much time it takes (relatively to the traced function's execution time) for these functions to execute. This visualisation helps to identify suspicious functions in the code quickly.
For Erlang, there is a library called [eflame](https://github.com/proger/eflame) that has tools for both: gathering traces from the Erlang system and building a flame graph from the data. But how do we inject that library into Rabbit for our load test?
### Building a custom RabbitMQ docker image
As we mentioned previously, all the services in our load testing platform run inside docker containers. Hence, we had to build a custom `RabbitMQ docker` image with the eflame library included in the server code. We created a [rabbitmq-docker repository](https://github.com/esl/rabbitmq-docker) that makes it easy to build a docker image with modified RabbitMQ source code.
### Profiling with eflame
Once we had a modified RabbitMQ docker image with eflame included, we could run another load test (specifications were the same as the previous test) and start the actual profiling. These were the results:
<p style="text-align:center"><img src="https://i.imgur.com/gGz7pQc.png" alt="rabbitmq ram table" width="1600" height="432" data-load="full" style=""></p>
<p style="text-align:center"><img src="https://i.imgur.com/AlhCzIX.png" alt="rabbitmq ram table" width="1600" height="432" data-load="full" style=""></p>
We ran a number of measurements and had two types of result as presented above. The main difference between these graphs is in `rabbit_mgmt_util:run_run_augmentation/2` function. What does that difference mean?
From the results of the previous load tests and manual code analysis, we know that there are slow and fast queries. The slow requests can take up to twenty seconds while the fast ones only take a few. It confirms the query time chart above with: 50 percentile about 1.5s but 95 (and higher percentiles) equaling up to 20s. Moreover, we manually measured execution time of both cases using [timer:tc/3](http://erlang.org/doc/man/timer.html#tc-3) and the results were consistent.
This happens because there is a [cache](https://github.com/rabbitmq/rabbitmq-management/blob/v3.7.9/src/rabbit_mgmt_db_cache.erl) in the Management plugin. When the cache is valid, the requests are served much faster as the data has already been collected, but when it's invalid, all the necessary information needs to be gathered.
Despite the fact that the graphs have the same length in the picture, they represent different execution times (fast vs slow). Hence it's hard to guess which graph shows which query without actually taking a measurement. The first graph shows a fast query while the second shows a slow one. In the slow query graph `rabbit_mgmt_util:augment/2 -> rabbit_mgmt_db:submit_cached/4 -> gen_server:call/3 -> …` the stack takes so much time because the cache is invalid and fresh data needs to be collected. So what happens when data is collected?
### Profiling with fprof
You might ask "why don't we see the data collection function(s) in the flame graphs?" This happens because the cache is implemented as another Erlang process and the data collection happens inside the cache [process](https://github.com/rabbitmq/rabbitmq-management/blob/v3.7.9/src/rabbit_mgmt_db_cache.erl#L101-L119 ). There is a `gen_server:call/3` function visible in the graphs that makes a call to the cache process and waits for a response. Depending on the cache state (valid or invalid) a response can come back quickly or slowly.
Collecting data is implemented in [`rabbit_mgmt_db:list_queue_stats/3`](https://github.com/rabbitmq/rabbitmq-management/blob/v3.7.9/src/rabbit_mgmt_db.erl#L357-L368) function which is invoked from the cache process. Naturally, we should profile that function. We tried eflame and after **several dozens of minutes** this is the result we got:
`
```
eheap_alloc: Cannot allocate 42116020480 bytes of memory (of type "old_heap").
```
`
The Erlang heap memory allocator tried to allocate **42GB** of memory (in fact, the space was needed for [garbage collector](https://www.erlang-solutions.com/blog/erlang-19-0-garbage-collector.html) to operate) and crashed the server. As eflame takes advantage of Erlang Tracing to generate flame graphs it was, most probably, simply overloaded with a number of trace events generated by the traced function. That's where [fprof](http://erlang.org/doc/man/fprof.html) comes into play.
According to the official Erlang documentation fprof is:
*a Time Profiling Tool using trace to file for minimal runtime performance impact.*
That’s very true. The tool dealt with collecting data function smoothly, however it took several minutes to produce the result. The output was quite big so there are only crucial lines listed below:
`
```
(rabbit@rmq-test-rabbitmq-1)96> fprof:apply(rabbit_mgmt_db, list_queue_stats, [RA, B, 5000]).
...
(rabbit@rmq-test-rabbitmq-1)97> fprof:profile().
...
(rabbit@rmq-test-rabbitmq-1)98> fprof:analyse().
...
% CNT ACC OWN
{[{{rabbit_mgmt_db,'-list_queue_stats/3-lc$^1/1-1-',4}, 803,391175.593, 105.666}],
{ {rabbit_mgmt_db,queue_stats,3}, 803,391175.593, 105.666}, %
[{{rabbit_mgmt_db,format_range,4}, 3212,390985.427, 76.758},
{{rabbit_mgmt_db,pick_range,2}, 3212, 58.047, 34.206},
{{erlang,'++',2}, 2407, 19.445, 19.445},
{{rabbit_mgmt_db,message_stats,1}, 803, 7.040, 7.040}]}.
```
`
The output consists of many of these entries. The function marked with the % character is the one that the current entry concerns. The functions below are the ones that were called from the marked function. The third column (`ACC`) shows the total execution time of the marked function (the functions own execution time and callees) in milliseconds. For example, in the above entry the total execution time of the `rabbit_mgmt_db:pick_range/2` function equals 58,047ms. For a detailed explanation of the fprof output check the [official fprof documentation](http://erlang.org/doc/man/fprof.html#analysis-format).
The entry above is the top level entry concerning `rabbit_mgmt_db:queue_stats/3` which was called from the traced function. That function spent most of its execution time in `rabbit_mgmt_db:format_range/4` function. We can go to an entry concerning that function and check where it spent its execution time accordingly. This way, we can go through the output and find potential causes of the Management API slowness issue.
Reading through the fprof output in a top-down fashion we ended up with this entry:
`
```
{[{{exometer_slide,'-sum/5-anonymous-6-',7}, 3713,364774.737, 206.874}],
{ {exometer_slide,to_normalized_list,6}, 3713,364774.737, 206.874}, %
[{{exometer_slide,create_normalized_lookup,4},3713,213922.287, 64.599}, %% SUSPICIOUS
{{exometer_slide,'-to_normalized_list/6-lists^foldl/2-4-',3},3713,145165.626, 51.991}, %% SUSPICIOUS
{{exometer_slide,to_list_from,3}, 3713, 4518.772, 201.682},
{{lists,seq,3}, 3713, 837.788, 35.720},
{{erlang,'++',2}, 3712, 70.038, 70.038},
{{exometer_slide,'-sum/5-anonymous-5-',1}, 3713, 51.971, 25.739},
{garbage_collect, 1, 1.269, 1.269},
{suspend, 2, 0.151, 0.000}]}.
```
`
The entry concerns `exometer_slide:to_normalized_list/6` function which in turn called two “suspicious” functions from the same module. Going deeper we found this:
`
```
{[{{exometer_slide,'-create_normalized_lookup/4-anonymous-2-',5},347962,196916.209,35453.182},
{{exometer_slide,'-sum/5-anonymous-4-',2}, 356109,16625.240, 4471.993},
{{orddict,update,4}, 20268881, 0.000,172352.980}],
{ {orddict,update,4}, 20972952,213541.449,212278.155}, %
[{suspend, 9301, 682.033, 0.000},
{{exometer_slide,'-sum/5-anonymous-3-',2}, 31204, 420.574, 227.727},
{garbage_collect, 99, 160.687, 160.687},
{{orddict,update,4}, 20268881, 0.000,172352.980}]}.
```
`
and:
```
{[{{exometer_slide,'-to_normalized_list/6-anonymous-5-',3},456669,133229.862, 3043.145},
{{orddict,find,2}, 19369215, 0.000,129761.708}],
{ {orddict,find,2}, 19825884,133229.862,132804.853}, %
[{suspend, 4754, 392.064, 0.000},
{garbage_collect, 22, 33.195, 33.195},
{{orddict,find,2}, 19369215, 0.000,129761.708}]}.
```
A lot of the execution time was consumed by `orddict:update/4` and `orddict:find/2` functions. These two combined accounted for **86%** of the total execution time.
This led us to the [`exometer_slide`](https://github.com/rabbitmq/rabbitmq-management-agent/blob/v3.7.9/src/exometer_slide.erl) module from the [RabbitMQ Management Agent Plugin](https://github.com/rabbitmq/rabbitmq-management-agent/tree/v3.7.9). If you look into the module, you'll find all the functions above and the connections between them.
We decided to close the investigation at this stage because this was clearly the issue. Now, that we've shared our thoughts on the issue with the community in this blog, who knows, maybe we'll come up with a new solution together.
### The observer effect
There is one last thing that is essential to consider when it comes to debugging/observing systems - [the observer effect](https://en.wikipedia.org/wiki/Observer_effect_(physics)). The observer effect is a theory that claims if we are monitoring some kind of phenomena the observation process changes that phenomena.
In our example, we used tools that take advantage of tracing. Tracing has an impact on a system as it generates, sends and processes a lot of events.
Execution times of the aforementioned functions increased substantially when they were called with profiling enabled. Pure calls took several seconds while calls with profiling enabled several minutes. However, the difference between the slow and fast queries seemed to remain unchanged.
The observer effect was not evaluated in the scope of the experiment described in this blog post.
### A workaround solution
The issue can be solved in a slightly different manner. Let's think for a while if there is another way of obtaining queues names corresponding to the amount of messages in them?
There is the [`rabbit_amqqueue:emit_info_all/5`](https://github.com/rabbitmq/rabbitmq-server/blob/v3.7.9/src/rabbit_amqqueue.erl#L758) function that allows us to retrieve the exact information we are interested in - directly from a queue process. We could use that API from a custom RabbitMQ plugin and expose a HTTP endpoint to send that data when queried.
We turned that idea into reality and built a proof of concept plugin called [`rabbitmq-queue-info`](https://github.com/esl/rabbitmq-queue-info) that does exactly what's described above.
The plugin was even load tested (test specification was exactly the same as it was with the management plugin; from earlier in the blog). The results are below and they speak for themselves:
<p style="text-align:center"><img src="https://i.imgur.com/MQnUb8B.png" alt="rabbitmq ram table" width="2264" height="504" data-load="full" style=""></p>
### Want more
Want to know more about tracing in RabbitMQ, Erlang & Elixir? Check out WombatOAM, an intuitive system that makes monitoring and maintenance of your systems easy. [Get your free 45 day trial of WombatOAM now](https://www.erlang-solutions.com/products/wombatoam.html).
### Apendix
Version 3.7.9 of RabbitMQ was used in all the load tests mentioned in this blog post.
Special thanks go to Szymon Mentel and Andrzej Teleżyński for all the help with that publication.
Top comments (0)