Lately, I have been troubleshooting memory issues on a process my team owns.
I started by watching this video about memory leaks from Benoît Jacquemont:
I learned a lot in the process, but also noticed the nice memory graphs in the video and figured it would be hard to troubleshoot anything if I didn't have them.
When using php extensions such as Benoît's or Arnaud Le Blanc's to take a snapshot of the memory, it's great to think about the most appropriate moment to take that snapshot in order to capture the memory leak you might be hunting.
Sure, you can use Monolog's MemoryUsageProcessor to that end, but I thought it would be more useful to get something a bit more ✨visual✨.
On our environments, we use Datadog, but on my development setup, I don't have that.
There are several metrics you can track for troubleshooting a memory issue, some provided by the OS, and typically reported by Datadog, some others by PHP through memory_get_usage()
(which I don't currently have a way to monitor in production).
Measuring from PHP
PHP provides several methods to understand what is happening memory-wise. First, you have memory_get_usage()
, which takes a boolean argument. Depending on that argument, the method returns the memory used by PHP, or the memory allocated to PHP. When freeing some memory, you will typically see the former decrease, while the latter stays stable.
Then, you have memory_get_peak_usage()
which reports the highest value of used or allocated memory since the beginning of the script. That's useful because it can help the developer figure out that they are not calling memory_get_usage()
where memory usage is at its highest.
Producing metrics
In the case of the script I was troubleshooting, I had a main loop that was frequently executed (but not at an even rate). That's still a good candidate for gathering metrics, as we will see.
Here is what I put inside that loop:
<?php
file_put_contents(
'memory.tsv',
time() . "\t" .
memory_get_usage(true) / 1024 / 1024 . "\t" .
memory_get_usage(false) / 1024 / 1024 . "\t".
memory_get_peak_usage(true) / 1024 / 1024 . "\t" .
memory_get_peak_usage(false) / 1024 / 1024 . "\n",
FILE_APPEND
);
This produces a TSV file that looks like this:
1660831187 20 8.1252593994141 20.359375 19.608978271484
1660831187 20 8.1281814575195 20.359375 19.608978271484
1660831187 20 8.131103515625 20.359375 19.608978271484
1660831187 20 8.134033203125 20.359375 19.608978271484
1660831187 20 8.1369552612305 20.359375 19.608978271484
1660831187 20 8.1398773193359 20.359375 19.608978271484
1660831190 22 8.2328033447266 24.42578125 22.869613647461
1660831190 22 8.2357330322266 24.42578125 22.869613647461
1660831190 22 8.2386627197266 24.42578125 22.869613647461
1660831190 22 8.2415924072266 24.42578125 22.869613647461
Looking at the first column, what stands out is that there are groups of lines that can be several seconds apart, so the production of metrics is really, really not paced regularly.
Plotting graphs
Then, to create the graph, I turned to gnuplot, which seems like a whole universe of its own as well as a very robust piece of software. I started by creating the following configuration file:
# config.plt
set term png small size 800,600
set output "/tmp/memory_get_usage-graph.png"
set ylabel "memory in MB"
set yrange [0:*]
set xdata time # x is not just a random number
set timefmt "%s" # we use UNIX timestamps
plot "memory.tsv" using 1:2 with lines axes x1y1 title "memory_get_usage(true) in MB", \
"memory.tsv" using 1:3 with lines axes x1y1 title "memory_get_usage(false) in MB", \
"memory.tsv" using 1:4 with lines axes x1y1 title "memory_get_peak_usage(true) in MB", \
"memory.tsv" using 1:5 with lines axes x1y1 title "memory_get_peak_usage(false) in MB"
As you can see, it's possible to let gnuplot know that the x axis represents time, which ensures you have a nicely formatted X axis.
The graph is created by running gnuplot config.plt
.
Rendering the graph
What would be handy would be a graph that refreshes over time. For that, you will need 2 tiny programs: watch
and feh
.
Run watch gnuplot config.plt
, to ensure a png is created every 2 seconds (which is watch’s default).
In parallel of that, you run feh /tmp/memory_get_usage-graph.png
to display the png file. What's great with feh
is that it refreshes automatically, so you don't need to do anything special to get your live graph. 🤯 feh
does very little, but does it well.
2 interesting things to note here:
- Only the graph for
memory_get_usage(false)
goes down, but it does go down, so there is no memory leak - The defaults of gnuplot are a bit ugly, and I am no frontend developer, so it will stay ugly.
Measuring from Linux
Producing metrics
Here, to produce the metrics, you can use ps
.
while true; do
ps --pid $(pgrep -f some_string_that_identifies_your_process) \
-o pid=,%mem=,vsz= >> /tmp/mem.log
gnuplot config.plt
sleep 1
done
Note that you can of course use this for any process, not just PHP processes.
Plotting graphs
This time it's a bit more tricky, I am telling gnuplot to plot 2 metrics that
have different units on the same graph.
The left Y axis will have a scale for the first metric, and the right Y axis
will have a scale for the second metric.
I do not configure the X axis this time, since I'm producing metrics at a
regular pace.
This is all shamelessly stolen from Stack Overflow
set term png small size 800,600
set output "/tmp/mem-graph.png"
set ylabel "VSZ"
set y2label "%MEM"
set ytics nomirror
set y2tics nomirror in
set yrange [0:*]
set y2range [0:*]
plot "/tmp/mem.log" using 3 with lines axes x1y1 title "VSZ", \
"/tmp/mem.log" using 2 with lines axes x1y2 title "%MEM"
Here you can see that the figures are different than from inside PHP. I will not get into this because that is off topic, but when troubleshooting memory issues, it can also be important to compare both aspects.
Takeaway
Those graphs helped me understand the differences between memory_get_usage(true)
and memory_get_usage(false)
, and gave me a better understanding of my application. In particular, I understood that the batch processing I was doing relied on batches of objects that were not all the same size, and that making sure they were all roughly the same size would help avoid situations where a series of big objects caused an out-of-memory error.
Top comments (0)