First of all, what's the Zalgo Effect
?
Zalgo Effect
is a term used to describe unexpected outcomes of mixing sync
and async
JavaScript code - meaning if you mix these two approaches ~SOMETHING~ weird will happen.
It's one of those things you kinda don't understand until you see it in real production systems.
Well, in the internet and Pop culture - it's more than that. Think of it like a monster.
So what it has to do with Resource Leakage?
Occasionally, our SRE team received a couple PagerDuty alerts claiming our services were restarting and not able to work properly due to Error: No channels left to allocate
- ie RabbitMQ connections were maxing out in channel allocation. (For RabbitMQ reference into Channels and Connections)
(Ofc the screenshot is just an old print)
It was clear some code was leaking channel creations. No one knew what could potentially be - but God I listened somewhere about this Zalgo Effect
This "somewhere" is called Node.js Design Patterns: Design and Implement Production-grade Node.js Applications Using Proven Patterns and Techniques
- Packet book.
How was I so sure this "Zalgo" was the culprit?
The service throwing that error was only responsible to fan out a couple messages for a lot of other services - so it was easy as creating a our internal Queue
object and running N promises concurrently to publish some message - as the first thing the script run.
The RabbitMQ Management UI showed me that we created one channel for each promise - roughly.
But why it only happened in some scenarios?
That's where the Zalgo Effect
pops in.
The PubSub code was built back in ~2015 - Node 4 - where the callback style was the go-to. Our Engineers created the abstraction Queue
which dealt with almost 50% of our Event-Driven Architecture by itself. It was a very naive implementation - pointing to a single RabbitMQ Node.
Our legacy code was 100% usage of async.series
- so when publishing messages it would be always sequentially. Later then parallel processing was added, issues started to appear more frequently. It was always just an silent issue.
So the code assumed the following:
- Assert the exchange, queues and necessary resources - using one channel - which we could call
consumeChannel
.- The consume channel is created whenever the connection is made - and it is always created.
- Our
confirmChannel
- ie the channel we used topublish
events was lazily created - only being created when we need to publish some message via that connection.- This code mixed
async
andsync
code heavily
- This code mixed
Imagine the following:
We call assertConfirmChannel
- which calls the AmqpLibConfirmChannelSingleton
and then it creates the channel. If the Singleton, still has no inner-instance to refer, it calls a promise to create the confirmChannel
and attaches the error listeners. Code example below:
What happens if two concurrent promises reaches the AmqpConfirmChannelSingleton.getInstance
at the same time - in the same event-loop tick?
Code would reach this line from amqpLibChannelDecorator.createConfirmChannel
- N times where the Promise
is not resolved yet.
It means, we would have N promises created - mistakenly creating useless channels and de-referencing them.
The NodeJS GC won't kill this - since the channel is just an abstraction and still has "reference" inside the connection.
This is where the code was leaking channels.
Fixing the problem
The hotfix
Due to production hours - our hot-fix was to simply await the first promise and then fan out the other promises.
The real fix - below - was shipped with the PubSub refactoring.
How did we fix this in the long term?
If you want a real solution, here's what the V2 looked like - the idea is to create Promises and assign variables with them, instead of doing await
on it. Example as below:
- Notice how
this.connection
has anunawaited
promise.
This easily fixes the problem - by setting a variable as promise and checking its existence.
A more robust style example - where you actually need to initialize a couple of resources, you could do something like below - which is the exact same idea.
- Create a function to execute the entire Promise.
- Set up some reference to it
- If requested the same, just use the same Promise.
Ok - but why it fixes the problem?
NodeJS runs upon the famous Event-loop.
TL;DR: It is an ever-running while(true)
with some well-defined and ordered steps from a queue of functions to execute.
Basically, what happens in this example is - too much synchronous code executing without ever reaching the Poll
phase - where the IO callbacks are executed.
Top comments (0)