Scenario
We are attempting to make a post request that involves calling a third-party API, the application is serverless and an EDA, so everything.
Nevertheless, this API occasionally experiences unexpected outages. We are currently exploring possible solutions to address this issue.
A typical "trust no one" problem can arise when integrating with external partners or suppliers, and the first thing that occurred in my mind was whether I should use a circuit breaker, However, this is only used in synchronous applications, and my application is entirely asynchronous. How can I achieve it? First of all, let's recap.
What is a circuit breaker?
Visualise your app as a rock star holding an arena concert, and the Circuit Breaker is that guy backstage who knows when you've just gone too far. You'll fry the stage wiring if you keep up like that.
So, your application is at the top of its lungs, each note is a service call. But then suddenly, the guitar amp (your database) wants to rest. Rather than having your app crying for more solos, like a cool roadie, the Circuit Breaker watches out after things, saying:
"Calm down there. The amp needs time to breathe."
It's like making sure your concert doesn't descend into a riotous frenzy, allowing the guitar amp (your company) to cool down.
Who wants to hear the tale of a concert fiasco?
In the curtain call, the Circuit Breaker is your backup plan for staying gigable even when glitches strike you out midair. So, you can rock on, but with a safety net.
Backing to the tech buzzwords
In a nutshell, a circuit breaker is designed for synchronous applications because it can interrupt the sequence of operations when one fails swiftly. In synchronous settings, operations occur linearly, allowing the circuit breaker to prevent cascading failures effectively.
In contrast, asynchronous applications, with concurrent and independent operations, often rely on different error-handling mechanisms better suited to their non-linear and non-blocking nature
Is it possible to accomplish an asynchronous circuit breaker?
I've been brainstorming and coming up with some out-of-the-box ways to achieve our goal.
I'm excited to share these two approaches with you!
1 - A self-healing Lambda function that adapts its throughput based on performance
This unbeatable article goes through a way how to handle third parties in an approach using kinesis and a ventilator lambda function, below it's a small piece of this treasure written by the marvellous serverless hero Yan Cui at the OG BurningMonk Blog.
The ventilator function can, therefore, self-adjust its batch size by updating the Kinesis event source mapping.
As the provider API’s response time goes up, the ventilator can respond by reducing its batch size. This gives the API a chance to catch its breath and recover. When the response time returns to acceptable levels, then the ventilator can gradually increase its batch size back to previous levels
2 - Take a break for your downstream 😅
After a quick catch-up with a 🇵🇱 Polish Wizard a.k.a Łukasz Kiedrowski 🧙♂️, giving me some wisdom, he sent me a quote that basically blew my mind. This quote, written by him, clarified the perspective I was wondering about, a way to achieve the wanted async circuit breaker using basically only a DLQ
In EDA architectures DLQ (maxReceiveCount + dynamic visibility timeout) will act as a circuit breaker in Synchronous architectures
This approach will protect you from the third party but won't act, in fact, as a "breaker". It will reduce the invocations instead of breaking it, protecting you from third-party failure.
This post presents some alternatives for implementing a circuit breaker within an asynchronous serverless environment.
Please let me know if you would like to see an implementation of this.
Of course, I would highly appreciate constructive discussion. If you don't agree with anything mentioned here, please feel free to leave your comments below.
Cheers!
Top comments (7)
Yes, implementing asynchronous circuit breakers in a serverless architecture is feasible. Serverless computing relies on event-driven functions, and asynchronous circuit breakers can be designed to monitor and manage such events. By leveraging serverless features like AWS Lambda or Azure Functions, developers can implement circuit-breaking patterns to handle asynchronous communication and prevent system failures and there is now easier to browse (scoopearth.com/recieving-healing-b...) site for best body healing. This approach enhances fault tolerance by temporarily isolating faulty services, allowing them to recover independently. Asynchronous circuit breakers in serverless architectures contribute to robust, scalable, and resilient systems, ensuring smooth operation even in the face of unpredictable events or failures.
Lots of ways to skin the cat here but ultimately you need some flag controlling the status of the third party service, updated by your Lambda. From there you could use either AWS Systems Manager or AWS Config to control concurrency/batch size of the Lambda function, reacting to changes in the flag. Think of it as a Cloudwatch Alarm or a Feature Flag.
That's only half the battle of course, it's still letting messages through so configure SQS to push to a DLQ but also ensure the DLQ is set to redrive those messages back to the source queue, with timings that make sense for the process. Then it's completely automatic and you don't need to intervene or worry about it, other than to consider if you need FIFO or not and what sort of throughput you need and the resulting backlog that will cause.
If your app is truly EDA, then it should also respect idempotency to prevent inadvertent duplication and back to the earlier point on Cloudwatch Alarms, if your backlog ends up high you can increase batch size and concurrency based on queue depth so you're not waiting ages for the processes to catch up following the outage.
Hope that helps!
I was wondering if it's possible to throttle the first lambda when a CloudWatch alarm is triggered
Kind of. Two options...
In your Lambda, your external service has failed, your batch is now 1, you get the next message, it fails. Before dropping to the DLQ you check that feature flag and if it's in error state then throw in a 30 second sleep. This will really slow down batch retrieval, but will increase Lambda runtime and cost and make sure concurrency is 1 or you can get 100 functions sitting for 30 seconds!!
You don't want to lose messages which is why you have SQS but you might not want to lose the FIFO ordering either which DLQ usage will. In which case you can "turn off" SQS as an event source temporarily. It doesn't turn off SQS, you'll still get messages from your publisher but Lambda won't trigger. In this scenario, you would then need the feature flag to trigger another function that checks the response on your external service regularly and when it's working again it updates the feature flag and enables SQS as the event source. This is useful when that external service can be down for hours at a time. I had this with a vendor and didn't want to just cycle through the queue over and over, so this is the "proper" circuit breaker approach, in that the circuit is disabled and nothing passes through it until you turn it back on again.
Wow, man! Thanks a lot for sharing. This could help clarify everything for everyone who drops by. Hehe.
You may want to check out "Circuit Breaker Solution for AWS Lambda Functions" medium.com/@ch.gerkens/circuit-bre...
such amazing approach, thanks for sharing it