DEV Community

Why We Moved From Lambda to ECS

Taylor Reece on April 21, 2021

After many months of development, my team just announced the general availability of our platform. That milestone seems like a perfect opportunity ...

Read full post

Libert S • Apr 21 '21

Thanks for the insights. I wasn't aware of the process isolation, as far as I'm aware every new request starts a new lambda environment in a new container.

Taylor Reece • Apr 21 '21

That's what surprised us, too - we thought every Lambda got a new environment in a new container. It turns out if you invoke a Lambda that you haven't in a while, it "cold starts", so you get a new environment. Then, that Lambda sits around "warm" waiting for more invocations. That same environment might be used several times before it gets removed from the pool.

That's usually fine, since Lambdas tend to be stateless for most use cases, but in our case state could potentially be mucked with by a user's custom code that we execute.

K • Apr 22 '21

Did you try a custom runtime?

Taylor Reece • Apr 22 '21

We didn't. What has your experience been with custom runtimes in Lambda?

K • Apr 22 '21

I don't have much, but as far as I understood it a custom runtime is basically a HTTP API that passes event data to a "function" whatever that may be.

I'd guessed that you could have used a customer ID in the event data and have the custom runtime spin up isolated "functions" for every customer.

Lou (🚀 Open Up The Cloud ☁️) • Apr 28 '21

I wrote a thread about this a while back:

twitter.com/loujaybee/status/13463...

Ambar • Apr 21 '21

To get around the SQS size limit issues we faced, we swapped in a Redis-backed queuing service.

Interesting, wouldn't this Redis queue have solved this particular issue even within Lambda-land also? Or does it only work for ECS?

Taylor Reece • Apr 21 '21

Great question - The Redis queue would have solved the size issue in Lambda, for sure. Is there a great way to invoke the next Lambda in line using a Redis queue? Looking through StackOverflow, it seems like people suggest leveraging SQS to invoke a series of Lambdas in series and pass data between them, but there may be better ways I'm not aware of related to leveraging a Redis queue.

Ambar • Apr 21 '21 • Edited

I've heard good things about RSMQ (simple, fast queue abstractions on top of plain old redis). Their TLDR is:

If you run a Redis server and currently use Amazon SQS or a similar message queue you might as well use this fast little replacement. Using a shared Redis server multiple Node.js processes can send / receive messages.

Sounds like it could be a good match for your requirements. Perhaps you could setup an AWS step function to trigger the appropriate next lambda (with RSMQ ID and msg id in the payload to the subsequent lambda function). We use AWS step functions to great success for similar asynchronous lambda processing in a state machine.

Rolf Streefkerk • Apr 23 '21

With the addition of EFS for Lambda a lot of your problems can be solved with Lambda.
Latency should be vastly reduced doing network disk operations with EFS when you provision transfer rate and set to high iops.

process isolation can be an issue if you have code executed outside the handler functions, these will remain until the Lambda container is thrown away. If you require such isolation, this is where you need to cut code a lot and keep it in your execution handler.

Taylor Reece • Apr 23 '21 • Edited

That's a good point. EFS in Lambda is exciting.

WRT the process isolation thing, try running a test of this code in Lambda twice. The first time, you get a nice logged "Hello, world!". The second time you run it, console.log has been redefined and you get a less desirable "Your message has been hijacked".

gist.github.com/taylorreece/70ed16...

Matt Morgan • Apr 23 '21

There aren't a lot of languages or runtimes where you'd want to allow endusers to hack the global scope. You can certainly use Lambda safely with process isolation by not creating globals and creating and setting any runtime variables inside your handler. Moving to ECS won't solve your problem. Polite suggestion: don't allow your customers to attach things to the global scope. NodeJS has support for isolating the vm or you can just regex the code.

Taylor Reece • Apr 23 '21

Hey Matt, thanks for linking the vm module - it's good to know about. It seems like that should work, though the docs note:

The vm module enables compiling and running code within V8 Virtual Machine contexts. The vm module is not a security mechanism. Do not use it to run untrusted code.

For our use case, where our platform runs customers' code which could contain anything, we've had to be a bit more heavy-handed with isolating our runtime environments. We ended up creating chroot jails and distinct node processes within our ECS containers to run our customers' code, so each run is guaranteed to not interact with any another.

Matt Morgan • Apr 23 '21

That makes sense and it's obvious that your business puts you in a position to do something that most apps would not want to do (execute untrusted enduser code). My comment was really in response to your gist above. The behavior of globals in Lambda is well documented and predictable. This didn't fit your rather unusual use case, but for most users, a quick read of the docs will arm them with what they need to understand process isolation in Lambda.

Scott Simontis • Apr 27 '21

I think you did a great job of highlighting what I would say is the ideal path for AWS development nowadays. Start with Lambda, move to ECS if Lambda doesn't fit your needs, and only consider EC2-based applications as a last resort or a lift n' shift strategy for migration.

Taylor Reece • Apr 27 '21

Thanks, Scott! Lambda sure make it easy to iterate quickly, without needing to sink hours into DevOps tasks, but it definitely does make sense for some use cases to sink the time needed into running on ECS/EC2/etc.

Omri Gabay • Apr 21 '21

Great read. Would Amazon Step Functions have worked for your business use cases?

Taylor Reece • Apr 21 '21

Hey Omri - thanks, and great question! I'll start with a disclaimer that I'm no expert in AWS Step Functions, so correct me if I get anything wrong :-)

Similar to Step Functions, Prismatic's platform allows you to build workflows (we call them integrations) where data flows through a series of steps. The integration's steps do a variety of things, like interact with third party APIs, mutate data, manage files, etc. Users can create branches and loops that run based on some conditionals, etc. - all of that is pretty similar to what you'd see in a Step Function state machine definition.

Our platform differs from Step Functions in a number of ways, too. Most notably, our users (typically B2B software companies) can develop one integration and deploy instances of the integration to their own customers. Those instances are configurable, and can be driven by customer-specific configuration variables and credentials, so one integration can handle a variety of customer-specific setups. We also offer tooling - things like logging, monitoring and alerting - so our users can easily track instance executions and can configure if/how they are notified if something goes wrong in their integrations.

It might have been possible to create some sort of shim that dynamically converted a Prismatic integration to a Step Functions state machine definition - I'd need to look more into Step Functions to figure out what difficulties we'd have there. The biggest thing keeping us from doing something like that is probably vendor lock-in. We have customers who would like to run Prismatic on other cloud platforms (or on their on-premise stacks), and implementing our integration runner as a container gives us more flexibility to migrate cloud providers as needed.

Omri Gabay • Apr 21 '21

Good looking out on the vendor lock-in issue. And I was under the impression that Prismatic was only being sold as a cloud solution, but self-hosted options are great as well. Thanks for responding!

Kimmo Sääskilahti • Apr 23 '21

That's a very interesting read, thanks a lot!

Are you using Bull as queue library over Redis?

My experience of SQS is that it's great for decoupling services and queueing jobs when you don't have an end-user waiting for the job to complete. If you need near-real-time user experience, I'd similarly go for e.g. Redis, RabbitMQ or even Kafka. Does that sound reasonable?

Taylor Reece • Apr 23 '21

Yep! We're big fans of Bull :-)

Luítame de Oliveira • Apr 25 '21

It gave me a lot of insights. Thanks for sharing.