DEV Community

How do you update backend web services without downtime?

Meghan (she/her) on May 31, 2018

This is a bit of a r/NoStupidQuestions kind of post. I've made a handful of projects in the past that use PHP and as someone who has recently fall...

Read full post

Chris James • May 31 '18 • Edited

Going to describe it as simply as I can.

Generally for availability users of your web service will hit it via a load balancer (like nginx) which then routes requests to a pool of instances of your application.

This means if one instance falls over, you have availability because the load balancer will just send the traffic to the instances that are working.

For that reason, when you deploy you can take down one instance, upgrade, then another, etc until they are all updated. During this you will always have at least one running.

In addition this lets your application "horizontally scale" very easily as it's trivial to add more instances of your application behind the load balancer.

There are more involved ways of doing this but they all mainly work on the premise of some kind of load balancer / router managing traffic so that there is always a running app available

Albrecht Scheidig • Jun 1 '18

What would you recommend, if those instances share a database, and the database schema depends on the version of the application?
Meaning: when updating the first instance, it updates the database schema, making it incompatible with the running, outdated instances (so they tend to run into errors).

Is your recommendation only applicable to architectures without a central database?

Timur Zurbaev • Jun 1 '18

If you need to release some critical DB schema updates, try to roll updates in several parts. For example, if you need to move a column from one table to another (drop column in first table, create column in second table), consider this scenario:

Add column to the second table & update your code to read/write from new column;
Release first part - now all of your production instances are not touching old column at all;
Remove column from the first table and deploy changes - no matter how many instances you're running, they won't produce errors.

Pert Soomann • Jun 1 '18

Actually very good point.

Code updates are usually very trivial, with PHP it could be just pulling changes from GIT repo, with node you probably have to re-build on each instance?

But with DBs, once you get decent amount of data in tables, changing the table config could take very long time to re-build.

There are few ways you can work around, like Timur explained, you could try to implement backwards compatible approach (ie new column defaults to NULL, so old code can still insert new entries to DB without necessarily falling over).

Another option is to have graceful maintenance mode, something we're using at my current place. When updating the real users will see maintenance screen instead of half updated code, nor do we have to worry about concurrent legacy v new code running, depending on instance they end up on.

I know it's technically "downtime", but when built into project from ground up, much easier than trying to achieve the same thing with networking and re-pointing servers etc, and it's not bad user experience, IMHO.

Albrecht Scheidig • Jun 1 '18

We do "maintenance page"-like updates here, too, but I dream of having smart updates without downtime / maintenance page. And as things turn out, this is not possible in my scenario: shared DB, lots of schema changes in every new release.
Timurs approach is interesting, but seems to add a lot of complexity and testing efforts.

Pert Soomann • Jun 1 '18 • Edited

I think it's OK to find a reasonable solution that doesn't annoy your userbase or break your dev-team, even if it's not dream no-noticeable-downtime :)

Abel Wang • Jun 1 '18

Another thing you can do to have no down time w/your db is version your schemes somehow and store this value in your db. Then in your code wrap new features using feature flags. Part of the flag logic is the whether the switch is on or off and part of the switch logic is what version is the db at. And based on those values your code would route to new or old code hitting new or old sql calls. It does add complexity and debt as you are now using feature flags and you will need to religiously clean up your flags or things can spiral out of control. However, that’s a small price to pay in terms of the benefits you get like the speed at which you can deploy new changes, not worrying about what order to deploy your micro services and dbs and ease of rolling features in and out. And, zero down time. Another added benefit of using flags is now you can do trunk based development which simplified things tremendously over complex branching schemes but that’s a whole nother topic.

Adam Bullmer • Jun 2 '18

If you're asking these questions, it may mean that you are going down a path that you might not need to. I understand the desire for perfectly maintained and groomed DB schemas. But at the end of the day, you've got to take into account business requirements while solving the problems in software.

Does the business have an SLA on downtime? No? Great throw up a maintenance page, run a migration, and move on with your life.

Yes? Maybe think about how you can migrate data and deploy new features in 2 or more deploys/migrations.

Do you have slow traffic hours? Maybe it is acceptable to have downtime at 3am, when your usage is low to none.

The downtime plans have an advantage of not engineering a flawless plan, which takes time, and frees you to continue developing. But these are only viable if your application would suffer during any sort of outrage.

Lastly, the option I don't hear people advocating for, is taking on technical debt. Mind you, debt is bad and I would recommend seeking the above alternatives first. In a pinch this is a solution.

Tech debt the process of making an informed concession on architecture/design/schema I'm the interest of time, and all parties of the business agree that you will be given time later to do it the right way. It's called debt because almost always is more total work to do later then now, but is less work to short cut now than the right way.

So for you, you might just deal with this in your software: a column that isn't mapped to the right table, or make a new table, or whatever outrage free solution you come up with. Then later run a migration and code deploy. And things could have changed at this time to your benefit, like more coders, different SLAs, different patterns, different technology, or a rewrite even. There is no hard and fast rule on how you deploy, only that some solutions fit your particular case better.

Albrecht Scheidig • Jun 2 '18

Adam, thanks for taking the time.
We sell a product. We have hundreds of installations worldwide and support updates from 15 year old versions as well as 15 h old versions. So, tech. debt approach is not feasible because it does not solve the problem fundamentally.
"Updates without downtimes" would definitely add value we could sell or use as USPs. Or fulfill stricter SLAs and what not. But there seams no silver bullet to add this into our existing legacy code base with a centralized database.

Adam Bullmer • Jun 3 '18

Yikes, 15 year old software? That sounds rough. If the service can't be interrupted, I've seen the staged release work well:

forwards compatible DB alter
Ship code to read/write from both places (gross, but it's going away)
Any data migration necessary (no alters allowed here)
Update code to stop looking in the old place
Confirm everything is working as planned and blank out the old data, or alters to clean up your schema.

Step 2 is the hardest because there's a lot of options of how you can handle this. I've personally had my software write to 2 different places, ran a migration to move the old data to the new place, and then updated the service again to only read/write from the new place.

Hopefully you're sourcing see good ideas from the communuty suggestions, these have all been good ideas if they solve your problem effectively.

I also can't stress enough the importance of DB backups, testing that your theory of deploys works at every step in a non production environment, and having your plan written down and reviewed by your peers. DB alters in prod are hard, and carry a lot of risk if something unexpected happens.

Meghan (she/her) • May 31 '18

Thanks for the quick helpful response! :D

Adrian B.G. • Jun 1 '18 • Edited

Hello, sorry that your first language is PHP. I was stuck in it for my first 5-6 yrs as a web developer so I can help you by doing a timeline (of the advancements done in the meantime):

A. Monolith age (1 server/VM)

1 version. You connect trough FTP and overwrite the source code. || With a small project, low amount of users and some prayers to achieve no downtime. Cons: around 1000 reasons, don't do it
N versions. You create a new folder for every release, the nginx/apache points to a symlink. When you finish uploading the code you just switch the symlink to point to the new version. || You can do rollbacks, staging tests. The versions are immutable. See capistrano.

B. Horizontally scaled (multiple servers/VMS)

From this one we add a new layer of complexity (beside the local web server that listens for requests, we have a load balancer that capture the user requests and redirect them to the web servers). This allows us to have 0 downtime if the update is done correctly and the new version works.

You apply 1 (hope not) or 2 but on multiple machines in the same time.
Blue green deployment, LB and immutable: for each new release you create new servers, and you point the load balancer to the new version. First for only 10% of the traffic for 1 hour (random numbers). If everything is ok with the new version you put it to 50% and so on. You remove the old servers after a while.

C. containers

Instead of servers you apply 4 method in containers (you can have multiple of "mini virtual machines" on the same machine).

Servers -> VMs -> Containers -> and now cloud functions, read more about them and you will understand why and how.

PS: everything is over simplified to make a point.
PS2: things get more complex when you update a relational database schema for the new version.

Gunnar Gissel • May 31 '18

In the Java world, the load balancer approach Chris James describes is probably the best.

For development work, you can hot reload your app server with a tool like spring-loaded or JRebel.

If you tried the second in prod, I expect you'd get a memory leak sooner or later. Maybe some kind of classpath weirdness

With docker, what you get is a completely configured environment for your code to run. It's convenient because you can bundle environmental changes with code changes. Docker alone isn't going to handle zero downtime deploys. Here's an article that talks about zero downtime deploys with docker

Nancy Deschenes • Jun 1 '18

Typically, after you reload too many classes too many times, what you get is a java.lang.OutOfMemoryError: PermGen exception. That's why on development environments, it is usually a good idea to boost the PermGen pool significantly. I run with -Xmx2048m -XX:MaxPermSize=1024m (PermGen is part of the heap, so make sure you have enough space in your heap for PermGen AND for all the other things that the heap will use)

I don't know if that ratio (heap/PermGen) is ideal, but it works.

Alan Barr • May 31 '18

Microservices and health check endpoints are pretty good at this. There are various strategies like blue/green deployments and others to enable using a load balancer to have high availability.