Downtime Deployment is Solved, Data Migration Isn't

The anatomy of downtime

We’ve all been there - it’s starting to get late and it’s one of those days where things just seem to be taking forever. Continuous delivery has pushed a new code version out, but error signals are slowly starting to accumulate. At first you dismiss them, but then you look closely and release a shout heard by every engineer on your team “F%$#ing s@#t - who renamed a database column without telling anyone?”. Steve, one of the junior devs, who up until a second ago was so excited about his code going to production for the first time, answers: “It was me, but I changed all the references so there shouldn’t be any problem”. You sigh and facepalm, yet thank Eric, that guy that finally convinced you to fix rollback last week. Just another story of averted downtime.

We’re engineers. Downtime is an admission of failure. “What, you couldn’t change data centers, migrate all the data, and deploy a re-architected version of your system with no downtime??” (for the record, we have done that in the past), and with web apps supporting multiple timezones, maintenance windows are no longer a valid option.

Why am I telling you this? Well, because it’s time we have patterns and tools that allow us to avoid it. We’re definitely moving in the right direction, for example…

Zero downtime deployment is a solved problem

Yes, you heard me - ZERO DOWNTIME DEPLOYMENT IS A SOLVED PROBLEM. It goes by different names, and being handled at a different scale, but almost everybody’s using some variant of the blue green deployment pattern.

In short, blue green deployment works by creating a new copy of the environment that has your most recent release (the blue one), while traffic is still handled by the existing copy with the previous release (the green one). When the blue copy is ready, it starts accepting new requests, while the green copy is still handling requests that were already initiated. Once the green copy has finished processing its existing requests (sometimes called “bled out”), you can either take it down, or keep it for rollback purposes.

What blue green means for us at Handybook is that you don’t hear people say “ok, new code is up, time to restart the server”, a restart that will mean a tradeoff between the number of existing requests you’re going to kill and the number of new requests that are going to be queued up or not handled. To do that effectively we need to be in full control of our hosting environment - another reason why PaaS is not a solution for system at scale (Heroku does have a blue green feature but it is not generally available at the time of writing.)

You might be creating a whole new environment and then flipping the switch. Of course, you could argue that flipping the DNS switch is far different from flipping the load balancer switch, and you might be gradually pushing more traffic rather than flipping the switch at once, but at the end of the day, if you’re deploying without downtime, you’re on the blue green train.

“But we’re not doing any of that!”, you say “we’re all ruby, and we just send our unicorn masters a signal to restart”. Well guess what - you’re on the blue green train. Unicorn is starting a new master with a bunch of slaves, while letting the existing slaves handle the requests that are already in process.

So deploying with zero downtime is a solved problem, but our story doesn’t end there. In the next part of this series, I’ll talk about the data aspects of zero downtime migrations, and why NoSQL is just a partial solution.

Zero downtime data/schema migrations are not

Deployment is rarely just about code. Data schemas change, the semantics of data change and your worry-free, downtime-less setup just went down the drain.

There’s a good chance that dealing with data changes is the most complicated aspect of your ops setup, and timing those changes with code changes adds another level of complexity.

Let’s say, for example, you’re removing a column. If you want to avoid errors being thrown at you faster than a Aroldis Chapman fastball, you need to make sure your deployed code is not using this method anywhere. Since we’ve established you’re on some kind of a blue green setup, that means you have to wait for the greens if your don’t want to get the blues.

On the other hand, let’s say you’re adding a column - you now have the reverse problem. Your data migration must run before any new code is deployed.

But, wait - what about those schema-less NoSQL databases we’ve been hearing about? In my next post, I’ll explain why NoSQL is just a partial solution.