2023-03-24 22:55:13 +07:00
|
|
|
|
# Zero Downtime Deployments
|
|
|
|
|
|
|
|
|
|
Another important part of your application lifecycle is deployment time. There are lots of strategies for deploying
|
|
|
|
|
software. Like with anything there are pros and cons to the various strategies so I will run through a few options from
|
|
|
|
|
least complex to most complex, and as you may imagine the most complex deployment types tend come with the highest
|
|
|
|
|
guarantees of uptime and least disruption to your customer.
|
|
|
|
|
|
|
|
|
|
You may be asking why it's important to consider how we deploy our applications as the vast majority of our application
|
|
|
|
|
lifecycle time will be in the “running” state and therefore we could focus our time on strategies that support our
|
|
|
|
|
running application’s resilience. My answer is: Have you ever been on-call? Almost all incidents are due to code
|
|
|
|
|
releases or changes. The first thing I do when im on-call and called to an incident is see what was recently deployed -
|
|
|
|
|
I focus my main attention on that component and more often than not it was to blame.
|
|
|
|
|
|
|
|
|
|
We do also need to consider that some of these deployment strategies will require us to make specific code changes or
|
|
|
|
|
application architecture decisions to allow us to support the specific deployment in question.
|
|
|
|
|
|
|
|
|
|
### Rolling Deployments
|
|
|
|
|
|
|
|
|
|
One of the simplest deployment strategies is a rolling deployment. This is where we slowly, one by one (or many be many,
|
|
|
|
|
depending on how many instances of a service you have) we replace old deployments with their new tasks. We can check
|
|
|
|
|
that the new deployments are healthy before moving onto the next, only have a few tasks not healthy at a time.
|
|
|
|
|
|
|
|
|
|
This is the default deployment strategy in Kubernetes. It actually borrows some characteristics from Surge, which is
|
|
|
|
|
coming next. It starts slightly more new tasks and waits for them to be healthy before removing old ones.
|
|
|
|
|
|
|
|
|
|
### Surge Deployments
|
|
|
|
|
|
|
|
|
|
Surge deployments are exactly what they sound like. We start a large number of new tasks before cutting over traffic to
|
|
|
|
|
those tasks and then draining traffic from our old tasks. This is a good strategy when you have high usage applications
|
|
|
|
|
that may not cope well with reducing their availability at all. Usually surge deployments can be configured to run a
|
|
|
|
|
certain percentage more than the existing tasks and then wait for them to be healthy before doing a cutover.
|
|
|
|
|
|
|
|
|
|
The problem with surge deployments is that we need a large capacity of spare compute resources to spin up a lot of new
|
|
|
|
|
tasks before rolling over and removing the old ones. This can work well where you have very elastic compute such as AWS
|
|
|
|
|
Fargate where you don’t need to provision more compute yourself.
|
|
|
|
|
|
|
|
|
|
### Blue/Green
|
|
|
|
|
|
|
|
|
|
The idea behind a Blue/Green deployment is that your entire stack (or application) is spun up, tested and then finally
|
|
|
|
|
once you are happy you change config to send traffic to the entire new deployment. Sometimes companies will always have
|
|
|
|
|
both a Blue and a Green stack running. This is a good strategy where you need very fast rollback and recovery to a known
|
|
|
|
|
good state. You can leave your “old” stack running for any amount of time once you are running on your new stack.
|
|
|
|
|
|
|
|
|
|
### Canary
|
|
|
|
|
|
|
|
|
|
Possibly one of the most complicated deployment strategies. This involves deploying a small number of your new
|
|
|
|
|
application and then sending a small portion of load to the new service, checking that nothing has broken by monitoring
|
|
|
|
|
application performance and metrics such as 4XX or 5XX error rates and then deciding if we continue with the deployment.
|
|
|
|
|
In advanced setups the canary controllers can do automatic rollbacks if error thresholds are exceeded.
|
|
|
|
|
|
|
|
|
|
This approach does involve a lot more configuration, code and effort.
|
|
|
|
|
|
|
|
|
|
Interestingly the name comes from from coal mining and the phrase "canary in the coal mine." Canary birds have a lower
|
|
|
|
|
tolerance to toxic gases than humans, so they were used to alert miners when these gases reached dangerous levels inside
|
|
|
|
|
the mine.
|
|
|
|
|
|
|
|
|
|
We use our metrics and monitoring to decide if our “canary” application is healthy and if it is, we then proceed with a
|
|
|
|
|
larger deployment.
|
|
|
|
|
|
|
|
|
|
## Application design considerations
|
|
|
|
|
|
|
|
|
|
You may have worked out by now that the more advanced deployment strategies require you to have both old and new
|
|
|
|
|
versions of your application running at once. This means that we need to ensure backwards compatibility with all the
|
|
|
|
|
other software running at the time. For instance, you couldn't use a database migration to rename a table or column
|
|
|
|
|
because the old deployment would no longer work.
|
|
|
|
|
|
|
|
|
|
Additionally, our canary deployment strategy requires our application to have health checks, metrics, good logging and
|
|
|
|
|
monitoring so that we can detect a problem in our specific canary application deployment. Without these metrics we would
|
|
|
|
|
be unable to programmatically decide if our new application works.
|
|
|
|
|
|
|
|
|
|
Both these considerations, along with others, mean that we need to spend extra time both on our application code,
|
|
|
|
|
deployment code and our monitoring and alerting stacks to take advantage of the most robust deployments.
|