From 0ba59699e16ee89c6942d872346fb7e3d2d9c997 Mon Sep 17 00:00:00 2001 From: Alistair Hey Date: Fri, 24 Mar 2023 15:32:21 +0000 Subject: [PATCH] Add day 86 Signed-off-by: Alistair Hey --- 2023.md | 2 +- 2023/day86.md | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 96 insertions(+), 1 deletion(-) diff --git a/2023.md b/2023.md index 3a22d96..db2cff3 100644 --- a/2023.md +++ b/2023.md @@ -158,7 +158,7 @@ Or contact us via Twitter, my handle is [@MichaelCade1](https://twitter.com/Mich - [] 👷🏻‍♀️ 84 > [](2023/day84.md) - [] 👷🏻‍♀️ 85 > [](2023/day85.md) -- [] 👷🏻‍♀️ 86 > [](2023/day86.md) +- [] 👷🏻‍♀️ 86 > [Designing for Resilience, Redundancy and Reliability](2023/day86.md) - [] 👷🏻‍♀️ 87 > [](2023/day87.md) - [] 👷🏻‍♀️ 88 > [](2023/day88.md) - [] 👷🏻‍♀️ 89 > [](2023/day89.md) diff --git a/2023/day86.md b/2023/day86.md index e69de29..b58ad07 100644 --- a/2023/day86.md +++ b/2023/day86.md @@ -0,0 +1,95 @@ +# Designing for Resilience, Redundancy and Reliability + +We now have an application which uses asynchronous queue based messaging to communicate. This gives us some real +flexibility on how we design our system to withstand failures. We are going to look at the concept of Failure Zones, +Replication, Contract testing, Logging and Tracing to build more robust systems. + +## Failure Zones + +Imagine building an application and deploying everything onto a single VM or server. What happens when, inevitably, this +server fails. Your application goes offline and your customers won’t be happy! There’s a fix to this. Spread your +workloads over multiple points of failure. This doesn’t just go for your application instances but you can build +multiple redundant points for every aspect of your system. + +Ever wonder what some of the things large cloud providers do to keep their services running despite external and +unpredictable factors? For starters they will have generators on-site for when they inevitably suffer a power cut. They +have at least two internet connections into the datacentres and they often provide multiple availability zones in each +region. Take Amazon’s eu-west-2 (London) region. This has 3 availability zones, eu-west-2a, eu-west-2b and eu-west-2c. +There are 3 separate datacenters (physically separated) that all inter-connect to provide a single region. This means by +deploying services across these availability zones (AZs) we increase our redundancy and resilience over factors such as +a fire in one of these facilities. If you run a homelab you could distribute work over failure zones by running more +than one computer, placing these computers in separate parts of the house or buying 2 internet connections to stay +online if one goes down (This does mean checking that these connections don’t just run on the same infrastructure as +soon as they leave your door. At my house I can get fibre to my house and also a connection over the old phone lines) + +## Replication + +This links nicely into having multiple replicas of our “stuff”, this doesn’t just mean running 2 or 3 copies of our +application over our previously mentioned availability zones we also need to consider our data - If we ran a database in +AZ1 and our application code over AZ1, AZ2 and AZ3 what do you think would happen if AZ1 was to go offline, potentially +permanently? There have been instances of cloud datacenters being completely destroyed by fire and many customers had +not been backing up their data or running across multiple AZs. Your data in this case is… gone. Do you think the +business you workin in could survive if their data, or their customers data, just disappeared overnight? + +Many of our data storage tools come with the ability to be configured for replication across multiple failure zones. +NATS.io that we used previously can be configured to replicate messages across multiple instances to survive failure of +one or more zones. Postgresql databases can be configured to stream their WAL (Write ahead log), which stores an +in-order history of all the transactions, to a standby instance running somewhere else. If our primary fails then the +most we would lose would be the amount of data in our WAL that was not transferred to the replica. Much less data loss +than if we didn’t have any replication. + +## Contract Testing + +We are going to change direction now and look at making our applications more reliable. This means reducing the change +of failures. You may appreciate that the time most likely to cause failures in your system is during deployments. New +code hits our system and if it has not been tested thoroughly in production-like environments then we may end up with +undefined behaviours. + +There’s a concept called Contract testing where we can test the interfaces between our applications at development and +build time. This allows us to rapidly get feedback (a core principle of DevOps) on our progress. + +## Async programming and queues + +We have already discussed how breaking the dependencies without our system can lead to increased reliability. Our +changes become smaller, less likely to impact other areas of our application and easy to roll-back. + +If our users are not expecting an immediate tactile response to an event, such as requesting a PDF be generated then we +can always place a message onto a queue and just wait for the consumer of that message to eventually get round to it. We +could see a situation where thousands of people request their PDF at once and an application that does this +synchronously would just fall over, run out of memory and collapse. This would leave all our customers without their +PDFs and needing to take an action again to wait for them to be generated. Without developer intervention we may not get +back to a state where the service can recover. + +Using a queue we can slowly work away in the background to generate these PDFs and could even scale the service in the +background automatically when we noticed the queue getting longer. Each new application would just take the next message +from the queue and work through the backlog. Once we were seeing less demand we could automatically scale these services +down when our queue depth reached 0. + +## Logging and Tracing + +It goes without saying that our applications and systems will fail. What we are doing in this section of 90daysOfDevOps +is thinking about what we can do to make our lives less bad when things do. Logging and tracing are some really +important tools in our toolbox to keep ourselves happy. + +If we log frequently with both success and failure messages we can follow the progress of our requests and customer +journeys through our system then when things go wrong we can quickly rule out specific services or focus down on +applications that are logging warning or error messages. My general rule is that you can’t log too much - its not +possible! It is however important to use a log framework that lets you tune the log level that gets printed to your +logs. For example if i have 5 log levels (as is common), TRACE, DEBUG, INFO, WARN and ERROR we should have a mechanism +for ever application to set the level of logs we want to print. Most of the time we only want WARN and ERROR to be +visible in logs, to quickly show us theres and issue. However if we are in development or debugging a specific +application its important to be able to turn up the verbosity of our logs and see those INFO/DEBUG or TRACE levels. + +Tracing is a concept of being able to attack a unique identifier to a request in our system that then gets populated and +logged throughout that requests journey, we could see a HTTP request hit a LoadBalancer, get a correlationID and then we +want to see that correlationID in ever log line as our request’s actions percolate through our system. + +## Conclusion + +It’s hard to build a fully fault-tolerant system. It involves learning from our, and other people’s, mistakes. I have +personally made many changes to company infrastructure after we discovered a previously unknown failure zone in our +application. + +We could change our application running in Kubernetes to tolerate our underlying infrastructure’s failure zones by +leveraging [topology spread constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/) +to spread replicas around. We won’t in this example as we have much more to cover!