Add day 86

Signed-off-by: Alistair Hey <alistair@heyal.co.uk>
2025-07-27 08:12:04 +07:00 · 2023-03-24 15:32:21 +00:00
parent 8657e7f413
commit 0ba59699e1
2 changed files with 96 additions and 1 deletions
--- a/2023.md
+++ b/2023.md
@ -158,7 +158,7 @@ Or contact us via Twitter, my handle is [@MichaelCade1](https://twitter.com/Mich
 - [] 👷🏻‍♀️ 84 > [](2023/day84.md)
 - [] 👷🏻‍♀️ 85 > [](2023/day85.md)
- [] 👷🏻‍♀️ 86 > [](2023/day86.md)
+- [] 👷🏻‍♀️ 86 > [Designing for Resilience, Redundancy and Reliability](2023/day86.md)
 - [] 👷🏻‍♀️ 87 > [](2023/day87.md)
 - [] 👷🏻‍♀️ 88 > [](2023/day88.md)
 - [] 👷🏻‍♀️ 89 > [](2023/day89.md)
--- a/2023/day86.md
+++ b/2023/day86.md
@ -0,0 +1,95 @@
 # Designing for Resilience, Redundancy and Reliability
 We now have an application which uses asynchronous queue based messaging to communicate. This gives us some real
 flexibility on how we design our system to withstand failures. We are going to look at the concept of Failure Zones,
 Replication, Contract testing, Logging and Tracing to build more robust systems.
 ## Failure  Zones
 Imagine building an application and deploying everything onto a single VM or server. What happens when, inevitably, this
 server fails. Your application goes offline and your customers won’t be happy! There’s a fix to this. Spread your
 workloads over multiple points of failure. This doesn’t just go for your application instances but you can build
 multiple redundant points for every aspect of your system.
 Ever wonder what some of the things large cloud providers do to keep their services running despite external and
 unpredictable factors? For starters they will have generators on-site for when they inevitably suffer a power cut. They
 have at least two internet connections into the datacentres and they often provide multiple availability zones in each
 region. Take Amazon’s eu-west-2 (London) region. This has 3 availability zones, eu-west-2a, eu-west-2b and eu-west-2c.
 There are 3 separate datacenters (physically separated) that all inter-connect to provide a single region. This means by
 deploying services across these availability zones (AZs) we increase our redundancy and resilience over factors such as
 a fire in one of these facilities. If you run a homelab you could distribute work over failure zones by running more
 than one computer, placing these computers in separate parts of the house or buying 2 internet connections to stay
 online if one goes down (This does mean checking that these connections don’t just run on the same infrastructure as
 soon as they leave your door. At my house I can get fibre to my house and also a connection over the old phone lines)
 ## Replication
 This links nicely into having multiple replicas of our “stuff”, this doesn’t just mean running 2 or 3 copies of our
 application over our previously mentioned availability zones we also need to consider our data - If we ran a database in
 AZ1 and our application code over AZ1, AZ2 and AZ3 what do you think would happen if AZ1 was to go offline, potentially
 permanently? There have been instances of cloud datacenters being completely destroyed by fire and many customers had
 not been backing up their data or running across multiple AZs. Your data in this case is… gone. Do you think the
 business you workin in could survive if their data, or their customers data, just disappeared overnight?
 Many of our data storage tools come with the ability to be configured for replication across multiple failure zones.
 NATS.io that we used previously can be configured to replicate messages across multiple instances to survive failure of
 one or more zones. Postgresql databases can be configured to stream their WAL (Write ahead log), which stores an
 in-order history of all the transactions, to a standby instance running somewhere else. If our primary fails then the
 most we would lose would be the amount of data in our WAL that was not transferred to the replica. Much less data loss
 than if we didn’t have any replication.
 ## Contract Testing
 We are going to change direction now and look at making our applications more reliable. This means reducing the change
 of failures. You may appreciate that the time most likely to cause failures in your system is during deployments. New
 code hits our system and if it has not been tested thoroughly in production-like environments then we may end up with
 undefined behaviours.
 There’s a concept called Contract testing where we can test the interfaces between our applications at development and
 build time. This allows us to rapidly get feedback (a core principle of DevOps) on our progress.
 ## Async programming and queues
 We have already discussed how breaking the dependencies without our system can lead to increased reliability. Our
 changes become smaller, less likely to impact other areas of our application and easy to roll-back.
 If our users are not expecting an immediate tactile response to an event, such as requesting a PDF be generated then we
 can always place a message onto a queue and just wait for the consumer of that message to eventually get round to it. We
 could see a situation where thousands of people request their PDF at once and an application that does this
 synchronously would just fall over, run out of memory and collapse. This would leave all our customers without their
 PDFs and needing to take an action again to wait for them to be generated. Without developer intervention we may not get
 back to a state where the service can recover.
 Using a queue we can slowly work away in the background to generate these PDFs and could even scale the service in the
 background automatically when we noticed the queue getting longer. Each new application would just take the next message
 from the queue and work through the backlog. Once we were seeing less demand we could automatically scale these services
 down when our queue depth reached 0.
 ## Logging and Tracing
 It goes without saying that our applications and systems will fail. What we are doing in this section of 90daysOfDevOps
 is thinking about what we can do to make our lives less bad when things do. Logging and tracing are some really
 important tools in our toolbox to keep ourselves happy.
 If we log frequently with both success and failure messages we can follow the progress of our requests and customer
 journeys through our system then when things go wrong we can quickly rule out specific services or focus down on
 applications that are logging warning or error messages. My general rule is that you can’t log too much - its not
 possible! It is however important to use a log framework that lets you tune the log level that gets printed to your
 logs. For example if i have 5 log levels (as is common), TRACE, DEBUG, INFO, WARN and ERROR we should have a mechanism
 for ever application to set the level of logs we want to print. Most of the time we only want WARN and ERROR to be
 visible in logs, to quickly show us theres and issue. However if we are in development or debugging a specific
 application its important to be able to turn up the verbosity of our logs and see those INFO/DEBUG or TRACE levels.
 Tracing is a concept of being able to attack a unique identifier to a request in our system that then gets populated and
 logged throughout that requests journey, we could see a HTTP request hit a LoadBalancer, get a correlationID and then we
 want to see that correlationID in ever log line as our request’s actions percolate through our system.
 ## Conclusion
 It’s hard to build a fully fault-tolerant system. It involves learning from our, and other people’s, mistakes. I have
 personally made many changes to company infrastructure after we discovered a previously unknown failure zone in our
 application.
 We could change our application running in Kubernetes to tolerate our underlying infrastructure’s failure zones by
 leveraging [topology spread constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/)
 to spread replicas around. We won’t in this example as we have much more to cover!