Add day 86

Signed-off-by: Alistair Hey <alistair@heyal.co.uk>
2025-07-25 07:11:16 +07:00 · 2023-03-24 15:32:21 +00:00
parent 8657e7f413
commit 0ba59699e1
2 changed files with 96 additions and 1 deletions
--- a/2023.md
+++ b/2023.md
@ -158,7 +158,7 @@ Or contact us via Twitter, my handle is [@MichaelCade1](https://twitter.com/Mich

 - [] 👷🏻‍♀️ 84 > [](2023/day84.md)
 - [] 👷🏻‍♀️ 85 > [](2023/day85.md)
- [] 👷🏻‍♀️ 86 > [](2023/day86.md)
+- [] 👷🏻‍♀️ 86 > [Designing for Resilience, Redundancy and Reliability](2023/day86.md)
 - [] 👷🏻‍♀️ 87 > [](2023/day87.md)
 - [] 👷🏻‍♀️ 88 > [](2023/day88.md)
 - [] 👷🏻‍♀️ 89 > [](2023/day89.md)
--- a/2023/day86.md
+++ b/2023/day86.md
@ -0,0 +1,95 @@
+# Designing for Resilience, Redundancy and Reliability
+
+We now have an application which uses asynchronous queue based messaging to communicate. This gives us some real
+flexibility on how we design our system to withstand failures. We are going to look at the concept of Failure Zones,
+Replication, Contract testing, Logging and Tracing to build more robust systems.
+
+## Failure  Zones
+
+Imagine building an application and deploying everything onto a single VM or server. What happens when, inevitably, this
+server fails. Your application goes offline and your customers won’t be happy! There’s a fix to this. Spread your
+workloads over multiple points of failure. This doesn’t just go for your application instances but you can build
+multiple redundant points for every aspect of your system.
+
+Ever wonder what some of the things large cloud providers do to keep their services running despite external and
+unpredictable factors? For starters they will have generators on-site for when they inevitably suffer a power cut. They
+have at least two internet connections into the datacentres and they often provide multiple availability zones in each
+region. Take Amazon’s eu-west-2 (London) region. This has 3 availability zones, eu-west-2a, eu-west-2b and eu-west-2c.
+There are 3 separate datacenters (physically separated) that all inter-connect to provide a single region. This means by
+deploying services across these availability zones (AZs) we increase our redundancy and resilience over factors such as
+a fire in one of these facilities. If you run a homelab you could distribute work over failure zones by running more
+than one computer, placing these computers in separate parts of the house or buying 2 internet connections to stay
+online if one goes down (This does mean checking that these connections don’t just run on the same infrastructure as
+soon as they leave your door. At my house I can get fibre to my house and also a connection over the old phone lines)
+
+## Replication
+
+This links nicely into having multiple replicas of our “stuff”, this doesn’t just mean running 2 or 3 copies of our
+application over our previously mentioned availability zones we also need to consider our data - If we ran a database in
+AZ1 and our application code over AZ1, AZ2 and AZ3 what do you think would happen if AZ1 was to go offline, potentially
+permanently? There have been instances of cloud datacenters being completely destroyed by fire and many customers had
+not been backing up their data or running across multiple AZs. Your data in this case is… gone. Do you think the
+business you workin in could survive if their data, or their customers data, just disappeared overnight?
+
+Many of our data storage tools come with the ability to be configured for replication across multiple failure zones.
+NATS.io that we used previously can be configured to replicate messages across multiple instances to survive failure of
+one or more zones. Postgresql databases can be configured to stream their WAL (Write ahead log), which stores an
+in-order history of all the transactions, to a standby instance running somewhere else. If our primary fails then the
+most we would lose would be the amount of data in our WAL that was not transferred to the replica. Much less data loss
+than if we didn’t have any replication.
+
+## Contract Testing
+
+We are going to change direction now and look at making our applications more reliable. This means reducing the change
+of failures. You may appreciate that the time most likely to cause failures in your system is during deployments. New
+code hits our system and if it has not been tested thoroughly in production-like environments then we may end up with
+undefined behaviours.
+
+There’s a concept called Contract testing where we can test the interfaces between our applications at development and
+build time. This allows us to rapidly get feedback (a core principle of DevOps) on our progress.
+
+## Async programming and queues
+
+We have already discussed how breaking the dependencies without our system can lead to increased reliability. Our
+changes become smaller, less likely to impact other areas of our application and easy to roll-back.
+
+If our users are not expecting an immediate tactile response to an event, such as requesting a PDF be generated then we
+can always place a message onto a queue and just wait for the consumer of that message to eventually get round to it. We
+could see a situation where thousands of people request their PDF at once and an application that does this
+synchronously would just fall over, run out of memory and collapse. This would leave all our customers without their
+PDFs and needing to take an action again to wait for them to be generated. Without developer intervention we may not get
+back to a state where the service can recover.
+
+Using a queue we can slowly work away in the background to generate these PDFs and could even scale the service in the
+background automatically when we noticed the queue getting longer. Each new application would just take the next message
+from the queue and work through the backlog. Once we were seeing less demand we could automatically scale these services
+down when our queue depth reached 0.
+
+## Logging and Tracing
+
+It goes without saying that our applications and systems will fail. What we are doing in this section of 90daysOfDevOps
+is thinking about what we can do to make our lives less bad when things do. Logging and tracing are some really
+important tools in our toolbox to keep ourselves happy.
+
+If we log frequently with both success and failure messages we can follow the progress of our requests and customer
+journeys through our system then when things go wrong we can quickly rule out specific services or focus down on
+applications that are logging warning or error messages. My general rule is that you can’t log too much - its not
+possible! It is however important to use a log framework that lets you tune the log level that gets printed to your
+logs. For example if i have 5 log levels (as is common), TRACE, DEBUG, INFO, WARN and ERROR we should have a mechanism
+for ever application to set the level of logs we want to print. Most of the time we only want WARN and ERROR to be
+visible in logs, to quickly show us theres and issue. However if we are in development or debugging a specific
+application its important to be able to turn up the verbosity of our logs and see those INFO/DEBUG or TRACE levels.
+
+Tracing is a concept of being able to attack a unique identifier to a request in our system that then gets populated and
+logged throughout that requests journey, we could see a HTTP request hit a LoadBalancer, get a correlationID and then we
+want to see that correlationID in ever log line as our request’s actions percolate through our system.
+
+## Conclusion
+
+It’s hard to build a fully fault-tolerant system. It involves learning from our, and other people’s, mistakes. I have
+personally made many changes to company infrastructure after we discovered a previously unknown failure zone in our
+application.
+
+We could change our application running in Kubernetes to tolerate our underlying infrastructure’s failure zones by
+leveraging [topology spread constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/)
+to spread replicas around. We won’t in this example as we have much more to cover!