90DaysOfDevOps/2023/day86.md
Alistair Hey 0ba59699e1
Add day 86
Signed-off-by: Alistair Hey <alistair@heyal.co.uk>
2023-03-24 15:32:21 +00:00

7.2 KiB
Raw Permalink Blame History

Designing for Resilience, Redundancy and Reliability

We now have an application which uses asynchronous queue based messaging to communicate. This gives us some real flexibility on how we design our system to withstand failures. We are going to look at the concept of Failure Zones, Replication, Contract testing, Logging and Tracing to build more robust systems.

Failure Zones

Imagine building an application and deploying everything onto a single VM or server. What happens when, inevitably, this server fails. Your application goes offline and your customers wont be happy! Theres a fix to this. Spread your workloads over multiple points of failure. This doesnt just go for your application instances but you can build multiple redundant points for every aspect of your system.

Ever wonder what some of the things large cloud providers do to keep their services running despite external and unpredictable factors? For starters they will have generators on-site for when they inevitably suffer a power cut. They have at least two internet connections into the datacentres and they often provide multiple availability zones in each region. Take Amazons eu-west-2 (London) region. This has 3 availability zones, eu-west-2a, eu-west-2b and eu-west-2c. There are 3 separate datacenters (physically separated) that all inter-connect to provide a single region. This means by deploying services across these availability zones (AZs) we increase our redundancy and resilience over factors such as a fire in one of these facilities. If you run a homelab you could distribute work over failure zones by running more than one computer, placing these computers in separate parts of the house or buying 2 internet connections to stay online if one goes down (This does mean checking that these connections dont just run on the same infrastructure as soon as they leave your door. At my house I can get fibre to my house and also a connection over the old phone lines)

Replication

This links nicely into having multiple replicas of our “stuff”, this doesnt just mean running 2 or 3 copies of our application over our previously mentioned availability zones we also need to consider our data - If we ran a database in AZ1 and our application code over AZ1, AZ2 and AZ3 what do you think would happen if AZ1 was to go offline, potentially permanently? There have been instances of cloud datacenters being completely destroyed by fire and many customers had not been backing up their data or running across multiple AZs. Your data in this case is… gone. Do you think the business you workin in could survive if their data, or their customers data, just disappeared overnight?

Many of our data storage tools come with the ability to be configured for replication across multiple failure zones. NATS.io that we used previously can be configured to replicate messages across multiple instances to survive failure of one or more zones. Postgresql databases can be configured to stream their WAL (Write ahead log), which stores an in-order history of all the transactions, to a standby instance running somewhere else. If our primary fails then the most we would lose would be the amount of data in our WAL that was not transferred to the replica. Much less data loss than if we didnt have any replication.

Contract Testing

We are going to change direction now and look at making our applications more reliable. This means reducing the change of failures. You may appreciate that the time most likely to cause failures in your system is during deployments. New code hits our system and if it has not been tested thoroughly in production-like environments then we may end up with undefined behaviours.

Theres a concept called Contract testing where we can test the interfaces between our applications at development and build time. This allows us to rapidly get feedback (a core principle of DevOps) on our progress.

Async programming and queues

We have already discussed how breaking the dependencies without our system can lead to increased reliability. Our changes become smaller, less likely to impact other areas of our application and easy to roll-back.

If our users are not expecting an immediate tactile response to an event, such as requesting a PDF be generated then we can always place a message onto a queue and just wait for the consumer of that message to eventually get round to it. We could see a situation where thousands of people request their PDF at once and an application that does this synchronously would just fall over, run out of memory and collapse. This would leave all our customers without their PDFs and needing to take an action again to wait for them to be generated. Without developer intervention we may not get back to a state where the service can recover.

Using a queue we can slowly work away in the background to generate these PDFs and could even scale the service in the background automatically when we noticed the queue getting longer. Each new application would just take the next message from the queue and work through the backlog. Once we were seeing less demand we could automatically scale these services down when our queue depth reached 0.

Logging and Tracing

It goes without saying that our applications and systems will fail. What we are doing in this section of 90daysOfDevOps is thinking about what we can do to make our lives less bad when things do. Logging and tracing are some really important tools in our toolbox to keep ourselves happy.

If we log frequently with both success and failure messages we can follow the progress of our requests and customer journeys through our system then when things go wrong we can quickly rule out specific services or focus down on applications that are logging warning or error messages. My general rule is that you cant log too much - its not possible! It is however important to use a log framework that lets you tune the log level that gets printed to your logs. For example if i have 5 log levels (as is common), TRACE, DEBUG, INFO, WARN and ERROR we should have a mechanism for ever application to set the level of logs we want to print. Most of the time we only want WARN and ERROR to be visible in logs, to quickly show us theres and issue. However if we are in development or debugging a specific application its important to be able to turn up the verbosity of our logs and see those INFO/DEBUG or TRACE levels.

Tracing is a concept of being able to attack a unique identifier to a request in our system that then gets populated and logged throughout that requests journey, we could see a HTTP request hit a LoadBalancer, get a correlationID and then we want to see that correlationID in ever log line as our requests actions percolate through our system.

Conclusion

Its hard to build a fully fault-tolerant system. It involves learning from our, and other peoples, mistakes. I have personally made many changes to company infrastructure after we discovered a previously unknown failure zone in our application.

We could change our application running in Kubernetes to tolerate our underlying infrastructures failure zones by leveraging topology spread constraints to spread replicas around. We wont in this example as we have much more to cover!