mirror of
https://github.com/MichaelCade/90DaysOfDevOps.git
synced 2024-12-22 16:13:11 +07:00
Add day 86
Signed-off-by: Alistair Hey <alistair@heyal.co.uk>
This commit is contained in:
parent
8657e7f413
commit
0ba59699e1
2
2023.md
2
2023.md
@ -158,7 +158,7 @@ Or contact us via Twitter, my handle is [@MichaelCade1](https://twitter.com/Mich
|
||||
|
||||
- [] 👷🏻♀️ 84 > [](2023/day84.md)
|
||||
- [] 👷🏻♀️ 85 > [](2023/day85.md)
|
||||
- [] 👷🏻♀️ 86 > [](2023/day86.md)
|
||||
- [] 👷🏻♀️ 86 > [Designing for Resilience, Redundancy and Reliability](2023/day86.md)
|
||||
- [] 👷🏻♀️ 87 > [](2023/day87.md)
|
||||
- [] 👷🏻♀️ 88 > [](2023/day88.md)
|
||||
- [] 👷🏻♀️ 89 > [](2023/day89.md)
|
||||
|
@ -0,0 +1,95 @@
|
||||
# Designing for Resilience, Redundancy and Reliability
|
||||
|
||||
We now have an application which uses asynchronous queue based messaging to communicate. This gives us some real
|
||||
flexibility on how we design our system to withstand failures. We are going to look at the concept of Failure Zones,
|
||||
Replication, Contract testing, Logging and Tracing to build more robust systems.
|
||||
|
||||
## Failure Zones
|
||||
|
||||
Imagine building an application and deploying everything onto a single VM or server. What happens when, inevitably, this
|
||||
server fails. Your application goes offline and your customers won’t be happy! There’s a fix to this. Spread your
|
||||
workloads over multiple points of failure. This doesn’t just go for your application instances but you can build
|
||||
multiple redundant points for every aspect of your system.
|
||||
|
||||
Ever wonder what some of the things large cloud providers do to keep their services running despite external and
|
||||
unpredictable factors? For starters they will have generators on-site for when they inevitably suffer a power cut. They
|
||||
have at least two internet connections into the datacentres and they often provide multiple availability zones in each
|
||||
region. Take Amazon’s eu-west-2 (London) region. This has 3 availability zones, eu-west-2a, eu-west-2b and eu-west-2c.
|
||||
There are 3 separate datacenters (physically separated) that all inter-connect to provide a single region. This means by
|
||||
deploying services across these availability zones (AZs) we increase our redundancy and resilience over factors such as
|
||||
a fire in one of these facilities. If you run a homelab you could distribute work over failure zones by running more
|
||||
than one computer, placing these computers in separate parts of the house or buying 2 internet connections to stay
|
||||
online if one goes down (This does mean checking that these connections don’t just run on the same infrastructure as
|
||||
soon as they leave your door. At my house I can get fibre to my house and also a connection over the old phone lines)
|
||||
|
||||
## Replication
|
||||
|
||||
This links nicely into having multiple replicas of our “stuff”, this doesn’t just mean running 2 or 3 copies of our
|
||||
application over our previously mentioned availability zones we also need to consider our data - If we ran a database in
|
||||
AZ1 and our application code over AZ1, AZ2 and AZ3 what do you think would happen if AZ1 was to go offline, potentially
|
||||
permanently? There have been instances of cloud datacenters being completely destroyed by fire and many customers had
|
||||
not been backing up their data or running across multiple AZs. Your data in this case is… gone. Do you think the
|
||||
business you workin in could survive if their data, or their customers data, just disappeared overnight?
|
||||
|
||||
Many of our data storage tools come with the ability to be configured for replication across multiple failure zones.
|
||||
NATS.io that we used previously can be configured to replicate messages across multiple instances to survive failure of
|
||||
one or more zones. Postgresql databases can be configured to stream their WAL (Write ahead log), which stores an
|
||||
in-order history of all the transactions, to a standby instance running somewhere else. If our primary fails then the
|
||||
most we would lose would be the amount of data in our WAL that was not transferred to the replica. Much less data loss
|
||||
than if we didn’t have any replication.
|
||||
|
||||
## Contract Testing
|
||||
|
||||
We are going to change direction now and look at making our applications more reliable. This means reducing the change
|
||||
of failures. You may appreciate that the time most likely to cause failures in your system is during deployments. New
|
||||
code hits our system and if it has not been tested thoroughly in production-like environments then we may end up with
|
||||
undefined behaviours.
|
||||
|
||||
There’s a concept called Contract testing where we can test the interfaces between our applications at development and
|
||||
build time. This allows us to rapidly get feedback (a core principle of DevOps) on our progress.
|
||||
|
||||
## Async programming and queues
|
||||
|
||||
We have already discussed how breaking the dependencies without our system can lead to increased reliability. Our
|
||||
changes become smaller, less likely to impact other areas of our application and easy to roll-back.
|
||||
|
||||
If our users are not expecting an immediate tactile response to an event, such as requesting a PDF be generated then we
|
||||
can always place a message onto a queue and just wait for the consumer of that message to eventually get round to it. We
|
||||
could see a situation where thousands of people request their PDF at once and an application that does this
|
||||
synchronously would just fall over, run out of memory and collapse. This would leave all our customers without their
|
||||
PDFs and needing to take an action again to wait for them to be generated. Without developer intervention we may not get
|
||||
back to a state where the service can recover.
|
||||
|
||||
Using a queue we can slowly work away in the background to generate these PDFs and could even scale the service in the
|
||||
background automatically when we noticed the queue getting longer. Each new application would just take the next message
|
||||
from the queue and work through the backlog. Once we were seeing less demand we could automatically scale these services
|
||||
down when our queue depth reached 0.
|
||||
|
||||
## Logging and Tracing
|
||||
|
||||
It goes without saying that our applications and systems will fail. What we are doing in this section of 90daysOfDevOps
|
||||
is thinking about what we can do to make our lives less bad when things do. Logging and tracing are some really
|
||||
important tools in our toolbox to keep ourselves happy.
|
||||
|
||||
If we log frequently with both success and failure messages we can follow the progress of our requests and customer
|
||||
journeys through our system then when things go wrong we can quickly rule out specific services or focus down on
|
||||
applications that are logging warning or error messages. My general rule is that you can’t log too much - its not
|
||||
possible! It is however important to use a log framework that lets you tune the log level that gets printed to your
|
||||
logs. For example if i have 5 log levels (as is common), TRACE, DEBUG, INFO, WARN and ERROR we should have a mechanism
|
||||
for ever application to set the level of logs we want to print. Most of the time we only want WARN and ERROR to be
|
||||
visible in logs, to quickly show us theres and issue. However if we are in development or debugging a specific
|
||||
application its important to be able to turn up the verbosity of our logs and see those INFO/DEBUG or TRACE levels.
|
||||
|
||||
Tracing is a concept of being able to attack a unique identifier to a request in our system that then gets populated and
|
||||
logged throughout that requests journey, we could see a HTTP request hit a LoadBalancer, get a correlationID and then we
|
||||
want to see that correlationID in ever log line as our request’s actions percolate through our system.
|
||||
|
||||
## Conclusion
|
||||
|
||||
It’s hard to build a fully fault-tolerant system. It involves learning from our, and other people’s, mistakes. I have
|
||||
personally made many changes to company infrastructure after we discovered a previously unknown failure zone in our
|
||||
application.
|
||||
|
||||
We could change our application running in Kubernetes to tolerate our underlying infrastructure’s failure zones by
|
||||
leveraging [topology spread constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/)
|
||||
to spread replicas around. We won’t in this example as we have much more to cover!
|
Loading…
Reference in New Issue
Block a user