mirror of
https://github.com/MichaelCade/90DaysOfDevOps.git
synced 2024-12-22 21:13:12 +07:00
Add day 86
Signed-off-by: Alistair Hey <alistair@heyal.co.uk>
This commit is contained in:
parent
8657e7f413
commit
0ba59699e1
2
2023.md
2
2023.md
@ -158,7 +158,7 @@ Or contact us via Twitter, my handle is [@MichaelCade1](https://twitter.com/Mich
|
|||||||
|
|
||||||
- [] 👷🏻♀️ 84 > [](2023/day84.md)
|
- [] 👷🏻♀️ 84 > [](2023/day84.md)
|
||||||
- [] 👷🏻♀️ 85 > [](2023/day85.md)
|
- [] 👷🏻♀️ 85 > [](2023/day85.md)
|
||||||
- [] 👷🏻♀️ 86 > [](2023/day86.md)
|
- [] 👷🏻♀️ 86 > [Designing for Resilience, Redundancy and Reliability](2023/day86.md)
|
||||||
- [] 👷🏻♀️ 87 > [](2023/day87.md)
|
- [] 👷🏻♀️ 87 > [](2023/day87.md)
|
||||||
- [] 👷🏻♀️ 88 > [](2023/day88.md)
|
- [] 👷🏻♀️ 88 > [](2023/day88.md)
|
||||||
- [] 👷🏻♀️ 89 > [](2023/day89.md)
|
- [] 👷🏻♀️ 89 > [](2023/day89.md)
|
||||||
|
@ -0,0 +1,95 @@
|
|||||||
|
# Designing for Resilience, Redundancy and Reliability
|
||||||
|
|
||||||
|
We now have an application which uses asynchronous queue based messaging to communicate. This gives us some real
|
||||||
|
flexibility on how we design our system to withstand failures. We are going to look at the concept of Failure Zones,
|
||||||
|
Replication, Contract testing, Logging and Tracing to build more robust systems.
|
||||||
|
|
||||||
|
## Failure Zones
|
||||||
|
|
||||||
|
Imagine building an application and deploying everything onto a single VM or server. What happens when, inevitably, this
|
||||||
|
server fails. Your application goes offline and your customers won’t be happy! There’s a fix to this. Spread your
|
||||||
|
workloads over multiple points of failure. This doesn’t just go for your application instances but you can build
|
||||||
|
multiple redundant points for every aspect of your system.
|
||||||
|
|
||||||
|
Ever wonder what some of the things large cloud providers do to keep their services running despite external and
|
||||||
|
unpredictable factors? For starters they will have generators on-site for when they inevitably suffer a power cut. They
|
||||||
|
have at least two internet connections into the datacentres and they often provide multiple availability zones in each
|
||||||
|
region. Take Amazon’s eu-west-2 (London) region. This has 3 availability zones, eu-west-2a, eu-west-2b and eu-west-2c.
|
||||||
|
There are 3 separate datacenters (physically separated) that all inter-connect to provide a single region. This means by
|
||||||
|
deploying services across these availability zones (AZs) we increase our redundancy and resilience over factors such as
|
||||||
|
a fire in one of these facilities. If you run a homelab you could distribute work over failure zones by running more
|
||||||
|
than one computer, placing these computers in separate parts of the house or buying 2 internet connections to stay
|
||||||
|
online if one goes down (This does mean checking that these connections don’t just run on the same infrastructure as
|
||||||
|
soon as they leave your door. At my house I can get fibre to my house and also a connection over the old phone lines)
|
||||||
|
|
||||||
|
## Replication
|
||||||
|
|
||||||
|
This links nicely into having multiple replicas of our “stuff”, this doesn’t just mean running 2 or 3 copies of our
|
||||||
|
application over our previously mentioned availability zones we also need to consider our data - If we ran a database in
|
||||||
|
AZ1 and our application code over AZ1, AZ2 and AZ3 what do you think would happen if AZ1 was to go offline, potentially
|
||||||
|
permanently? There have been instances of cloud datacenters being completely destroyed by fire and many customers had
|
||||||
|
not been backing up their data or running across multiple AZs. Your data in this case is… gone. Do you think the
|
||||||
|
business you workin in could survive if their data, or their customers data, just disappeared overnight?
|
||||||
|
|
||||||
|
Many of our data storage tools come with the ability to be configured for replication across multiple failure zones.
|
||||||
|
NATS.io that we used previously can be configured to replicate messages across multiple instances to survive failure of
|
||||||
|
one or more zones. Postgresql databases can be configured to stream their WAL (Write ahead log), which stores an
|
||||||
|
in-order history of all the transactions, to a standby instance running somewhere else. If our primary fails then the
|
||||||
|
most we would lose would be the amount of data in our WAL that was not transferred to the replica. Much less data loss
|
||||||
|
than if we didn’t have any replication.
|
||||||
|
|
||||||
|
## Contract Testing
|
||||||
|
|
||||||
|
We are going to change direction now and look at making our applications more reliable. This means reducing the change
|
||||||
|
of failures. You may appreciate that the time most likely to cause failures in your system is during deployments. New
|
||||||
|
code hits our system and if it has not been tested thoroughly in production-like environments then we may end up with
|
||||||
|
undefined behaviours.
|
||||||
|
|
||||||
|
There’s a concept called Contract testing where we can test the interfaces between our applications at development and
|
||||||
|
build time. This allows us to rapidly get feedback (a core principle of DevOps) on our progress.
|
||||||
|
|
||||||
|
## Async programming and queues
|
||||||
|
|
||||||
|
We have already discussed how breaking the dependencies without our system can lead to increased reliability. Our
|
||||||
|
changes become smaller, less likely to impact other areas of our application and easy to roll-back.
|
||||||
|
|
||||||
|
If our users are not expecting an immediate tactile response to an event, such as requesting a PDF be generated then we
|
||||||
|
can always place a message onto a queue and just wait for the consumer of that message to eventually get round to it. We
|
||||||
|
could see a situation where thousands of people request their PDF at once and an application that does this
|
||||||
|
synchronously would just fall over, run out of memory and collapse. This would leave all our customers without their
|
||||||
|
PDFs and needing to take an action again to wait for them to be generated. Without developer intervention we may not get
|
||||||
|
back to a state where the service can recover.
|
||||||
|
|
||||||
|
Using a queue we can slowly work away in the background to generate these PDFs and could even scale the service in the
|
||||||
|
background automatically when we noticed the queue getting longer. Each new application would just take the next message
|
||||||
|
from the queue and work through the backlog. Once we were seeing less demand we could automatically scale these services
|
||||||
|
down when our queue depth reached 0.
|
||||||
|
|
||||||
|
## Logging and Tracing
|
||||||
|
|
||||||
|
It goes without saying that our applications and systems will fail. What we are doing in this section of 90daysOfDevOps
|
||||||
|
is thinking about what we can do to make our lives less bad when things do. Logging and tracing are some really
|
||||||
|
important tools in our toolbox to keep ourselves happy.
|
||||||
|
|
||||||
|
If we log frequently with both success and failure messages we can follow the progress of our requests and customer
|
||||||
|
journeys through our system then when things go wrong we can quickly rule out specific services or focus down on
|
||||||
|
applications that are logging warning or error messages. My general rule is that you can’t log too much - its not
|
||||||
|
possible! It is however important to use a log framework that lets you tune the log level that gets printed to your
|
||||||
|
logs. For example if i have 5 log levels (as is common), TRACE, DEBUG, INFO, WARN and ERROR we should have a mechanism
|
||||||
|
for ever application to set the level of logs we want to print. Most of the time we only want WARN and ERROR to be
|
||||||
|
visible in logs, to quickly show us theres and issue. However if we are in development or debugging a specific
|
||||||
|
application its important to be able to turn up the verbosity of our logs and see those INFO/DEBUG or TRACE levels.
|
||||||
|
|
||||||
|
Tracing is a concept of being able to attack a unique identifier to a request in our system that then gets populated and
|
||||||
|
logged throughout that requests journey, we could see a HTTP request hit a LoadBalancer, get a correlationID and then we
|
||||||
|
want to see that correlationID in ever log line as our request’s actions percolate through our system.
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
It’s hard to build a fully fault-tolerant system. It involves learning from our, and other people’s, mistakes. I have
|
||||||
|
personally made many changes to company infrastructure after we discovered a previously unknown failure zone in our
|
||||||
|
application.
|
||||||
|
|
||||||
|
We could change our application running in Kubernetes to tolerate our underlying infrastructure’s failure zones by
|
||||||
|
leveraging [topology spread constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/)
|
||||||
|
to spread replicas around. We won’t in this example as we have much more to cover!
|
Loading…
Reference in New Issue
Block a user