mirror of
https://github.com/MichaelCade/90DaysOfDevOps.git
synced 2025-01-13 00:04:57 +07:00
0ba59699e1
Signed-off-by: Alistair Hey <alistair@heyal.co.uk>
96 lines
7.2 KiB
Markdown
96 lines
7.2 KiB
Markdown
# Designing for Resilience, Redundancy and Reliability
|
||
|
||
We now have an application which uses asynchronous queue based messaging to communicate. This gives us some real
|
||
flexibility on how we design our system to withstand failures. We are going to look at the concept of Failure Zones,
|
||
Replication, Contract testing, Logging and Tracing to build more robust systems.
|
||
|
||
## Failure Zones
|
||
|
||
Imagine building an application and deploying everything onto a single VM or server. What happens when, inevitably, this
|
||
server fails. Your application goes offline and your customers won’t be happy! There’s a fix to this. Spread your
|
||
workloads over multiple points of failure. This doesn’t just go for your application instances but you can build
|
||
multiple redundant points for every aspect of your system.
|
||
|
||
Ever wonder what some of the things large cloud providers do to keep their services running despite external and
|
||
unpredictable factors? For starters they will have generators on-site for when they inevitably suffer a power cut. They
|
||
have at least two internet connections into the datacentres and they often provide multiple availability zones in each
|
||
region. Take Amazon’s eu-west-2 (London) region. This has 3 availability zones, eu-west-2a, eu-west-2b and eu-west-2c.
|
||
There are 3 separate datacenters (physically separated) that all inter-connect to provide a single region. This means by
|
||
deploying services across these availability zones (AZs) we increase our redundancy and resilience over factors such as
|
||
a fire in one of these facilities. If you run a homelab you could distribute work over failure zones by running more
|
||
than one computer, placing these computers in separate parts of the house or buying 2 internet connections to stay
|
||
online if one goes down (This does mean checking that these connections don’t just run on the same infrastructure as
|
||
soon as they leave your door. At my house I can get fibre to my house and also a connection over the old phone lines)
|
||
|
||
## Replication
|
||
|
||
This links nicely into having multiple replicas of our “stuff”, this doesn’t just mean running 2 or 3 copies of our
|
||
application over our previously mentioned availability zones we also need to consider our data - If we ran a database in
|
||
AZ1 and our application code over AZ1, AZ2 and AZ3 what do you think would happen if AZ1 was to go offline, potentially
|
||
permanently? There have been instances of cloud datacenters being completely destroyed by fire and many customers had
|
||
not been backing up their data or running across multiple AZs. Your data in this case is… gone. Do you think the
|
||
business you workin in could survive if their data, or their customers data, just disappeared overnight?
|
||
|
||
Many of our data storage tools come with the ability to be configured for replication across multiple failure zones.
|
||
NATS.io that we used previously can be configured to replicate messages across multiple instances to survive failure of
|
||
one or more zones. Postgresql databases can be configured to stream their WAL (Write ahead log), which stores an
|
||
in-order history of all the transactions, to a standby instance running somewhere else. If our primary fails then the
|
||
most we would lose would be the amount of data in our WAL that was not transferred to the replica. Much less data loss
|
||
than if we didn’t have any replication.
|
||
|
||
## Contract Testing
|
||
|
||
We are going to change direction now and look at making our applications more reliable. This means reducing the change
|
||
of failures. You may appreciate that the time most likely to cause failures in your system is during deployments. New
|
||
code hits our system and if it has not been tested thoroughly in production-like environments then we may end up with
|
||
undefined behaviours.
|
||
|
||
There’s a concept called Contract testing where we can test the interfaces between our applications at development and
|
||
build time. This allows us to rapidly get feedback (a core principle of DevOps) on our progress.
|
||
|
||
## Async programming and queues
|
||
|
||
We have already discussed how breaking the dependencies without our system can lead to increased reliability. Our
|
||
changes become smaller, less likely to impact other areas of our application and easy to roll-back.
|
||
|
||
If our users are not expecting an immediate tactile response to an event, such as requesting a PDF be generated then we
|
||
can always place a message onto a queue and just wait for the consumer of that message to eventually get round to it. We
|
||
could see a situation where thousands of people request their PDF at once and an application that does this
|
||
synchronously would just fall over, run out of memory and collapse. This would leave all our customers without their
|
||
PDFs and needing to take an action again to wait for them to be generated. Without developer intervention we may not get
|
||
back to a state where the service can recover.
|
||
|
||
Using a queue we can slowly work away in the background to generate these PDFs and could even scale the service in the
|
||
background automatically when we noticed the queue getting longer. Each new application would just take the next message
|
||
from the queue and work through the backlog. Once we were seeing less demand we could automatically scale these services
|
||
down when our queue depth reached 0.
|
||
|
||
## Logging and Tracing
|
||
|
||
It goes without saying that our applications and systems will fail. What we are doing in this section of 90daysOfDevOps
|
||
is thinking about what we can do to make our lives less bad when things do. Logging and tracing are some really
|
||
important tools in our toolbox to keep ourselves happy.
|
||
|
||
If we log frequently with both success and failure messages we can follow the progress of our requests and customer
|
||
journeys through our system then when things go wrong we can quickly rule out specific services or focus down on
|
||
applications that are logging warning or error messages. My general rule is that you can’t log too much - its not
|
||
possible! It is however important to use a log framework that lets you tune the log level that gets printed to your
|
||
logs. For example if i have 5 log levels (as is common), TRACE, DEBUG, INFO, WARN and ERROR we should have a mechanism
|
||
for ever application to set the level of logs we want to print. Most of the time we only want WARN and ERROR to be
|
||
visible in logs, to quickly show us theres and issue. However if we are in development or debugging a specific
|
||
application its important to be able to turn up the verbosity of our logs and see those INFO/DEBUG or TRACE levels.
|
||
|
||
Tracing is a concept of being able to attack a unique identifier to a request in our system that then gets populated and
|
||
logged throughout that requests journey, we could see a HTTP request hit a LoadBalancer, get a correlationID and then we
|
||
want to see that correlationID in ever log line as our request’s actions percolate through our system.
|
||
|
||
## Conclusion
|
||
|
||
It’s hard to build a fully fault-tolerant system. It involves learning from our, and other people’s, mistakes. I have
|
||
personally made many changes to company infrastructure after we discovered a previously unknown failure zone in our
|
||
application.
|
||
|
||
We could change our application running in Kubernetes to tolerate our underlying infrastructure’s failure zones by
|
||
leveraging [topology spread constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/)
|
||
to spread replicas around. We won’t in this example as we have much more to cover!
|