What is the difference between reliability and availability?
Availability is a measure of the % of time the system is in an operable state while reliability is a measure of how long the item performs its intended function without breaking down.
However, reliability and availability go hand in hand. An increase in reliability translates to an increase in availability. It’s important to keep in mind that both metrics can produce different results.
You might have a highly available machine that is not reliable.
For example, a commercial blender that is operating close to its maximum capacity. The Motor can run for several hours a day which implies high availability.
However, it may need to cool every half an hour to resolve operational problems. Despite its high availability, the blender is not a highly reliable piece of equipment.
Best Practices to improve system availability and Reliability
The goal of high availability is to minimize system downtime and/or minimize the times needed to recover from an outage. This can be achieved by:
Build with failure in mind - Always plan on your application and services failing. As the CTO of Amazon, Wener Vogels says, “Everything fails all the time”. Using design constructs such as simple
try-catch
methods, retry logic, and circuit breakers allow you to catch errors. This will allow you to limit the scope of the problem and your app will continue working even if parts of the application are failing. Circuit breaker patterns are useful for handling dependency failures since they can greatly reduce the impact a dependency failure has on your system.Always think about scaling - An application that generates a certain amount of traffic today might generate a lot more traffic sooner than you anticipate. As you build your app, don’t build it for today’s traffic, but for tomorrow’s. This can be achieved by building an application in a way that you can add additional servers and increase the size and capacity of your databases easily when needed.
Reduce single points of failure. - Eliminate all single points of failures from your application infrastructure. Since all hardware fails at some point, eliminate the impact that it will cause on your application. This means backups of everything: servers, routers, switches, power sources, etc that you anticipate.
Monitor the application - Make sure your application is instrumented to see how the application is performing. Instrumentation tools monitor the health of servers, monitor the performance of applications and services, do synthetic testing(examines in real time how the app is working from the user's perspective), and alert appropriate personnel when problems occur so that they can be quickly resolved.
Respond to downtime in a predictable way - Monitoring issues are useless unless you are prepared to act on issues that arise. You should establish processes that your team follows to diagnose and fix common failures scenarios. The standard processes should be prepared ahead of time so that during a downtime/outage the owner of the related service should be alerted to restore the service quickly.