System Reliability is the probability that a system will perform correctly during a specific time duration. A system is reliable when it adequately follows the defined performance specifications and no repair is required during that period.
It’s obvious that hardware depreciates with time which has an effect on a system's reliability. On the other hand, it’s difficult to measure software reliability; responses to client requests could slow down but still be accurate.
A reliable system should continue working even when the software or hardware components fail. Any failing component should be replaced immediately with a healthy one to ensure the completion of a requested task.
For instance, in a large online store like Amazon, where one of the primary requirements is that a transaction should never be canceled due to the failure of the node running the transaction.
For example, if a user adds an item to a shopping cart and proceeds to payments, the system is expected not to lose it even if the server carrying the transaction fails. A reliable system should be fault tolerant i.e. detect failures and migrate the transaction task to another redundant server for completion. A resilient system should be able to eliminate every single point of failure.
A common way to measure reliability is by using Mean Time Between Failure(MTBF). MTBF is the average time between system breakdowns which measures the performance of a system.
MTBF is calculated by taking the total time a system is running(uptime) and diving it by the number of failures(downtimes). For instance, if a system is operational for 100 hours, it breaks down two times for 3 hours, and with an addition of 4 hours the MTBF can be calculated as follows:
MTBF = (100hrs - 7hrs)/2 breakdowns = 93 hours/2 breakdowns = 46.5 hours