Chapter 5. Reliability metrics
Reliability metrics are used to express the reliability of a software product using quantitative measures. Which metric you use depends upon the type of system to which the reliability metrics applies and the requirements of the application domain. From a Site Reliability Engineering perspective, there are a few key metrics to focus on for Java applications.
Mean time to failure
Mean Time to Failure (MTTF) is the time interval between two successive failures. The time units you use to measure MTTF depends on the system and can also be defined by the number of transactions. For systems with large transactions, MTTF is typically consistent.
Mean Time to Repair
Mean Time to Repair (MTTR) is the average time it takes to track errors causing failure and repair them.
Mean Time Between Failure
When you combine MTTF and MTTR metrics, the result equals Mean Time Between Failure (MTBF). Time measurements are real-time and not the execution time that is included in MTTF.
Rate of Occurrence of Failure
The Rate of Occurrence of Failure (ROCOF) is the number of failures that occur in a unit time interval and focuses on the likelihood of frequently-occurring, unexpected events.
Probability of Failure on Demand
Probability of Failure on Demand (POFOD) is the probability that a system will fail when a service request is made. POFOD is an essential measure for safety critical systems and relevant for protection systems where services are demanded occasionally.
Availabiity
Availability measures the probability that the system is available for use at any given time. You must take into account the repair time and the restart time for the system.