Chapter 5. Reliability metrics

Reliability metrics are used to express the reliability of a software product using quantitative measures. Which metric you use depends upon the type of system to which the reliability metrics applies and the requirements of the application domain. From a Site Reliability Engineering perspective, there are a few key metrics to focus on for Java applications.

Mean time to failure

Mean Time to Failure (MTTF) is the time interval between two successive failures. The time units you use to measure MTTF depends on the system and can also be defined by the number of transactions. For systems with large transactions, MTTF is typically consistent.

Mean Time to Repair

Mean Time to Repair (MTTR) is the average time it takes to track errors causing failure and repair them.

Mean Time Between Failure

When you combine MTTF and MTTR metrics, the result equals Mean Time Between Failure (MTBF). Time measurements are real-time and not the execution time that is included in MTTF.

Rate of Occurrence of Failure

The Rate of Occurrence of Failure (ROCOF) is the number of failures that occur in a unit time interval and focuses on the likelihood of frequently-occurring, unexpected events.

Probability of Failure on Demand

Probability of Failure on Demand (POFOD) is the probability that a system will fail when a service request is made. POFOD is an essential measure for safety critical systems and relevant for protection systems where services are demanded occasionally.

Availabiity

Availability measures the probability that the system is available for use at any given time. You must take into account the repair time and the restart time for the system.

Chapter 5. Reliability metrics

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links