Chapter 11. Network observability health rules
The Network Observability Operator provides alerts by using built-in metrics and the OpenShift Container Platform monitoring stack to report cluster network health.
Network observability health alerts require OpenShift Container Platform 4.16 or later.
11.1. Network observability rules for health and performance Copy linkLink copied to clipboard!
Network observability includes a system for managing Prometheus-based rules. Use these rules to monitor the health and performance of OpenShift Container Platform applications and infrastructure.
The Network Observability Operator converts these rules into a PrometheusRule resource. The Network Observability Operator supports the following rule types:
-
Alerting rules: Specifies rules managed by the Prometheus
AlertManagerto provide notification of network anomalies or infrastructure failures. - Recording rules: Specifies pre-compute complex Prometheus Query Language (PromQL) expressions into new time series to improve dashboard performance and visualization.
View the PrometheusRule resource in the netobserv namespace by running the following command:
oc get prometheusrules -n netobserv -o yaml
$ oc get prometheusrules -n netobserv -o yaml
11.1.1. Network health monitoring and alerting rules Copy linkLink copied to clipboard!
The Network Observability Operator includes a rule-based system to detect network anomalies and infrastructure failures. By converting configurations into alerting rules, the Operator enables automated monitoring and troubleshooting through the OpenShift Container Platform web console.
11.1.1.1. Monitoring outcomes Copy linkLink copied to clipboard!
The Network Observability Operator surfaces network status in the following areas:
- Alerting UI
-
Specific alerts appear in Observe
Alerting, where notifications are managed through the Prometheus AlertManager. - Network Health dashboard
-
A specialized dashboard in Observe
Network Health provides a high-level summary of cluster network status.
The Network Health dashboard categorizes violations into tabs to isolate the scope of an issue:
- Global: Aggregate health of the entire cluster.
- Nodes: Violations specific to infrastructure nodes.
- Namespaces: Violations specific to individual namespaces.
-
Workloads: Violations specific to resources, such as
DeploymentsorDaemonSets.
11.1.1.2. Predefined health rules Copy linkLink copied to clipboard!
The Network Observability Operator provides default rules for common networking scenarios. These rules are active only if the corresponding feature is enabled in the FlowCollector custom resource (CR).
The following list contains a subset of available default rules:
PacketDropsByDevice-
Triggers on a high percentage of packet drops from network devices. It is based on standard node-exporter metrics and does not require the
PacketDropagent feature. PacketDropsByKernel-
Triggers on a high percentage of packet drops by the kernel. Requires the
PacketDropagent feature. IPsecErrors-
Triggers when IPsec encryption errors are detected. Requires the
IPSecagent feature. NetpolDenied-
Triggers when traffic denied by network policies is detected. Requires the
NetworkEventsagent feature. LatencyHighTrend-
Triggers when a significant increase in TCP latency is detected. Requires the
FlowRTTagent feature. DNSErrors-
Triggers when DNS errors are detected. Requires the
DNSTrackingagent feature.
Operational alerts for the Network Observability Operator:
NetObservNoFlows- Triggers when the pipeline is active but no flows are observed.
NetObservLokiError- Triggers when flows are dropped because of Loki errors.
For a complete list of rules and runbooks, see the Network Observability Operator runbooks.
11.1.1.3. Rule dependencies and feature requirements Copy linkLink copied to clipboard!
The Network Observability Operator creates rules based on the features enabled in the FlowCollector custom resource (CR).
For example, packet drop-related rules are created only if the PacketDrop agent feature is enabled. Rules are built on metrics; if the required metrics are missing, configuration warnings might appear. Configure metrics in the spec.processor.metrics.includeList object of the FlowCollector resource.
11.2. Performance optimization with recording rules Copy linkLink copied to clipboard!
For large-scale clusters, recording rules optimize how Prometheus handles network data. Recording rules improve dashboard responsiveness and reduce the computational overhead of complex queries.
11.2.1. Optimization benefits Copy linkLink copied to clipboard!
Recording rules pre-compute complex Prometheus Query Language (PromQL) expressions and save the results as new time series. Unlike alerting rules, recording rules do not monitor thresholds.
Using recording rules provides the following advantages:
- Improved performance
- Pre-computing Prometheus queries allows dashboards to load faster by avoiding on-demand calculations for long-term trends.
- Resource efficiency
- Calculating data at fixed intervals reduces CPU load on the Prometheus server compared to recalculating data on every dashboard refresh.
- Simplified queries
-
Using short metric names, such as
cluster:network_traffic:rate_5m, simplifies complex aggregate calculations in custom dashboards.
11.2.2. Comparison of rule modes Copy linkLink copied to clipboard!
The following table compares rule modes based on the expected outcome:
| Description | Alerting rules | Recording rules |
|---|---|---|
| Goal | Issue notification. | Save history of high level metrics. |
| Data result | Generates an alerting state. | Creates a persistent metric. |
| Visibility | Alerting UI and Network Health view. | Metrics Explorer and Network Health view. |
| Notifications |
Triggers | Does not trigger notifications. |
11.3. Network observability health rule structure and customization Copy linkLink copied to clipboard!
Health rules in the Network Observability Operator are defined using rule templates and variants in the spec.processor.metrics.healthRules object of the FlowCollector custom resource (CR). You can customize the default templates and variants for flexible, fine-grained alerting.
For each template, you can define a list of variants, each with their own thresholds and grouping configurations. For more information, see "List of default alert templates".
The following example shows an alert:
where:
spec.processor.metrics.healthRules.template- Specifies the name of the predefined rule template.
spec.processor.metrics.healthRules.mode-
Specifies whether the rule functions as an
Alertor aRecordingrule. This setting can either be defined per variant, or for the whole template. spec.processor.metrics.healthRules.variants.thresholds-
Specifies the numerical values that trigger the rule. You can define multiple severity levels, such as
critical,warning, orinfo, within a single variant. cluster-wide variant-
Specifies a variant defined without a
groupBysetting. In the provided example, this variant triggers when the total cluster traffic reaches 10% drops. spec.processor.metrics.healthRules.variants.groupBy- Specifies the dimension used to aggregate the metric. In the provided example, the alert is evaluated independently for each *Node8.
Customizing a rule replaces the default configuration for that template. If you want to keep the default configurations, you must manually replicate them.
11.3.1. PromQL expressions and metadata for health rules Copy linkLink copied to clipboard!
Learn about the base query for Prometheus Query Language (PromQL), and how to customize it so you can configure network observability alerts for your specific needs.
The health rule API in the network observability FlowCollector custom resource (CR) is mapped to the Prometheus Operator API, generating a PrometheusRule. You can see the PrometheusRule in the default netobserv namespace by running the following command:
oc get prometheusrules -n netobserv -oyaml
$ oc get prometheusrules -n netobserv -oyaml
11.3.1.1. An example query for an alert in a surge of incoming traffic Copy linkLink copied to clipboard!
This example provides the base PromQL query pattern for an alert about a surge in incoming traffic:
sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)
sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)
This query calculates the byte rate coming from the openshift-ingress namespace to any of your workloads' namespaces over the past 30 minutes.
You can customize the query, including retaining only some rates, running the query for specific time periods, and setting a final threshold.
- Filtering noise
Appending
> 1000to this query retains only the rates observed that are greater than1 KB/s, which eliminates noise from low-bandwidth consumers.(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)The byte rate is relative to the sampling interval defined in the
FlowCollectorcustom resource (CR) configuration. If the sampling interval is1:100, the actual traffic might be approximately 100 times higher than the reported metrics.- Time comparison
You can run the same query for a particular period of time using the
offsetmodifier. For example, a query for one day earlier can be run usingoffset 1d, and a query for five hours ago can be run usingoffset 5h.sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))You can use the formula
100 * (<query now> - <query from the previous day>) / <query from the previous day>to calculate the percentage of increase compared to the previous day. This value can be negative if the byte rate today is lower than the previous day.- Final threshold
-
You can apply a final threshold to filter increases that are lower than the desired percentage. For example,
> 100eliminates increases that are lower than 100%.
Together, the complete expression for the PrometheusRule looks like the following:
11.3.1.2. Alert metadata fields Copy linkLink copied to clipboard!
The Network Observability Operator uses components from other OpenShift Container Platform features, such as the monitoring stack, to enhance visibility into network traffic. For more information, see: "Monitoring stack architecture".
Some metadata must be configured for the rule definitions. This metadata is used by Prometheus and the Alertmanager service from the monitoring stack, or by the Network Health dashboard.
The following example shows an AlertingRule resource with the configured metadata:
where:
spec.groups.rules.alert.labels.netobserv-
Specifies the alert for the Network Health dashboard to detect when set to
true. spec.groups.rules.alert.labels.severity-
Specifies the severity of the alert. The following values are valid:
critical,warning, orinfo.
You can leverage the output labels from the defined PromQL expression in the message annotation. In the example, since results are grouped per DstK8S_Namespace, the expression {{ $labels.DstK8S_Namespace }} is used in the message text.
The netobserv_io_network_health annotation is optional, and controls how the alert is rendered on the Network Health page.
The netobserv_io_network_health annotation is a JSON string consisting of the following fields:
| Field | Type | Description |
|---|---|---|
|
| List of strings | One or more labels that hold namespaces. When provided, the alert appears under the Namespaces tab. |
|
| List of strings | One or more labels that hold node names. When provided, the alert appears under the Nodes tab. |
|
| List of strings |
One or more labels that hold owner/workload names. When provided alongside with |
|
| List of strings |
One or more labels that hold owner/workload kinds. When provided alongside with |
|
| String |
The alert threshold, expected to match the threshold defined in the |
|
| String | The data unit, used only for display purposes. |
|
| String | An upper bound value used to compute the score on a closed scale. Metric values exceeding this bound are clamped. |
|
| List of objects |
A list of links to display contextually with the alert. Each link requires a |
|
| String |
Information related to the link to the Network Traffic page, for URL building. Some filters will be set automatically, such as the |
The namespaceLabels and nodeLabels are mutually exclusive. If neither is provided, the alert appears under the Global tab.
| Field | Description |
|---|---|
|
| Additional filter to inject (for example, a DNS response code for DNS-related alerts). |
|
|
Whether the filter should include return traffic ( |
|
|
Whether the filter should target the destination of the traffic instead of the source ( |
11.3.2. Custom health rule configuration Copy linkLink copied to clipboard!
Use the Prometheus Query Language (PromQL) to define a custom AlertingRule resource to trigger alerts based on specific network metrics (e.g., traffic surges).
Prerequisites
-
Familiarity with
PromQL. - You have installed OpenShift Container Platform 4.16 or later.
-
You have access to the cluster as a user with the
cluster-adminrole. - You have installed the Network Observability Operator.
Procedure
-
Create a YAML file named
custom-alert.yamlthat contains yourAlertingRuleresource. Apply the custom alert rule by running the following command:
oc apply -f custom-alert.yaml
$ oc apply -f custom-alert.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that the
PrometheusRuleresource was created in thenetobservnamespace by running the following command:oc get prometheusrules -n netobserv -oyaml
$ oc get prometheusrules -n netobserv -oyamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow The output should include the
netobserv-alertsrule you just created, confirming that the resource was generated correctly.-
Confirm the rule is active by checking the Network Health dashboard in the OpenShift Container Platform web console
Observe.
11.4. Disable predefined rules Copy linkLink copied to clipboard!
Rule templates can be disabled in the spec.processor.metrics.disableAlerts field of the FlowCollector custom resource (CR). This setting accepts a list of rule template names. For a list of alert template names, see "List of default rules".
If a template is disabled and overridden in the spec.processor.metrics.healthRules field, the disable setting takes precedence and the alert rule is not created.