Chapter 11. Network observability health rules


The Network Observability Operator provides alerts by using built-in metrics and the OpenShift Container Platform monitoring stack to report cluster network health.

Important

Network observability health alerts require OpenShift Container Platform 4.16 or later.

Network observability includes a system for managing Prometheus-based rules. Use these rules to monitor the health and performance of OpenShift Container Platform applications and infrastructure.

The Network Observability Operator converts these rules into a PrometheusRule resource. The Network Observability Operator supports the following rule types:

  • Alerting rules: Specifies rules managed by the Prometheus AlertManager to provide notification of network anomalies or infrastructure failures.
  • Recording rules: Specifies pre-compute complex Prometheus Query Language (PromQL) expressions into new time series to improve dashboard performance and visualization.

View the PrometheusRule resource in the netobserv namespace by running the following command:

$ oc get prometheusrules -n netobserv -o yaml
Copy to Clipboard Toggle word wrap

The Network Observability Operator includes a rule-based system to detect network anomalies and infrastructure failures. By converting configurations into alerting rules, the Operator enables automated monitoring and troubleshooting through the OpenShift Container Platform web console.

11.1.1.1. Monitoring outcomes

The Network Observability Operator surfaces network status in the following areas:

Alerting UI
Specific alerts appear in Observe Alerting, where notifications are managed through the Prometheus AlertManager.
Network Health dashboard
A specialized dashboard in Observe Network Health provides a high-level summary of cluster network status.

The Network Health dashboard categorizes violations into tabs to isolate the scope of an issue:

  • Global: Aggregate health of the entire cluster.
  • Nodes: Violations specific to infrastructure nodes.
  • Namespaces: Violations specific to individual namespaces.
  • Workloads: Violations specific to resources, such as Deployments or DaemonSets.

11.1.1.2. Predefined health rules

The Network Observability Operator provides default rules for common networking scenarios. These rules are active only if the corresponding feature is enabled in the FlowCollector custom resource (CR).

The following list contains a subset of available default rules:

PacketDropsByDevice
Triggers on a high percentage of packet drops from network devices. It is based on standard node-exporter metrics and does not require the PacketDrop agent feature.
PacketDropsByKernel
Triggers on a high percentage of packet drops by the kernel. Requires the PacketDrop agent feature.
IPsecErrors
Triggers when IPsec encryption errors are detected. Requires the IPSec agent feature.
NetpolDenied
Triggers when traffic denied by network policies is detected. Requires the NetworkEvents agent feature.
LatencyHighTrend
Triggers when a significant increase in TCP latency is detected. Requires the FlowRTT agent feature.
DNSErrors
Triggers when DNS errors are detected. Requires the DNSTracking agent feature.

Operational alerts for the Network Observability Operator:

NetObservNoFlows
Triggers when the pipeline is active but no flows are observed.
NetObservLokiError
Triggers when flows are dropped because of Loki errors.

For a complete list of rules and runbooks, see the Network Observability Operator runbooks.

The Network Observability Operator creates rules based on the features enabled in the FlowCollector custom resource (CR).

For example, packet drop-related rules are created only if the PacketDrop agent feature is enabled. Rules are built on metrics; if the required metrics are missing, configuration warnings might appear. Configure metrics in the spec.processor.metrics.includeList object of the FlowCollector resource.

For large-scale clusters, recording rules optimize how Prometheus handles network data. Recording rules improve dashboard responsiveness and reduce the computational overhead of complex queries.

11.2.1. Optimization benefits

Recording rules pre-compute complex Prometheus Query Language (PromQL) expressions and save the results as new time series. Unlike alerting rules, recording rules do not monitor thresholds.

Using recording rules provides the following advantages:

Improved performance
Pre-computing Prometheus queries allows dashboards to load faster by avoiding on-demand calculations for long-term trends.
Resource efficiency
Calculating data at fixed intervals reduces CPU load on the Prometheus server compared to recalculating data on every dashboard refresh.
Simplified queries
Using short metric names, such as cluster:network_traffic:rate_5m, simplifies complex aggregate calculations in custom dashboards.

11.2.2. Comparison of rule modes

The following table compares rule modes based on the expected outcome:

Expand
DescriptionAlerting rulesRecording rules

Goal

Issue notification.

Save history of high level metrics.

Data result

Generates an alerting state.

Creates a persistent metric.

Visibility

Alerting UI and Network Health view.

Metrics Explorer and Network Health view.

Notifications

Triggers AlertManager notifications.

Does not trigger notifications.

Health rules in the Network Observability Operator are defined using rule templates and variants in the spec.processor.metrics.healthRules object of the FlowCollector custom resource (CR). You can customize the default templates and variants for flexible, fine-grained alerting.

For each template, you can define a list of variants, each with their own thresholds and grouping configurations. For more information, see "List of default alert templates".

The following example shows an alert:

apiVersion: flows.netobserv.io/v1beta1
kind: FlowCollector
metadata:
  name: flow-collector
spec:
  processor:
    metrics:
      healthRules:
      - template: PacketDropsByKernel
        mode: Alert # or Recording
        variants:
        # triggered when the whole cluster traffic (no grouping) reaches 10% of drops
        - thresholds:
            critical: "10"
        # triggered when per-node traffic reaches 5% of drops, with gradual severity
        - thresholds:
            critical: "15"
            warning: "10"
            info: "5"
          groupBy: Node
Copy to Clipboard Toggle word wrap

where:

spec.processor.metrics.healthRules.template
Specifies the name of the predefined rule template.
spec.processor.metrics.healthRules.mode
Specifies whether the rule functions as an Alert or a Recording rule. This setting can either be defined per variant, or for the whole template.
spec.processor.metrics.healthRules.variants.thresholds
Specifies the numerical values that trigger the rule. You can define multiple severity levels, such as critical, warning, or info, within a single variant.
cluster-wide variant
Specifies a variant defined without a groupBy setting. In the provided example, this variant triggers when the total cluster traffic reaches 10% drops.
spec.processor.metrics.healthRules.variants.groupBy
Specifies the dimension used to aggregate the metric. In the provided example, the alert is evaluated independently for each *Node8.
Note

Customizing a rule replaces the default configuration for that template. If you want to keep the default configurations, you must manually replicate them.

Learn about the base query for Prometheus Query Language (PromQL), and how to customize it so you can configure network observability alerts for your specific needs.

The health rule API in the network observability FlowCollector custom resource (CR) is mapped to the Prometheus Operator API, generating a PrometheusRule. You can see the PrometheusRule in the default netobserv namespace by running the following command:

$ oc get prometheusrules -n netobserv -oyaml
Copy to Clipboard Toggle word wrap

This example provides the base PromQL query pattern for an alert about a surge in incoming traffic:

sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)
Copy to Clipboard Toggle word wrap

This query calculates the byte rate coming from the openshift-ingress namespace to any of your workloads' namespaces over the past 30 minutes.

You can customize the query, including retaining only some rates, running the query for specific time periods, and setting a final threshold.

Filtering noise

Appending > 1000 to this query retains only the rates observed that are greater than 1 KB/s, which eliminates noise from low-bandwidth consumers.

(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)

The byte rate is relative to the sampling interval defined in the FlowCollector custom resource (CR) configuration. If the sampling interval is 1:100, the actual traffic might be approximately 100 times higher than the reported metrics.

Time comparison

You can run the same query for a particular period of time using the offset modifier. For example, a query for one day earlier can be run using offset 1d, and a query for five hours ago can be run using offset 5h.

sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))

You can use the formula 100 * (<query now> - <query from the previous day>) / <query from the previous day> to calculate the percentage of increase compared to the previous day. This value can be negative if the byte rate today is lower than the previous day.

Final threshold
You can apply a final threshold to filter increases that are lower than the desired percentage. For example, > 100 eliminates increases that are lower than 100%.

Together, the complete expression for the PrometheusRule looks like the following:

...
      expr: |-
        (100 *
          (
            (sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
            - sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
          )
          / sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
        > 100
Copy to Clipboard Toggle word wrap

11.3.1.2. Alert metadata fields

The Network Observability Operator uses components from other OpenShift Container Platform features, such as the monitoring stack, to enhance visibility into network traffic. For more information, see: "Monitoring stack architecture".

Some metadata must be configured for the rule definitions. This metadata is used by Prometheus and the Alertmanager service from the monitoring stack, or by the Network Health dashboard.

The following example shows an AlertingRule resource with the configured metadata:

apiVersion: monitoring.openshift.io/v1
kind: AlertingRule
metadata:
  name: netobserv-alerts
  namespace: openshift-monitoring
spec:
  groups:
  - name: NetObservAlerts
    rules:
    - alert: NetObservIncomingBandwidth
      annotations:
        netobserv_io_network_health: '{"namespaceLabels":["DstK8S_Namespace"],"threshold":"100","unit":"%","upperBound":"500"}'
        message: |-
          NetObserv is detecting a surge of incoming traffic: current traffic to {{ $labels.DstK8S_Namespace }} has increased by more than 100% since yesterday.
        summary: "Surge in incoming traffic"
      expr: |-
        (100 *
          (
            (sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
            - sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
          )
          / sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
        > 100
      for: 1m
      labels:
        app: netobserv
        netobserv: "true"
        severity: warning
Copy to Clipboard Toggle word wrap

where:

spec.groups.rules.alert.labels.netobserv
Specifies the alert for the Network Health dashboard to detect when set to true.
spec.groups.rules.alert.labels.severity
Specifies the severity of the alert. The following values are valid: critical, warning, or info.

You can leverage the output labels from the defined PromQL expression in the message annotation. In the example, since results are grouped per DstK8S_Namespace, the expression {{ $labels.DstK8S_Namespace }} is used in the message text.

The netobserv_io_network_health annotation is optional, and controls how the alert is rendered on the Network Health page.

The netobserv_io_network_health annotation is a JSON string consisting of the following fields:

Expand
Table 11.1. Fields for the netobserv_io_network_health annotation
FieldTypeDescription

namespaceLabels

List of strings

One or more labels that hold namespaces. When provided, the alert appears under the Namespaces tab.

nodeLabels

List of strings

One or more labels that hold node names. When provided, the alert appears under the Nodes tab.

workloadLabels

List of strings

One or more labels that hold owner/workload names. When provided alongside with kindLabels, the alert will show up under the "Owners" tab.

kindLabels

List of strings

One or more labels that hold owner/workload kinds. When provided alongside with workloadLabels, the alert will show up under the "Owners" tab.

threshold

String

The alert threshold, expected to match the threshold defined in the PromQL expression.

unit

String

The data unit, used only for display purposes.

upperBound

String

An upper bound value used to compute the score on a closed scale. Metric values exceeding this bound are clamped.

links

List of objects

A list of links to display contextually with the alert. Each link requires a name (display name) and url.

trafficLink

String

Information related to the link to the Network Traffic page, for URL building. Some filters will be set automatically, such as the node or namespace filter.

The namespaceLabels and nodeLabels are mutually exclusive. If neither is provided, the alert appears under the Global tab.

Expand
Table 11.2. trafficLink fields
FieldDescription

extraFilter

Additional filter to inject (for example, a DNS response code for DNS-related alerts).

backAndForth

Whether the filter should include return traffic (true or false).

filterDestination

Whether the filter should target the destination of the traffic instead of the source (true or false).

11.3.2. Custom health rule configuration

Use the Prometheus Query Language (PromQL) to define a custom AlertingRule resource to trigger alerts based on specific network metrics (e.g., traffic surges).

Prerequisites

  • Familiarity with PromQL.
  • You have installed OpenShift Container Platform 4.16 or later.
  • You have access to the cluster as a user with the cluster-admin role.
  • You have installed the Network Observability Operator.

Procedure

  1. Create a YAML file named custom-alert.yaml that contains your AlertingRule resource.
  2. Apply the custom alert rule by running the following command:

    $ oc apply -f custom-alert.yaml
    Copy to Clipboard Toggle word wrap

Verification

  1. Verify that the PrometheusRule resource was created in the netobserv namespace by running the following command:

    $ oc get prometheusrules -n netobserv -oyaml
    Copy to Clipboard Toggle word wrap

    The output should include the netobserv-alerts rule you just created, confirming that the resource was generated correctly.

  2. Confirm the rule is active by checking the Network Health dashboard in the OpenShift Container Platform web console Observe.

11.4. Disable predefined rules

Rule templates can be disabled in the spec.processor.metrics.disableAlerts field of the FlowCollector custom resource (CR). This setting accepts a list of rule template names. For a list of alert template names, see "List of default rules".

If a template is disabled and overridden in the spec.processor.metrics.healthRules field, the disable setting takes precedence and the alert rule is not created.

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top