Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 9. Network observability alerts

Important

Network observability alerts is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

The Network Observability Operator provides a set of alerts for monitoring the network in your OpenShift Container Platform cluster. The alerts are based on its built-in metrics, but can include other metrics, such as ones provided by the OpenShift Container Platform monitoring stack. Alerts are designed to give you a quick indication of your cluster’s network health.

9.1. About network observability alerts
Link kopieren

Network observability includes predefined alerts. Use these alerts to gain insight into the health and performance of your OpenShift Container Platform applications and infrastructure.

The predefined alerts provide a quick health indication of your cluster’s network in the Network Health dashboard. You can also customize alerts using Prometheus Query Language (PromQL) queries.

By default, network observability creates alerts that are contextual to the features you enable.

For example, packet drop-related alerts are created only if the PacketDrop agent feature is enabled in the FlowCollector custom resource (CR). Alerts are built on metrics, and you might see configuration warnings if enabled alerts are missing their required metrics.

You can configure these metrics in the spec.processor.metrics.includeList object of the FlowCollector CR.

9.1.1. List of default alert templates
Link kopieren

These alert templates are installed by default:

PacketDropsByDevice: Triggers on high percentage of packet drops from devices (/proc/net/dev).
PacketDropsByKernel: Triggers on high percentage of packet drops by the kernel; it requires the PacketDrop agent feature.
IPsecErrors: Triggers when IPsec encryption errors are detected by network observability; it requires the IPSec agent feature.
NetpolDenied: Triggers when traffic denied by network policies is detected by network observability; it requires the NetworkEvents agent feature.
LatencyHighTrend: Triggers when an increase of TCP latency is detected by network observability; it requires the FlowRTT agent feature.
DNSErrors: Triggers when DNS errors are detected by network observability; it requires the DNSTracking agent feature.

These are operational alerts that relate to the self-health of network observability:

NetObservNoFlows: Triggers when no flows are being observed for a certain period.
NetObservLokiError: Triggers when flows are being dropped due to Loki errors.

You can configure, extend, or disable alerts for network observability. You can view the resulting PrometheusRule resource in the default netobserv namespace by running the following command:

oc get prometheusrules -n netobserv -oyaml

$ oc get prometheusrules -n netobserv -oyaml

Copy to Clipboard

Toggle word wrap

9.1.2. Network Health dashboard
Link kopieren

When alerts are enabled in the Network Observability Operator, two things happen:

New alerts appear in Observe Alerting Alerting rules tab in the OpenShift Container Platform web console.
A new Network Health dashboard appears in OpenShift Container Platform web console Observe.

The Network Health dashboard provides a summary of triggered alerts and pending alerts, distinguishing between critical, warning, and minor issues. Alerts for rule violations are displayed in the following tabs:

Global: Shows alerts that are global to the cluster.
Nodes: Shows alerts for rule violations per node.
Namespaces: Shows alerts for rule violations per namespace.

Click on a resource card to see more information. Next to each alert, a three dot menu appears. From this menu, you can navigate to Network Traffic Traffic flows to see more detailed information for the selected resource.

9.2. Enabling Technology Preview alerts in network observability
Link kopieren

Network Observability Operator alerts are a Technology Preview feature. To use this feature, you must enable it in the FlowCollector custom resource (CR), and then continue with configuring alerts to your specific needs.

Procedure

Edit the FlowCollector CR to set the experimental alerts flag to true:

apiVersion: flows.netobserv.io/v1beta1
kind: FlowCollector
metadata:
  name: flow-collector
spec:
  processor:
    advanced:
      env:
        EXPERIMENTAL_ALERTS_HEALTH: "true"

apiVersion: flows.netobserv.io/v1beta1
kind: FlowCollector
metadata:
  name: flow-collector
spec:
  processor:
    advanced:
      env:
        EXPERIMENTAL_ALERTS_HEALTH: "true"

Copy to Clipboard

Toggle word wrap

You can still use the existing method for creating alerts. For more information, see "Creating alerts".

9.2.1. Configuring predefined alerts
Link kopieren

Alerts in the Network Observability Operator are defined using alert templates and variants in the spec.processor.metrics.alerts object of the FlowCollector custom resource (CR). You can customize the default templates and variants for flexible, fine-grained alerting.

After you enable alerts, the Network Health dashboard appears in the Observe section of the OpenShift Container Platform web console.

For each template, you can define a list of variants, each with their own thresholds and grouping configurations. For more information, see the "List of default alert templates".

Here is an example:

apiVersion: flows.netobserv.io/v1beta1
kind: FlowCollector
metadata:
  name: flow-collector
spec:
  processor:
    metrics:
      alerts:
      - template: PacketDropsByKernel
        variants:
        # triggered when the whole cluster traffic (no grouping) reaches 10% of drops
        - thresholds:
            critical: "10"
        # triggered when per-node traffic reaches 5% of drops, with gradual severity
        - thresholds:
            critical: "15"
            warning: "10"
            info: "5"
          groupBy: Node

apiVersion: flows.netobserv.io/v1beta1
kind: FlowCollector
metadata:
  name: flow-collector
spec:
  processor:
    metrics:
      alerts:
      - template: PacketDropsByKernel
        variants:
        # triggered when the whole cluster traffic (no grouping) reaches 10% of drops
        - thresholds:
            critical: "10"
        # triggered when per-node traffic reaches 5% of drops, with gradual severity
        - thresholds:
            critical: "15"
            warning: "10"
            info: "5"
          groupBy: Node

Copy to Clipboard

Toggle word wrap

Note

Customizing an alert replaces the default configuration for that template. If you want to keep the default configurations, you must manually replicate them.

9.2.2. About the PromQL expression for alerts
Link kopieren

Learn about the base query for Prometheus Query Language (PromQL), and how to customize it so you can configure network observability alerts for your specific needs.

The alerting API in the network observability FlowCollector custom resource (CR) is mapped to the Prometheus Operator API, generating a PrometheusRule. You can see the PrometheusRule in the default netobserv namespace by running the following command:

oc get prometheusrules -n netobserv -oyaml

$ oc get prometheusrules -n netobserv -oyaml

Copy to Clipboard

Toggle word wrap

9.2.2.1. An example query for an alert in a surge of incoming traffic
Link kopieren

This example provides the base PromQL query pattern for an alert about a surge in incoming traffic:

sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)

sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)

Copy to Clipboard

Toggle word wrap

This query calculates the byte rate coming from the openshift-ingress namespace to any of your workloads' namespaces over the past 30 minutes.

You can customize the query, including retaining only some rates, running the query for specific time periods, and setting a final threshold.

Filtering noise

Appending > 1000 to this query retains only the rates observed that are greater than 1 KB/s, which eliminates noise from low-bandwidth consumers.

(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)

The byte rate is relative to the sampling interval defined in the FlowCollector custom resource (CR) configuration. If the sampling interval is 1:100, the actual traffic might be approximately 100 times higher than the reported metrics.

Time comparison

You can run the same query for a particular period of time using the offset modifier. For example, a query for one day earlier can be run using offset 1d, and a query for five hours ago can be run using offset 5h.

sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))

You can use the formula 100 * (<query now> - <query from the previous day>) / <query from the previous day> to calculate the percentage of increase compared to the previous day. This value can be negative if the byte rate today is lower than the previous day.

Final threshold

You can apply a final threshold to filter increases that are lower than the desired percentage. For example, > 100 eliminates increases that are lower than 100%.

Together, the complete expression for the PrometheusRule looks like the following:

...
      expr: |-
        (100 *
          (
            (sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
            - sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
          )
          / sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
        > 100

...
      expr: |-
        (100 *
          (
            (sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
            - sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
          )
          / sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
        > 100

Copy to Clipboard

Toggle word wrap

9.2.2.2. Alert metadata fields
Link kopieren

The Network Observability Operator uses components from other OpenShift Container Platform features, such as the monitoring stack, to enhance visibility into network traffic. For more information, see: "Monitoring stack architecture".

Some metadata must be configured for the alert definitions. This metadata is used by Prometheus and the Alertmanager service from the monitoring stack, or by the Network Health dashboard.

The following example shows an AlertingRule resource with the configured metadata:

apiVersion: monitoring.openshift.io/v1
kind: AlertingRule
metadata:
  name: netobserv-alerts
  namespace: openshift-monitoring
spec:
  groups:
  - name: NetObservAlerts
    rules:
    - alert: NetObservIncomingBandwidth
      annotations:
        netobserv_io_network_health: '{"namespaceLabels":["DstK8S_Namespace"],"threshold":"100","unit":"%","upperBound":"500"}'
        message: |-
          NetObserv is detecting a surge of incoming traffic: current traffic to {{ $labels.DstK8S_Namespace }} has increased by more than 100% since yesterday.
        summary: "Surge in incoming traffic"
      expr: |-
        (100 *
          (
            (sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
            - sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
          )
          / sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
        > 100
      for: 1m
      labels:
        app: netobserv
        netobserv: "true"
        severity: warning

apiVersion: monitoring.openshift.io/v1
kind: AlertingRule
metadata:
  name: netobserv-alerts
  namespace: openshift-monitoring
spec:
  groups:
  - name: NetObservAlerts
    rules:
    - alert: NetObservIncomingBandwidth
      annotations:
        netobserv_io_network_health: '{"namespaceLabels":["DstK8S_Namespace"],"threshold":"100","unit":"%","upperBound":"500"}'
        message: |-
          NetObserv is detecting a surge of incoming traffic: current traffic to {{ $labels.DstK8S_Namespace }} has increased by more than 100% since yesterday.
        summary: "Surge in incoming traffic"
      expr: |-
        (100 *
          (
            (sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
            - sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
          )
          / sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
        > 100
      for: 1m
      labels:
        app: netobserv
        netobserv: "true"
        severity: warning

Copy to Clipboard

Toggle word wrap

where:

spec.groups.rules.alert.labels.netobserv: Specifies the alert for the Network Health dashboard to detect when set to true.
spec.groups.rules.alert.labels.severity: Specifies the severity of the alert. The following values are valid: critical, warning, or info.

You can leverage the output labels from the defined PromQL expression in the message annotation. In the example, since results are grouped per DstK8S_Namespace, the expression {{ $labels.DstK8S_Namespace }} is used in the message text.

The netobserv_io_network_health annotation is optional, and controls how the alert is rendered on the Network Health page.

The netobserv_io_network_health annotation is a JSON string consisting of the following fields:

Expand

Table 9.1. Fields for the netobserv_io_network_health annotation
Field	Type	Description
`namespaceLabels`	List of strings	One or more labels that hold namespaces. When provided, the alert appears under the Namespaces tab.
`nodeLabels`	List of strings	One or more labels that hold node names. When provided, the alert appears under the Nodes tab.
`threshold`	String	The alert threshold, expected to match the threshold defined in the `PromQL` expression.
`unit`	String	The data unit, used only for display purposes.
`upperBound`	String	An upper bound value used to compute the score on a closed scale. Metric values exceeding this bound are clamped.
`links`	List of objects	A list of links to display contextually with the alert. Each link requires a `name` (display name) and `url`.
`trafficLinkFilter`	String	An additional filter to inject into the URL for the Network Traffic page.

The namespaceLabels and nodeLabels are mutually exclusive. If neither is provided, the alert appears under the Global tab.

9.2.3. Creating custom alert rules
Link kopieren

Use the Prometheus Query Language (PromQL) to define a custom AlertingRule resource to trigger alerts based on specific network metrics (e.g., traffic surges).

Prerequisites

Familiarity with PromQL.
You have installed OpenShift Container Platform 4.14 or later.
You have access to the cluster as a user with the cluster-admin role.
You have installed the Network Observability Operator.

Procedure

Create a YAML file named custom-alert.yaml that contains your AlertingRule resource.
Apply the custom alert rule by running the following command:
```
oc apply -f custom-alert.yaml
```
```
$ oc apply -f custom-alert.yaml
```
Copy to Clipboard Toggle word wrap

Verification

Verify that the PrometheusRule resource was created in the netobserv namespace by running the following command:
```
oc get prometheusrules -n netobserv -oyaml
```
```
$ oc get prometheusrules -n netobserv -oyaml
```
Copy to Clipboard Toggle word wrap
The output should include the netobserv-alerts rule you just created, confirming that the resource was generated correctly.
Confirm the rule is active by checking the Network Health dashboard in the OpenShift Container Platform web console Observe.

9.2.4. Disabling predefined alerts
Link kopieren

Alert templates can be disabled in the spec.processor.metrics.disableAlerts field of the FlowCollector custom resource (CR). This setting accepts a list of alert template names. For a list of alert template names, see: "List of default alerts".

If a template is disabled and overridden in the spec.processor.metrics.alerts field, the disable setting takes precedence and the alert rule is not created.

Nach oben

Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 9. Network observability alerts

9.1. About network observability alerts
Link kopieren

9.1.1. List of default alert templates
Link kopieren

9.1.2. Network Health dashboard
Link kopieren

9.2. Enabling Technology Preview alerts in network observability
Link kopieren

9.2.1. Configuring predefined alerts
Link kopieren

9.2.2. About the PromQL expression for alerts
Link kopieren

9.2.2.1. An example query for an alert in a surge of incoming traffic
Link kopieren

9.2.2.2. Alert metadata fields
Link kopieren

9.2.3. Creating custom alert rules
Link kopieren

9.2.4. Disabling predefined alerts
Link kopieren

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Mehr Inklusion in Open Source

Über Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 9. Network observability alerts

9.1. About network observability alertsLink kopierenLink in die Zwischenablage kopiert!

9.1.1. List of default alert templatesLink kopierenLink in die Zwischenablage kopiert!

9.1.2. Network Health dashboardLink kopierenLink in die Zwischenablage kopiert!

9.2. Enabling Technology Preview alerts in network observabilityLink kopierenLink in die Zwischenablage kopiert!

9.2.1. Configuring predefined alertsLink kopierenLink in die Zwischenablage kopiert!

9.2.2. About the PromQL expression for alertsLink kopierenLink in die Zwischenablage kopiert!

9.2.2.1. An example query for an alert in a surge of incoming trafficLink kopierenLink in die Zwischenablage kopiert!

9.2.2.2. Alert metadata fieldsLink kopierenLink in die Zwischenablage kopiert!

9.2.3. Creating custom alert rulesLink kopierenLink in die Zwischenablage kopiert!

9.2.4. Disabling predefined alertsLink kopierenLink in die Zwischenablage kopiert!

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Mehr Inklusion in Open Source

Über Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

9.1. About network observability alerts
Link kopieren

9.1.1. List of default alert templates
Link kopieren

9.1.2. Network Health dashboard
Link kopieren

9.2. Enabling Technology Preview alerts in network observability
Link kopieren

9.2.1. Configuring predefined alerts
Link kopieren

9.2.2. About the PromQL expression for alerts
Link kopieren

9.2.2.1. An example query for an alert in a surge of incoming traffic
Link kopieren

9.2.2.2. Alert metadata fields
Link kopieren

9.2.3. Creating custom alert rules
Link kopieren

9.2.4. Disabling predefined alerts
Link kopieren