Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.
Chapter 9. Network observability alerts
Network observability alerts is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
The Network Observability Operator provides a set of alerts for monitoring the network in your OpenShift Container Platform cluster. The alerts are based on its built-in metrics, but can include other metrics, such as ones provided by the OpenShift Container Platform monitoring stack. Alerts are designed to give you a quick indication of your cluster’s network health.
9.1. About network observability alerts Link kopierenLink in die Zwischenablage kopiert!
Network observability includes predefined alerts. Use these alerts to gain insight into the health and performance of your OpenShift Container Platform applications and infrastructure.
The predefined alerts provide a quick health indication of your cluster’s network in the Network Health dashboard. You can also customize alerts using Prometheus Query Language (PromQL) queries.
By default, network observability creates alerts that are contextual to the features you enable.
For example, packet drop-related alerts are created only if the PacketDrop agent feature is enabled in the FlowCollector custom resource (CR). Alerts are built on metrics, and you might see configuration warnings if enabled alerts are missing their required metrics.
You can configure these metrics in the spec.processor.metrics.includeList object of the FlowCollector CR.
9.1.1. List of default alert templates Link kopierenLink in die Zwischenablage kopiert!
These alert templates are installed by default:
PacketDropsByDevice-
Triggers on high percentage of packet drops from devices (
/proc/net/dev). PacketDropsByKernel-
Triggers on high percentage of packet drops by the kernel; it requires the
PacketDropagent feature. IPsecErrors-
Triggers when IPsec encryption errors are detected by network observability; it requires the
IPSecagent feature. NetpolDenied-
Triggers when traffic denied by network policies is detected by network observability; it requires the
NetworkEventsagent feature. LatencyHighTrend-
Triggers when an increase of TCP latency is detected by network observability; it requires the
FlowRTTagent feature. DNSErrors-
Triggers when DNS errors are detected by network observability; it requires the
DNSTrackingagent feature.
These are operational alerts that relate to the self-health of network observability:
NetObservNoFlows- Triggers when no flows are being observed for a certain period.
NetObservLokiError- Triggers when flows are being dropped due to Loki errors.
You can configure, extend, or disable alerts for network observability. You can view the resulting PrometheusRule resource in the default netobserv namespace by running the following command:
oc get prometheusrules -n netobserv -oyaml
$ oc get prometheusrules -n netobserv -oyaml
9.1.2. Network Health dashboard Link kopierenLink in die Zwischenablage kopiert!
When alerts are enabled in the Network Observability Operator, two things happen:
-
New alerts appear in Observe
Alerting Alerting rules tab in the OpenShift Container Platform web console. -
A new Network Health dashboard appears in OpenShift Container Platform web console
Observe.
The Network Health dashboard provides a summary of triggered alerts and pending alerts, distinguishing between critical, warning, and minor issues. Alerts for rule violations are displayed in the following tabs:
- Global: Shows alerts that are global to the cluster.
- Nodes: Shows alerts for rule violations per node.
- Namespaces: Shows alerts for rule violations per namespace.
Click on a resource card to see more information. Next to each alert, a three dot menu appears. From this menu, you can navigate to Network Traffic
9.2. Enabling Technology Preview alerts in network observability Link kopierenLink in die Zwischenablage kopiert!
Network Observability Operator alerts are a Technology Preview feature. To use this feature, you must enable it in the FlowCollector custom resource (CR), and then continue with configuring alerts to your specific needs.
Procedure
-
Edit the
FlowCollectorCR to set the experimental alerts flag totrue:
You can still use the existing method for creating alerts. For more information, see "Creating alerts".
9.2.1. Configuring predefined alerts Link kopierenLink in die Zwischenablage kopiert!
Alerts in the Network Observability Operator are defined using alert templates and variants in the spec.processor.metrics.alerts object of the FlowCollector custom resource (CR). You can customize the default templates and variants for flexible, fine-grained alerting.
After you enable alerts, the Network Health dashboard appears in the Observe section of the OpenShift Container Platform web console.
For each template, you can define a list of variants, each with their own thresholds and grouping configurations. For more information, see the "List of default alert templates".
Here is an example:
Customizing an alert replaces the default configuration for that template. If you want to keep the default configurations, you must manually replicate them.
9.2.2. About the PromQL expression for alerts Link kopierenLink in die Zwischenablage kopiert!
Learn about the base query for Prometheus Query Language (PromQL), and how to customize it so you can configure network observability alerts for your specific needs.
The alerting API in the network observability FlowCollector custom resource (CR) is mapped to the Prometheus Operator API, generating a PrometheusRule. You can see the PrometheusRule in the default netobserv namespace by running the following command:
oc get prometheusrules -n netobserv -oyaml
$ oc get prometheusrules -n netobserv -oyaml
9.2.2.1. An example query for an alert in a surge of incoming traffic Link kopierenLink in die Zwischenablage kopiert!
This example provides the base PromQL query pattern for an alert about a surge in incoming traffic:
sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)
sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)
This query calculates the byte rate coming from the openshift-ingress namespace to any of your workloads' namespaces over the past 30 minutes.
You can customize the query, including retaining only some rates, running the query for specific time periods, and setting a final threshold.
- Filtering noise
Appending
> 1000to this query retains only the rates observed that are greater than1 KB/s, which eliminates noise from low-bandwidth consumers.(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)The byte rate is relative to the sampling interval defined in the
FlowCollectorcustom resource (CR) configuration. If the sampling interval is1:100, the actual traffic might be approximately 100 times higher than the reported metrics.- Time comparison
You can run the same query for a particular period of time using the
offsetmodifier. For example, a query for one day earlier can be run usingoffset 1d, and a query for five hours ago can be run usingoffset 5h.sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))You can use the formula
100 * (<query now> - <query from the previous day>) / <query from the previous day>to calculate the percentage of increase compared to the previous day. This value can be negative if the byte rate today is lower than the previous day.- Final threshold
-
You can apply a final threshold to filter increases that are lower than the desired percentage. For example,
> 100eliminates increases that are lower than 100%.
Together, the complete expression for the PrometheusRule looks like the following:
9.2.2.2. Alert metadata fields Link kopierenLink in die Zwischenablage kopiert!
The Network Observability Operator uses components from other OpenShift Container Platform features, such as the monitoring stack, to enhance visibility into network traffic. For more information, see: "Monitoring stack architecture".
Some metadata must be configured for the alert definitions. This metadata is used by Prometheus and the Alertmanager service from the monitoring stack, or by the Network Health dashboard.
The following example shows an AlertingRule resource with the configured metadata:
where:
spec.groups.rules.alert.labels.netobserv-
Specifies the alert for the Network Health dashboard to detect when set to
true. spec.groups.rules.alert.labels.severity-
Specifies the severity of the alert. The following values are valid:
critical,warning, orinfo.
You can leverage the output labels from the defined PromQL expression in the message annotation. In the example, since results are grouped per DstK8S_Namespace, the expression {{ $labels.DstK8S_Namespace }} is used in the message text.
The netobserv_io_network_health annotation is optional, and controls how the alert is rendered on the Network Health page.
The netobserv_io_network_health annotation is a JSON string consisting of the following fields:
| Field | Type | Description |
|---|---|---|
|
| List of strings | One or more labels that hold namespaces. When provided, the alert appears under the Namespaces tab. |
|
| List of strings | One or more labels that hold node names. When provided, the alert appears under the Nodes tab. |
|
| String |
The alert threshold, expected to match the threshold defined in the |
|
| String | The data unit, used only for display purposes. |
|
| String | An upper bound value used to compute the score on a closed scale. Metric values exceeding this bound are clamped. |
|
| List of objects |
A list of links to display contextually with the alert. Each link requires a |
|
| String | An additional filter to inject into the URL for the Network Traffic page. |
The namespaceLabels and nodeLabels are mutually exclusive. If neither is provided, the alert appears under the Global tab.
9.2.3. Creating custom alert rules Link kopierenLink in die Zwischenablage kopiert!
Use the Prometheus Query Language (PromQL) to define a custom AlertingRule resource to trigger alerts based on specific network metrics (e.g., traffic surges).
Prerequisites
-
Familiarity with
PromQL. - You have installed OpenShift Container Platform 4.14 or later.
-
You have access to the cluster as a user with the
cluster-adminrole. - You have installed the Network Observability Operator.
Procedure
-
Create a YAML file named
custom-alert.yamlthat contains yourAlertingRuleresource. Apply the custom alert rule by running the following command:
oc apply -f custom-alert.yaml
$ oc apply -f custom-alert.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that the
PrometheusRuleresource was created in thenetobservnamespace by running the following command:oc get prometheusrules -n netobserv -oyaml
$ oc get prometheusrules -n netobserv -oyamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow The output should include the
netobserv-alertsrule you just created, confirming that the resource was generated correctly.-
Confirm the rule is active by checking the Network Health dashboard in the OpenShift Container Platform web console
Observe.
9.2.4. Disabling predefined alerts Link kopierenLink in die Zwischenablage kopiert!
Alert templates can be disabled in the spec.processor.metrics.disableAlerts field of the FlowCollector custom resource (CR). This setting accepts a list of alert template names. For a list of alert template names, see: "List of default alerts".
If a template is disabled and overridden in the spec.processor.metrics.alerts field, the disable setting takes precedence and the alert rule is not created.