Este conteúdo não está disponível no idioma selecionado.
Chapter 12. Network observability health rules
The Network Observability Operator provides alerts by using built-in metrics and the OpenShift Container Platform monitoring stack to report cluster network health.
Network observability health alerts require OpenShift Container Platform 4.16 or later.
12.1. Identifying network issues with automated health rules Copiar o linkLink copiado para a área de transferência!
Network observability identifies network issues by using automated health rules to monitor metrics. These rules trigger alerts when anomalies occur, which assists in maintaining connectivity and responding to network degradation.
The Network Observability Operator manages a system of Prometheus-based rules that detect network problems, and converts these rules into PrometheusRule resources. It supports the following rule types:
- Alerting rules
-
Trigger notifications through the Prometheus
Alertmanagerwhen network anomalies or infrastructure failures are detected. - Recording rules
- Pre-compute complex Prometheus Query Language (PromQL) expressions into new time series to improve dashboard performance.
12.1.1. Importance of network health monitoring Copiar o linkLink copiado para a área de transferência!
Maintaining reliable and secure network connectivity is critical for cluster administrators and security teams. Unresolved network issues can result in the following consequences:
- Application downtime caused by packet drops or DNS failures.
- Security risks from undetected network policy violations.
- Performance degradation caused by latency spikes or bandwidth saturation.
- Compliance issues from unmonitored network traffic.
Early detection of these issues allows for resolution before service level objectives (SLOs) are affected.
12.1.2. Automated health monitoring Copiar o linkLink copiado para a área de transferência!
The Network Observability Operator provides automated health monitoring through the following features:
- Pre-configured health rules: Detect common network problems by using default thresholds.
- Automated alerting: Integrates with the OpenShift Container Platform monitoring stack.
- Health dashboards: Displays health status for clusters, nodes, namespaces, and workloads.
- Custom rules: Supports the creation of organization-specific monitoring rules.
Health rules monitor network flow metrics and trigger alerts when defined thresholds are exceeded. For example, the PacketDropsByKernel rule reports an alert when kernel packet drop rates exceed defined levels.
12.1.3. Network health monitoring workflow Copiar o linkLink copiado para a área de transferência!
Monitoring network health involves the following phases:
- Configuring the Network Observability Operator to collect required network health data for monitoring, such as packet drops or DNS tracking.
-
Reviewing and customizing default health rules and thresholds in the
FlowCollectorcustom resource. -
Monitoring alerts in the OpenShift Container Platform web console in the Observe
Alerting and Observe Network Health views. - Creating custom health rules for specific requirements.
- Configuring recording rules to optimize performance for large-scale deployments.
The PrometheusRule resource in the netobserv namespace can be viewed by running the following command:
$ oc get prometheusrules -n netobserv -o yaml
12.1.4. Detecting network issues with automated health rules Copiar o linkLink copiado para a área de transferência!
The Network Observability Operator includes a rule-based system to detect network anomalies and infrastructure failures. By converting configurations into alerting rules, the Operator provides automated monitoring and troubleshooting through the OpenShift Container Platform web console.
12.1.4.1. Monitoring outcomes Copiar o linkLink copiado para a área de transferência!
The Network Observability Operator displays network status in the following views:
- Alerting UI
-
Specific alerts appear in Observe
Alerting. Notifications are managed through the Prometheus Alertmanager. - Network Health dashboard
-
A specialized dashboard in Observe
Network Health provides a summary of cluster network status.
The Network Health dashboard categorizes violations into tabs to isolate the scope of an issue:
- Global: Aggregate health of the cluster.
- Nodes: Violations specific to infrastructure nodes.
- Namespaces: Violations specific to individual namespaces.
-
Workloads: Violations specific to resources, such as
DeploymentsorDaemonSets.
12.1.4.2. Predefined health rules Copiar o linkLink copiado para a área de transferência!
The Network Observability Operator provides default rules for common networking scenarios. These rules are active only if the corresponding feature is enabled in the FlowCollector custom resource (CR).
The following list contains a subset of available default rules:
PacketDropsByDevice-
Reports a high percentage of packet drops from network devices. This rule is based on node-exporter metrics and does not require the
PacketDropagent feature. PacketDropsByKernel-
Reports a high percentage of packet drops by the kernel. This rule requires the
PacketDropagent feature. IPsecErrors-
Reports IPsec encryption errors. This rule requires the
IPSecagent feature. NetpolDenied-
Reports traffic denied by network policies. This rule requires the
NetworkEventsagent feature. LatencyHighTrend-
Reports a significant increase in TCP latency. This rule requires the
FlowRTTagent feature. DNSErrors-
Reports DNS errors. This rule requires the
DNSTrackingagent feature.
The following operational alerts apply to the Network Observability Operator:
NetObservNoFlows- Reports when the pipeline is active but no flows are observed.
NetObservLokiError- Reports when flows are dropped because of Loki errors.
For a complete list of rules and runbooks, see the Network Observability Operator runbooks.
12.1.4.3. Enabling features for health monitoring Copiar o linkLink copiado para a área de transferência!
The Network Observability Operator creates rules based on the features enabled in the FlowCollector CR.
For example, packet drop rules are created only if the PacketDrop agent feature is enabled. Rules depend on metrics; if the required metrics are unavailable, configuration warnings might appear. Configure metrics in the spec.processor.metrics.includeList field of the FlowCollector resource.
12.2. Health rule threshold and grouping customization Copiar o linkLink copiado para a área de transferência!
Health rules in the Network Observability Operator are defined by using rule templates and variants in the spec.processor.metrics.healthRules field of the FlowCollector custom resource (CR). Customizing these templates allows for flexible, fine-grained alerting tailored to specific environment needs.
For each template, a list of variants can be defined, each with distinct thresholds and grouping configurations.
The following example shows a FlowCollector configuration with custom health rules:
apiVersion: flows.netobserv.io/v1beta1
kind: FlowCollector
metadata:
name: flow-collector
spec:
processor:
metrics:
healthRules:
- template: PacketDropsByKernel
mode: Alert # or Recording
variants:
# Triggered when aggregate cluster traffic reaches 10% drops
- thresholds:
critical: "10"
# Triggered per-node with increasing severity levels
- thresholds:
critical: "15"
warning: "10"
info: "5"
groupBy: Node
spec.processor.metrics.healthRules.template- Specifies the name of the predefined rule template.
spec.processor.metrics.healthRules.mode-
Specifies whether the rule functions as an
Alertor aRecordingrule. spec.processor.metrics.healthRules.variants.thresholds-
Specifies the numerical values that trigger the rule. Multiple severity levels, such as
critical,warning, orinfo, can be defined within a single variant. spec.processor.metrics.healthRules.variants.groupBy-
Specifies the dimension used to aggregate the metric, such as
NodeorNamespace.
Customizing a rule replaces the default configuration for that template. To retain default configurations, the default settings must be manually included in the custom resource.
12.2.1. Health rule query and metadata reference Copiar o linkLink copiado para a área de transferência!
The FlowCollector health rule API maps to the Prometheus Operator to generate PrometheusRule objects. Use these base Prometheus Query Language (PromQL) patterns and metadata configurations to create custom health rules for network observability.
The PrometheusRule resource in the netobserv namespace can be viewed by running the following command:
$ oc get prometheusrules -n netobserv -o yaml
12.2.1.1. Customizing alert logic with PromQL: Incoming traffic surge Copiar o linkLink copiado para a área de transferência!
The following PromQL query calculates the byte rate from the openshift-ingress namespace to any workload namespace over a 30-minute interval:
sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)
Queries can be customized to filter low-bandwidth data, compare time periods, and establish thresholds.
- Data filtering
Appending
> 1000to the query removes rates lower than1 KB/sto filter low-bandwidth traffic.(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)NoteThe byte rate is relative to the sampling interval in the
FlowCollectorCR. Normalizing byte rates with thenetobserv_agent_sampling_ratemetric decouples the PromQL expression from the sampling configuration.- Time comparison
The
offsetmodifier compares data across different time periods. For example,offset 1dretrieves data from the previous day.sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))- Threshold application
-
A final threshold filters increases below a specific percentage. For example,
> 100removes increases lower than 100%.
The following example shows a complete PromQL expression for a PrometheusRule:
expr: |-
(100 *
(
(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
- sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
)
/ sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
> 100
12.2.1.2. Alert metadata fields Copiar o linkLink copiado para a área de transferência!
Rule definitions require specific metadata for the Prometheus Alertmanager service and the Network Health dashboard. The following example shows an AlertingRule resource with configured metadata:
apiVersion: monitoring.openshift.io/v1
kind: AlertingRule
metadata:
name: netobserv-alerts
namespace: openshift-monitoring
spec:
groups:
- name: NetObservAlerts
rules:
- alert: NetObservIncomingBandwidth
annotations:
netobserv_io_network_health: '{"namespaceLabels":["DstK8S_Namespace"],"threshold":"100","unit":"%","upperBound":"500"}'
message: |-
Surge of incoming traffic detected: current traffic to {{ $labels.DstK8S_Namespace }} increased by more than 100% since yesterday.
summary: "Surge in incoming traffic"
expr: |-
# ... (PromQL expression)
for: 1m
labels:
app: netobserv
netobserv: "true"
severity: warning
spec.groups.rules.alert.labels.netobserv-
Specifies that the Network Health dashboard must detect the alert when set to
true. spec.groups.rules.alert.labels.severity-
Specifies the alert severity. Valid values are
critical,warning, orinfo.
12.2.1.3. netobserv_io_network_health annotation fields Copiar o linkLink copiado para a área de transferência!
The optional netobserv_io_network_health annotation is a JSON string that controls how the alert renders on the Network Health page.
| Field | Type | Description |
|---|---|---|
|
| List of strings | One or more labels containing namespaces. Alerts appear under the Namespaces tab. |
|
| List of strings | One or more labels containing node names. Alerts appear under the Nodes tab. |
|
| List of strings |
One or more labels containing owner or workload names. Alerts appear under the Owners tab when |
|
| String | The alert threshold. This value should match the threshold in the PromQL expression. |
|
| String | The data unit for display purposes. |
|
| String | An upper bound value used to calculate scores on a closed scale. Metric values exceeding this bound are clamped. |
The namespaceLabels and nodeLabels fields are mutually exclusive. If neither is provided, the alert appears under the Global tab.
12.2.2. Configuring custom health rules Copiar o linkLink copiado para a área de transferência!
Create custom health rules by using Prometheus Query Language (PromQL) to define an AlertingRule resource. These rules trigger alerts based on specific network metrics, such as traffic surges.
Prerequisites
-
Access to the cluster with
cluster-adminprivileges. - The Network Observability Operator is installed.
- OpenShift Container Platform 4.16 or later is installed.
- Familiarity with PromQL.
Custom PrometheusRule resources are not owned by the FlowCollector resource. Custom rules created in the netobserv namespace might be deleted if the Network Observability Operator is uninstalled. To prevent data loss, create custom rules in a different namespace, such as openshift-monitoring, and maintain a backup in version control.
Procedure
-
Define an
AlertingRuleresource in a YAML file, for example,custom-alert.yaml. Apply the custom alert rule by running the following command:
$ oc apply -f custom-alert.yaml
Verification
Confirm the
PrometheusRuleresource was created in the target namespace by running the following command:$ oc get prometheusrules -n <namespace> -o yamlConfirm the rule is active in the OpenShift Container Platform web console:
-
Navigate to Observe
Alerting to see the firing status. -
Navigate to Observe
Network Health to view the dashboard integration.
-
Navigate to Observe
12.3. Performance optimization with recording rules Copiar o linkLink copiado para a área de transferência!
In large-scale clusters, recording rules optimize how Prometheus handles network data. Recording rules improve dashboard responsiveness and reduce the computational overhead of complex queries.
12.3.1. Optimization benefits Copiar o linkLink copiado para a área de transferência!
Recording rules pre-compute complex Prometheus Query Language (PromQL) expressions and save the results as new time series. Unlike alerting rules, recording rules do not monitor thresholds.
Using recording rules provides the following advantages:
- Improved performance
- Pre-computing Prometheus queries allows dashboards to load faster by avoiding on-demand calculations for long-term trends.
- Resource efficiency
- Calculating data at fixed intervals reduces CPU load on the Prometheus server compared to recalculating data on every dashboard refresh.
- Simplified queries
-
Using short metric names, such as
cluster:network_traffic:rate_5m, simplifies complex aggregate calculations in custom dashboards.
12.3.2. Comparison of rule modes Copiar o linkLink copiado para a área de transferência!
The following table compares rule modes based on the expected outcome:
| Feature | Alerting rules | Recording rules |
|---|---|---|
| Primary goal | Issue notification. | Persistent metric history. |
| Data output | Alerting state. | New time series metric. |
| UI visibility | Alerting and Network Health views. | Metrics Explorer and Network Health views. |
| Notifications |
Triggers | Does not trigger notifications. |
12.3.3. Integrating recording rules with the health dashboard Copiar o linkLink copiado para a área de transferência!
Custom recording rules that contribute to the Network Health dashboard must meet specific metadata requirements.
- Label requirements
-
Include the
netobserv: "true"label in thelabelsfield of the rule and thePrometheusRulemetadata. The Network Observability Operator identifiesPrometheusRuleresources cluster-wide by using this label. - Annotation requirements
Include the
netobserv.io/network-healthannotation in thePrometheusRulemetadata. This annotation is required for recording rules to appear in the Network Health dashboard. The value is a JSON object where keys are the metric names (therecordfield of each rule). Each value consists of the following fields:-
summary: An optional short title. This field supports Prometheus template syntax, such as{{ $labels.namespace }}. -
description: An optional description. This field supports Prometheus template syntax. -
netobserv_io_network_health: A required JSON string. For recording rules, use therecordingThresholdsfield instead ofthreshold. This field determines the health score and UI coloring, such as{"info":"10","warning":"25","critical":"50"}.
-
12.3.4. Optimizing dashboard metrics with recording rules Copiar o linkLink copiado para a área de transferência!
Create custom recording rules to pre-compute metrics for the Network Health dashboard. Recording rules require specific annotations and labels to integrate with the Network Observability Operator.
Prerequisites
-
Access to the cluster with
cluster-adminprivileges. - The Network Observability Operator is installed.
- OpenShift Container Platform 4.16 or later is installed.
- Familiarity with PromQL.
Custom PrometheusRule resources are not owned by the FlowCollector resource. Custom rules created in the netobserv namespace might be deleted if the Network Observability Operator is uninstalled. To prevent data loss, create custom rules in a different namespace, such as openshift-monitoring, and maintain a backup in version control.
Procedure
Define a
PrometheusRuleresource in a YAML file, such ascustom-recording-rule.yaml, ensuring thenetobserv: "true"label andnetobserv.io/network-healthannotation are included:apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: my-recording-rules namespace: openshift-monitoring labels: netobserv: "true" annotations: netobserv.io/network-health: | { "my_metric_per_namespace": { "summary": "Custom metric is {{ $value }} in the namespace {{ $labels.namespace }}", "description": "Custom metric is {{ $value }} in the namespace {{ $labels.namespace }}", "netobserv_io_network_health": "{\"unit\":\"%\",\"upperBound\":\"100\",\"namespaceLabels\":[\"namespace\"],\"recordingThresholds\":{\"info\":\"10\",\"warning\":\"25\",\"critical\":\"50\"}}" } } spec: groups: - name: MyRecordingRules interval: 30s rules: - record: my_metric_per_namespace expr: (count by (namespace) (kube_pod_info) * 0 + 20) labels: netobserv: "true"Apply the custom recording rule by running the following command:
$ oc apply -f custom-recording-rule.yaml
Verification
Confirm the
PrometheusRuleresource exists by running the following command:$ oc get prometheusrules my-recording-rules -n openshift-monitoring -o yaml-
Confirm the recording rule appears in the OpenShift Container Platform web console by navigating to Observe
Network Health.
12.4. Disabling default rules Copiar o linkLink copiado para a área de transferência!
Rule templates can be disabled in the spec.processor.metrics.disableAlerts field of the FlowCollector custom resource (CR). This setting accepts a list of rule template names. For a list of alert template names, see "List of default rules".
If a rule template is included in the disableAlerts list, it is not created, even if a custom override exists in the spec.processor.metrics.healthRules field. The disableAlerts configuration takes precedence over all other health rule settings.
For a list of alert template names, see "List of default rules".