Ce contenu n'est pas disponible dans la langue sélectionnée.
Chapter 13. Monitoring the Network Observability Operator
Use the OpenShift Container Platform web console to monitor alerts related to the Network Observability Operator’s health. This helps you maintain system stability and quickly detect operational issues.
13.1. Health dashboards Copier lienLien copié sur presse-papiers!
View the Network Observability Operator health dashboards in the OpenShift Container Platform web console to monitor the health status, resource usage, and internal statistics of the operator and its components.
Metrics are located in the Observe
- Flows per second
- Sampling
- Errors last minute
- Dropped flows per second
- Flowlogs-pipeline statistics
- Flowlogs-pipleine statistics views
- eBPF agent statistics views
- Operator statistics
- Resource usage
13.2. Health alerts Copier lienLien copié sur presse-papiers!
Understand the health alerts generated by the Network Observability Operator, which trigger banners when conditions like Loki ingestion errors, zero flow ingestion, or dropped eBPF flows occur.
A health alert banner that directs you to the dashboard can appear on the Network Traffic and Home pages if an alert is triggered. Alerts are generated in the following cases:
-
The
NetObservLokiErroralert occurs if theflowlogs-pipelineworkload is dropping flows because of Loki errors, such as if the Loki ingestion rate limit has been reached. -
The
NetObservNoFlowsalert occurs if no flows are ingested for a certain amount of time. -
The
NetObservFlowsDroppedalert occurs if the Network Observability eBPF agent hashmap table is full, and the eBPF agent processes flows with degraded performance, or when the capacity limiter is triggered.
13.3. Viewing health information Copier lienLien copié sur presse-papiers!
View the Netobserv/Health dashboard within the OpenShift Container Platform web console to monitor the health status and resource usage of the Network Observability Operator and its components.
Prerequisites
- You have the Network Observability Operator installed.
-
You have access to the cluster as a user with the
cluster-adminrole or with view permissions for all projects.
Procedure
-
From the Administrator perspective in the web console, navigate to Observe
Dashboards. - From the Dashboards dropdown, select Netobserv/Health.
- View the metrics about the health of the Operator that are displayed on the page.
13.3.1. Disabling health alerts Copier lienLien copié sur presse-papiers!
Disable specific health alerts, such as NetObservLokiError or NetObservNoFlows, by editing the FlowCollector resource and using the spec.processor.metrics.disableAlerts specification.
Procedure
-
In the web console, navigate to Operators
Installed Operators. - Under the Provided APIs heading for the NetObserv Operator, select Flow Collector.
- Select cluster then select the YAML tab.
Add
spec.processor.metrics.disableAlertsto disable health alerts, as in the following YAML sample:apiVersion: flows.netobserv.io/v1beta2 kind: FlowCollector metadata: name: cluster spec: processor: metrics: disableAlerts: [NetObservLokiError, NetObservNoFlows]1 - 1
- You can specify one or a list with both types of alerts to disable.
13.4. Creating Loki rate limit alerts for the NetObserv dashboard Copier lienLien copié sur presse-papiers!
Create a custom AlertingRule resource based on Loki metrics to monitor for and trigger alerts when the Loki ingestion rate limits are reached, indicated by HTTP 429 errors.
You can create custom alerting rules for the Netobserv dashboard metrics to trigger alerts when Loki rate limits have been reached.
Prerequisites
- You have access to the cluster as a user with the cluster-admin role or with view permissions for all projects.
- You have the Network Observability Operator installed.
Procedure
- Create a YAML file by clicking the import icon, +.
Add an alerting rule configuration to the YAML file. In the YAML sample that follows, an alert is created for when Loki rate limits have been reached:
apiVersion: monitoring.openshift.io/v1 kind: AlertingRule metadata: name: loki-alerts namespace: openshift-monitoring spec: groups: - name: LokiRateLimitAlerts rules: - alert: LokiTenantRateLimit annotations: message: |- {{ $labels.job }} {{ $labels.route }} is experiencing 429 errors. summary: "At any number of requests are responded with the rate limit error code." expr: sum(irate(loki_request_duration_seconds_count{status_code="429"}[1m])) by (job, namespace, route) / sum(irate(loki_request_duration_seconds_count[1m])) by (job, namespace, route) * 100 > 0 for: 10s labels: severity: warning- Click Create to apply the configuration file to the cluster.
13.5. Using the eBPF agent alert Copier lienLien copié sur presse-papiers!
Resolve the NetObservAgentFlowsDropped alert, which occurs when the eBPF agent hashmap is full, by increasing the spec.agent.ebpf.cacheMaxFlows value in the FlowCollector custom resource.
An alert, NetObservAgentFlowsDropped, is also triggered when the capacity limiter is triggered. If you see this alert, consider increasing the cacheMaxFlows in the FlowCollector, as shown in the following example.
Increasing the cacheMaxFlows might increase the memory usage of the eBPF agent.
Procedure
-
In the web console, navigate to Operators
Installed Operators. - Under the Provided APIs heading for the Network Observability Operator, select Flow Collector.
- Select cluster, and then select the YAML tab.
Increase the
spec.agent.ebpf.cacheMaxFlowsvalue, as shown in the following YAML sample:apiVersion: flows.netobserv.io/v1beta2 kind: FlowCollector metadata: name: cluster spec: namespace: netobserv deploymentModel: Service agent: type: eBPF ebpf: cacheMaxFlows: 2000001 - 1
- Increase the
cacheMaxFlowsvalue from its value at the time of theNetObservAgentFlowsDroppedalert.