이 콘텐츠는 선택한 언어로 제공되지 않습니다.
Network Observability
Configuring and using the Network Observability Operator in OpenShift Container Platform
Abstract
Chapter 1. Network Observability Operator release notes
The Network Observability Operator enables administrators to observe and analyze network traffic flows for OpenShift Container Platform clusters.
These release notes track the development of the Network Observability Operator in the OpenShift Container Platform.
For an overview of the Network Observability Operator, see About Network Observability Operator.
1.1. Network Observability Operator 1.6.2
The following advisory is available for the Network Observability Operator 1.6.2:
1.1.1. CVEs
1.1.2. Bug fixes
- When the secondary interface support was added, there was a need to iterate multiple times to register the per network namespace with the netlink to learn about interface notifications. At the same time, unsuccessful handlers caused a leaking file descriptor because with TCX hook, unlike TC, handlers needed to be explicitly removed when the interface went down. Now, there is no longer leaking file descriptors when creating and deleting pods. (NETOBSERV-1805)
1.1.3. Known issues
There was a compatibility issue with console plugins that would have prevented Network Observability from being installed on future versions of an OpenShift Container Platform cluster. By upgrading to 1.6.2, the compatibility issue is resolved and Network Observability can be installed as expected. (NETOBSERV-1737)
1.2. Network Observability Operator 1.6.1
The following advisory is available for the Network Observability Operator 1.6.1:
1.2.1. CVEs
1.2.2. Bug fixes
- Previously, information about packet drops, such as the cause and TCP state, was only available in the Loki datastore and not in Prometheus. For that reason, the drop statistics in the OpenShift web console plugin Overview was only available with Loki. With this fix, information about packet drops is also added to metrics, so you can view drops statistics when Loki is disabled. (NETOBSERV-1649)
-
When the eBPF agent
PacketDrop
feature was enabled, and sampling was configured to a value greater than1
, reported dropped bytes and dropped packets ignored the sampling configuration. While this was done on purpose, so as not to miss any drops, a side effect was that the reported proportion of drops compared with non-drops became biased. For example, at a very high sampling rate, such as1:1000
, it was likely that almost all the traffic appears to be dropped when observed from the console plugin. With this fix, the sampling configuration is honored with dropped bytes and packets. (NETOBSERV-1676) - Previously, the SR-IOV secondary interface was not detected if the interface was created first and then the eBPF agent was deployed. It was only detected if the agent was deployed first and then the SR-IOV interface was created. With this fix, the SR-IOV secondary interface is detected no matter the sequence of the deployments. (NETOBSERV-1697)
- Previously, when Loki was disabled, the Topology view in the OpenShift web console displayed the Cluster and Zone aggregation options in the slider beside the network topology diagram, even when the related features were not enabled. With this fix, the slider now only displays options according to the enabled features. (NETOBSERV-1705)
-
Previously, when Loki was disabled, and the OpenShift web console was first loading, an error would occur:
Request failed with status code 400 Loki is disabled
. With this fix, the errors no longer occur. (NETOBSERV-1706) - Previously, in the Topology view of the OpenShift web console, when clicking on the Step into icon next to any graph node, the filters were not applied as required in order to set the focus to the selected graph node, resulting in showing a wide view of the Topology view in the OpenShift web console. With this fix, the filters are correctly set, effectively narrowing down the Topology. As part of this change, clicking the Step into icon on a Node now brings you to the Resource scope instead of the Namespaces scope. (NETOBSERV-1720)
- Previously, when Loki was disabled, in the Topology view of the OpenShift web console with the Scope set to Owner, clicking on the Step into icon next to any graph node would bring the Scope to Resource, which is not available without Loki, so an error message was shown. With this fix, the Step into icon is hidden in the Owner scope when Loki is disabled, so this scenario no longer occurs.(NETOBSERV-1721)
- Previously, when Loki was disabled, an error was displayed in the Topology view of the OpenShift web console when a group was set, but then the scope was changed so that the group becomes invalid. With this fix, the invalid group is removed, preventing the error. (NETOBSERV-1722)
-
When creating a
FlowCollector
resource from the OpenShift web console Form view, as opposed to the YAML view, the following settings were incorrectly managed by the web console:agent.ebpf.metrics.enable
andprocessor.subnetLabels.openShiftAutoDetect
. These settings can only be disabled in the YAML view, not in the Form view. To avoid any confusion, these settings have been removed from the Form view. They are still accessible in the YAML view. (NETOBSERV-1731) - Previously, the eBPF agent was unable to clean up traffic control flows installed before an ungraceful crash, for example a crash due to a SIGTERM signal. This led to the creation of multiple traffic control flow filters with the same name, since the older ones were not removed. With this fix, all previously installed traffic control flows are cleaned up when the agent starts, before installing new ones. (NETOBSERV-1732)
- Previously, when configuring custom subnet labels and keeping the OpenShift subnets auto-detection enabled, OpenShift subnets would take precedence over the custom ones, preventing the definition of custom labels for in cluster subnets. With this fix, custom defined subnets take precedence, allowing the definition of custom labels for in cluster subnets. (NETOBSERV-1734)
1.3. Network Observability Operator 1.6.0
The following advisory is available for the Network Observability Operator 1.6.0:
Before upgrading to the latest version of the Network Observability Operator, you must Migrate removed stored versions of the FlowCollector CRD. An automated solution to this workaround is planned with NETOBSERV-1747.
1.3.1. New features and enhancements
1.3.1.1. Enhanced use of Network Observability Operator without Loki
You can now use Prometheus metrics and rely less on Loki for storage when using the Network Observability Operator. For more information, see Network Observability without Loki.
1.3.1.2. Custom metrics API
You can create custom metrics out of flowlogs data by using the FlowMetrics
API. Flowlogs data can be used with Prometheus labels to customize cluster information on your dashboards. You can add custom labels for any subnet that you want to identify in your flows and metrics. This enhancement can also be used to more easily identify external traffic by using the new labels SrcSubnetLabel
and DstSubnetLabel
, which exists both in flow logs and in metrics. Those fields are empty when there is external traffic, which gives a way to identify it. For more information, see Custom metrics and FlowMetric API reference.
1.3.1.3. eBPF performance enhancements
Experience improved performances of the eBPF agent, in terms of CPU and memory, with the following updates:
- The eBPF agent now uses TCX webhooks instead of TC.
The NetObserv / Health dashboard has a new section that shows eBPF metrics.
- Based on the new eBPF metrics, an alert notifies you when the eBPF agent is dropping flows.
- Loki storage demand decreases significantly now that duplicated flows are removed. Instead of having multiple, individual duplicated flows per network interface, there is one de-duplicated flow with a list of related network interfaces.
With the duplicated flows update, the Interface and Interface Direction fields in the Network Traffic table are renamed to Interfaces and Interface Directions, so any bookmarked Quick filter queries using these fields need to be updated to interfaces
and ifdirections
.
For more information, see Using the eBPF agent alert and Quick filters.
1.3.1.4. eBPF collection rule-based filtering
You can use rule-based filtering to reduce the volume of created flows. When this option is enabled, the Netobserv / Health dashboard for eBPF agent statistics has the Filtered flows rate view. For more information, see eBPF flow rule filter.
1.3.2. Technology Preview features
Some features in this release are currently in Technology Preview. These experimental features are not intended for production use. Note the following scope of support on the Red Hat Customer Portal for these features:
Technology Preview Features Support Scope
1.3.2.1. Network Observability CLI
The Network Observability CLI (oc netobserv
) is temporarily unavailable and is expected to resolve with OCPBUGS-36146.
You can debug and troubleshoot network traffic issues without needing to install the Network Observability Operator by using the Network Observability CLI. Capture and visualize flow and packet data in real-time with no persistent storage requirement during the capture. For more information, see Network Observability CLI and Network Observability CLI 1.6.0
1.3.3. Bug fixes
-
Previously, a dead link to the OpenShift containter platform documentation was displayed in the Operator Lifecycle Manager (OLM) form for the
FlowMetrics
API creation. Now the link has been updated to point to a valid page. (NETOBSERV-1607) - Previously, the Network Observability Operator description in the Operator Hub displayed a broken link to the documentation. With this fix, this link is restored. (NETOBSERV-1544)
-
Previously, if Loki was disabled and the Loki
Mode
was set toLokiStack
, or if Loki manual TLS configuration was configured, the Network Observability Operator still tried to read the Loki CA certificates. With this fix, when Loki is disabled, the Loki certificates are not read, even if there are settings in the Loki configuration. (NETOBSERV-1647) -
Previously, the
oc
must-gather
plugin for the Network Observability Operator was only working on theamd64
architecture and failing on all others because the plugin was usingamd64
for theoc
binary. Now, the Network Observability Operatoroc
must-gather
plugin collects logs on any architecture platform. -
Previously, when filtering on IP addresses using
not equal to
, the Network Observability Operator would return a request error. Now, the IP filtering works in bothequal
andnot equal to
cases for IP addresses and ranges. (NETOBSERV-1630) -
Previously, when a user was not an admin, the error messages were not consistent with the selected tab of the Network Traffic view in the web console. Now, the
user not admin
error displays on any tab with improved display.(NETOBSERV-1621)
1.3.4. Known issues
-
When the eBPF agent
PacketDrop
feature is enabled, and sampling is configured to a value greater than1
, reported dropped bytes and dropped packets ignore the sampling configuration. While this is done on purpose to not miss any drops, a side effect is that the reported proportion of drops compared to non-drops becomes biased. For example, at a very high sampling rate, such as1:1000
, it is likely that almost all the traffic appears to be dropped when observed from the console plugin. (NETOBSERV-1676) - In the Manage panels pop-up window in the Overview tab, filtering on total, bar, donut, or line does not show any result. (NETOBSERV-1540)
- The SR-IOV secondary interface is not detected if the interface was created first and then the eBPF agent was deployed. It is only detected if the agent was deployed first and then the SR-IOV interface is created. (NETOBSERV-1697)
- When Loki is disabled, the Topology view in the OpenShift web console always shows the Cluster and Zone aggregation options in the slider beside the network topology diagram, even when the related features are not enabled. There is no specific workaround, besides ignoring these slider options. (NETOBSERV-1705)
-
When Loki is disabled, and the OpenShift web console first loads, it might display an error:
Request failed with status code 400 Loki is disabled
. As a workaround, you can continue switching content on the Network Traffic page, such as clicking between the Topology and the Overview tabs. The error should disappear. (NETOBSERV-1706)
1.4. Network Observability Operator 1.5.0
The following advisory is available for the Network Observability Operator 1.5.0:
1.4.1. New features and enhancements
1.4.1.1. DNS tracking enhancements
In 1.5, the TCP protocol is now supported in addition to UDP. New dashboards are also added to the Overview view of the Network Traffic page. For more information, see Configuring DNS tracking and Working with DNS tracking.
1.4.1.2. Round-trip time (RTT)
You can use TCP handshake Round-Trip Time (RTT) captured from the fentry/tcp_rcv_established
Extended Berkeley Packet Filter (eBPF) hookpoint to read smoothed round-trip time (SRTT) and analyze network flows. In the Overview, Network Traffic, and Topology pages in web console, you can monitor network traffic and troubleshoot with RTT metrics, filtering, and edge labeling. For more information, see RTT Overview and Working with RTT.
1.4.1.3. Metrics, dashboards, and alerts enhancements
The Network Observability metrics dashboards in Observe → Dashboards → NetObserv have new metrics types you can use to create Prometheus alerts. You can now define available metrics in the includeList
specification. In previous releases, these metrics were defined in the ignoreTags
specification. For a complete list of these metrics, see Network Observability Metrics.
1.4.1.4. Improvements for Network Observability without Loki
You can create Prometheus alerts for the Netobserv dashboard using DNS, Packet drop, and RTT metrics, even if you don’t use Loki. In the previous version of Network Observability, 1.4, these metrics were only available for querying and analysis in the Network Traffic, Overview, and Topology views, which are not available without Loki. For more information, see Network Observability Metrics.
1.4.1.5. Availability zones
You can configure the FlowCollector
resource to collect information about the cluster availability zones. This configuration enriches the network flow data with the topology.kubernetes.io/zone
label value applied to the nodes. For more information, see Working with availability zones.
1.4.1.6. Notable enhancements
The 1.5 release of the Network Observability Operator adds improvements and new capabilities to the OpenShift Container Platform web console plugin and the Operator configuration.
Performance enhancements
The
spec.agent.ebpf.kafkaBatchSize
default is changed from10MB
to1MB
to enhance eBPF performance when using Kafka.ImportantWhen upgrading from an existing installation, this new value is not set automatically in the configuration. If you monitor a performance regression with the eBPF Agent memory consumption after upgrading, you might consider reducing the
kafkaBatchSize
to the new value.
Web console enhancements:
- There are new panels added to the Overview view for DNS and RTT: Min, Max, P90, P99.
There are new panel display options added:
- Focus on one panel while keeping others viewable but with smaller focus.
- Switch graph type.
- Show Top and Overall.
- A collection latency warning is shown in the Custom time range pop-up window.
- There is enhanced visibility for the contents of the Manage panels and Manage columns pop-up windows.
- The Differentiated Services Code Point (DSCP) field for egress QoS is available for filtering QoS DSCP in the web console Network Traffic page.
Configuration enhancements:
-
The
LokiStack
mode in thespec.loki.mode
specification simplifies installation by automatically setting URLs, TLS, cluster roles and a cluster role binding, as well as theauthToken
value. TheManual
mode allows more control over configuration of these settings. -
The API version changes from
flows.netobserv.io/v1beta1
toflows.netobserv.io/v1beta2
.
1.4.2. Bug fixes
-
Previously, it was not possible to register the console plugin manually in the web console interface if the automatic registration of the console plugin was disabled. If the
spec.console.register
value was set tofalse
in theFlowCollector
resource, the Operator would override and erase the plugin registration. With this fix, setting thespec.console.register
value tofalse
does not impact the console plugin registration or registration removal. As a result, the plugin can be safely registered manually. (NETOBSERV-1134) -
Previously, using the default metrics settings, the NetObserv/Health dashboard was showing an empty graph named Flows Overhead. This metric was only available by removing "namespaces-flows" and "namespaces" from the
ignoreTags
list. With this fix, this metric is visible when you use the default metrics setting. (NETOBSERV-1351) - Previously, the node on which the eBPF Agent was running would not resolve with a specific cluster configuration. This resulted in cascading consequences that culminated in a failure to provide some of the traffic metrics. With this fix, the eBPF agent’s node IP is safely provided by the Operator, inferred from the pod status. Now, the missing metrics are restored. (NETOBSERV-1430)
- Previously, the Loki error 'Input size too long' error for the Loki Operator did not include additional information to troubleshoot the problem. With this fix, help is directly displayed in the web console next to the error with a direct link for more guidance. (NETOBSERV-1464)
-
Previously, the console plugin read timeout was forced to 30s. With the
FlowCollector
v1beta2
API update, you can configure thespec.loki.readTimeout
specification to update this value according to the Loki OperatorqueryTimeout
limit. (NETOBSERV-1443) -
Previously, the Operator bundle did not display some of the supported features by CSV annotations as expected, such as
features.operators.openshift.io/…
With this fix, these annotations are set in the CSV as expected. (NETOBSERV-1305) -
Previously, the
FlowCollector
status sometimes oscillated betweenDeploymentInProgress
andReady
states during reconciliation. With this fix, the status only becomesReady
when all of the underlying components are fully ready. (NETOBSERV-1293)
1.4.3. Known issues
-
When trying to access the web console, cache issues on OCP 4.14.10 prevent access to the Observe view. The web console shows the error message:
Failed to get a valid plugin manifest from /api/plugins/monitoring-plugin/
. The recommended workaround is to update the cluster to the latest minor version. If this does not work, you need to apply the workarounds described in this Red Hat Knowledgebase article.(NETOBSERV-1493) -
Since the 1.3.0 release of the Network Observability Operator, installing the Operator causes a warning kernel taint to appear. The reason for this error is that the Network Observability eBPF agent has memory constraints that prevent preallocating the entire hashmap table. The Operator eBPF agent sets the
BPF_F_NO_PREALLOC
flag so that pre-allocation is disabled when the hashmap is too memory expansive.
1.5. Network Observability Operator 1.4.2
The following advisory is available for the Network Observability Operator 1.4.2:
1.5.1. CVEs
1.6. Network Observability Operator 1.4.1
The following advisory is available for the Network Observability Operator 1.4.1:
1.6.1. CVEs
1.6.2. Bug fixes
- In 1.4, there was a known issue when sending network flow data to Kafka. The Kafka message key was ignored, causing an error with connection tracking. Now the key is used for partitioning, so each flow from the same connection is sent to the same processor. (NETOBSERV-926)
-
In 1.4, the
Inner
flow direction was introduced to account for flows between pods running on the same node. Flows with theInner
direction were not taken into account in the generated Prometheus metrics derived from flows, resulting in under-evaluated bytes and packets rates. Now, derived metrics are including flows with theInner
direction, providing correct bytes and packets rates. (NETOBSERV-1344)
1.7. Network Observability Operator 1.4.0
The following advisory is available for the Network Observability Operator 1.4.0:
1.7.1. Channel removal
You must switch your channel from v1.0.x
to stable
to receive the latest Operator updates. The v1.0.x
channel is now removed.
1.7.2. New features and enhancements
1.7.2.1. Notable enhancements
The 1.4 release of the Network Observability Operator adds improvements and new capabilities to the OpenShift Container Platform web console plugin and the Operator configuration.
Web console enhancements:
- In the Query Options, the Duplicate flows checkbox is added to choose whether or not to show duplicated flows.
- You can now filter source and destination traffic with One-way, Back-and-forth, and Swap filters.
The Network Observability metrics dashboards in Observe → Dashboards → NetObserv and NetObserv / Health are modified as follows:
- The NetObserv dashboard shows top bytes, packets sent, packets received per nodes, namespaces, and workloads. Flow graphs are removed from this dashboard.
- The NetObserv / Health dashboard shows flows overhead as well as top flow rates per nodes, namespaces, and workloads.
- Infrastructure and Application metrics are shown in a split-view for namespaces and workloads.
For more information, see Network Observability metrics and Quick filters.
Configuration enhancements:
- You now have the option to specify different namespaces for any configured ConfigMap or Secret reference, such as in certificates configuration.
-
The
spec.processor.clusterName
parameter is added so that the name of the cluster appears in the flows data. This is useful in a multi-cluster context. When using OpenShift Container Platform, leave empty to make it automatically determined.
For more information, see Flow Collector sample resource and Flow Collector API Reference.
1.7.2.2. Network Observability without Loki
The Network Observability Operator is now functional and usable without Loki. If Loki is not installed, it can only export flows to KAFKA or IPFIX format and provide metrics in the Network Observability metrics dashboards. For more information, see Network Observability without Loki.
1.7.2.3. DNS tracking
In 1.4, the Network Observability Operator makes use of eBPF tracepoint hooks to enable DNS tracking. You can monitor your network, conduct security analysis, and troubleshoot DNS issues in the Network Traffic and Overview pages in the web console.
For more information, see Configuring DNS tracking and Working with DNS tracking.
1.7.2.4. SR-IOV support
You can now collect traffic from a cluster with Single Root I/O Virtualization (SR-IOV) device. For more information, see Configuring the monitoring of SR-IOV interface traffic.
1.7.2.5. IPFIX exporter support
You can now export eBPF-enriched network flows to the IPFIX collector. For more information, see Export enriched network flow data.
1.7.2.6. Packet drops
In the 1.4 release of the Network Observability Operator, eBPF tracepoint hooks are used to enable packet drop tracking. You can now detect and analyze the cause for packet drops and make decisions to optimize network performance. In OpenShift Container Platform 4.14 and later, both host drops and OVS drops are detected. In OpenShift Container Platform 4.13, only host drops are detected. For more information, see Configuring packet drop tracking and Working with packet drops.
1.7.2.7. s390x architecture support
Network Observability Operator can now run on s390x
architecture. Previously it ran on amd64
, ppc64le
, or arm64
.
1.7.3. Bug fixes
- Previously, the Prometheus metrics exported by Network Observability were computed out of potentially duplicated network flows. In the related dashboards, from Observe → Dashboards, this could result in potentially doubled rates. Note that dashboards from the Network Traffic view were not affected. Now, network flows are filtered to eliminate duplicates before metrics calculation, which results in correct traffic rates displayed in the dashboards. (NETOBSERV-1131)
-
Previously, the Network Observability Operator agents were not able to capture traffic on network interfaces when configured with Multus or SR-IOV, non-default network namespaces. Now, all available network namespaces are recognized and used for capturing flows, allowing capturing traffic for SR-IOV. There are configurations needed for the
FlowCollector
andSRIOVnetwork
custom resource to collect traffic. (NETOBSERV-1283) -
Previously, in the Network Observability Operator details from Operators → Installed Operators, the
FlowCollector
Status field might have reported incorrect information about the state of the deployment. The status field now shows the proper conditions with improved messages. The history of events is kept, ordered by event date. (NETOBSERV-1224) -
Previously, during spikes of network traffic load, certain eBPF pods were OOM-killed and went into a
CrashLoopBackOff
state. Now, theeBPF
agent memory footprint is improved, so pods are not OOM-killed and entering aCrashLoopBackOff
state. (NETOBSERV-975) -
Previously when
processor.metrics.tls
was set toPROVIDED
theinsecureSkipVerify
option value was forced to betrue
. Now you can setinsecureSkipVerify
totrue
orfalse
, and provide a CA certificate if needed. (NETOBSERV-1087)
1.7.4. Known issues
-
Since the 1.2.0 release of the Network Observability Operator, using Loki Operator 5.6, a Loki certificate change periodically affects the
flowlogs-pipeline
pods and results in dropped flows rather than flows written to Loki. The problem self-corrects after some time, but it still causes temporary flow data loss during the Loki certificate change. This issue has only been observed in large-scale environments of 120 nodes or greater. (NETOBSERV-980) -
Currently, when
spec.agent.ebpf.features
includes DNSTracking, larger DNS packets require theeBPF
agent to look for DNS header outside of the 1st socket buffer (SKB) segment. A neweBPF
agent helper function needs to be implemented to support it. Currently, there is no workaround for this issue. (NETOBSERV-1304) -
Currently, when
spec.agent.ebpf.features
includes DNSTracking, DNS over TCP packets requires theeBPF
agent to look for DNS header outside of the 1st SKB segment. A neweBPF
agent helper function needs to be implemented to support it. Currently, there is no workaround for this issue. (NETOBSERV-1245) -
Currently, when using a
KAFKA
deployment model, if conversation tracking is configured, conversation events might be duplicated across Kafka consumers, resulting in inconsistent tracking of conversations, and incorrect volumetric data. For that reason, it is not recommended to configure conversation tracking whendeploymentModel
is set toKAFKA
. (NETOBSERV-926) -
Currently, when the
processor.metrics.server.tls.type
is configured to use aPROVIDED
certificate, the operator enters an unsteady state that might affect its performance and resource consumption. It is recommended to not use aPROVIDED
certificate until this issue is resolved, and instead using an auto-generated certificate, settingprocessor.metrics.server.tls.type
toAUTO
. (NETOBSERV-1293 -
Since the 1.3.0 release of the Network Observability Operator, installing the Operator causes a warning kernel taint to appear. The reason for this error is that the Network Observability eBPF agent has memory constraints that prevent preallocating the entire hashmap table. The Operator eBPF agent sets the
BPF_F_NO_PREALLOC
flag so that pre-allocation is disabled when the hashmap is too memory expansive.
1.8. Network Observability Operator 1.3.0
The following advisory is available for the Network Observability Operator 1.3.0:
1.8.1. Channel deprecation
You must switch your channel from v1.0.x
to stable
to receive future Operator updates. The v1.0.x
channel is deprecated and planned for removal in the next release.
1.8.2. New features and enhancements
1.8.2.1. Multi-tenancy in Network Observability
- System administrators can allow and restrict individual user access, or group access, to the flows stored in Loki. For more information, see Multi-tenancy in Network Observability.
1.8.2.2. Flow-based metrics dashboard
- This release adds a new dashboard, which provides an overview of the network flows in your OpenShift Container Platform cluster. For more information, see Network Observability metrics.
1.8.2.3. Troubleshooting with the must-gather tool
- Information about the Network Observability Operator can now be included in the must-gather data for troubleshooting. For more information, see Network Observability must-gather.
1.8.2.4. Multiple architectures now supported
-
Network Observability Operator can now run on an
amd64
,ppc64le
, orarm64
architectures. Previously, it only ran onamd64
.
1.8.3. Deprecated features
1.8.3.1. Deprecated configuration parameter setting
The release of Network Observability Operator 1.3 deprecates the spec.Loki.authToken
HOST
setting. When using the Loki Operator, you must now only use the FORWARD
setting.
1.8.4. Bug fixes
-
Previously, when the Operator was installed from the CLI, the
Role
andRoleBinding
that are necessary for the Cluster Monitoring Operator to read the metrics were not installed as expected. The issue did not occur when the operator was installed from the web console. Now, either way of installing the Operator installs the requiredRole
andRoleBinding
. (NETOBSERV-1003) -
Since version 1.2, the Network Observability Operator can raise alerts when a problem occurs with the flows collection. Previously, due to a bug, the related configuration to disable alerts,
spec.processor.metrics.disableAlerts
was not working as expected and sometimes ineffectual. Now, this configuration is fixed so that it is possible to disable the alerts. (NETOBSERV-976) -
Previously, when Network Observability was configured with
spec.loki.authToken
set toDISABLED
, only akubeadmin
cluster administrator was able to view network flows. Other types of cluster administrators received authorization failure. Now, any cluster administrator is able to view network flows. (NETOBSERV-972) -
Previously, a bug prevented users from setting
spec.consolePlugin.portNaming.enable
tofalse
. Now, this setting can be set tofalse
to disable port-to-service name translation. (NETOBSERV-971) - Previously, the metrics exposed by the console plugin were not collected by the Cluster Monitoring Operator (Prometheus), due to an incorrect configuration. Now the configuration has been fixed so that the console plugin metrics are correctly collected and accessible from the OpenShift Container Platform web console. (NETOBSERV-765)
-
Previously, when
processor.metrics.tls
was set toAUTO
in theFlowCollector
, theflowlogs-pipeline servicemonitor
did not adapt the appropriate TLS scheme, and metrics were not visible in the web console. Now the issue is fixed for AUTO mode. (NETOBSERV-1070) -
Previously, certificate configuration, such as used for Kafka and Loki, did not allow specifying a namespace field, implying that the certificates had to be in the same namespace where Network Observability is deployed. Moreover, when using Kafka with TLS/mTLS, the user had to manually copy the certificate(s) to the privileged namespace where the
eBPF
agent pods are deployed and manually manage certificate updates, such as in the case of certificate rotation. Now, Network Observability setup is simplified by adding a namespace field for certificates in theFlowCollector
resource. As a result, users can now install Loki or Kafka in different namespaces without needing to manually copy their certificates in the Network Observability namespace. The original certificates are watched so that the copies are automatically updated when needed. (NETOBSERV-773) - Previously, the SCTP, ICMPv4 and ICMPv6 protocols were not covered by the Network Observability agents, resulting in a less comprehensive network flows coverage. These protocols are now recognized to improve the flows coverage. (NETOBSERV-934)
1.8.5. Known issues
-
When
processor.metrics.tls
is set toPROVIDED
in theFlowCollector
, theflowlogs-pipeline
servicemonitor
is not adapted to the TLS scheme. (NETOBSERV-1087) -
Since the 1.2.0 release of the Network Observability Operator, using Loki Operator 5.6, a Loki certificate change periodically affects the
flowlogs-pipeline
pods and results in dropped flows rather than flows written to Loki. The problem self-corrects after some time, but it still causes temporary flow data loss during the Loki certificate change. This issue has only been observed in large-scale environments of 120 nodes or greater.(NETOBSERV-980) -
When you install the Operator, a warning kernel taint can appear. The reason for this error is that the Network Observability eBPF agent has memory constraints that prevent preallocating the entire hashmap table. The Operator eBPF agent sets the
BPF_F_NO_PREALLOC
flag so that pre-allocation is disabled when the hashmap is too memory expansive.
1.9. Network Observability Operator 1.2.0
The following advisory is available for the Network Observability Operator 1.2.0:
1.9.1. Preparing for the next update
The subscription of an installed Operator specifies an update channel that tracks and receives updates for the Operator. Until the 1.2 release of the Network Observability Operator, the only channel available was v1.0.x
. The 1.2 release of the Network Observability Operator introduces the stable
update channel for tracking and receiving updates. You must switch your channel from v1.0.x
to stable
to receive future Operator updates. The v1.0.x
channel is deprecated and planned for removal in a following release.
1.9.2. New features and enhancements
1.9.2.1. Histogram in Traffic Flows view
- You can now choose to show a histogram bar chart of flows over time. The histogram enables you to visualize the history of flows without hitting the Loki query limit. For more information, see Using the histogram.
1.9.2.2. Conversation tracking
- You can now query flows by Log Type, which enables grouping network flows that are part of the same conversation. For more information, see Working with conversations.
1.9.2.3. Network Observability health alerts
-
The Network Observability Operator now creates automatic alerts if the
flowlogs-pipeline
is dropping flows because of errors at the write stage or if the Loki ingestion rate limit has been reached. For more information, see Health dashboards.
1.9.3. Bug fixes
-
Previously, after changing the
namespace
value in the FlowCollector spec,eBPF
agent pods running in the previous namespace were not appropriately deleted. Now, the pods running in the previous namespace are appropriately deleted. (NETOBSERV-774) -
Previously, after changing the
caCert.name
value in the FlowCollector spec (such as in Loki section), FlowLogs-Pipeline pods and Console plug-in pods were not restarted, therefore they were unaware of the configuration change. Now, the pods are restarted, so they get the configuration change. (NETOBSERV-772) - Previously, network flows between pods running on different nodes were sometimes not correctly identified as being duplicates because they are captured by different network interfaces. This resulted in over-estimated metrics displayed in the console plug-in. Now, flows are correctly identified as duplicates, and the console plug-in displays accurate metrics. (NETOBSERV-755)
- The "reporter" option in the console plug-in is used to filter flows based on the observation point of either source node or destination node. Previously, this option mixed the flows regardless of the node observation point. This was due to network flows being incorrectly reported as Ingress or Egress at the node level. Now, the network flow direction reporting is correct. The "reporter" option filters for source observation point, or destination observation point, as expected. (NETOBSERV-696)
- Previously, for agents configured to send flows directly to the processor as gRPC+protobuf requests, the submitted payload could be too large and is rejected by the processors' GRPC server. This occurred under very-high-load scenarios and with only some configurations of the agent. The agent logged an error message, such as: grpc: received message larger than max. As a consequence, there was information loss about those flows. Now, the gRPC payload is split into several messages when the size exceeds a threshold. As a result, the server maintains connectivity. (NETOBSERV-617)
1.9.4. Known issue
-
In the 1.2.0 release of the Network Observability Operator, using Loki Operator 5.6, a Loki certificate transition periodically affects the
flowlogs-pipeline
pods and results in dropped flows rather than flows written to Loki. The problem self-corrects after some time, but it still causes temporary flow data loss during the Loki certificate transition. (NETOBSERV-980)
1.9.5. Notable technical changes
-
Previously, you could install the Network Observability Operator using a custom namespace. This release introduces the
conversion webhook
which changes theClusterServiceVersion
. Because of this change, all the available namespaces are no longer listed. Additionally, to enable Operator metrics collection, namespaces that are shared with other Operators, like theopenshift-operators
namespace, cannot be used. Now, the Operator must be installed in theopenshift-netobserv-operator
namespace. You cannot automatically upgrade to the new Operator version if you previously installed the Network Observability Operator using a custom namespace. If you previously installed the Operator using a custom namespace, you must delete the instance of the Operator that was installed and re-install your operator in theopenshift-netobserv-operator
namespace. It is important to note that custom namespaces, such as the commonly usednetobserv
namespace, are still possible for theFlowCollector
, Loki, Kafka, and other plug-ins. (NETOBSERV-907)(NETOBSERV-956)
1.10. Network Observability Operator 1.1.0
The following advisory is available for the Network Observability Operator 1.1.0:
The Network Observability Operator is now stable and the release channel is upgraded to v1.1.0
.
1.10.1. Bug fix
-
Previously, unless the Loki
authToken
configuration was set toFORWARD
mode, authentication was no longer enforced, allowing any user who could connect to the OpenShift Container Platform console in an OpenShift Container Platform cluster to retrieve flows without authentication. Now, regardless of the LokiauthToken
mode, only cluster administrators can retrieve flows. (BZ#2169468)
Chapter 2. About Network Observability
Red Hat offers cluster administrators the Network Observability Operator to observe the network traffic for OpenShift Container Platform clusters. The Network Observability Operator uses the eBPF technology to create network flows. The network flows are then enriched with OpenShift Container Platform information. They are available as Prometheus metrics or as logs in Loki. You can view and analyze the stored network flows information in the OpenShift Container Platform console for further insight and troubleshooting.
2.1. Optional dependencies of the Network Observability Operator
- Loki Operator: Loki is the backend that can be used to store all collected flows with a maximal level of details. You can choose to use Network Observability without Loki, but there are some considerations for doing this, as described in the linked section. If you choose to install Loki, it is recommended to use the Loki Operator, which is supported by Red Hat.
- AMQ Streams Operator: Kafka provides scalability, resiliency and high availability in the OpenShift Container Platform cluster for large scale deployments. If you choose to use Kafka, it is recommended to use the AMQ Streams Operator, because it is supported by Red Hat.
2.2. Network Observability Operator
The Network Observability Operator provides the Flow Collector API custom resource definition. A Flow Collector instance is a cluster-scoped resource that enables configuration of network flow collection. The Flow Collector instance deploys pods and services that form a monitoring pipeline where network flows are then collected and enriched with the Kubernetes metadata before storing in Loki or generating Prometheus metrics. The eBPF agent, which is deployed as a daemonset
object, creates the network flows.
2.3. OpenShift Container Platform console integration
OpenShift Container Platform console integration offers overview, topology view and traffic flow tables.
2.3.1. Network Observability metrics dashboards
On the Overview tab in the OpenShift Container Platform console, you can view the overall aggregated metrics of the network traffic flow on the cluster. You can choose to display the information by zone, node, namespace, owner, pod, and service. Filters and display options can further refine the metrics. For more information, see Observing the network traffic from the Overview view.
In Observe → Dashboards, the Netobserv dashboards provide a quick overview of the network flows in your OpenShift Container Platform cluster. The Netobserv/Health dashboard provides metrics about the health of the Operator. For more information, see Network Observability Metrics and Viewing health information.
2.3.2. Network Observability topology views
The OpenShift Container Platform console offers the Topology tab which displays a graphical representation of the network flows and the amount of traffic. The topology view represents traffic between the OpenShift Container Platform components as a network graph. You can refine the graph by using the filters and display options. You can access the information for zone, node, namespace, owner, pod, and service.
2.3.3. Traffic flow tables
The traffic flow table view provides a view for raw flows, non aggregated filtering options, and configurable columns. The OpenShift Container Platform console offers the Traffic flows tab which displays the data of the network flows and the amount of traffic.
2.4. Network Observability CLI
You can quickly debug and troubleshoot networking issues with Network Observability by using the Network Observability CLI (oc netobserv
). The Network Observability CLI is a flow and packet visualization tool that relies on eBPF agents to stream collected data to an ephemeral collector pod. It requires no persistent storage during the capture. After the run, the output is transferred to your local machine. This enables quick, live insight into packets and flow data without installing the Network Observability Operator.
Chapter 3. Installing the Network Observability Operator
Installing Loki is a recommended prerequisite for using the Network Observability Operator. You can choose to use Network Observability without Loki, but there are some considerations for doing this, described in the previously linked section.
The Loki Operator integrates a gateway that implements multi-tenancy and authentication with Loki for data flow storage. The LokiStack
resource manages Loki, which is a scalable, highly-available, multi-tenant log aggregation system, and a web proxy with OpenShift Container Platform authentication. The LokiStack
proxy uses OpenShift Container Platform authentication to enforce multi-tenancy and facilitate the saving and indexing of data in Loki log stores.
The Loki Operator can also be used for configuring the LokiStack log store]. The Network Observability Operator requires a dedicated LokiStack separate from the logging.
3.1. Network Observability without Loki
You can use Network Observability without Loki by not performing the Loki installation steps and skipping directly to "Installing the Network Observability Operator". If you only want to export flows to a Kafka consumer or IPFIX collector, or you only need dashboard metrics, then you do not need to install Loki or provide storage for Loki. The following table compares available features with and without Loki.
With Loki | Without Loki | |
---|---|---|
Exporters |
|
|
Multi-tenancy |
|
|
Complete filtering and aggregations capabilities [1] |
|
|
Partial filtering and aggregations capabilities [2] |
|
|
Flow-based metrics and dashboards |
|
|
Traffic flows view overview [3] |
|
|
Traffic flows view table |
|
|
Topology view |
|
|
OpenShift Container Platform console Network Traffic tab integration |
|
|
- Such as per pod.
- Such as per workload or namespace.
- Statistics on packet drops are only available with Loki.
Additional resources
3.2. Installing the Loki Operator
The Loki Operator versions 5.7+ are the supported Loki Operator versions for Network Observability; these versions provide the ability to create a LokiStack
instance using the openshift-network
tenant configuration mode and provide fully-automatic, in-cluster authentication and authorization support for Network Observability. There are several ways you can install Loki. One way is by using the OpenShift Container Platform web console Operator Hub.
Prerequisites
- Supported Log Store (AWS S3, Google Cloud Storage, Azure, Swift, Minio, OpenShift Data Foundation)
- OpenShift Container Platform 4.10+
- Linux Kernel 4.18+
Procedure
- In the OpenShift Container Platform web console, click Operators → OperatorHub.
- Choose Loki Operator from the list of available Operators, and click Install.
- Under Installation Mode, select All namespaces on the cluster.
Verification
- Verify that you installed the Loki Operator. Visit the Operators → Installed Operators page and look for Loki Operator.
- Verify that Loki Operator is listed with Status as Succeeded in all the projects.
To uninstall Loki, refer to the uninstallation process that corresponds with the method you used to install Loki. You might have remaining ClusterRoles
and ClusterRoleBindings
, data stored in object store, and persistent volume that must be removed.
3.2.1. Creating a secret for Loki storage
The Loki Operator supports a few log storage options, such as AWS S3, Google Cloud Storage, Azure, Swift, Minio, OpenShift Data Foundation. The following example shows how to create a secret for AWS S3 storage. The secret created in this example, loki-s3
, is referenced in "Creating a LokiStack resource". You can create this secret in the web console or CLI.
-
Using the web console, navigate to the Project → All Projects dropdown and select Create Project. Name the project
netobserv
and click Create. Navigate to the Import icon, +, in the top right corner. Paste your YAML file into the editor.
The following shows an example secret YAML file for S3 storage:
apiVersion: v1 kind: Secret metadata: name: loki-s3 namespace: netobserv 1 stringData: access_key_id: QUtJQUlPU0ZPRE5ON0VYQU1QTEUK access_key_secret: d0phbHJYVXRuRkVNSS9LN01ERU5HL2JQeFJmaUNZRVhBTVBMRUtFWQo= bucketnames: s3-bucket-name endpoint: https://s3.eu-central-1.amazonaws.com region: eu-central-1
- 1
- The installation examples in this documentation use the same namespace,
netobserv
, across all components. You can optionally use a different namespace for the different components
Verification
- Once you create the secret, you should see it listed under Workloads → Secrets in the web console.
Additional resources
3.2.2. Creating a LokiStack custom resource
You can deploy a LokiStack
custom resource (CR) by using the web console or OpenShift CLI (oc
) to create a namespace, or new project.
Procedure
- Navigate to Operators → Installed Operators, viewing All projects from the Project dropdown.
- Look for Loki Operator. In the details, under Provided APIs, select LokiStack.
- Click Create LokiStack.
Ensure the following fields are specified in either Form View or YAML view:
apiVersion: loki.grafana.com/v1 kind: LokiStack metadata: name: loki namespace: netobserv 1 spec: size: 1x.small 2 storage: schemas: - version: v12 effectiveDate: '2022-06-01' secret: name: loki-s3 type: s3 storageClassName: gp3 3 tenants: mode: openshift-network
- 1
- The installation examples in this documentation use the same namespace,
netobserv
, across all components. You can optionally use a different namespace. - 2
- Specify the deployment size. In the Loki Operator 5.8 and later versions, the supported size options for production instances of Loki are
1x.extra-small
,1x.small
, or1x.medium
.ImportantIt is not possible to change the number
1x
for the deployment size. - 3
- Use a storage class name that is available on the cluster for
ReadWriteOnce
access mode. You can useoc get storageclasses
to see what is available on your cluster.ImportantYou must not reuse the same
LokiStack
CR that is used for logging.
- Click Create.
3.2.3. Creating a new group for the cluster-admin user role
Querying application logs for multiple namespaces as a cluster-admin
user, where the sum total of characters of all of the namespaces in the cluster is greater than 5120, results in the error Parse error: input size too long (XXXX > 5120)
. For better control over access to logs in LokiStack, make the cluster-admin
user a member of the cluster-admin
group. If the cluster-admin
group does not exist, create it and add the desired users to it.
Use the following procedure to create a new group for users with cluster-admin
permissions.
Procedure
Enter the following command to create a new group:
$ oc adm groups new cluster-admin
Enter the following command to add the desired user to the
cluster-admin
group:$ oc adm groups add-users cluster-admin <username>
Enter the following command to add
cluster-admin
user role to the group:$ oc adm policy add-cluster-role-to-group cluster-admin cluster-admin
3.2.4. Custom admin group access
If you need to see cluster-wide logs without necessarily being an administrator, or if you already have any group defined that you want to use here, you can specify a custom group using the adminGroup
field. Users who are members of any group specified in the adminGroups
field of the LokiStack
custom resource (CR) have the same read access to logs as administrators.
Administrator users have access to all application logs in all namespaces, if they also get assigned the cluster-logging-application-view
role.
Administrator users have access to all network logs across the cluster.
Example LokiStack CR
apiVersion: loki.grafana.com/v1 kind: LokiStack metadata: name: loki namespace: netobserv spec: tenants: mode: openshift-network 1 openshift: adminGroups: 2 - cluster-admin - custom-admin-group 3
3.2.5. Loki deployment sizing
Sizing for Loki follows the format of 1x.<size>
where the value 1x
is number of instances and <size>
specifies performance capabilities.
It is not possible to change the number 1x
for the deployment size.
1x.demo | 1x.extra-small | 1x.small | 1x.medium | |
---|---|---|---|---|
Data transfer | Demo use only | 100GB/day | 500GB/day | 2TB/day |
Queries per second (QPS) | Demo use only | 1-25 QPS at 200ms | 25-50 QPS at 200ms | 25-75 QPS at 200ms |
Replication factor | None | 2 | 2 | 2 |
Total CPU requests | None | 14 vCPUs | 34 vCPUs | 54 vCPUs |
Total memory requests | None | 31Gi | 67Gi | 139Gi |
Total disk requests | 40Gi | 430Gi | 430Gi | 590Gi |
3.2.6. LokiStack ingestion limits and health alerts
The LokiStack instance comes with default settings according to the configured size. It is possible to override some of these settings, such as the ingestion and query limits. You might want to update them if you get Loki errors showing up in the Console plugin, or in flowlogs-pipeline
logs. An automatic alert in the web console notifies you when these limits are reached.
Here is an example of configured limits:
spec: limits: global: ingestion: ingestionBurstSize: 40 ingestionRate: 20 maxGlobalStreamsPerTenant: 25000 queries: maxChunksPerQuery: 2000000 maxEntriesLimitPerQuery: 10000 maxQuerySeries: 3000
For more information about these settings, see the LokiStack API reference.
3.2.7. Enabling multi-tenancy in Network Observability
Multi-tenancy in the Network Observability Operator allows and restricts individual user access, or group access, to the flows stored in Loki. Access is enabled for project admins. Project admins who have limited access to some namespaces can access flows for only those namespaces.
Prerequisite
- You have installed at least Loki Operator version 5.7
- You must be logged in as a project administrator
Procedure
Authorize reading permission to
user1
by running the following command:$ oc adm policy add-cluster-role-to-user netobserv-reader user1
Now, the data is restricted to only allowed user namespaces. For example, a user that has access to a single namespace can see all the flows internal to this namespace, as well as flows going from and to this namespace. Project admins have access to the Administrator perspective in the OpenShift Container Platform console to access the Network Flows Traffic page.
3.3. Installing the Network Observability Operator
You can install the Network Observability Operator using the OpenShift Container Platform web console Operator Hub. When you install the Operator, it provides the FlowCollector
custom resource definition (CRD). You can set specifications in the web console when you create the FlowCollector
.
The actual memory consumption of the Operator depends on your cluster size and the number of resources deployed. Memory consumption might need to be adjusted accordingly. For more information refer to "Network Observability controller manager pod runs out of memory" in the "Important Flow Collector configuration considerations" section.
Prerequisites
- If you choose to use Loki, install the Loki Operator version 5.7+.
-
You must have
cluster-admin
privileges. -
One of the following supported architectures is required:
amd64
,ppc64le
,arm64
, ors390x
. - Any CPU supported by Red Hat Enterprise Linux (RHEL) 9.
- Must be configured with OVN-Kubernetes as the main network plugin, and optionally using secondary interfaces with Multus and SR-IOV.
Additionally, this installation example uses the netobserv
namespace, which is used across all components. You can optionally use a different namespace.
Procedure
- In the OpenShift Container Platform web console, click Operators → OperatorHub.
- Choose Network Observability Operator from the list of available Operators in the OperatorHub, and click Install.
-
Select the checkbox
Enable Operator recommended cluster monitoring on this Namespace
. - Navigate to Operators → Installed Operators. Under Provided APIs for Network Observability, select the Flow Collector link.
Navigate to the Flow Collector tab, and click Create FlowCollector. Make the following selections in the form view:
-
spec.agent.ebpf.Sampling: Specify a sampling size for flows. Lower sampling sizes will have higher impact on resource utilization. For more information, see the "FlowCollector API reference",
spec.agent.ebpf
. - If you are not using Loki, click Loki client settings and change Enable to False. The setting is True by default.
If you are using Loki, set the following specifications:
-
spec.loki.mode: Set this to the
LokiStack
mode, which automatically sets URLs, TLS, cluster roles and a cluster role binding, as well as theauthToken
value. Alternatively, theManual
mode allows more control over configuration of these settings. -
spec.loki.lokistack.name: Set this to the name of your
LokiStack
resource. In this documentation,loki
is used.
-
spec.loki.mode: Set this to the
-
Optional: If you are in a large-scale environment, consider configuring the
FlowCollector
with Kafka for forwarding data in a more resilient, scalable way. See "Configuring the Flow Collector resource with Kafka storage" in the "Important Flow Collector configuration considerations" section. -
Optional: Configure other optional settings before the next step of creating the
FlowCollector
. For example, if you choose not to use Loki, then you can configure exporting flows to Kafka or IPFIX. See "Export enriched network flow data to Kafka and IPFIX" and more in the "Important Flow Collector configuration considerations" section.
-
spec.agent.ebpf.Sampling: Specify a sampling size for flows. Lower sampling sizes will have higher impact on resource utilization. For more information, see the "FlowCollector API reference",
- Click Create.
Verification
To confirm this was successful, when you navigate to Observe you should see Network Traffic listed in the options.
In the absence of Application Traffic within the OpenShift Container Platform cluster, default filters might show that there are "No results", which results in no visual flow. Beside the filter selections, select Clear all filters to see the flow.
3.4. Important Flow Collector configuration considerations
Once you create the FlowCollector
instance, you can reconfigure it, but the pods are terminated and recreated again, which can be disruptive. Therefore, you can consider configuring the following options when creating the FlowCollector
for the first time:
Additional resources
For more general information about Flow Collector specifications and the Network Observability Operator architecture and resource use, see the following resources:
3.4.1. Migrating removed stored versions of the FlowCollector CRD
Network Observability Operator version 1.6 removes the old and deprecated v1alpha1
version of the FlowCollector
API. If you previously installed this version on your cluster, it might still be referenced in the storedVersion
of the FlowCollector
CRD, even if it is removed from the etcd store, which blocks the upgrade process. These references need to be manually removed.
There are two options to remove stored versions:
- Use the Storage Version Migrator Operator.
- Uninstall and reinstall the Network Observability Operator, ensuring that the installation is in a clean state.
Prerequisites
-
You have an older version of the Operator installed, and you want to prepare your cluster to install the latest version of the Operator. Or you have attempted to install the Network Observability Operator 1.6 and run into the error:
Failed risk of data loss updating "flowcollectors.flows.netobserv.io": new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD
.
Procedure
Verify that the old
FlowCollector
CRD version is still referenced in thestoredVersion
:$ oc get crd flowcollectors.flows.netobserv.io -ojsonpath='{.status.storedVersions}'
If
v1alpha1
appears in the list of results, proceed with Step a to use the Kubernetes Storage Version Migrator or Step b to uninstall and reinstall the CRD and the Operator.Option 1: Kubernetes Storage Version Migrator: Create a YAML to define the
StorageVersionMigration
object, for examplemigrate-flowcollector-v1alpha1.yaml
:apiVersion: migration.k8s.io/v1alpha1 kind: StorageVersionMigration metadata: name: migrate-flowcollector-v1alpha1 spec: resource: group: flows.netobserv.io resource: flowcollectors version: v1alpha1
- Save the file.
Apply the
StorageVersionMigration
by running the following command:$ oc apply -f migrate-flowcollector-v1alpha1.yaml
Update the
FlowCollector
CRD to manually removev1alpha1
from thestoredVersion
:$ oc edit crd flowcollectors.flows.netobserv.io
Option 2: Reinstall: Save the Network Observability Operator 1.5 version of the
FlowCollector
CR to a file, for exampleflowcollector-1.5.yaml
.$ oc get flowcollector cluster -o yaml > flowcollector-1.5.yaml
-
Follow the steps in "Uninstalling the Network Observability Operator", which uninstalls the Operator and removes the existing
FlowCollector
CRD. - Install the Network Observability Operator latest version, 1.6.0.
-
Create the
FlowCollector
using backup that was saved in Step b.
-
Follow the steps in "Uninstalling the Network Observability Operator", which uninstalls the Operator and removes the existing
Verification
Run the following command:
$ oc get crd flowcollectors.flows.netobserv.io -ojsonpath='{.status.storedVersions}'
The list of results should no longer show
v1alpha1
and only show the latest version,v1beta1
.
Additional resources
3.5. Installing Kafka (optional)
The Kafka Operator is supported for large scale environments. Kafka provides high-throughput and low-latency data feeds for forwarding network flow data in a more resilient, scalable way. You can install the Kafka Operator as Red Hat AMQ Streams from the Operator Hub, just as the Loki Operator and Network Observability Operator were installed. Refer to "Configuring the FlowCollector resource with Kafka" to configure Kafka as a storage option.
To uninstall Kafka, refer to the uninstallation process that corresponds with the method you used to install.
Additional resources
3.6. Uninstalling the Network Observability Operator
You can uninstall the Network Observability Operator using the OpenShift Container Platform web console Operator Hub, working in the Operators → Installed Operators area.
Procedure
Remove the
FlowCollector
custom resource.- Click Flow Collector, which is next to the Network Observability Operator in the Provided APIs column.
- Click the options menu for the cluster and select Delete FlowCollector.
Uninstall the Network Observability Operator.
- Navigate back to the Operators → Installed Operators area.
- Click the options menu next to the Network Observability Operator and select Uninstall Operator.
-
Home → Projects and select
openshift-netobserv-operator
- Navigate to Actions and select Delete Project
Remove the
FlowCollector
custom resource definition (CRD).- Navigate to Administration → CustomResourceDefinitions.
- Look for FlowCollector and click the options menu .
Select Delete CustomResourceDefinition.
ImportantThe Loki Operator and Kafka remain if they were installed and must be removed separately. Additionally, you might have remaining data stored in an object store, and a persistent volume that must be removed.
Chapter 4. Network Observability Operator in OpenShift Container Platform
Network Observability is an OpenShift operator that deploys a monitoring pipeline to collect and enrich network traffic flows that are produced by the Network Observability eBPF agent.
4.1. Viewing statuses
The Network Observability Operator provides the Flow Collector API. When a Flow Collector resource is created, it deploys pods and services to create and store network flows in the Loki log store, as well as to display dashboards, metrics, and flows in the OpenShift Container Platform web console.
Procedure
Run the following command to view the state of
FlowCollector
:$ oc get flowcollector/cluster
Example output
NAME AGENT SAMPLING (EBPF) DEPLOYMENT MODEL STATUS cluster EBPF 50 DIRECT Ready
Check the status of pods running in the
netobserv
namespace by entering the following command:$ oc get pods -n netobserv
Example output
NAME READY STATUS RESTARTS AGE flowlogs-pipeline-56hbp 1/1 Running 0 147m flowlogs-pipeline-9plvv 1/1 Running 0 147m flowlogs-pipeline-h5gkb 1/1 Running 0 147m flowlogs-pipeline-hh6kf 1/1 Running 0 147m flowlogs-pipeline-w7vv5 1/1 Running 0 147m netobserv-plugin-cdd7dc6c-j8ggp 1/1 Running 0 147m
flowlogs-pipeline
pods collect flows, enriches the collected flows, then send flows to the Loki storage. netobserv-plugin
pods create a visualization plugin for the OpenShift Container Platform Console.
Check the status of pods running in the namespace
netobserv-privileged
by entering the following command:$ oc get pods -n netobserv-privileged
Example output
NAME READY STATUS RESTARTS AGE netobserv-ebpf-agent-4lpp6 1/1 Running 0 151m netobserv-ebpf-agent-6gbrk 1/1 Running 0 151m netobserv-ebpf-agent-klpl9 1/1 Running 0 151m netobserv-ebpf-agent-vrcnf 1/1 Running 0 151m netobserv-ebpf-agent-xf5jh 1/1 Running 0 151m
netobserv-ebpf-agent
pods monitor network interfaces of the nodes to get flows and send them to flowlogs-pipeline
pods.
If you are using the Loki Operator, check the status of pods running in the
openshift-operators-redhat
namespace by entering the following command:$ oc get pods -n openshift-operators-redhat
Example output
NAME READY STATUS RESTARTS AGE loki-operator-controller-manager-5f6cff4f9d-jq25h 2/2 Running 0 18h lokistack-compactor-0 1/1 Running 0 18h lokistack-distributor-654f87c5bc-qhkhv 1/1 Running 0 18h lokistack-distributor-654f87c5bc-skxgm 1/1 Running 0 18h lokistack-gateway-796dc6ff7-c54gz 2/2 Running 0 18h lokistack-index-gateway-0 1/1 Running 0 18h lokistack-index-gateway-1 1/1 Running 0 18h lokistack-ingester-0 1/1 Running 0 18h lokistack-ingester-1 1/1 Running 0 18h lokistack-ingester-2 1/1 Running 0 18h lokistack-querier-66747dc666-6vh5x 1/1 Running 0 18h lokistack-querier-66747dc666-cjr45 1/1 Running 0 18h lokistack-querier-66747dc666-xh8rq 1/1 Running 0 18h lokistack-query-frontend-85c6db4fbd-b2xfb 1/1 Running 0 18h lokistack-query-frontend-85c6db4fbd-jm94f 1/1 Running 0 18h
4.2. Network Observablity Operator architecture
The Network Observability Operator provides the FlowCollector
API, which is instantiated at installation and configured to reconcile the eBPF agent
, the flowlogs-pipeline
, and the netobserv-plugin
components. Only a single FlowCollector
per cluster is supported.
The eBPF agent
runs on each cluster node with some privileges to collect network flows. The flowlogs-pipeline
receives the network flows data and enriches the data with Kubernetes identifiers. If you choose to use Loki, the flowlogs-pipeline
sends flow logs data to Loki for storing and indexing. The netobserv-plugin
, which is a dynamic OpenShift Container Platform web console plugin, queries Loki to fetch network flows data. Cluster-admins can view the data in the web console.
If you do not use Loki, you can generate metrics with Prometheus. Those metrics and their related dashboards are accessible in the web console. For more information, see "Network Observability without Loki".
If you are using the Kafka option, the eBPF agent sends the network flow data to Kafka, and the flowlogs-pipeline
reads from the Kafka topic before sending to Loki, as shown in the following diagram.
Additional resources
4.3. Viewing Network Observability Operator status and configuration
You can inspect the status and view the details of the FlowCollector
using the oc describe
command.
Procedure
Run the following command to view the status and configuration of the Network Observability Operator:
$ oc describe flowcollector/cluster
Chapter 5. Configuring the Network Observability Operator
You can update the FlowCollector
API resource to configure the Network Observability Operator and its managed components. The FlowCollector
is explicitly created during installation. Since this resource operates cluster-wide, only a single FlowCollector
is allowed, and it must be named cluster
. For more information, see the FlowCollector API reference.
5.1. View the FlowCollector resource
You can view and edit YAML directly in the OpenShift Container Platform web console.
Procedure
- In the web console, navigate to Operators → Installed Operators.
- Under the Provided APIs heading for the NetObserv Operator, select Flow Collector.
-
Select cluster then select the YAML tab. There, you can modify the
FlowCollector
resource to configure the Network Observability operator.
The following example shows a sample FlowCollector
resource for OpenShift Container Platform Network Observability operator:
Sample FlowCollector
resource
apiVersion: flows.netobserv.io/v1beta2 kind: FlowCollector metadata: name: cluster spec: namespace: netobserv deploymentModel: Direct agent: type: eBPF 1 ebpf: sampling: 50 2 logLevel: info privileged: false resources: requests: memory: 50Mi cpu: 100m limits: memory: 800Mi processor: 3 logLevel: info resources: requests: memory: 100Mi cpu: 100m limits: memory: 800Mi logTypes: Flows advanced: conversationEndTimeout: 10s conversationHeartbeatInterval: 30s loki: 4 mode: LokiStack 5 consolePlugin: register: true logLevel: info portNaming: enable: true portNames: "3100": loki quickFilters: 6 - name: Applications filter: src_namespace!: 'openshift-,netobserv' dst_namespace!: 'openshift-,netobserv' default: true - name: Infrastructure filter: src_namespace: 'openshift-,netobserv' dst_namespace: 'openshift-,netobserv' - name: Pods network filter: src_kind: 'Pod' dst_kind: 'Pod' default: true - name: Services network filter: dst_kind: 'Service'
- 1
- The Agent specification,
spec.agent.type
, must beEBPF
. eBPF is the only OpenShift Container Platform supported option. - 2
- You can set the Sampling specification,
spec.agent.ebpf.sampling
, to manage resources. Lower sampling values might consume a large amount of computational, memory and storage resources. You can mitigate this by specifying a sampling ratio value. A value of 100 means 1 flow every 100 is sampled. A value of 0 or 1 means all flows are captured. The lower the value, the increase in returned flows and the accuracy of derived metrics. By default, eBPF sampling is set to a value of 50, so 1 flow every 50 is sampled. Note that more sampled flows also means more storage needed. It is recommend to start with default values and refine empirically, to determine which setting your cluster can manage. - 3
- The Processor specification
spec.processor.
can be set to enable conversation tracking. When enabled, conversation events are queryable in the web console. Thespec.processor.logTypes
value isFlows
. Thespec.processor.advanced
values areConversations
,EndedConversations
, orALL
. Storage requirements are highest forAll
and lowest forEndedConversations
. - 4
- The Loki specification,
spec.loki
, specifies the Loki client. The default values match the Loki install paths mentioned in the Installing the Loki Operator section. If you used another installation method for Loki, specify the appropriate client information for your install. - 5
- The
LokiStack
mode automatically sets a few configurations:querierUrl
,ingesterUrl
andstatusUrl
,tenantID
, and corresponding TLS configuration. Cluster roles and a cluster role binding are created for reading and writing logs to Loki. AndauthToken
is set toForward
. You can set these manually using theManual
mode. - 6
- The
spec.quickFilters
specification defines filters that show up in the web console. TheApplication
filter keys,src_namespace
anddst_namespace
, are negated (!
), so theApplication
filter shows all traffic that does not originate from, or have a destination to, anyopenshift-
ornetobserv
namespaces. For more information, see Configuring quick filters below.
Additional resources
5.2. Configuring the Flow Collector resource with Kafka
You can configure the FlowCollector
resource to use Kafka for high-throughput and low-latency data feeds. A Kafka instance needs to be running, and a Kafka topic dedicated to OpenShift Container Platform Network Observability must be created in that instance. For more information, see Kafka documentation with AMQ Streams.
Prerequisites
- Kafka is installed. Red Hat supports Kafka with AMQ Streams Operator.
Procedure
- In the web console, navigate to Operators → Installed Operators.
- Under the Provided APIs heading for the Network Observability Operator, select Flow Collector.
- Select the cluster and then click the YAML tab.
-
Modify the
FlowCollector
resource for OpenShift Container Platform Network Observability Operator to use Kafka, as shown in the following sample YAML:
Sample Kafka configuration in FlowCollector
resource
apiVersion: flows.netobserv.io/v1beta2 kind: FlowCollector metadata: name: cluster spec: deploymentModel: Kafka 1 kafka: address: "kafka-cluster-kafka-bootstrap.netobserv" 2 topic: network-flows 3 tls: enable: false 4
- 1
- Set
spec.deploymentModel
toKafka
instead ofDirect
to enable the Kafka deployment model. - 2
spec.kafka.address
refers to the Kafka bootstrap server address. You can specify a port if needed, for instancekafka-cluster-kafka-bootstrap.netobserv:9093
for using TLS on port 9093.- 3
spec.kafka.topic
should match the name of a topic created in Kafka.- 4
spec.kafka.tls
can be used to encrypt all communications to and from Kafka with TLS or mTLS. When enabled, the Kafka CA certificate must be available as a ConfigMap or a Secret, both in the namespace where theflowlogs-pipeline
processor component is deployed (default:netobserv
) and where the eBPF agents are deployed (default:netobserv-privileged
). It must be referenced withspec.kafka.tls.caCert
. When using mTLS, client secrets must be available in these namespaces as well (they can be generated for instance using the AMQ Streams User Operator) and referenced withspec.kafka.tls.userCert
.
5.3. Export enriched network flow data
You can send network flows to Kafka, IPFIX, or both at the same time. Any processor or storage that supports Kafka or IPFIX input, such as Splunk, Elasticsearch, or Fluentd, can consume the enriched network flow data.
Prerequisites
-
Your Kafka or IPFIX collector endpoint(s) are available from Network Observability
flowlogs-pipeline
pods.
Procedure
- In the web console, navigate to Operators → Installed Operators.
- Under the Provided APIs heading for the NetObserv Operator, select Flow Collector.
- Select cluster and then select the YAML tab.
Edit the
FlowCollector
to configurespec.exporters
as follows:apiVersion: flows.netobserv.io/v1beta2 kind: FlowCollector metadata: name: cluster spec: exporters: - type: Kafka 1 kafka: address: "kafka-cluster-kafka-bootstrap.netobserv" topic: netobserv-flows-export 2 tls: enable: false 3 - type: IPFIX 4 ipfix: targetHost: "ipfix-collector.ipfix.svc.cluster.local" targetPort: 4739 transport: tcp or udp 5
- 2
- The Network Observability Operator exports all flows to the configured Kafka topic.
- 3
- You can encrypt all communications to and from Kafka with SSL/TLS or mTLS. When enabled, the Kafka CA certificate must be available as a ConfigMap or a Secret, both in the namespace where the
flowlogs-pipeline
processor component is deployed (default: netobserv). It must be referenced withspec.exporters.tls.caCert
. When using mTLS, client secrets must be available in these namespaces as well (they can be generated for instance using the AMQ Streams User Operator) and referenced withspec.exporters.tls.userCert
. - 1 4
- You can export flows to IPFIX instead of or in conjunction with exporting flows to Kafka.
- 5
- You have the option to specify transport. The default value is
tcp
but you can also specifyudp
.
- After configuration, network flows data can be sent to an available output in a JSON format. For more information, see Network flows format reference.
Additional resources
5.4. Updating the Flow Collector resource
As an alternative to editing YAML in the OpenShift Container Platform web console, you can configure specifications, such as eBPF sampling, by patching the flowcollector
custom resource (CR):
Procedure
Run the following command to patch the
flowcollector
CR and update thespec.agent.ebpf.sampling
value:$ oc patch flowcollector cluster --type=json -p "[{"op": "replace", "path": "/spec/agent/ebpf/sampling", "value": <new value>}] -n netobserv"
5.5. Configuring quick filters
You can modify the filters in the FlowCollector
resource. Exact matches are possible using double-quotes around values. Otherwise, partial matches are used for textual values. The bang (!) character, placed at the end of a key, means negation. See the sample FlowCollector
resource for more context about modifying the YAML.
The filter matching types "all of" or "any of" is a UI setting that the users can modify from the query options. It is not part of this resource configuration.
Here is a list of all available filter keys:
Universal* | Source | Destination | Description |
---|---|---|---|
namespace |
|
| Filter traffic related to a specific namespace. |
name |
|
| Filter traffic related to a given leaf resource name, such as a specific pod, service, or node (for host-network traffic). |
kind |
|
| Filter traffic related to a given resource kind. The resource kinds include the leaf resource (Pod, Service or Node), or the owner resource (Deployment and StatefulSet). |
owner_name |
|
| Filter traffic related to a given resource owner; that is, a workload or a set of pods. For example, it can be a Deployment name, a StatefulSet name, etc. |
resource |
|
|
Filter traffic related to a specific resource that is denoted by its canonical name, that identifies it uniquely. The canonical notation is |
address |
|
| Filter traffic related to an IP address. IPv4 and IPv6 are supported. CIDR ranges are also supported. |
mac |
|
| Filter traffic related to a MAC address. |
port |
|
| Filter traffic related to a specific port. |
host_address |
|
| Filter traffic related to the host IP address where the pods are running. |
protocol | N/A | N/A | Filter traffic related to a protocol, such as TCP or UDP. |
-
Universal keys filter for any of source or destination. For example, filtering
name: 'my-pod'
means all traffic frommy-pod
and all traffic tomy-pod
, regardless of the matching type used, whether Match all or Match any.
5.6. Configuring monitoring for SR-IOV interface traffic
In order to collect traffic from a cluster with a Single Root I/O Virtualization (SR-IOV) device, you must set the FlowCollector
spec.agent.ebpf.privileged
field to true
. Then, the eBPF agent monitors other network namespaces in addition to the host network namespaces, which are monitored by default. When a pod with a virtual functions (VF) interface is created, a new network namespace is created. With SRIOVNetwork
policy IPAM
configurations specified, the VF interface is migrated from the host network namespace to the pod network namespace.
Prerequisites
- Access to an OpenShift Container Platform cluster with a SR-IOV device.
-
The
SRIOVNetwork
custom resource (CR)spec.ipam
configuration must be set with an IP address from the range that the interface lists or from other plugins.
Procedure
- In the web console, navigate to Operators → Installed Operators.
- Under the Provided APIs heading for the NetObserv Operator, select Flow Collector.
- Select cluster and then select the YAML tab.
-
Configure the
FlowCollector
custom resource. A sample configuration is as follows:
Configure FlowCollector
for SR-IOV monitoring
apiVersion: flows.netobserv.io/v1beta2
kind: FlowCollector
metadata:
name: cluster
spec:
namespace: netobserv
deploymentModel: Direct
agent:
type: eBPF
ebpf:
privileged: true 1
- 1
- The
spec.agent.ebpf.privileged
field value must be set totrue
to enable SR-IOV monitoring.
Additional resources
5.7. Resource management and performance considerations
The amount of resources required by Network Observability depends on the size of your cluster and your requirements for the cluster to ingest and store observability data. To manage resources and set performance criteria for your cluster, consider configuring the following settings. Configuring these settings might meet your optimal setup and observability needs.
The following settings can help you manage resources and performance from the outset:
- eBPF Sampling
-
You can set the Sampling specification,
spec.agent.ebpf.sampling
, to manage resources. Smaller sampling values might consume a large amount of computational, memory and storage resources. You can mitigate this by specifying a sampling ratio value. A value of100
means 1 flow every 100 is sampled. A value of0
or1
means all flows are captured. Smaller values result in an increase in returned flows and the accuracy of derived metrics. By default, eBPF sampling is set to a value of 50, so 1 flow every 50 is sampled. Note that more sampled flows also means more storage needed. Consider starting with the default values and refine empirically, in order to determine which setting your cluster can manage. - Restricting or excluding interfaces
-
Reduce the overall observed traffic by setting the values for
spec.agent.ebpf.interfaces
andspec.agent.ebpf.excludeInterfaces
. By default, the agent fetches all the interfaces in the system, except the ones listed inexcludeInterfaces
andlo
(local interface). Note that the interface names might vary according to the Container Network Interface (CNI) used.
The following settings can be used to fine-tune performance after the Network Observability has been running for a while:
- Resource requirements and limits
-
Adapt the resource requirements and limits to the load and memory usage you expect on your cluster by using the
spec.agent.ebpf.resources
andspec.processor.resources
specifications. The default limits of 800MB might be sufficient for most medium-sized clusters. - Cache max flows timeout
-
Control how often flows are reported by the agents by using the eBPF agent’s
spec.agent.ebpf.cacheMaxFlows
andspec.agent.ebpf.cacheActiveTimeout
specifications. A larger value results in less traffic being generated by the agents, which correlates with a lower CPU load. However, a larger value leads to a slightly higher memory consumption, and might generate more latency in the flow collection.
5.7.1. Resource considerations
The following table outlines examples of resource considerations for clusters with certain workload sizes.
The examples outlined in the table demonstrate scenarios that are tailored to specific workloads. Consider each example only as a baseline from which adjustments can be made to accommodate your workload needs.
Extra small (10 nodes) | Small (25 nodes) | Medium (65 nodes) [2] | Large (120 nodes) [2] | |
---|---|---|---|---|
Worker Node vCPU and memory | 4 vCPUs| 16GiB mem [1] | 16 vCPUs| 64GiB mem [1] | 16 vCPUs| 64GiB mem [1] | 16 vCPUs| 64GiB Mem [1] |
LokiStack size |
|
|
|
|
Network Observability controller memory limit | 400Mi (default) | 400Mi (default) | 400Mi (default) | 400Mi (default) |
eBPF sampling rate | 50 (default) | 50 (default) | 50 (default) | 50 (default) |
eBPF memory limit | 800Mi (default) | 800Mi (default) | 800Mi (default) | 1600Mi |
cacheMaxSize | 50,000 | 100,000 (default) | 100,000 (default) | 100,000 (default) |
FLP memory limit | 800Mi (default) | 800Mi (default) | 800Mi (default) | 800Mi (default) |
FLP Kafka partitions | N/A | 48 | 48 | 48 |
Kafka consumer replicas | N/A | 6 | 12 | 18 |
Kafka brokers | N/A | 3 (default) | 3 (default) | 3 (default) |
- Tested with AWS M6i instances.
-
In addition to this worker and its controller, 3 infra nodes (size
M6i.12xlarge
) and 1 workload node (sizeM6i.8xlarge
) were tested.
Chapter 6. Network Policy
As a user with the admin
role, you can create a network policy for the netobserv
namespace to secure inbound access to the Network Observability Operator.
6.1. Creating a network policy for Network Observability
You might need to create a network policy to secure ingress traffic to the netobserv
namespace. In the web console, you can create a network policy using the form view.
Procedure
- Navigate to Networking → NetworkPolicies.
-
Select the
netobserv
project from the Project dropdown menu. -
Name the policy. For this example, the policy name is
allow-ingress
. - Click Add ingress rule three times to create three ingress rules.
Specify the following in the form:
Make the following specifications for the first Ingress rule:
- From the Add allowed source dropdown menu, select Allow pods from the same namespace.
Make the following specifications for the second Ingress rule:
- From the Add allowed source dropdown menu, select Allow pods from inside the cluster.
- Click + Add namespace selector.
-
Add the label,
kubernetes.io/metadata.name
, and the selector,openshift-console
.
Make the following specifications for the third Ingress rule:
- From the Add allowed source dropdown menu, select Allow pods from inside the cluster.
- Click + Add namespace selector.
-
Add the label,
kubernetes.io/metadata.name
, and the selector,openshift-monitoring
.
Verification
- Navigate to Observe → Network Traffic.
- View the Traffic Flows tab, or any tab, to verify that the data is displayed.
- Navigate to Observe → Dashboards. In the NetObserv/Health selection, verify that the flows are being ingested and sent to Loki, which is represented in the first graph.
6.2. Example network policy
The following annotates an example NetworkPolicy
object for the netobserv
namespace:
Sample network policy
kind: NetworkPolicy apiVersion: networking.k8s.io/v1 metadata: name: allow-ingress namespace: netobserv spec: podSelector: {} 1 ingress: - from: - podSelector: {} 2 namespaceSelector: 3 matchLabels: kubernetes.io/metadata.name: openshift-console - podSelector: {} namespaceSelector: matchLabels: kubernetes.io/metadata.name: openshift-monitoring policyTypes: - Ingress status: {}
- 1
- A selector that describes the pods to which the policy applies. The policy object can only select pods in the project that defines the
NetworkPolicy
object. In this documentation, it would be the project in which the Network Observability Operator is installed, which is thenetobserv
project. - 2
- A selector that matches the pods from which the policy object allows ingress traffic. The default is that the selector matches pods in the same namespace as the
NetworkPolicy
. - 3
- When the
namespaceSelector
is specified, the selector matches pods in the specified namespace.
Additional resources
Chapter 7. Observing the network traffic
As an administrator, you can observe the network traffic in the OpenShift Container Platform console for detailed troubleshooting and analysis. This feature helps you get insights from different graphical representations of traffic flow. There are several available views to observe the network traffic.
7.1. Observing the network traffic from the Overview view
The Overview view displays the overall aggregated metrics of the network traffic flow on the cluster. As an administrator, you can monitor the statistics with the available display options.
7.1.1. Working with the Overview view
As an administrator, you can navigate to the Overview view to see the graphical representation of the flow rate statistics.
Procedure
- Navigate to Observe → Network Traffic.
- In the Network Traffic page, click the Overview tab.
You can configure the scope of each flow rate data by clicking the menu icon.
7.1.2. Configuring advanced options for the Overview view
You can customize the graphical view by using advanced options. To access the advanced options, click Show advanced options. You can configure the details in the graph by using the Display options drop-down menu. The options available are as follows:
- Scope: Select to view the components that network traffic flows between. You can set the scope to Node, Namespace, Owner, Zones, Cluster or Resource. Owner is an aggregation of resources. Resource can be a pod, service, node, in case of host-network traffic, or an unknown IP address. The default value is Namespace.
- Truncate labels: Select the required width of the label from the drop-down list. The default value is M.
7.1.2.1. Managing panels and display
You can select the required panels to be displayed, reorder them, and focus on a specific panel. To add or remove panels, click Manage panels.
The following panels are shown by default:
- Top X average bytes rates
- Top X bytes rates stacked with total
Other panels can be added in Manage panels:
- Top X average packets rates
- Top X packets rates stacked with total
Query options allows you to choose whether to show the Top 5, Top 10, or Top 15 rates.
7.1.3. Packet drop tracking
You can configure graphical representation of network flow records with packet loss in the Overview view. By employing eBPF tracepoint hooks, you can gain valuable insights into packet drops for TCP, UDP, SCTP, ICMPv4, and ICMPv6 protocols, which can result in the following actions:
- Identification: Pinpoint the exact locations and network paths where packet drops are occurring. Determine whether specific devices, interfaces, or routes are more prone to drops.
- Root cause analysis: Examine the data collected by the eBPF program to understand the causes of packet drops. For example, are they a result of congestion, buffer issues, or specific network events?
- Performance optimization: With a clearer picture of packet drops, you can take steps to optimize network performance, such as adjust buffer sizes, reconfigure routing paths, or implement Quality of Service (QoS) measures.
When packet drop tracking is enabled, you can see the following panels in the Overview by default:
- Top X packet dropped state stacked with total
- Top X packet dropped cause stacked with total
- Top X average dropped packets rates
- Top X dropped packets rates stacked with total
Other packet drop panels are available to add in Manage panels:
- Top X average dropped bytes rates
- Top X dropped bytes rates stacked with total
7.1.3.1. Types of packet drops
Two kinds of packet drops are detected by Network Observability: host drops and OVS drops. Host drops are prefixed with SKB_DROP
and OVS drops are prefixed with OVS_DROP
. Dropped flows are shown in the side panel of the Traffic flows table along with a link to a description of each drop type. Examples of host drop reasons are as follows:
-
SKB_DROP_REASON_NO_SOCKET
: the packet dropped due to a missing socket. -
SKB_DROP_REASON_TCP_CSUM
: the packet dropped due to a TCP checksum error.
Examples of OVS drops reasons are as follows:
-
OVS_DROP_LAST_ACTION
: OVS packets dropped due to an implicit drop action, for example due to a configured network policy. -
OVS_DROP_IP_TTL
: OVS packets dropped due to an expired IP TTL.
See the Additional resources of this section for more information about enabling and working with packet drop tracking.
Additional resources
7.1.4. DNS tracking
You can configure graphical representation of Domain Name System (DNS) tracking of network flows in the Overview view. Using DNS tracking with extended Berkeley Packet Filter (eBPF) tracepoint hooks can serve various purposes:
- Network Monitoring: Gain insights into DNS queries and responses, helping network administrators identify unusual patterns, potential bottlenecks, or performance issues.
- Security Analysis: Detect suspicious DNS activities, such as domain name generation algorithms (DGA) used by malware, or identify unauthorized DNS resolutions that might indicate a security breach.
- Troubleshooting: Debug DNS-related issues by tracing DNS resolution steps, tracking latency, and identifying misconfigurations.
By default, when DNS tracking is enabled, you can see the following non-empty metrics represented in a donut or line chart in the Overview:
- Top X DNS Response Code
- Top X average DNS latencies with overall
- Top X 90th percentile DNS latencies
Other DNS tracking panels can be added in Manage panels:
- Bottom X minimum DNS latencies
- Top X maximum DNS latencies
- Top X 99th percentile DNS latencies
This feature is supported for IPv4 and IPv6 UDP and TCP protocols.
See the Additional resources in this section for more information about enabling and working with this view.
Additional resources
7.1.5. Round-Trip Time
You can use TCP smoothed Round-Trip Time (sRTT) to analyze network flow latencies. You can use RTT captured from the fentry/tcp_rcv_established
eBPF hookpoint to read sRTT from the TCP socket to help with the following:
- Network Monitoring: Gain insights into TCP latencies, helping network administrators identify unusual patterns, potential bottlenecks, or performance issues.
- Troubleshooting: Debug TCP-related issues by tracking latency and identifying misconfigurations.
By default, when RTT is enabled, you can see the following TCP RTT metrics represented in the Overview:
- Top X 90th percentile TCP Round Trip Time with overall
- Top X average TCP Round Trip Time with overall
- Bottom X minimum TCP Round Trip Time with overall
Other RTT panels can be added in Manage panels:
- Top X maximum TCP Round Trip Time with overall
- Top X 99th percentile TCP Round Trip Time with overall
See the Additional resources in this section for more information about enabling and working with this view.
Additional resources
7.1.6. eBPF flow rule filter
You can use rule-based filtering to control the volume of packets cached in the eBPF flow table. For example, a filter can specify that only packets coming from port 100 should be recorded. Then only the packets that match the filter are cached and the rest are not cached.
7.1.6.1. Ingress and egress traffic filtering
CIDR notation efficiently represents IP address ranges by combining the base IP address with a prefix length. For both ingress and egress traffic, the source IP address is first used to match filter rules configured with CIDR notation. If there is a match, then the filtering proceeds. If there is no match, then the destination IP is used to match filter rules configured with CIDR notation.
After matching either the source IP or the destination IP CIDR, you can pinpoint specific endpoints using the peerIP
to differentiate the destination IP address of the packet. Based on the provisioned action, the flow data is either cached in the eBPF flow table or not cached.
7.1.6.2. Dashboard and metrics integrations
When this option is enabled, the Netobserv/Health dashboard for eBPF agent statistics now has the Filtered flows rate view. Additionally, in Observe → Metrics you can query netobserv_agent_filtered_flows_total
to observe metrics with the reason in FlowFilterAcceptCounter, FlowFilterNoMatchCounter or FlowFilterRecjectCounter.
7.1.6.3. Flow filter configuration parameters
The flow filter rules consist of required and optional parameters.
Parameter | Description |
---|---|
|
Set |
|
Provides the IP address and CIDR mask for the flow filter rule. Supports both IPv4 and IPv6 address format. If you want to match against any IP, you can use |
|
Describes the action that is taken for the flow filter rule. The possible values are
|
Parameter | Description |
---|---|
|
Defines the direction of the flow filter rule. Possible values are |
|
Defines the protocol of the flow filter rule. Possible values are |
|
Defines the ports to use for filtering flows. It can be used for either source or destination ports. To filter a single port, set a single port as an integer value. For example |
|
Defines the source port to use for filtering flows. To filter a single port, set a single port as an integer value, for example |
|
DestPorts defines the destination ports to use for filtering flows. To filter a single port, set a single port as an integer value, for example |
| Defines the ICMP type to use for filtering flows. |
| Defines the ICMP code to use for filtering flows. |
|
Defines the IP address to use for filtering flows, for example: |
Additional resources
7.2. Observing the network traffic from the Traffic flows view
The Traffic flows view displays the data of the network flows and the amount of traffic in a table. As an administrator, you can monitor the amount of traffic across the application by using the traffic flow table.
7.2.1. Working with the Traffic flows view
As an administrator, you can navigate to Traffic flows table to see network flow information.
Procedure
- Navigate to Observe → Network Traffic.
- In the Network Traffic page, click the Traffic flows tab.
You can click on each row to get the corresponding flow information.
7.2.2. Configuring advanced options for the Traffic flows view
You can customize and export the view by using Show advanced options. You can set the row size by using the Display options drop-down menu. The default value is Normal.
7.2.2.1. Managing columns
You can select the required columns to be displayed, and reorder them. To manage columns, click Manage columns.
7.2.2.2. Exporting the traffic flow data
You can export data from the Traffic flows view.
Procedure
- Click Export data.
- In the pop-up window, you can select the Export all data checkbox to export all the data, and clear the checkbox to select the required fields to be exported.
- Click Export.
7.2.3. Working with conversation tracking
As an administrator, you can group network flows that are part of the same conversation. A conversation is defined as a grouping of peers that are identified by their IP addresses, ports, and protocols, resulting in an unique Conversation Id. You can query conversation events in the web console. These events are represented in the web console as follows:
- Conversation start: This event happens when a connection is starting or TCP flag intercepted
-
Conversation tick: This event happens at each specified interval defined in the
FlowCollector
spec.processor.conversationHeartbeatInterval
parameter while the connection is active. -
Conversation end: This event happens when the
FlowCollector
spec.processor.conversationEndTimeout
parameter is reached or the TCP flag is intercepted. - Flow: This is the network traffic flow that occurs within the specified interval.
Procedure
- In the web console, navigate to Operators → Installed Operators.
- Under the Provided APIs heading for the NetObserv Operator, select Flow Collector.
- Select cluster then select the YAML tab.
Configure the
FlowCollector
custom resource so thatspec.processor.logTypes
,conversationEndTimeout
, andconversationHeartbeatInterval
parameters are set according to your observation needs. A sample configuration is as follows:Configure
FlowCollector
for conversation trackingapiVersion: flows.netobserv.io/v1beta2 kind: FlowCollector metadata: name: cluster spec: processor: logTypes: Flows 1 advanced: conversationEndTimeout: 10s 2 conversationHeartbeatInterval: 30s 3
- 1
- When
logTypes
is set toFlows
, only the Flow event is exported. If you set the value toAll
, both conversation and flow events are exported and visible in the Network Traffic page. To focus only on conversation events, you can specifyConversations
which exports the Conversation start, Conversation tick and Conversation end events; orEndedConversations
exports only the Conversation end events. Storage requirements are highest forAll
and lowest forEndedConversations
. - 2
- The Conversation end event represents the point when the
conversationEndTimeout
is reached or the TCP flag is intercepted. - 3
- The Conversation tick event represents each specified interval defined in the
FlowCollector
conversationHeartbeatInterval
parameter while the network connection is active.
NoteIf you update the
logType
option, the flows from the previous selection do not clear from the console plugin. For example, if you initially setlogType
toConversations
for a span of time until 10 AM and then move toEndedConversations
, the console plugin shows all conversation events before 10 AM and only ended conversations after 10 AM.-
Refresh the Network Traffic page on the Traffic flows tab. Notice there are two new columns, Event/Type and Conversation Id. All the Event/Type fields are
Flow
when Flow is the selected query option. - Select Query Options and choose the Log Type, Conversation. Now the Event/Type shows all of the desired conversation events.
- Next you can filter on a specific conversation ID or switch between the Conversation and Flow log type options from the side panel.
7.2.4. Working with packet drops
Packet loss occurs when one or more packets of network flow data fail to reach their destination. You can track these drops by editing the FlowCollector
to the specifications in the following YAML example.
CPU and memory usage increases when this feature is enabled.
Procedure
- In the web console, navigate to Operators → Installed Operators.
- Under the Provided APIs heading for the NetObserv Operator, select Flow Collector.
- Select cluster, and then select the YAML tab.
Configure the
FlowCollector
custom resource for packet drops, for example:Example
FlowCollector
configurationapiVersion: flows.netobserv.io/v1beta2 kind: FlowCollector metadata: name: cluster spec: namespace: netobserv agent: type: eBPF ebpf: features: - PacketDrop 1 privileged: true 2
Verification
When you refresh the Network Traffic page, the Overview, Traffic Flow, and Topology views display new information about packet drops:
- Select new choices in Manage panels to choose which graphical visualizations of packet drops to display in the Overview.
Select new choices in Manage columns to choose which packet drop information to display in the Traffic flows table.
-
In the Traffic Flows view, you can also expand the side panel to view more information about packet drops. Host drops are prefixed with
SKB_DROP
and OVS drops are prefixed withOVS_DROP
.
-
In the Traffic Flows view, you can also expand the side panel to view more information about packet drops. Host drops are prefixed with
- In the Topology view, red lines are displayed where drops are present.
7.2.5. Working with DNS tracking
Using DNS tracking, you can monitor your network, conduct security analysis, and troubleshoot DNS issues. You can track DNS by editing the FlowCollector
to the specifications in the following YAML example.
CPU and memory usage increases are observed in the eBPF agent when this feature is enabled.
Procedure
- In the web console, navigate to Operators → Installed Operators.
- Under the Provided APIs heading for Network Observability, select Flow Collector.
- Select cluster then select the YAML tab.
Configure the
FlowCollector
custom resource. A sample configuration is as follows:Configure
FlowCollector
for DNS trackingapiVersion: flows.netobserv.io/v1beta2 kind: FlowCollector metadata: name: cluster spec: namespace: netobserv agent: type: eBPF ebpf: features: - DNSTracking 1 sampling: 1 2
- 1
- You can set the
spec.agent.ebpf.features
parameter list to enable DNS tracking of each network flow in the web console. - 2
- You can set
sampling
to a value of1
for more accurate metrics and to capture DNS latency. For asampling
value greater than 1, you can observe flows with DNS Response Code and DNS Id, and it is unlikely that DNS Latency can be observed.
When you refresh the Network Traffic page, there are new DNS representations you can choose to view in the Overview and Traffic Flow views and new filters you can apply.
- Select new DNS choices in Manage panels to display graphical visualizations and DNS metrics in the Overview.
- Select new choices in Manage columns to add DNS columns to the Traffic Flows view.
- Filter on specific DNS metrics, such as DNS Id, DNS Error DNS Latency and DNS Response Code, and see more information from the side panel. The DNS Latency and DNS Response Code columns are shown by default.
TCP handshake packets do not have DNS headers. TCP protocol flows without DNS headers are shown in the traffic flow data with DNS Latency, ID, and Response code values of "n/a". You can filter out flow data to view only flows that have DNS headers using the Common filter "DNSError" equal to "0".
7.2.6. Working with RTT tracing
You can track RTT by editing the FlowCollector
to the specifications in the following YAML example.
Procedure
- In the web console, navigate to Operators → Installed Operators.
- In the Provided APIs heading for the NetObserv Operator, select Flow Collector.
- Select cluster, and then select the YAML tab.
Configure the
FlowCollector
custom resource for RTT tracing, for example:Example
FlowCollector
configurationapiVersion: flows.netobserv.io/v1beta2 kind: FlowCollector metadata: name: cluster spec: namespace: netobserv agent: type: eBPF ebpf: features: - FlowRTT 1
- 1
- You can start tracing RTT network flows by listing the
FlowRTT
parameter in thespec.agent.ebpf.features
specification list.
Verification
When you refresh the Network Traffic page, the Overview, Traffic Flow, and Topology views display new information about RTT:
- In the Overview, select new choices in Manage panels to choose which graphical visualizations of RTT to display.
- In the Traffic flows table, the Flow RTT column can be seen, and you can manage display in Manage columns.
In the Traffic Flows view, you can also expand the side panel to view more information about RTT.
Example filtering
- Click the Common filters → Protocol.
- Filter the network flow data based on TCP, Ingress direction, and look for FlowRTT values greater than 10,000,000 nanoseconds (10ms).
- Remove the Protocol filter.
- Filter for Flow RTT values greater than 0 in the Common filters.
- In the Topology view, click the Display option dropdown. Then click RTT in the edge labels drop-down list.
7.2.6.1. Using the histogram
You can click Show histogram to display a toolbar view for visualizing the history of flows as a bar chart. The histogram shows the number of logs over time. You can select a part of the histogram to filter the network flow data in the table that follows the toolbar.
7.2.7. Working with availability zones
You can configure the FlowCollector
to collect information about the cluster availability zones. This allows you to enrich network flow data with the topology.kubernetes.io/zone
label value applied to the nodes.
Procedure
- In the web console, go to Operators → Installed Operators.
- Under the Provided APIs heading for the NetObserv Operator, select Flow Collector.
- Select cluster then select the YAML tab.
Configure the
FlowCollector
custom resource so that thespec.processor.addZone
parameter is set totrue
. A sample configuration is as follows:Configure
FlowCollector
for availability zones collectionapiVersion: flows.netobserv.io/v1beta2 kind: FlowCollector metadata: name: cluster spec: # ... processor: addZone: true # ...
Verification
When you refresh the Network Traffic page, the Overview, Traffic Flow, and Topology views display new information about availability zones:
- In the Overview tab, you can see Zones as an available Scope.
- In Network Traffic → Traffic flows, Zones are viewable under the SrcK8S_Zone and DstK8S_Zone fields.
- In the Topology view, you can set Zones as Scope or Group.
7.2.8. Filtering eBPF flow data using a global rule
You can configure the FlowCollector
to filter eBPF flows using a global rule to control the flow of packets cached in the eBPF flow table.
Procedure
- In the web console, navigate to Operators → Installed Operators.
- Under the Provided APIs heading for Network Observability, select Flow Collector.
- Select cluster, then select the YAML tab.
Configure the
FlowCollector
custom resource, similar to the following sample configurations:Example 7.1. Filter Kubernetes service traffic to a specific Pod IP endpoint
apiVersion: flows.netobserv.io/v1beta2 kind: FlowCollector metadata: name: cluster spec: namespace: netobserv deploymentModel: Direct agent: type: eBPF ebpf: flowFilter: action: Accept 1 cidr: 172.210.150.1/24 2 protocol: SCTP direction: Ingress destPortRange: 80-100 peerIP: 10.10.10.10 enable: true 3
- 1
- The required
action
parameter describes the action that is taken for the flow filter rule. Possible values areAccept
orReject
. - 2
- The required
cidr
parameter provides the IP address and CIDR mask for the flow filter rule and supports IPv4 and IPv6 address formats. If you want to match against any IP address, you can use0.0.0.0/0
for IPv4 or::/0
for IPv6. - 3
- You must set
spec.agent.ebpf.flowFilter.enable
totrue
to enable this feature.
Example 7.2. See flows to any addresses outside the cluster
apiVersion: flows.netobserv.io/v1beta2 kind: FlowCollector metadata: name: cluster spec: namespace: netobserv deploymentModel: Direct agent: type: eBPF ebpf: flowFilter: action: Accept 1 cidr: 0.0.0.0/0 2 protocol: TCP direction: Egress sourcePort: 100 peerIP: 192.168.127.12 3 enable: true 4
7.3. Observing the network traffic from the Topology view
The Topology view provides a graphical representation of the network flows and the amount of traffic. As an administrator, you can monitor the traffic data across the application by using the Topology view.
7.3.1. Working with the Topology view
As an administrator, you can navigate to the Topology view to see the details and metrics of the component.
Procedure
- Navigate to Observe → Network Traffic.
- In the Network Traffic page, click the Topology tab.
You can click each component in the Topology to view the details and metrics of the component.
7.3.2. Configuring the advanced options for the Topology view
You can customize and export the view by using Show advanced options. The advanced options view has the following features:
- Find in view: To search the required components in the view.
Display options: To configure the following options:
- Edge labels: To show the specified measurements as edge labels. The default is to show the Average rate in Bytes.
- Scope: To select the scope of components between which the network traffic flows. The default value is Namespace.
- Groups: To enhance the understanding of ownership by grouping the components. The default value is None.
- Layout: To select the layout of the graphical representation. The default value is ColaNoForce.
- Show: To select the details that need to be displayed. All the options are checked by default. The options available are: Edges, Edges label, and Badges.
- Truncate labels: To select the required width of the label from the drop-down list. The default value is M.
- Collapse groups: To expand or collapse the groups. The groups are expanded by default. This option is disabled if Groups has the value of None.
7.3.2.1. Exporting the topology view
To export the view, click Export topology view. The view is downloaded in PNG format.
7.4. Filtering the network traffic
By default, the Network Traffic page displays the traffic flow data in the cluster based on the default filters configured in the FlowCollector
instance. You can use the filter options to observe the required data by changing the preset filter.
- Query Options
You can use Query Options to optimize the search results, as listed below:
- Log Type: The available options Conversation and Flows provide the ability to query flows by log type, such as flow log, new conversation, completed conversation, and a heartbeat, which is a periodic record with updates for long conversations. A conversation is an aggregation of flows between the same peers.
- Match filters: You can determine the relation between different filter parameters selected in the advanced filter. The available options are Match all and Match any. Match all provides results that match all the values, and Match any provides results that match any of the values entered. The default value is Match all.
- Datasource: You can choose the datasource to use for queries: Loki, Prometheus, or Auto. Notable performance improvements can be realized when using Prometheus as a datasource rather than Loki, but Prometheus supports a limited set of filters and aggregations. The default datasource is Auto, which uses Prometheus on supported queries or uses Loki if the query does not support Prometheus.
Drops filter: You can view different levels of dropped packets with the following query options:
- Fully dropped shows flow records with fully dropped packets.
- Containing drops shows flow records that contain drops but can be sent.
- Without drops shows records that contain sent packets.
- All shows all the aforementioned records.
- Limit: The data limit for internal backend queries. Depending upon the matching and the filter settings, the number of traffic flow data is displayed within the specified limit.
- Quick filters
-
The default values in Quick filters drop-down menu are defined in the
FlowCollector
configuration. You can modify the options from console. - Advanced filters
- You can set the advanced filters, Common, Source, or Destination, by selecting the parameter to be filtered from the dropdown list. The flow data is filtered based on the selection. To enable or disable the applied filter, you can click on the applied filter listed below the filter options.
You can toggle between
One way and
Back and forth filtering. The
One way filter shows only Source and Destination traffic according to your filter selections. You can use Swap to change the directional view of the Source and Destination traffic. The
Back and forth filter includes return traffic with the Source and Destination filters. The directional flow of network traffic is shown in the Direction column in the Traffic flows table as Ingress`or `Egress
for inter-node traffic and `Inner`for traffic inside a single node.
You can click Reset defaults to remove the existing filters, and apply the filter defined in FlowCollector
configuration.
To understand the rules of specifying the text value, click Learn More.
Alternatively, you can access the traffic flow data in the Network Traffic tab of the Namespaces, Services, Routes, Nodes, and Workloads pages which provide the filtered data of the corresponding aggregations.
Additional resources
For more information about configuring quick filters in the FlowCollector
, see Configuring Quick Filters and the Flow Collector sample resource.
Chapter 8. Using metrics with dashboards and alerts
The Network Observability Operator uses the flowlogs-pipeline
to generate metrics from flow logs. You can utilize these metrics by setting custom alerts and viewing dashboards.
8.1. Viewing Network Observability metrics dashboards
On the Overview tab in the OpenShift Container Platform console, you can view the overall aggregated metrics of the network traffic flow on the cluster. You can choose to display the information by node, namespace, owner, pod, and service. You can also use filters and display options to further refine the metrics.
Procedure
- In the web console Observe → Dashboards, select the Netobserv dashboard.
View network traffic metrics in the following categories, with each having the subset per node, namespace, source, and destination:
- Byte rates
- Packet drops
- DNS
- RTT
- Select the Netobserv/Health dashboard.
View metrics about the health of the Operator in the following categories, with each having the subset per node, namespace, source, and destination.
- Flows
- Flows Overhead
- Flow rates
- Agents
- Processor
- Operator
Infrastructure and Application metrics are shown in a split-view for namespace and workloads.
8.2. Predefined metrics
Metrics generated by the flowlogs-pipeline
are configurable in the spec.processor.metrics.includeList
of the FlowCollector
custom resource to add or remove metrics.
8.3. Network Observability metrics
You can also create alerts by using the includeList
metrics in Prometheus rules, as shown in the example "Creating alerts".
When looking for these metrics in Prometheus, such as in the Console through Observe → Metrics, or when defining alerts, all the metrics names are prefixed with netobserv_
. For example, netobserv_namespace_flows_total
. Available metrics names are as follows:
- includeList metrics names
Names followed by an asterisk
*
are enabled by default.-
namespace_egress_bytes_total
-
namespace_egress_packets_total
-
namespace_ingress_bytes_total
-
namespace_ingress_packets_total
-
namespace_flows_total
* -
node_egress_bytes_total
-
node_egress_packets_total
-
node_ingress_bytes_total
* -
node_ingress_packets_total
-
node_flows_total
-
workload_egress_bytes_total
-
workload_egress_packets_total
-
workload_ingress_bytes_total
* -
workload_ingress_packets_total
-
workload_flows_total
-
- PacketDrop metrics names
When the
PacketDrop
feature is enabled inspec.agent.ebpf.features
(withprivileged
mode), the following additional metrics are available:-
namespace_drop_bytes_total
-
namespace_drop_packets_total
* -
node_drop_bytes_total
-
node_drop_packets_total
-
workload_drop_bytes_total
-
workload_drop_packets_total
-
- DNS metrics names
When the
DNSTracking
feature is enabled inspec.agent.ebpf.features
, the following additional metrics are available:-
namespace_dns_latency_seconds
* -
node_dns_latency_seconds
-
workload_dns_latency_seconds
-
- FlowRTT metrics names
When the
FlowRTT
feature is enabled inspec.agent.ebpf.features
, the following additional metrics are available:-
namespace_rtt_seconds
* -
node_rtt_seconds
-
workload_rtt_seconds
-
8.4. Creating alerts
You can create custom alerting rules for the Netobserv dashboard metrics to trigger alerts when some defined conditions are met.
Prerequisites
- You have access to the cluster as a user with the cluster-admin role or with view permissions for all projects.
- You have the Network Observability Operator installed.
Procedure
- Create a YAML file by clicking the import icon, +.
Add an alerting rule configuration to the YAML file. In the YAML sample that follows, an alert is created for when the cluster ingress traffic reaches a given threshold of 10 MBps per destination workload.
apiVersion: monitoring.openshift.io/v1 kind: AlertingRule metadata: name: netobserv-alerts namespace: openshift-monitoring spec: groups: - name: NetObservAlerts rules: - alert: NetObservIncomingBandwidth annotations: message: |- {{ $labels.job }}: incoming traffic exceeding 10 MBps for 30s on {{ $labels.DstK8S_OwnerType }} {{ $labels.DstK8S_OwnerName }} ({{ $labels.DstK8S_Namespace }}). summary: "High incoming traffic." expr: sum(rate(netobserv_workload_ingress_bytes_total {SrcK8S_Namespace="openshift-ingress"}[1m])) by (job, DstK8S_Namespace, DstK8S_OwnerName, DstK8S_OwnerType) > 10000000 1 for: 30s labels: severity: warning
- 1
- The
netobserv_workload_ingress_bytes_total
metric is enabled by default inspec.processor.metrics.includeList
.
- Click Create to apply the configuration file to the cluster.
8.5. Custom metrics
You can create custom metrics out of the flowlogs data using the FlowMetric
API. In every flowlogs data that is collected, there are a number of fields labeled per log, such as source name and destination name. These fields can be leveraged as Prometheus labels to enable the customization of cluster information on your dashboard.
8.6. Configuring custom metrics by using FlowMetric API
You can configure the FlowMetric
API to create custom metrics by using flowlogs data fields as Prometheus labels. You can add multiple FlowMetric
resources to a project to see multiple dashboard views.
Procedure
- In the web console, navigate to Operators → Installed Operators.
- In the Provided APIs heading for the NetObserv Operator, select FlowMetric.
- In the Project: dropdown list, select the project of the Network Observability Operator instance.
- Click Create FlowMetric.
Configure the
FlowMetric
resource, similar to the following sample configurations:Example 8.1. Generate a metric that tracks ingress bytes received from cluster external sources
apiVersion: flows.netobserv.io/v1alpha1 kind: FlowMetric metadata: name: flowmetric-cluster-external-ingress-traffic namespace: netobserv 1 spec: metricName: cluster_external_ingress_bytes_total 2 type: Counter 3 valueField: Bytes direction: Ingress 4 labels: [DstK8S_HostName,DstK8S_Namespace,DstK8S_OwnerName,DstK8S_OwnerType] 5 filters: 6 - field: SrcSubnetLabel matchType: Absence
- 1
- The
FlowMetric
resources need to be created in the namespace defined in theFlowCollector
spec.namespace
, which isnetobserv
by default. - 2
- The name of the Prometheus metric, which in the web console appears with the prefix
netobserv-<metricName>
. - 3
- The
type
specifies the type of metric. TheCounter
type
is useful for counting bytes or packets. - 4
- The direction of traffic to capture. If not specified, both ingress and egress are captured, which can lead to duplicated counts.
- 5
- Labels define what the metrics look like and the relationship between the different entities and also define the metrics cardinality. For example,
SrcK8S_Name
is a high cardinality metric. - 6
- Refines results based on the listed criteria. In this example, selecting only the cluster external traffic is done by matching only flows where
SrcSubnetLabel
is absent. This assumes the subnet labels feature is enabled (viaspec.processor.subnetLabels
), which is done by default.
Verification
- Once the pods refresh, navigate to Observe → Metrics.
-
In the Expression field, type the metric name to view the corresponding result. You can also enter an expression, such as
topk(5, sum(rate(netobserv_cluster_external_ingress_bytes_total{DstK8S_Namespace="my-namespace"}[2m])) by (DstK8S_HostName, DstK8S_OwnerName, DstK8S_OwnerType))
Example 8.2. Show RTT latency for cluster external ingress traffic
apiVersion: flows.netobserv.io/v1alpha1 kind: FlowMetric metadata: name: flowmetric-cluster-external-ingress-rtt namespace: netobserv 1 spec: metricName: cluster_external_ingress_rtt_seconds type: Histogram 2 valueField: TimeFlowRttNs direction: Ingress labels: [DstK8S_HostName,DstK8S_Namespace,DstK8S_OwnerName,DstK8S_OwnerType] filters: - field: SrcSubnetLabel matchType: Absence - field: TimeFlowRttNs matchType: Presence divider: "1000000000" 3 buckets: [".001", ".005", ".01", ".02", ".03", ".04", ".05", ".075", ".1", ".25", "1"] 4
- 1
- The
FlowMetric
resources need to be created in the namespace defined in theFlowCollector
spec.namespace
, which isnetobserv
by default. - 2
- The
type
specifies the type of metric. TheHistogram
type
is useful for a latency value (TimeFlowRttNs
). - 3
- Since the Round-trip time (RTT) is provided as nanos in flows, use a divider of 1 billion to convert into seconds, which is standard in Prometheus guidelines.
- 4
- The custom buckets specify precision on RTT, with optimal precision ranging between 5ms and 250ms.
Verification
- Once the pods refresh, navigate to Observe → Metrics.
- In the Expression field, you can type the metric name to view the corresponding result.
High cardinality can affect the memory usage of Prometheus. You can check whether specific labels have high cardinality in the Network Flows format reference.
8.7. Configuring custom charts using FlowMetric API
You can generate charts for dashboards in the OpenShift Container Platform web console, which you can view as an administrator in the Dashboard menu by defining the charts
section of the FlowMetric
resource.
Procedure
- In the web console, navigate to Operators → Installed Operators.
- In the Provided APIs heading for the NetObserv Operator, select FlowMetric.
- In the Project: dropdown list, select the project of the Network Observability Operator instance.
- Click Create FlowMetric.
-
Configure the
FlowMetric
resource, similar to the following sample configurations:
Example 8.3. Chart for tracking ingress bytes received from cluster external sources
apiVersion: flows.netobserv.io/v1alpha1 kind: FlowMetric metadata: name: flowmetric-cluster-external-ingress-traffic namespace: netobserv 1 # ... charts: - dashboardName: Main 2 title: External ingress traffic unit: Bps type: SingleStat queries: - promQL: "sum(rate($METRIC[2m]))" legend: "" - dashboardName: Main 3 sectionName: External title: Top external ingress traffic per workload unit: Bps type: StackArea queries: - promQL: "sum(rate($METRIC{DstK8S_Namespace!=\"\"}[2m])) by (DstK8S_Namespace, DstK8S_OwnerName)" legend: "{{DstK8S_Namespace}} / {{DstK8S_OwnerName}}" # ...
- 1
- The
FlowMetric
resources need to be created in the namespace defined in theFlowCollector
spec.namespace
, which isnetobserv
by default.
Verification
- Once the pods refresh, navigate to Observe → Dashboards.
Search for the NetObserv / Main dashboard. View two panels under the NetObserv / Main dashboard, or optionally a dashboard name that you create:
- A textual single statistic showing the global external ingress rate summed across all dimensions
- A timeseries graph showing the same metric per destination workload
For more information about the query language, refer to the Prometheus documentation.
Example 8.4. Chart for RTT latency for cluster external ingress traffic
apiVersion: flows.netobserv.io/v1alpha1 kind: FlowMetric metadata: name: flowmetric-cluster-external-ingress-traffic namespace: netobserv 1 # ... charts: - dashboardName: Main 2 title: External ingress TCP latency unit: seconds type: SingleStat queries: - promQL: "histogram_quantile(0.99, sum(rate($METRIC_bucket[2m])) by (le)) > 0" legend: "p99" - dashboardName: Main 3 sectionName: External title: "Top external ingress sRTT per workload, p50 (ms)" unit: seconds type: Line queries: - promQL: "histogram_quantile(0.5, sum(rate($METRIC_bucket{DstK8S_Namespace!=\"\"}[2m])) by (le,DstK8S_Namespace,DstK8S_OwnerName))*1000 > 0" legend: "{{DstK8S_Namespace}} / {{DstK8S_OwnerName}}" - dashboardName: Main 4 sectionName: External title: "Top external ingress sRTT per workload, p99 (ms)" unit: seconds type: Line queries: - promQL: "histogram_quantile(0.99, sum(rate($METRIC_bucket{DstK8S_Namespace!=\"\"}[2m])) by (le,DstK8S_Namespace,DstK8S_OwnerName))*1000 > 0" legend: "{{DstK8S_Namespace}} / {{DstK8S_OwnerName}}" # ...
This example uses the histogram_quantile
function to show p50
and p99
.
You can show averages of histograms by dividing the metric, $METRIC_sum
, by the metric, $METRIC_count
, which are automatically generated when you create a histogram. With the preceding example, the Prometheus query to do this is as follows:
promQL: "(sum(rate($METRIC_sum{DstK8S_Namespace!=\"\"}[2m])) by (DstK8S_Namespace,DstK8S_OwnerName) / sum(rate($METRIC_count{DstK8S_Namespace!=\"\"}[2m])) by (DstK8S_Namespace,DstK8S_OwnerName))*1000"
Verification
- Once the pods refresh, navigate to Observe → Dashboards.
- Search for the NetObserv / Main dashboard. View the new panel under the NetObserv / Main dashboard, or optionally a dashboard name that you create.
For more information about the query language, refer to the Prometheus documentation.
Chapter 9. Monitoring the Network Observability Operator
You can use the web console to monitor alerts related to the health of the Network Observability Operator.
9.1. Health dashboards
Metrics about health and resource usage of the Network Observability Operator are located in the Observe → Dashboards page in the web console. You can view metrics about the health of the Operator in the following categories:
- Flows per second
- Sampling
- Errors last minute
- Dropped flows per second
- Flowlogs-pipeline statistics
- Flowlogs-pipleine statistics views
- eBPF agent statistics views
- Operator statistics
- Resource usage
9.2. Health alerts
A health alert banner that directs you to the dashboard can appear on the Network Traffic and Home pages if an alert is triggered. Alerts are generated in the following cases:
-
The
NetObservLokiError
alert occurs if theflowlogs-pipeline
workload is dropping flows because of Loki errors, such as if the Loki ingestion rate limit has been reached. -
The
NetObservNoFlows
alert occurs if no flows are ingested for a certain amount of time. -
The
NetObservFlowsDropped
alert occurs if the Network Observability eBPF agent hashmap table is full, and the eBPF agent processes flows with degraded performance, or when the capacity limiter is triggered.
9.3. Viewing health information
You can access metrics about health and resource usage of the Network Observability Operator from the Dashboards page in the web console.
Prerequisites
- You have the Network Observability Operator installed.
-
You have access to the cluster as a user with the
cluster-admin
role or with view permissions for all projects.
Procedure
- From the Administrator perspective in the web console, navigate to Observe → Dashboards.
- From the Dashboards dropdown, select Netobserv/Health.
- View the metrics about the health of the Operator that are displayed on the page.
9.3.1. Disabling health alerts
You can opt out of health alerting by editing the FlowCollector
resource:
- In the web console, navigate to Operators → Installed Operators.
- Under the Provided APIs heading for the NetObserv Operator, select Flow Collector.
- Select cluster then select the YAML tab.
Add
spec.processor.metrics.disableAlerts
to disable health alerts, as in the following YAML sample:apiVersion: flows.netobserv.io/v1beta2 kind: FlowCollector metadata: name: cluster spec: processor: metrics: disableAlerts: [NetObservLokiError, NetObservNoFlows] 1
- 1
- You can specify one or a list with both types of alerts to disable.
9.4. Creating Loki rate limit alerts for the NetObserv dashboard
You can create custom alerting rules for the Netobserv dashboard metrics to trigger alerts when Loki rate limits have been reached.
Prerequisites
- You have access to the cluster as a user with the cluster-admin role or with view permissions for all projects.
- You have the Network Observability Operator installed.
Procedure
- Create a YAML file by clicking the import icon, +.
Add an alerting rule configuration to the YAML file. In the YAML sample that follows, an alert is created for when Loki rate limits have been reached:
apiVersion: monitoring.openshift.io/v1 kind: AlertingRule metadata: name: loki-alerts namespace: openshift-monitoring spec: groups: - name: LokiRateLimitAlerts rules: - alert: LokiTenantRateLimit annotations: message: |- {{ $labels.job }} {{ $labels.route }} is experiencing 429 errors. summary: "At any number of requests are responded with the rate limit error code." expr: sum(irate(loki_request_duration_seconds_count{status_code="429"}[1m])) by (job, namespace, route) / sum(irate(loki_request_duration_seconds_count[1m])) by (job, namespace, route) * 100 > 0 for: 10s labels: severity: warning
- Click Create to apply the configuration file to the cluster.
9.5. Using the eBPF agent alert
An alert, NetObservAgentFlowsDropped
, is triggered when the Network Observability eBPF agent hashmap table is full or when the capacity limiter is triggered. If you see this alert, consider increasing the cacheMaxFlows
in the FlowCollector
, as shown in the following example.
Increasing the cacheMaxFlows
might increase the memory usage of the eBPF agent.
Procedure
- In the web console, navigate to Operators → Installed Operators.
- Under the Provided APIs heading for the Network Observability Operator, select Flow Collector.
- Select cluster, and then select the YAML tab.
-
Increase the
spec.agent.ebpf.cacheMaxFlows
value, as shown in the following YAML sample:
apiVersion: flows.netobserv.io/v1beta2
kind: FlowCollector
metadata:
name: cluster
spec:
namespace: netobserv
deploymentModel: Direct
agent:
type: eBPF
ebpf:
cacheMaxFlows: 200000 1
- 1
- Increase the
cacheMaxFlows
value from its value at the time of theNetObservAgentFlowsDropped
alert.
Additional resources
- For more information about creating alerts that you can see on the dashboard, see Creating alerting rules for user-defined projects.
Chapter 10. Scheduling resources
Taints and tolerations allow the node to control which pods should (or should not) be scheduled on them.
A node selector specifies a map of key/value pairs that are defined using custom labels on nodes and selectors specified in pods.
For the pod to be eligible to run on a node, the pod must have the same key/value node selector as the label on the node.
10.1. Network Observability deployment in specific nodes
You can configure the FlowCollector
to control the deployment of Network Observability components in specific nodes. The spec.agent.ebpf.advanced.scheduling
, spec.processor.advanced.scheduling
, and spec.consolePlugin.advanced.scheduling
specifications have the following configurable settings:
-
NodeSelector
-
Tolerations
-
Affinity
-
PriorityClassName
Sample FlowCollector
resource for spec.<component>.advanced.scheduling
apiVersion: flows.netobserv.io/v1beta2 kind: FlowCollector metadata: name: cluster spec: # ... advanced: scheduling: tolerations: - key: "<taint key>" operator: "Equal" value: "<taint value>" effect: "<taint effect>" nodeSelector: <key>: <value> affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: name operator: In values: - app-worker-node priorityClassName: """ # ...
Additional resources
- Understanding taints and tolerations
- Assign Pods to Nodes (Kubernetes documentation)
- Pod Priority and Preemption (Kubernetes documentation)
Chapter 11. Network Observability CLI
11.1. Installing the Network Observability CLI
The Network Observability CLI (oc netobserv
) is temporarily unavailable and is expected to resolve with OCPBUGS-36146.
The Network Observability CLI (oc netobserv
) is deployed separately from the Network Observability Operator. The CLI is available as an OpenShift CLI (oc
) plugin. It provides a lightweight way to quickly debug and troubleshoot with network observability.
Network Observability CLI (oc netobserv
) is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
11.1.1. About the Network Observability CLI
You can quickly debug and troubleshoot networking issues by using the Network Observability CLI (oc netobserv
). The Network Observability CLI is a flow and packet visualization tool that relies on eBPF agents to stream collected data to an ephemeral collector pod. It requires no persistent storage during the capture. After the run, the output is transferred to your local machine. This enables quick, live insight into packets and flow data without installing the Network Observability Operator.
CLI capture is meant to run only for short durations, such as 8-10 minutes. If it runs for too long, it can be difficult to delete the running process.
11.1.2. Installing the Network Observability CLI
Installing the Network Observability CLI (oc netobserv
) is a separate procedure from the Network Observability Operator installation. This means that, even if you have the Operator installed from OperatorHub, you need to install the CLI separately.
You can optionally use Krew to install the netobserv
CLI plugin. For more information, see "Installing a CLI plugin with Krew".
Prerequisites
-
You must install the OpenShift CLI (
oc
). - You must have a macOS or Linux operating system.
Procedure
-
Download the
oc netobserv
CLI tar file. Unpack the archive:
$ tar xvf netobserv-cli.tar.gz
Make the file executable:
$ chmod +x ./build/oc-netobserv
Move the extracted
netobserv-cli
binary to a directory that is on yourPATH
, such as/usr/local/bin/
:$ sudo mv ./build/oc-netobserv /usr/local/bin/
Verification
Verify that
oc netobserv
is available:$ oc netobserv version
Example output
Netobserv CLI version <version>
Additional resources
11.2. Using the Network Observability CLI
You can visualize and filter the flows and packets data directly in the terminal to see specific usage, such as identifying who is using a specific port. The Network Observability CLI collects flows as JSON and database files or packets as a PCAP file, which you can use with third-party tools.
11.2.1. Capturing flows
You can capture flows and filter on any resource or zone in the data to solve use cases, such as displaying Round-Trip Time (RTT) between two zones. Table visualization in the CLI provides viewing and flow search capabilities.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Install the Network Observability CLI (
oc netobserv
) plugin.
Procedure
Capture flows with filters enabled by running the following command:
$ oc netobserv flows --enable_filter=true --action=Accept --cidr=0.0.0.0/0 --protocol=TCP --port=49051
Add filters to the
live table filter
prompt in the terminal to further refine the incoming flows. For example:live table filter: [SrcK8S_Zone:us-west-1b] press enter to match multiple regular expressions at once
-
To stop capturing, press Ctrl+C. The data that was captured is written to two separate files in an
./output
directory located in the same path used to install the CLI. View the captured data in the
./output/flow/<capture_date_time>.json
JSON file, which contains JSON arrays of the captured data.Example JSON file
{ "AgentIP": "10.0.1.76", "Bytes": 561, "DnsErrno": 0, "Dscp": 20, "DstAddr": "f904:ece9:ba63:6ac7:8018:1e5:7130:0", "DstMac": "0A:58:0A:80:00:37", "DstPort": 9999, "Duplicate": false, "Etype": 2048, "Flags": 16, "FlowDirection": 0, "IfDirection": 0, "Interface": "ens5", "K8S_FlowLayer": "infra", "Packets": 1, "Proto": 6, "SrcAddr": "3e06:6c10:6440:2:a80:37:b756:270f", "SrcMac": "0A:58:0A:80:00:01", "SrcPort": 46934, "TimeFlowEndMs": 1709741962111, "TimeFlowRttNs": 121000, "TimeFlowStartMs": 1709741962111, "TimeReceived": 1709741964 }
You can use SQLite to inspect the
./output/flow/<capture_date_time>.db
database file. For example:Open the file by running the following command:
$ sqlite3 ./output/flow/<capture_date_time>.db
Query the data by running a SQLite
SELECT
statement, for example:sqlite> SELECT DnsLatencyMs, DnsFlagsResponseCode, DnsId, DstAddr, DstPort, Interface, Proto, SrcAddr, SrcPort, Bytes, Packets FROM flow WHERE DnsLatencyMs >10 LIMIT 10;
Example output
12|NoError|58747|10.128.0.63|57856||17|172.30.0.10|53|284|1 11|NoError|20486|10.128.0.52|56575||17|169.254.169.254|53|225|1 11|NoError|59544|10.128.0.103|51089||17|172.30.0.10|53|307|1 13|NoError|32519|10.128.0.52|55241||17|169.254.169.254|53|254|1 12|NoError|32519|10.0.0.3|55241||17|169.254.169.254|53|254|1 15|NoError|57673|10.128.0.19|59051||17|172.30.0.10|53|313|1 13|NoError|35652|10.0.0.3|46532||17|169.254.169.254|53|183|1 32|NoError|37326|10.0.0.3|52718||17|169.254.169.254|53|169|1 14|NoError|14530|10.0.0.3|58203||17|169.254.169.254|53|246|1 15|NoError|40548|10.0.0.3|45933||17|169.254.169.254|53|174|1
11.2.2. Capturing packets
You can capture packets using the Network Observability CLI.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Install the Network Observability CLI (
oc netobserv
) plugin.
Procedure
Run the packet capture with filters enabled:
$ oc netobserv packets tcp,80
Add filters to the
live table filter
prompt in the terminal to refine the incoming packets. An example filter is as follows:live table filter: [SrcK8S_Zone:us-west-1b] press enter to match multiple regular expressions at once
- To stop capturing, press Ctrl+C.
View the captured data, which is written to a single file in an
./output/pcap
directory located in the same path that was used to install the CLI:-
The
./output/pcap/<capture_date_time>.pcap
file can be opened with Wireshark.
-
The
11.2.3. Cleaning the Network Observability CLI
You can manually clean the CLI workload by running oc netobserv cleanup
. This command removes all the CLI components from your cluster.
When you end a capture, this command is run automatically by the client. You might be required to manually run it if you experience connectivity issues.
Procedure
Run the following command:
$ oc netobserv cleanup
Additional resources
11.3. Network Observability CLI (oc netobserv) reference
The Network Observability CLI (oc netobserv
) has most features and filtering options that are available for the Network Observability Operator. You can pass command line arguments to enable features or filtering options.
11.3.1. oc netobserv CLI reference
The Network Observability CLI (oc netobserv
) is a CLI tool for capturing flow data and packet data for further analysis.
oc netobserv
syntax
$ oc netobserv [<command>] [<feature_option>] [<command_options>] 1
- 1
- Feature options can only be used with the
oc netobserv flows
command. They cannot be used with theoc netobserv packets
command.
Command | Description |
---|---|
| Capture flows information. For subcommands, see the "Flow capture subcommands" table. |
|
Capture packets from a specific protocol or port pair, such as |
| Remove the Network Observability CLI components. |
| Print the software version. |
| Show help. |
11.3.1.1. Network Observability enrichment
The Network Observability enrichment to display zone, node, owner and resource names including optional features about packet drops, DNS latencies and Round-trip time can only be enabled when capturing flows. These do not appear in packet capture pcap output file.
Network Observability enrichment syntax
$ oc netobserv flows [<enrichment_options>] [<subcommands>]
Option | Description | Possible values | Default |
---|---|---|---|
| Enable packet drop. |
|
|
| Enable round trip time. |
|
|
| Enable DNS tracking. |
|
|
| Show help. | - | - |
|
Interfaces to match on the flow. For example, |
| - |
11.3.1.2. Flow capture options
Flow capture has mandatory commands as well as additional options, such as enabling extra features about packet drops, DNS latencies, Round-trip time, and filtering.
oc netobserv flows
syntax
$ oc netobserv flows [<feature_option>] [<command_options>]
Option | Description | Possible values | Mandatory | Default |
---|---|---|---|---|
| Enable flow filter. |
| Yes |
|
| Action to apply on the flow. |
| Yes |
|
| CIDR to match on the flow. |
| Yes |
|
| Protocol to match on the flow |
| No | - |
| Direction to match on the flow |
| No | - |
| Destination port to match on the flow. |
| no | - |
| Source port to match on the flow. |
| No | - |
| Port to match on the flow. |
| No | - |
| Source port range to match on the flow. |
| No | - |
| Destination port range to match on the flow. |
| No | - |
| Port range to match on the flow. |
| No | - |
| ICMP type to match on the flow. |
| No | - |
| ICMP code to match on the flow. |
| No | - |
| Peer IP to match on the flow. |
| No | - |
11.3.1.3. Packet capture options
You can filter on port and protocol for packet capture data.
oc netobserv packets
syntax
$ oc netobserv packets [<option>]
Option | Description | Mandatory | Default |
|
Capture packets from a specific protocol and port pair. Use a comma as delimiter. For example, | Yes | - |
Chapter 12. FlowCollector API reference
FlowCollector is the Schema for the network flows collection API, which pilots and configures the underlying deployments.
12.1. FlowCollector API specifications
- Description
-
FlowCollector
is the schema for the network flows collection API, which pilots and configures the underlying deployments. - Type
-
object
Property | Type | Description |
---|---|---|
|
| APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and might reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources |
|
| Kind is a string value representing the REST resource this object represents. Servers might infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds |
|
| Standard object’s metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata |
|
|
Defines the desired state of the FlowCollector resource. *: the mention of "unsupported" or "deprecated" for a feature throughout this document means that this feature is not officially supported by Red Hat. It might have been, for example, contributed by the community and accepted without a formal agreement for maintenance. The product maintainers might provide some support for these features as a best effort only. |
12.1.1. .metadata
- Description
- Standard object’s metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
- Type
-
object
12.1.2. .spec
- Description
Defines the desired state of the FlowCollector resource.
*: the mention of "unsupported" or "deprecated" for a feature throughout this document means that this feature is not officially supported by Red Hat. It might have been, for example, contributed by the community and accepted without a formal agreement for maintenance. The product maintainers might provide some support for these features as a best effort only.
- Type
-
object
Property | Type | Description |
---|---|---|
|
| Agent configuration for flows extraction. |
|
|
|
|
|
-
- Kafka can provide better scalability, resiliency, and high availability (for more details, see https://www.redhat.com/en/topics/integration/what-is-apache-kafka). |
|
|
|
|
|
Kafka configuration, allowing to use Kafka as a broker as part of the flow collection pipeline. Available when the |
|
|
|
|
| Namespace where Network Observability pods are deployed. |
|
|
|
|
|
|
12.1.3. .spec.agent
- Description
- Agent configuration for flows extraction.
- Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
12.1.4. .spec.agent.ebpf
- Description
-
ebpf
describes the settings related to the eBPF-based flow reporter whenspec.agent.type
is set toeBPF
. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
List of additional features to enable. They are all disabled by default. Enabling additional features might have performance impacts. Possible values are:
-
-
- |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Privileged mode for the eBPF Agent container. When ignored or set to |
|
|
|
|
| Sampling rate of the flow reporter. 100 means one flow on 100 is sent. 0 or 1 means all flows are sampled. |
12.1.5. .spec.agent.ebpf.advanced
- Description
-
advanced
allows setting some aspects of the internal configuration of the eBPF agent. This section is aimed mostly for debugging and fine-grained performance optimizations, such asGOGC
andGOMAXPROCS
env vars. Set these values at your own risk. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
| scheduling controls how the pods are scheduled on nodes. |
12.1.6. .spec.agent.ebpf.advanced.scheduling
- Description
- scheduling controls how the pods are scheduled on nodes.
- Type
-
object
Property | Type | Description |
---|---|---|
|
| If specified, the pod’s scheduling constraints. For documentation, refer to https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling. |
|
|
|
|
| If specified, indicates the pod’s priority. For documentation, refer to https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#how-to-use-priority-and-preemption. If not specified, default priority is used, or zero if there is no default. |
|
|
|
12.1.7. .spec.agent.ebpf.advanced.scheduling.affinity
- Description
- If specified, the pod’s scheduling constraints. For documentation, refer to https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling.
- Type
-
object
12.1.8. .spec.agent.ebpf.advanced.scheduling.tolerations
- Description
-
tolerations
is a list of tolerations that allow the pod to schedule onto nodes with matching taints. For documentation, refer to https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling. - Type
-
array
12.1.9. .spec.agent.ebpf.flowFilter
- Description
-
flowFilter
defines the eBPF agent configuration regarding flow filtering. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Set |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12.1.10. .spec.agent.ebpf.metrics
- Description
-
metrics
defines the eBPF agent configuration regarding metrics. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
Set |
|
| Metrics server endpoint configuration for the Prometheus scraper. |
12.1.11. .spec.agent.ebpf.metrics.server
- Description
- Metrics server endpoint configuration for the Prometheus scraper.
- Type
-
object
Property | Type | Description |
---|---|---|
|
| The metrics server HTTP port. |
|
| TLS configuration. |
12.1.12. .spec.agent.ebpf.metrics.server.tls
- Description
- TLS configuration.
- Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
TLS configuration when |
|
|
Reference to the CA file when |
|
|
Select the type of TLS configuration:
- |
12.1.13. .spec.agent.ebpf.metrics.server.tls.provided
- Description
-
TLS configuration when
type
is set toProvided
. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
12.1.14. .spec.agent.ebpf.metrics.server.tls.providedCaFile
- Description
-
Reference to the CA file when
type
is set toProvided
. - Type
-
object
Property | Type | Description |
---|---|---|
|
| File name within the config map or secret. |
|
| Name of the config map or secret containing the file. |
|
| Namespace of the config map or secret containing the file. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
| Type for the file reference: "configmap" or "secret". |
12.1.15. .spec.agent.ebpf.resources
- Description
-
resources
are the compute resources required by this container. For more information, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ - Type
-
object
Property | Type | Description |
---|---|---|
|
| Limits describes the maximum amount of compute resources allowed. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ |
|
| Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. Requests cannot exceed Limits. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ |
12.1.16. .spec.consolePlugin
- Description
-
consolePlugin
defines the settings related to the OpenShift Container Platform Console plugin, when available. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Enables the console plugin deployment. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12.1.17. .spec.consolePlugin.advanced
- Description
-
advanced
allows setting some aspects of the internal configuration of the console plugin. This section is aimed mostly for debugging and fine-grained performance optimizations, such asGOGC
andGOMAXPROCS
env vars. Set these values at your own risk. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12.1.18. .spec.consolePlugin.advanced.scheduling
- Description
-
scheduling
controls how the pods are scheduled on nodes. - Type
-
object
Property | Type | Description |
---|---|---|
|
| If specified, the pod’s scheduling constraints. For documentation, refer to https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling. |
|
|
|
|
| If specified, indicates the pod’s priority. For documentation, refer to https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#how-to-use-priority-and-preemption. If not specified, default priority is used, or zero if there is no default. |
|
|
|
12.1.19. .spec.consolePlugin.advanced.scheduling.affinity
- Description
- If specified, the pod’s scheduling constraints. For documentation, refer to https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling.
- Type
-
object
12.1.20. .spec.consolePlugin.advanced.scheduling.tolerations
- Description
-
tolerations
is a list of tolerations that allow the pod to schedule onto nodes with matching taints. For documentation, refer to https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling. - Type
-
array
12.1.21. .spec.consolePlugin.autoscaler
- Description
-
autoscaler
spec of a horizontal pod autoscaler to set up for the plugin Deployment. Refer to HorizontalPodAutoscaler documentation (autoscaling/v2). - Type
-
object
12.1.22. .spec.consolePlugin.portNaming
- Description
-
portNaming
defines the configuration of the port-to-service name translation - Type
-
object
Property | Type | Description |
---|---|---|
|
| Enable the console plugin port-to-service name translation |
|
|
|
12.1.23. .spec.consolePlugin.quickFilters
- Description
-
quickFilters
configures quick filter presets for the Console plugin - Type
-
array
12.1.24. .spec.consolePlugin.quickFilters[]
- Description
-
QuickFilter
defines preset configuration for Console’s quick filters - Type
-
object
- Required
-
filter
-
name
-
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the filter, that is displayed in the Console |
12.1.25. .spec.consolePlugin.resources
- Description
-
resources
, in terms of compute resources, required by this container. For more information, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ - Type
-
object
Property | Type | Description |
---|---|---|
|
| Limits describes the maximum amount of compute resources allowed. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ |
|
| Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. Requests cannot exceed Limits. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ |
12.1.26. .spec.exporters
- Description
-
exporters
define additional optional exporters for custom consumption or storage. - Type
-
array
12.1.27. .spec.exporters[]
- Description
-
FlowCollectorExporter
defines an additional exporter to send enriched flows to. - Type
-
object
- Required
-
type
-
Property | Type | Description |
---|---|---|
|
| IPFIX configuration, such as the IP address and port to send enriched IPFIX flows to. |
|
| Kafka configuration, such as the address and topic, to send enriched flows to. |
|
|
|
12.1.28. .spec.exporters[].ipfix
- Description
- IPFIX configuration, such as the IP address and port to send enriched IPFIX flows to.
- Type
-
object
- Required
-
targetHost
-
targetPort
-
Property | Type | Description |
---|---|---|
|
| Address of the IPFIX external receiver |
|
| Port for the IPFIX external receiver |
|
|
Transport protocol ( |
12.1.29. .spec.exporters[].kafka
- Description
- Kafka configuration, such as the address and topic, to send enriched flows to.
- Type
-
object
- Required
-
address
-
topic
-
Property | Type | Description |
---|---|---|
|
| Address of the Kafka server |
|
| SASL authentication configuration. [Unsupported (*)]. |
|
| TLS client configuration. When using TLS, verify that the address matches the Kafka port used for TLS, generally 9093. |
|
| Kafka topic to use. It must exist. Network Observability does not create it. |
12.1.30. .spec.exporters[].kafka.sasl
- Description
- SASL authentication configuration. [Unsupported (*)].
- Type
-
object
Property | Type | Description |
---|---|---|
|
| Reference to the secret or config map containing the client ID |
|
| Reference to the secret or config map containing the client secret |
|
|
Type of SASL authentication to use, or |
12.1.31. .spec.exporters[].kafka.sasl.clientIDReference
- Description
- Reference to the secret or config map containing the client ID
- Type
-
object
Property | Type | Description |
---|---|---|
|
| File name within the config map or secret. |
|
| Name of the config map or secret containing the file. |
|
| Namespace of the config map or secret containing the file. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
| Type for the file reference: "configmap" or "secret". |
12.1.32. .spec.exporters[].kafka.sasl.clientSecretReference
- Description
- Reference to the secret or config map containing the client secret
- Type
-
object
Property | Type | Description |
---|---|---|
|
| File name within the config map or secret. |
|
| Name of the config map or secret containing the file. |
|
| Namespace of the config map or secret containing the file. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
| Type for the file reference: "configmap" or "secret". |
12.1.33. .spec.exporters[].kafka.tls
- Description
- TLS client configuration. When using TLS, verify that the address matches the Kafka port used for TLS, generally 9093.
- Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
| Enable TLS |
|
|
|
|
|
|
12.1.34. .spec.exporters[].kafka.tls.caCert
- Description
-
caCert
defines the reference of the certificate for the Certificate Authority - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
12.1.35. .spec.exporters[].kafka.tls.userCert
- Description
-
userCert
defines the user certificate reference and is used for mTLS (you can ignore it when using one-way TLS) - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
12.1.36. .spec.kafka
- Description
-
Kafka configuration, allowing to use Kafka as a broker as part of the flow collection pipeline. Available when the
spec.deploymentModel
isKafka
. - Type
-
object
- Required
-
address
-
topic
-
Property | Type | Description |
---|---|---|
|
| Address of the Kafka server |
|
| SASL authentication configuration. [Unsupported (*)]. |
|
| TLS client configuration. When using TLS, verify that the address matches the Kafka port used for TLS, generally 9093. |
|
| Kafka topic to use. It must exist. Network Observability does not create it. |
12.1.37. .spec.kafka.sasl
- Description
- SASL authentication configuration. [Unsupported (*)].
- Type
-
object
Property | Type | Description |
---|---|---|
|
| Reference to the secret or config map containing the client ID |
|
| Reference to the secret or config map containing the client secret |
|
|
Type of SASL authentication to use, or |
12.1.38. .spec.kafka.sasl.clientIDReference
- Description
- Reference to the secret or config map containing the client ID
- Type
-
object
Property | Type | Description |
---|---|---|
|
| File name within the config map or secret. |
|
| Name of the config map or secret containing the file. |
|
| Namespace of the config map or secret containing the file. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
| Type for the file reference: "configmap" or "secret". |
12.1.39. .spec.kafka.sasl.clientSecretReference
- Description
- Reference to the secret or config map containing the client secret
- Type
-
object
Property | Type | Description |
---|---|---|
|
| File name within the config map or secret. |
|
| Name of the config map or secret containing the file. |
|
| Namespace of the config map or secret containing the file. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
| Type for the file reference: "configmap" or "secret". |
12.1.40. .spec.kafka.tls
- Description
- TLS client configuration. When using TLS, verify that the address matches the Kafka port used for TLS, generally 9093.
- Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
| Enable TLS |
|
|
|
|
|
|
12.1.41. .spec.kafka.tls.caCert
- Description
-
caCert
defines the reference of the certificate for the Certificate Authority - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
12.1.42. .spec.kafka.tls.userCert
- Description
-
userCert
defines the user certificate reference and is used for mTLS (you can ignore it when using one-way TLS) - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
12.1.43. .spec.loki
- Description
-
loki
, the flow store, client settings. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
Set |
|
|
Loki configuration for |
|
|
Loki configuration for |
|
|
Loki configuration for |
|
|
- Use
- Use
- Use
- Use |
|
|
Loki configuration for |
|
|
|
|
|
|
|
|
|
|
|
|
12.1.44. .spec.loki.advanced
- Description
-
advanced
allows setting some aspects of the internal configuration of the Loki clients. This section is aimed mostly for debugging and fine-grained performance optimizations. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
12.1.45. .spec.loki.lokiStack
- Description
-
Loki configuration for
LokiStack
mode. This is useful for an easy Loki Operator configuration. It is ignored for other modes. - Type
-
object
Property | Type | Description |
---|---|---|
|
| Name of an existing LokiStack resource to use. |
|
|
Namespace where this |
12.1.46. .spec.loki.manual
- Description
-
Loki configuration for
Manual
mode. This is the most flexible configuration. It is ignored for other modes. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
-
-
-
When using the Loki Operator, this must be set to |
|
|
|
|
|
|
|
| TLS client configuration for Loki status URL. |
|
|
|
|
|
|
|
| TLS client configuration for Loki URL. |
12.1.47. .spec.loki.manual.statusTls
- Description
- TLS client configuration for Loki status URL.
- Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
| Enable TLS |
|
|
|
|
|
|
12.1.48. .spec.loki.manual.statusTls.caCert
- Description
-
caCert
defines the reference of the certificate for the Certificate Authority - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
12.1.49. .spec.loki.manual.statusTls.userCert
- Description
-
userCert
defines the user certificate reference and is used for mTLS (you can ignore it when using one-way TLS) - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
12.1.50. .spec.loki.manual.tls
- Description
- TLS client configuration for Loki URL.
- Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
| Enable TLS |
|
|
|
|
|
|
12.1.51. .spec.loki.manual.tls.caCert
- Description
-
caCert
defines the reference of the certificate for the Certificate Authority - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
12.1.52. .spec.loki.manual.tls.userCert
- Description
-
userCert
defines the user certificate reference and is used for mTLS (you can ignore it when using one-way TLS) - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
12.1.53. .spec.loki.microservices
- Description
-
Loki configuration for
Microservices
mode. Use this option when Loki is installed using the microservices deployment mode (https://grafana.com/docs/loki/latest/fundamentals/architecture/deployment-modes/#microservices-mode). It is ignored for other modes. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
|
|
|
| TLS client configuration for Loki URL. |
12.1.54. .spec.loki.microservices.tls
- Description
- TLS client configuration for Loki URL.
- Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
| Enable TLS |
|
|
|
|
|
|
12.1.55. .spec.loki.microservices.tls.caCert
- Description
-
caCert
defines the reference of the certificate for the Certificate Authority - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
12.1.56. .spec.loki.microservices.tls.userCert
- Description
-
userCert
defines the user certificate reference and is used for mTLS (you can ignore it when using one-way TLS) - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
12.1.57. .spec.loki.monolithic
- Description
-
Loki configuration for
Monolithic
mode. Use this option when Loki is installed using the monolithic deployment mode (https://grafana.com/docs/loki/latest/fundamentals/architecture/deployment-modes/#monolithic-mode). It is ignored for other modes. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
| TLS client configuration for Loki URL. |
|
|
|
12.1.58. .spec.loki.monolithic.tls
- Description
- TLS client configuration for Loki URL.
- Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
| Enable TLS |
|
|
|
|
|
|
12.1.59. .spec.loki.monolithic.tls.caCert
- Description
-
caCert
defines the reference of the certificate for the Certificate Authority - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
12.1.60. .spec.loki.monolithic.tls.userCert
- Description
-
userCert
defines the user certificate reference and is used for mTLS (you can ignore it when using one-way TLS) - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
12.1.61. .spec.processor
- Description
-
processor
defines the settings of the component that receives the flows from the agent, enriches them, generates metrics, and forwards them to the Loki persistence layer and/or any available exporter. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
-
-
-
- |
|
|
|
|
|
Set |
|
|
|
|
|
|
12.1.62. .spec.processor.advanced
- Description
-
advanced
allows setting some aspects of the internal configuration of the flow processor. This section is aimed mostly for debugging and fine-grained performance optimizations, such asGOGC
andGOMAXPROCS
env vars. Set these values at your own risk. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Port of the flow collector (host port). By convention, some values are forbidden. It must be greater than 1024 and different from 4500, 4789 and 6081. |
|
|
|
|
| scheduling controls how the pods are scheduled on nodes. |
12.1.63. .spec.processor.advanced.scheduling
- Description
- scheduling controls how the pods are scheduled on nodes.
- Type
-
object
Property | Type | Description |
---|---|---|
|
| If specified, the pod’s scheduling constraints. For documentation, refer to https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling. |
|
|
|
|
| If specified, indicates the pod’s priority. For documentation, refer to https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#how-to-use-priority-and-preemption. If not specified, default priority is used, or zero if there is no default. |
|
|
|
12.1.64. .spec.processor.advanced.scheduling.affinity
- Description
- If specified, the pod’s scheduling constraints. For documentation, refer to https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling.
- Type
-
object
12.1.65. .spec.processor.advanced.scheduling.tolerations
- Description
-
tolerations
is a list of tolerations that allow the pod to schedule onto nodes with matching taints. For documentation, refer to https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling. - Type
-
array
12.1.66. .spec.processor.kafkaConsumerAutoscaler
- Description
-
kafkaConsumerAutoscaler
is the spec of a horizontal pod autoscaler to set up forflowlogs-pipeline-transformer
, which consumes Kafka messages. This setting is ignored when Kafka is disabled. Refer to HorizontalPodAutoscaler documentation (autoscaling/v2). - Type
-
object
12.1.67. .spec.processor.metrics
- Description
-
Metrics
define the processor configuration regarding metrics - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Metrics server endpoint configuration for Prometheus scraper |
12.1.68. .spec.processor.metrics.server
- Description
- Metrics server endpoint configuration for Prometheus scraper
- Type
-
object
Property | Type | Description |
---|---|---|
|
| The metrics server HTTP port. |
|
| TLS configuration. |
12.1.69. .spec.processor.metrics.server.tls
- Description
- TLS configuration.
- Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
TLS configuration when |
|
|
Reference to the CA file when |
|
|
Select the type of TLS configuration:
- |
12.1.70. .spec.processor.metrics.server.tls.provided
- Description
-
TLS configuration when
type
is set toProvided
. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
12.1.71. .spec.processor.metrics.server.tls.providedCaFile
- Description
-
Reference to the CA file when
type
is set toProvided
. - Type
-
object
Property | Type | Description |
---|---|---|
|
| File name within the config map or secret. |
|
| Name of the config map or secret containing the file. |
|
| Namespace of the config map or secret containing the file. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
| Type for the file reference: "configmap" or "secret". |
12.1.72. .spec.processor.resources
- Description
-
resources
are the compute resources required by this container. For more information, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ - Type
-
object
Property | Type | Description |
---|---|---|
|
| Limits describes the maximum amount of compute resources allowed. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ |
|
| Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. Requests cannot exceed Limits. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ |
12.1.73. .spec.processor.subnetLabels
- Description
-
subnetLabels
allows to define custom labels on subnets and IPs or to enable automatic labelling of recognized subnets in OpenShift Container Platform, which is used to identify cluster external traffic. When a subnet matches the source or destination IP of a flow, a corresponding field is added:SrcSubnetLabel
orDstSubnetLabel
. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
12.1.74. .spec.processor.subnetLabels.customLabels
- Description
-
customLabels
allows to customize subnets and IPs labelling, such as to identify cluster-external workloads or web services. If you enableopenShiftAutoDetect
,customLabels
can override the detected subnets in case they overlap. - Type
-
array
12.1.75. .spec.processor.subnetLabels.customLabels[]
- Description
- SubnetLabel allows to label subnets and IPs, such as to identify cluster-external workloads or web services.
- Type
-
object
Property | Type | Description |
---|---|---|
|
|
List of CIDRs, such as |
|
| Label name, used to flag matching flows. |
12.1.76. .spec.prometheus
- Description
-
prometheus
defines Prometheus settings, such as querier configuration used to fetch metrics from the Console plugin. - Type
-
object
Property | Type | Description |
---|---|---|
|
| Prometheus querying configuration, such as client settings, used in the Console plugin. |
12.1.77. .spec.prometheus.querier
- Description
- Prometheus querying configuration, such as client settings, used in the Console plugin.
- Type
-
object
Property | Type | Description |
---|---|---|
|
|
When |
|
|
Prometheus configuration for |
|
|
- Use
- Use |
|
|
|
12.1.78. .spec.prometheus.querier.manual
- Description
-
Prometheus configuration for
Manual
mode. - Type
-
object
Property | Type | Description |
---|---|---|
|
|
Set |
|
| TLS client configuration for Prometheus URL. |
|
|
|
12.1.79. .spec.prometheus.querier.manual.tls
- Description
- TLS client configuration for Prometheus URL.
- Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
| Enable TLS |
|
|
|
|
|
|
12.1.80. .spec.prometheus.querier.manual.tls.caCert
- Description
-
caCert
defines the reference of the certificate for the Certificate Authority - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
12.1.81. .spec.prometheus.querier.manual.tls.userCert
- Description
-
userCert
defines the user certificate reference and is used for mTLS (you can ignore it when using one-way TLS) - Type
-
object
Property | Type | Description |
---|---|---|
|
|
|
|
|
|
|
| Name of the config map or secret containing certificates. |
|
| Namespace of the config map or secret containing certificates. If omitted, the default is to use the same namespace as where Network Observability is deployed. If the namespace is different, the config map or the secret is copied so that it can be mounted as required. |
|
|
Type for the certificate reference: |
Chapter 13. FlowMetric configuration parameters
FlowMetric
is the API allowing to create custom metrics from the collected flow logs.
13.1. FlowMetric [flows.netobserv.io/v1alpha1]
- Description
- FlowMetric is the API allowing to create custom metrics from the collected flow logs.
- Type
-
object
Property | Type | Description |
---|---|---|
|
| APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and might reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources |
|
| Kind is a string value representing the REST resource this object represents. Servers might infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds |
|
| Standard object’s metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata |
|
|
FlowMetricSpec defines the desired state of FlowMetric The provided API allows you to customize these metrics according to your needs.
When adding new metrics or modifying existing labels, you must carefully monitor the memory usage of Prometheus workloads as this could potentially have a high impact. Cf https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md/#what-is-the-cardinality-of-a-metric
To check the cardinality of all Network Observability metrics, run as |
13.1.1. .metadata
- Description
- Standard object’s metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
- Type
-
object
13.1.2. .spec
- Description
FlowMetricSpec defines the desired state of FlowMetric The provided API allows you to customize these metrics according to your needs.
When adding new metrics or modifying existing labels, you must carefully monitor the memory usage of Prometheus workloads as this could potentially have a high impact. Cf https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md/#what-is-the-cardinality-of-a-metric
To check the cardinality of all Network Observability metrics, run as
promql
:count({name=~"netobserv.*"}) by (name)
.- Type
-
object
- Required
-
metricName
-
type
-
Property | Type | Description |
---|---|---|
|
|
A list of buckets to use when |
|
| Charts configuration, for the OpenShift Container Platform Console in the administrator view, Dashboards menu. |
|
|
Filter for ingress, egress or any direction flows. When set to |
|
| When nonzero, scale factor (divider) of the value. Metric value = Flow value / Divider. |
|
|
|
|
|
|
|
| Name of the metric. In Prometheus, it is automatically prefixed with "netobserv_". |
|
| Metric type: "Counter" or "Histogram". Use "Counter" for any value that increases over time and on which you can compute a rate, such as Bytes or Packets. Use "Histogram" for any value that must be sampled independently, such as latencies. |
|
|
|
13.1.3. .spec.charts
- Description
- Charts configuration, for the OpenShift Container Platform Console in the administrator view, Dashboards menu.
- Type
-
array
13.1.4. .spec.charts[]
- Description
- Configures charts / dashboard generation associated to a metric
- Type
-
object
- Required
-
dashboardName
-
queries
-
title
-
type
-
Property | Type | Description |
---|---|---|
|
| Name of the containing dashboard. If this name does not refer to an existing dashboard, a new dashboard is created. |
|
|
List of queries to be displayed on this chart. If |
|
|
Name of the containing dashboard section. If this name does not refer to an existing section, a new section is created. If |
|
| Title of the chart. |
|
| Type of the chart. |
|
| Unit of this chart. Only a few units are currently supported. Leave empty to use generic number. |
13.1.5. .spec.charts[].queries
- Description
-
List of queries to be displayed on this chart. If
type
isSingleStat
and multiple queries are provided, this chart is automatically expanded in several panels (one per query). - Type
-
array
13.1.6. .spec.charts[].queries[]
- Description
- Configures PromQL queries
- Type
-
object
- Required
-
legend
-
promQL
-
top
-
Property | Type | Description |
---|---|---|
|
|
The query legend that applies to each timeseries represented in this chart. When multiple timeseries are displayed, you should set a legend that distinguishes each of them. It can be done with the following format: |
|
|
The |
|
|
Top N series to display per timestamp. Does not apply to |
13.1.7. .spec.filters
- Description
-
filters
is a list of fields and values used to restrict which flows are taken into account. Oftentimes, these filters must be used to eliminate duplicates:Duplicate != "true"
andFlowDirection = "0"
. Refer to the documentation for the list of available fields: https://docs.openshift.com/container-platform/latest/observability/network_observability/json-flows-format-reference.html. - Type
-
array
13.1.8. .spec.filters[]
- Description
- Type
-
object
- Required
-
field
-
matchType
-
Property | Type | Description |
---|---|---|
|
| Name of the field to filter on |
|
| Type of matching to apply |
|
|
Value to filter on. When |
Chapter 14. Network flows format reference
These are the specifications for network flows format, used both internally and when exporting flows to Kafka.
14.1. Network flows format reference
This is the specification of the network flows format. That format is used when a Kafka exporter is configured, for Prometheus metrics labels as well as internally for the Loki store.
The "Filter ID" column shows which related name to use when defining Quick Filters (see spec.consolePlugin.quickFilters
in the FlowCollector
specification).
The "Loki label" column is useful when querying Loki directly: label fields need to be selected using stream selectors.
The "Cardinality" column gives information about the implied metric cardinality if this field was to be used as a Prometheus label with the FlowMetric
API. For more information, see the "FlowMetric API reference".
Name | Type | Description | Filter ID | Loki label | Cardinality |
---|---|---|---|---|---|
| number | Number of bytes | n/a | no | avoid |
| number | Error number returned from DNS tracker ebpf hook function |
| no | fine |
| number | DNS flags for DNS record | n/a | no | fine |
| string | Parsed DNS header RCODEs name |
| no | fine |
| number | DNS record id |
| no | avoid |
| number | Time between a DNS request and response, in milliseconds |
| no | avoid |
| number | Differentiated Services Code Point (DSCP) value |
| no | fine |
| string | Destination IP address (ipv4 or ipv6) |
| no | avoid |
| string | Destination node IP |
| no | fine |
| string | Destination node name |
| no | fine |
| string | Name of the destination Kubernetes object, such as Pod name, Service name or Node name. |
| no | careful |
| string | Destination namespace |
| yes | fine |
| string | Name of the destination owner, such as Deployment name, StatefulSet name, etc. |
| yes | fine |
| string | Kind of the destination owner, such as Deployment, StatefulSet, etc. |
| no | fine |
| string | Kind of the destination Kubernetes object, such as Pod, Service or Node. |
| yes | fine |
| string | Destination availability zone |
| yes | fine |
| string | Destination MAC address |
| no | avoid |
| number | Destination port |
| no | careful |
| string | Destination subnet label |
| no | fine |
| boolean | Indicates if this flow was also captured from another interface on the same host | n/a | yes | fine |
| number |
Logical OR combination of unique TCP flags comprised in the flow, as per RFC-9293, with additional custom flags to represent the following per-packet combinations: | n/a | no | fine |
| number |
Flow interpreted direction from the node observation point. Can be one of: |
| yes | fine |
| number | ICMP code |
| no | fine |
| number | ICMP type |
| no | fine |
| number |
Flow directions from the network interface observation point. Can be one of: |
| no | fine |
| string | Network interfaces |
| no | careful |
| string | Cluster name or identifier |
| yes | fine |
| string | Flow layer: 'app' or 'infra' |
| no | fine |
| number | Number of packets | n/a | no | avoid |
| number | Number of bytes dropped by the kernel | n/a | no | avoid |
| string | Latest drop cause |
| no | fine |
| number | TCP flags on last dropped packet | n/a | no | fine |
| string | TCP state on last dropped packet |
| no | fine |
| number | Number of packets dropped by the kernel | n/a | no | avoid |
| number | L4 protocol |
| no | fine |
| string | Source IP address (ipv4 or ipv6) |
| no | avoid |
| string | Source node IP |
| no | fine |
| string | Source node name |
| no | fine |
| string | Name of the source Kubernetes object, such as Pod name, Service name or Node name. |
| no | careful |
| string | Source namespace |
| yes | fine |
| string | Name of the source owner, such as Deployment name, StatefulSet name, etc. |
| yes | fine |
| string | Kind of the source owner, such as Deployment, StatefulSet, etc. |
| no | fine |
| string | Kind of the source Kubernetes object, such as Pod, Service or Node. |
| yes | fine |
| string | Source availability zone |
| yes | fine |
| string | Source MAC address |
| no | avoid |
| number | Source port |
| no | careful |
| string | Source subnet label |
| no | fine |
| number | End timestamp of this flow, in milliseconds | n/a | no | avoid |
| number | TCP Smoothed Round Trip Time (SRTT), in nanoseconds |
| no | avoid |
| number | Start timestamp of this flow, in milliseconds | n/a | no | avoid |
| number | Timestamp when this flow was received and processed by the flow collector, in seconds | n/a | no | avoid |
| string | In conversation tracking, the conversation identifier |
| no | avoid |
| string | Type of record: 'flowLog' for regular flow logs, or 'newConnection', 'heartbeat', 'endConnection' for conversation tracking |
| yes | fine |
Chapter 15. Troubleshooting Network Observability
To assist in troubleshooting Network Observability issues, you can perform some troubleshooting actions.
15.1. Using the must-gather tool
You can use the must-gather tool to collect information about the Network Observability Operator resources and cluster-wide resources, such as pod logs, FlowCollector
, and webhook
configurations.
Procedure
- Navigate to the directory where you want to store the must-gather data.
Run the following command to collect cluster-wide must-gather resources:
$ oc adm must-gather --image-stream=openshift/must-gather \ --image=quay.io/netobserv/must-gather
15.2. Configuring network traffic menu entry in the OpenShift Container Platform console
Manually configure the network traffic menu entry in the OpenShift Container Platform console when the network traffic menu entry is not listed in Observe menu in the OpenShift Container Platform console.
Prerequisites
- You have installed OpenShift Container Platform version 4.10 or newer.
Procedure
Check if the
spec.consolePlugin.register
field is set totrue
by running the following command:$ oc -n netobserv get flowcollector cluster -o yaml
Example output
apiVersion: flows.netobserv.io/v1alpha1 kind: FlowCollector metadata: name: cluster spec: consolePlugin: register: false
Optional: Add the
netobserv-plugin
plugin by manually editing the Console Operator config:$ oc edit console.operator.openshift.io cluster
Example output
... spec: plugins: - netobserv-plugin ...
Optional: Set the
spec.consolePlugin.register
field totrue
by running the following command:$ oc -n netobserv edit flowcollector cluster -o yaml
Example output
apiVersion: flows.netobserv.io/v1alpha1 kind: FlowCollector metadata: name: cluster spec: consolePlugin: register: true
Ensure the status of console pods is
running
by running the following command:$ oc get pods -n openshift-console -l app=console
Restart the console pods by running the following command:
$ oc delete pods -n openshift-console -l app=console
- Clear your browser cache and history.
Check the status of Network Observability plugin pods by running the following command:
$ oc get pods -n netobserv -l app=netobserv-plugin
Example output
NAME READY STATUS RESTARTS AGE netobserv-plugin-68c7bbb9bb-b69q6 1/1 Running 0 21s
Check the logs of the Network Observability plugin pods by running the following command:
$ oc logs -n netobserv -l app=netobserv-plugin
Example output
time="2022-12-13T12:06:49Z" level=info msg="Starting netobserv-console-plugin [build version: , build date: 2022-10-21 15:15] at log level info" module=main time="2022-12-13T12:06:49Z" level=info msg="listening on https://:9001" module=server
15.3. Flowlogs-Pipeline does not consume network flows after installing Kafka
If you deployed the flow collector first with deploymentModel: KAFKA
and then deployed Kafka, the flow collector might not connect correctly to Kafka. Manually restart the flow-pipeline pods where Flowlogs-pipeline does not consume network flows from Kafka.
Procedure
Delete the flow-pipeline pods to restart them by running the following command:
$ oc delete pods -n netobserv -l app=flowlogs-pipeline-transformer
15.4. Failing to see network flows from both br-int
and br-ex
interfaces
br-ex` and br-int
are virtual bridge devices operated at OSI layer 2. The eBPF agent works at the IP and TCP levels, layers 3 and 4 respectively. You can expect that the eBPF agent captures the network traffic passing through br-ex
and br-int
, when the network traffic is processed by other interfaces such as physical host or virtual pod interfaces. If you restrict the eBPF agent network interfaces to attach only to br-ex
and br-int
, you do not see any network flow.
Manually remove the part in the interfaces
or excludeInterfaces
that restricts the network interfaces to br-int
and br-ex
.
Procedure
Remove the
interfaces: [ 'br-int', 'br-ex' ]
field. This allows the agent to fetch information from all the interfaces. Alternatively, you can specify the Layer-3 interface for example,eth0
. Run the following command:$ oc edit -n netobserv flowcollector.yaml -o yaml
Example output
apiVersion: flows.netobserv.io/v1alpha1 kind: FlowCollector metadata: name: cluster spec: agent: type: EBPF ebpf: interfaces: [ 'br-int', 'br-ex' ] 1
- 1
- Specifies the network interfaces.
15.5. Network Observability controller manager pod runs out of memory
You can increase memory limits for the Network Observability operator by editing the spec.config.resources.limits.memory
specification in the Subscription
object.
Procedure
- In the web console, navigate to Operators → Installed Operators
- Click Network Observability and then select Subscription.
From the Actions menu, click Edit Subscription.
Alternatively, you can use the CLI to open the YAML configuration for the
Subscription
object by running the following command:$ oc edit subscription netobserv-operator -n openshift-netobserv-operator
Edit the
Subscription
object to add theconfig.resources.limits.memory
specification and set the value to account for your memory requirements. See the Additional resources for more information about resource considerations:apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: netobserv-operator namespace: openshift-netobserv-operator spec: channel: stable config: resources: limits: memory: 800Mi 1 requests: cpu: 100m memory: 100Mi installPlanApproval: Automatic name: netobserv-operator source: redhat-operators sourceNamespace: openshift-marketplace startingCSV: <network_observability_operator_latest_version> 2
15.6. Running custom queries to Loki
For troubleshooting, can run custom queries to Loki. There are two examples of ways to do this, which you can adapt according to your needs by replacing the <api_token> with your own.
These examples use the netobserv
namespace for the Network Observability Operator and Loki deployments. Additionally, the examples assume that the LokiStack is named loki
. You can optionally use a different namespace and naming by adapting the examples, specifically the -n netobserv
or the loki-gateway
URL.
Prerequisites
- Installed Loki Operator for use with Network Observability Operator
Procedure
To get all available labels, run the following:
$ oc exec deployment/netobserv-plugin -n netobserv -- curl -G -s -H 'X-Scope-OrgID:network' -H 'Authorization: Bearer <api_token>' -k https://loki-gateway-http.netobserv.svc:8080/api/logs/v1/network/loki/api/v1/labels | jq
To get all flows from the source namespace,
my-namespace
, run the following:$ oc exec deployment/netobserv-plugin -n netobserv -- curl -G -s -H 'X-Scope-OrgID:network' -H 'Authorization: Bearer <api_token>' -k https://loki-gateway-http.netobserv.svc:8080/api/logs/v1/network/loki/api/v1/query --data-urlencode 'query={SrcK8S_Namespace="my-namespace"}' | jq
Additional resources
15.7. Troubleshooting Loki ResourceExhausted error
Loki may return a ResourceExhausted
error when network flow data sent by Network Observability exceeds the configured maximum message size. If you are using the Red Hat Loki Operator, this maximum message size is configured to 100 MiB.
Procedure
- Navigate to Operators → Installed Operators, viewing All projects from the Project drop-down menu.
- In the Provided APIs list, select the Network Observability Operator.
Click the Flow Collector then the YAML view tab.
-
If you are using the Loki Operator, check that the
spec.loki.batchSize
value does not exceed 98 MiB. -
If you are using a Loki installation method that is different from the Red Hat Loki Operator, such as Grafana Loki, verify that the
grpc_server_max_recv_msg_size
Grafana Loki server setting is higher than theFlowCollector
resourcespec.loki.batchSize
value. If it is not, you must either increase thegrpc_server_max_recv_msg_size
value, or decrease thespec.loki.batchSize
value so that it is lower than the limit.
-
If you are using the Loki Operator, check that the
- Click Save if you edited the FlowCollector.
15.8. Loki empty ring error
The Loki "empty ring" error results in flows not being stored in Loki and not showing up in the web console. This error might happen in various situations. A single workaround to address them all does not exist. There are some actions you can take to investigate the logs in your Loki pods, and verify that the LokiStack
is healthy and ready.
Some of the situations where this error is observed are as follows:
After a
LokiStack
is uninstalled and reinstalled in the same namespace, old PVCs are not removed, which can cause this error.-
Action: You can try removing the
LokiStack
again, removing the PVC, then reinstalling theLokiStack
.
-
Action: You can try removing the
After a certificate rotation, this error can prevent communication with the
flowlogs-pipeline
andconsole-plugin
pods.- Action: You can restart the pods to restore the connectivity.
15.9. Resource troubleshooting
15.10. LokiStack rate limit errors
A rate-limit placed on the Loki tenant can result in potential temporary loss of data and a 429 error: Per stream rate limit exceeded (limit:xMB/sec) while attempting to ingest for stream
. You might consider having an alert set to notify you of this error. For more information, see "Creating Loki rate limit alerts for the NetObserv dashboard" in the Additional resources of this section.
You can update the LokiStack CRD with the perStreamRateLimit
and perStreamRateLimitBurst
specifications, as shown in the following procedure.
Procedure
- Navigate to Operators → Installed Operators, viewing All projects from the Project dropdown.
- Look for Loki Operator, and select the LokiStack tab.
Create or edit an existing LokiStack instance using the YAML view to add the
perStreamRateLimit
andperStreamRateLimitBurst
specifications:apiVersion: loki.grafana.com/v1 kind: LokiStack metadata: name: loki namespace: netobserv spec: limits: global: ingestion: perStreamRateLimit: 6 1 perStreamRateLimitBurst: 30 2 tenants: mode: openshift-network managementState: Managed
- Click Save.
Verification
Once you update the perStreamRateLimit
and perStreamRateLimitBurst
specifications, the pods in your cluster restart and the 429 rate-limit error no longer occurs.
15.11. Running a large query results in Loki errors
When running large queries for a long time, Loki errors can occur, such as a timeout
or too many outstanding requests
. There is no complete corrective for this issue, but there are several ways to mitigate it:
- Adapt your query to add an indexed filter
-
With Loki queries, you can query on both indexed and non-indexed fields or labels. Queries that contain filters on labels perform better. For example, if you query for a particular Pod, which is not an indexed field, you can add its Namespace to the query. The list of indexed fields can be found in the "Network flows format reference", in the
Loki label
column. - Consider querying Prometheus rather than Loki
- Prometheus is a better fit than Loki to query on large time ranges. However, whether or not you can use Prometheus instead of Loki depends on the use case. For example, queries on Prometheus are much faster than on Loki, and large time ranges do not impact performance. But Prometheus metrics do not contain as much information as flow logs in Loki. The Network Observability OpenShift web console automatically favors Prometheus over Loki if the query is compatible; otherwise, it defaults to Loki. If your query does not run against Prometheus, you can change some filters or aggregations to make the switch. In the OpenShift web console, you can force the use of Prometheus. An error message is displayed when incompatible queries fail, which can help you figure out which labels to change to make the query compatible. For example, changing a filter or an aggregation from Resource or Pods to Owner.
- Consider using the FlowMetrics API to create your own metric
- If the data that you need isn’t available as a Prometheus metric, you can use the FlowMetrics API to create your own metric. For more information, see "FlowMetrics API Reference" and "Configuring custom metrics by using FlowMetric API".
- Configure Loki to improve the query performance
If the problem persists, you can consider configuring Loki to improve the query performance. Some options depend on the installation mode you used for Loki, such as using the Operator and
LokiStack
, orMonolithic
mode, orMicroservices
mode.-
In
LokiStack
orMicroservices
modes, try increasing the number of querier replicas. -
Increase the query timeout. You must also increase the Network Observability read timeout to Loki in the
FlowCollector
spec.loki.readTimeout
.
-
In