Chapter 7. Configuring the Network Observability Operator
			You can update the FlowCollector API resource to configure the Network Observability Operator and its managed components. The FlowCollector is explicitly created during installation. Since this resource operates cluster-wide, only a single FlowCollector is allowed, and it must be named cluster. For more information, see the FlowCollector API reference.
		
7.1. View the FlowCollector resource
You can view and edit YAML directly in the OpenShift Container Platform web console.
Procedure
- 
						In the web console, navigate to Operators Installed Operators. 
- Under the Provided APIs heading for the NetObserv Operator, select Flow Collector.
- 
						Select cluster then select the YAML tab. There, you can modify the FlowCollectorresource to configure the Network Observability Operator.
				The following example shows a sample FlowCollector resource for OpenShift Container Platform Network Observability Operator:
			
Sample FlowCollector resource
- 1
- The Agent specification,spec.agent.type, must beEBPF. eBPF is the only OpenShift Container Platform supported option.
- 2
- You can set the Sampling specification,spec.agent.ebpf.sampling, to manage resources. Lower sampling values might consume a large amount of computational, memory and storage resources. You can mitigate this by specifying a sampling ratio value. A value of 100 means 1 flow every 100 is sampled. A value of 0 or 1 means all flows are captured. The lower the value, the increase in returned flows and the accuracy of derived metrics. By default, eBPF sampling is set to a value of 50, so 1 flow every 50 is sampled. Note that more sampled flows also means more storage needed. It is recommend to start with default values and refine empirically, to determine which setting your cluster can manage.
- 3
- The Processor specificationspec.processor.can be set to enable conversation tracking. When enabled, conversation events are queryable in the web console. Thespec.processor.logTypesvalue isFlows. Thespec.processor.advancedvalues areConversations,EndedConversations, orALL. Storage requirements are highest forAlland lowest forEndedConversations.
- 4
- The Loki specification,spec.loki, specifies the Loki client. The default values match the Loki install paths mentioned in the Installing the Loki Operator section. If you used another installation method for Loki, specify the appropriate client information for your install.
- 5
- TheLokiStackmode automatically sets a few configurations:querierUrl,ingesterUrlandstatusUrl,tenantID, and corresponding TLS configuration. Cluster roles and a cluster role binding are created for reading and writing logs to Loki. AndauthTokenis set toForward. You can set these manually using theManualmode.
- 6
- Thespec.quickFiltersspecification defines filters that show up in the web console. TheApplicationfilter keys,src_namespaceanddst_namespace, are negated (!), so theApplicationfilter shows all traffic that does not originate from, or have a destination to, anyopenshift-ornetobservnamespaces. For more information, see Configuring quick filters below.
7.2. Configuring the Flow Collector resource with Kafka
				You can configure the FlowCollector resource to use Kafka for high-throughput and low-latency data feeds. A Kafka instance needs to be running, and a Kafka topic dedicated to OpenShift Container Platform Network Observability must be created in that instance. For more information, see Kafka documentation with AMQ Streams.
			
Prerequisites
- Kafka is installed. Red Hat supports Kafka with AMQ Streams Operator.
Procedure
- 
						In the web console, navigate to Operators Installed Operators. 
- Under the Provided APIs heading for the Network Observability Operator, select Flow Collector.
- Select the cluster and then click the YAML tab.
- 
						Modify the FlowCollectorresource for OpenShift Container Platform Network Observability Operator to use Kafka, as shown in the following sample YAML:
Sample Kafka configuration in FlowCollector resource
- 1
- Setspec.deploymentModeltoKafkainstead ofDirectto enable the Kafka deployment model.
- 2
- spec.kafka.addressrefers to the Kafka bootstrap server address. You can specify a port if needed, for instance- kafka-cluster-kafka-bootstrap.netobserv:9093for using TLS on port 9093.
- 3
- spec.kafka.topicshould match the name of a topic created in Kafka.
- 4
- spec.kafka.tlscan be used to encrypt all communications to and from Kafka with TLS or mTLS. When enabled, the Kafka CA certificate must be available as a ConfigMap or a Secret, both in the namespace where the- flowlogs-pipelineprocessor component is deployed (default:- netobserv) and where the eBPF agents are deployed (default:- netobserv-privileged). It must be referenced with- spec.kafka.tls.caCert. When using mTLS, client secrets must be available in these namespaces as well (they can be generated for instance using the AMQ Streams User Operator) and referenced with- spec.kafka.tls.userCert.
7.3. Export enriched network flow data
You can send network flows to Kafka, IPFIX, the Red Hat build of OpenTelemetry, or all three at the same time. For Kafka or IPFIX, any processor or storage that supports those inputs, such as Splunk, Elasticsearch, or Fluentd, can consume the enriched network flow data. For OpenTelemetry, network flow data and metrics can be exported to a compatible OpenTelemetry endpoint, such as Red Hat build of OpenTelemetry, Jaeger, or Prometheus.
Prerequisites
- 
						Your Kafka, IPFIX, or OpenTelemetry collector endpoints are available from Network Observability flowlogs-pipelinepods.
Procedure
- 
						In the web console, navigate to Operators Installed Operators. 
- Under the Provided APIs heading for the NetObserv Operator, select Flow Collector.
- Select cluster and then select the YAML tab.
- Edit the - FlowCollectorto configure- spec.exportersas follows:- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - 1 4 6
- You can export flows to IPFIX, OpenTelemetry, and Kafka individually or concurrently.
- 2
- The Network Observability Operator exports all flows to the configured Kafka topic.
- 3
- You can encrypt all communications to and from Kafka with SSL/TLS or mTLS. When enabled, the Kafka CA certificate must be available as a ConfigMap or a Secret, both in the namespace where theflowlogs-pipelineprocessor component is deployed (default: netobserv). It must be referenced withspec.exporters.tls.caCert. When using mTLS, client secrets must be available in these namespaces as well (they can be generated for instance using the AMQ Streams User Operator) and referenced withspec.exporters.tls.userCert.
- 5
- You have the option to specify transport. The default value istcpbut you can also specifyudp.
- 7
- The protocol of OpenTelemetry connection. The available options arehttpandgrpc.
- 8
- OpenTelemetry configuration for exporting logs, which are the same as the logs created for Loki.
- 9
- OpenTelemetry configuration for exporting metrics, which are the same as the metrics created for Prometheus. These configurations are specified in thespec.processor.metrics.includeListparameter of theFlowCollectorcustom resource, along with any custom metrics you defined using theFlowMetricscustom resource.
- 10
- The time interval that metrics are sent to the OpenTelemetry collector.
- 11
- Optional:Network Observability network flows formats get automatically renamed to an OpenTelemetry compliant format. ThefieldsMappingspecification gives you the ability to customize the OpenTelemetry format output. For example in the YAML sample,SrcAddris the Network Observability input field, and it is being renamedsource.addressin OpenTelemetry output. You can see both Network Observability and OpenTelemetry formats in the "Network flows format reference".
 
After configuration, network flows data can be sent to an available output in a JSON format. For more information, see "Network flows format reference".
7.4. Updating the Flow Collector resource
				As an alternative to editing YAML in the OpenShift Container Platform web console, you can configure specifications, such as eBPF sampling, by patching the flowcollector custom resource (CR):
			
Procedure
- Run the following command to patch the - flowcollectorCR and update the- spec.agent.ebpf.samplingvalue:- oc patch flowcollector cluster --type=json -p "[{"op": "replace", "path": "/spec/agent/ebpf/sampling", "value": <new value>}] -n netobserv"- $ oc patch flowcollector cluster --type=json -p "[{"op": "replace", "path": "/spec/agent/ebpf/sampling", "value": <new value>}] -n netobserv"- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
7.5. Filter network flows at ingestion
You can create filters to reduce the number of generated network flows. Filtering network flows can reduce the resource usage of the network observability components.
You can configure two kinds of filters:
- eBPF agent filters
- Flowlogs-pipeline filters
7.5.1. eBPF agent filters
eBPF agent filters maximize performance because they take effect at the earliest stage of the network flows collection process.
To configure eBPF agent filters with the Network Observability Operator, see "Filtering eBPF flow data using multiple rules".
7.5.2. Flowlogs-pipeline filters
Flowlogs-pipeline filters provide greater control over traffic selection because they take effect later in the network flows collection process. They are primarily used to improve data storage.
Flowlogs-pipeline filters use a simple query language to filter network flow, as shown in the following example:
(srcnamespace="netobserv" OR (srcnamespace="ingress" AND dstnamespace="netobserv")) AND srckind!="service"
(srcnamespace="netobserv" OR (srcnamespace="ingress" AND dstnamespace="netobserv")) AND srckind!="service"The query language uses the following syntax:
| Category | Operators | 
|---|---|
| Logical boolean operators (not case-sensitive) | 
									 | 
| Comparison operators | 
									 | 
| Unary operations | 
									 | 
					You can configure flowlogs-pipeline filters in the spec.processor.filters section of the FlowCollector resource. For example:
				
Example YAML Flowlogs-pipeline filter
- 1
- Sends matching flows to a specific output, such as Loki, Prometheus, or an external system. When omitted, sends to all configured outputs.
- 2
- Optional. Applies a sampling ratio to limit the number of matching flows to be stored or exported. For example,sampling: 10means 1/10 of the flows are kept.
7.6. Configuring quick filters
				You can modify the filters in the FlowCollector resource. Exact matches are possible using double-quotes around values. Otherwise, partial matches are used for textual values. The bang (!) character, placed at the end of a key, means negation. See the sample FlowCollector resource for more context about modifying the YAML.
			
The filter matching types "all of" or "any of" is a UI setting that the users can modify from the query options. It is not part of this resource configuration.
Here is a list of all available filter keys:
| Universal* | Source | Destination | Description | 
|---|---|---|---|
| namespace | 
								 | 
								 | Filter traffic related to a specific namespace. | 
| name | 
								 | 
								 | Filter traffic related to a given leaf resource name, such as a specific pod, service, or node (for host-network traffic). | 
| kind | 
								 | 
								 | Filter traffic related to a given resource kind. The resource kinds include the leaf resource (Pod, Service or Node), or the owner resource (Deployment and StatefulSet). | 
| owner_name | 
								 | 
								 | Filter traffic related to a given resource owner; that is, a workload or a set of pods. For example, it can be a Deployment name, a StatefulSet name, etc. | 
| resource | 
								 | 
								 | 
								Filter traffic related to a specific resource that is denoted by its canonical name, that identifies it uniquely. The canonical notation is  | 
| address | 
								 | 
								 | Filter traffic related to an IP address. IPv4 and IPv6 are supported. CIDR ranges are also supported. | 
| mac | 
								 | 
								 | Filter traffic related to a MAC address. | 
| port | 
								 | 
								 | Filter traffic related to a specific port. | 
| host_address | 
								 | 
								 | Filter traffic related to the host IP address where the pods are running. | 
| protocol | N/A | N/A | Filter traffic related to a protocol, such as TCP or UDP. | 
- 
						Universal keys filter for any of source or destination. For example, filtering name: 'my-pod'means all traffic frommy-podand all traffic tomy-pod, regardless of the matching type used, whether Match all or Match any.
7.7. Resource management and performance considerations
The amount of resources required by network observability depends on the size of your cluster and your requirements for the cluster to ingest and store observability data. To manage resources and set performance criteria for your cluster, consider configuring the following settings. Configuring these settings might meet your optimal setup and observability needs.
The following settings can help you manage resources and performance from the outset:
- eBPF Sampling
- 
							You can set the Sampling specification, spec.agent.ebpf.sampling, to manage resources. Smaller sampling values might consume a large amount of computational, memory and storage resources. You can mitigate this by specifying a sampling ratio value. A value of100means 1 flow every 100 is sampled. A value of0or1means all flows are captured. Smaller values result in an increase in returned flows and the accuracy of derived metrics. By default, eBPF sampling is set to a value of 50, so 1 flow every 50 is sampled. Note that more sampled flows also means more storage needed. Consider starting with the default values and refine empirically, in order to determine which setting your cluster can manage.
- eBPF features
- The more features that are enabled, the more CPU and memory are impacted. See "Observing the network traffic" for a complete list of these features.
- Without Loki
- You can reduce the amount of resources that network observability requires by not using Loki and instead relying on Prometheus. For example, when network observability is configured without Loki, the total savings of memory usage are in the 20-65% range and CPU utilization is lower by 10-30%, depending upon the sampling value. See "Network observability without Loki" for more information.
- Restricting or excluding interfaces
- 
							Reduce the overall observed traffic by setting the values for spec.agent.ebpf.interfacesandspec.agent.ebpf.excludeInterfaces. By default, the agent fetches all the interfaces in the system, except the ones listed inexcludeInterfacesandlo(local interface). Note that the interface names might vary according to the Container Network Interface (CNI) used.
- Performance fine-tuning
- The following settings can be used to fine-tune performance after the Network Observability has been running for a while: - 
									Resource requirements and limits: Adapt the resource requirements and limits to the load and memory usage you expect on your cluster by using the spec.agent.ebpf.resourcesandspec.processor.resourcesspecifications. The default limits of 800MB might be sufficient for most medium-sized clusters.
- 
									Cache max flows timeout: Control how often flows are reported by the agents by using the eBPF agent’s spec.agent.ebpf.cacheMaxFlowsandspec.agent.ebpf.cacheActiveTimeoutspecifications. A larger value results in less traffic being generated by the agents, which correlates with a lower CPU load. However, a larger value leads to a slightly higher memory consumption, and might generate more latency in the flow collection.
 
- 
									Resource requirements and limits: Adapt the resource requirements and limits to the load and memory usage you expect on your cluster by using the 
7.7.1. Resource considerations
The following table outlines examples of resource considerations for clusters with certain workload sizes.
The examples outlined in the table demonstrate scenarios that are tailored to specific workloads. Consider each example only as a baseline from which adjustments can be made to accommodate your workload needs.
| Extra small (10 nodes) | Small (25 nodes) | Large (250 nodes) [2] | |
|---|---|---|---|
| Worker Node vCPU and memory | 4 vCPUs| 16GiB mem [1] | 16 vCPUs| 64GiB mem [1] | 16 vCPUs| 64GiB Mem [1] | 
| LokiStack size | 
									 | 
									 | 
									 | 
| Network Observability controller memory limit | 400Mi (default) | 400Mi (default) | 400Mi (default) | 
| eBPF sampling rate | 50 (default) | 50 (default) | 50 (default) | 
| eBPF memory limit | 800Mi (default) | 800Mi (default) | 1600Mi | 
| cacheMaxSize | 50,000 | 100,000 (default) | 100,000 (default) | 
| FLP memory limit | 800Mi (default) | 800Mi (default) | 800Mi (default) | 
| FLP Kafka partitions | – | 48 | 48 | 
| Kafka consumer replicas | – | 6 | 18 | 
| Kafka brokers | – | 3 (default) | 3 (default) | 
- Tested with AWS M6i instances.
- 
								In addition to this worker and its controller, 3 infra nodes (size M6i.12xlarge) and 1 workload node (sizeM6i.8xlarge) were tested.
7.7.2. Total average memory and CPU usage
					The following table outlines averages of total resource usage for clusters with a sampling value of 1 and 50 for two different tests: Test 1 and Test 2. The tests differ in the following ways:
				
- 
							Test 1takes into account high ingress traffic volume in addition to the total number of namespace, pods and services in an OpenShift Container Platform cluster, places load on the eBPF agent, and represents use cases with a high number of workloads for a given cluster size. For example,Test 1consists of 76 Namespaces, 5153 Pods, and 2305 Services with a network traffic scale of ~350 MB/s.
- 
							Test 2takes into account high ingress traffic volume in addition to the total number of namespace, pods and services in an OpenShift Container Platform cluster and represents use cases with a high number of workloads for a given cluster size. For example,Test 2consists of 553 Namespaces, 6998 Pods, and 2508 Services with a network traffic scale of ~950 MB/s.
Since different types of cluster use cases are exemplified in the different tests, the numbers in this table do not scale linearly when compared side-by-side. Instead, they are intended to be used as a benchmark for evaluating your personal cluster usage. The examples outlined in the table demonstrate scenarios that are tailored to specific workloads. Consider each example only as a baseline from which adjustments can be made to accommodate your workload needs.
Metrics exported to Prometheus can impact the resource usage. Cardinality values for the metrics can help determine how much resources are impacted. For more information, see "Network Flows format" in the Additional resources section.
| Sampling value | Resources used | Test 1 (25 nodes) | Test 2 (250 nodes) | 
|---|---|---|---|
| Sampling = 50 | Total NetObserv CPU Usage | 1.35 | 5.39 | 
| Total NetObserv RSS (Memory) Usage | 16 GB | 63 GB | |
| Sampling = 1 | Total NetObserv CPU Usage | 1.82 | 11.99 | 
| Total NetObserv RSS (Memory) Usage | 22 GB | 87 GB | 
Summary: This table shows average total resource usage of Network Observability, which includes Agents, FLP, Kafka, and Loki with all features enabled. For details about what features are enabled, see the features covered in "Observing the network traffic", which comprises all the features that are enabled for this testing.