Chapter 8. Using metrics with dashboards and alerts


The Network Observability Operator uses the flowlogs-pipeline to generate metrics from flow logs. You can utilize these metrics by setting custom alerts and viewing dashboards.

8.1. Viewing Network Observability metrics dashboards

On the Overview tab in the OpenShift Container Platform console, you can view the overall aggregated metrics of the network traffic flow on the cluster. You can choose to display the information by node, namespace, owner, pod, and service. You can also use filters and display options to further refine the metrics.

Procedure

  1. In the web console Observe Dashboards, select the Netobserv dashboard.
  2. View network traffic metrics in the following categories, with each having the subset per node, namespace, source, and destination:

    • Byte rates
    • Packet drops
    • DNS
    • RTT
  3. Select the Netobserv/Health dashboard.
  4. View metrics about the health of the Operator in the following categories, with each having the subset per node, namespace, source, and destination.

    • Flows
    • Flows Overhead
    • Flow rates
    • Agents
    • Processor
    • Operator

Infrastructure and Application metrics are shown in a split-view for namespace and workloads.

8.2. Predefined metrics

Metrics generated by the flowlogs-pipeline are configurable in the spec.processor.metrics.includeList of the FlowCollector custom resource to add or remove metrics.

8.3. Network Observability metrics

You can also create alerts by using the includeList metrics in Prometheus rules, as shown in the example "Creating alerts".

When looking for these metrics in Prometheus, such as in the Console through Observe Metrics, or when defining alerts, all the metrics names are prefixed with netobserv_. For example, netobserv_namespace_flows_total. Available metrics names are as follows:

includeList metrics names

Names followed by an asterisk * are enabled by default.

  • namespace_egress_bytes_total
  • namespace_egress_packets_total
  • namespace_ingress_bytes_total
  • namespace_ingress_packets_total
  • namespace_flows_total *
  • node_egress_bytes_total
  • node_egress_packets_total
  • node_ingress_bytes_total *
  • node_ingress_packets_total
  • node_flows_total
  • workload_egress_bytes_total
  • workload_egress_packets_total
  • workload_ingress_bytes_total *
  • workload_ingress_packets_total
  • workload_flows_total
PacketDrop metrics names

When the PacketDrop feature is enabled in spec.agent.ebpf.features (with privileged mode), the following additional metrics are available:

  • namespace_drop_bytes_total
  • namespace_drop_packets_total *
  • node_drop_bytes_total
  • node_drop_packets_total
  • workload_drop_bytes_total
  • workload_drop_packets_total
DNS metrics names

When the DNSTracking feature is enabled in spec.agent.ebpf.features, the following additional metrics are available:

  • namespace_dns_latency_seconds *
  • node_dns_latency_seconds
  • workload_dns_latency_seconds
FlowRTT metrics names

When the FlowRTT feature is enabled in spec.agent.ebpf.features, the following additional metrics are available:

  • namespace_rtt_seconds *
  • node_rtt_seconds
  • workload_rtt_seconds

8.4. Creating alerts

You can create custom alerting rules for the Netobserv dashboard metrics to trigger alerts when some defined conditions are met.

Prerequisites

  • You have access to the cluster as a user with the cluster-admin role or with view permissions for all projects.
  • You have the Network Observability Operator installed.

Procedure

  1. Create a YAML file by clicking the import icon, +.
  2. Add an alerting rule configuration to the YAML file. In the YAML sample that follows, an alert is created for when the cluster ingress traffic reaches a given threshold of 10 MBps per destination workload.

    apiVersion: monitoring.openshift.io/v1
    kind: AlertingRule
    metadata:
      name: netobserv-alerts
      namespace: openshift-monitoring
    spec:
      groups:
      - name: NetObservAlerts
        rules:
        - alert: NetObservIncomingBandwidth
          annotations:
            message: |-
              {{ $labels.job }}: incoming traffic exceeding 10 MBps for 30s on {{ $labels.DstK8S_OwnerType }} {{ $labels.DstK8S_OwnerName }} ({{ $labels.DstK8S_Namespace }}).
            summary: "High incoming traffic."
          expr: sum(rate(netobserv_workload_ingress_bytes_total     {SrcK8S_Namespace="openshift-ingress"}[1m])) by (job, DstK8S_Namespace, DstK8S_OwnerName, DstK8S_OwnerType) > 10000000      1
          for: 30s
          labels:
            severity: warning
    1
    The netobserv_workload_ingress_bytes_total metric is enabled by default in spec.processor.metrics.includeList.
  3. Click Create to apply the configuration file to the cluster.

8.5. Custom metrics

You can create custom metrics out of the flowlogs data using the FlowMetric API. In every flowlogs data that is collected, there are a number of fields labeled per log, such as source name and destination name. These fields can be leveraged as Prometheus labels to enable the customization of cluster information on your dashboard.

8.6. Configuring custom metrics by using FlowMetric API

You can configure the FlowMetric API to create custom metrics by using flowlogs data fields as Prometheus labels. You can add multiple FlowMetric resources to a project to see multiple dashboard views.

Procedure

  1. In the web console, navigate to Operators Installed Operators.
  2. In the Provided APIs heading for the NetObserv Operator, select FlowMetric.
  3. In the Project: dropdown list, select the project of the Network Observability Operator instance.
  4. Click Create FlowMetric.
  5. Configure the FlowMetric resource, similar to the following sample configurations:

    Example 8.1. Generate a metric that tracks ingress bytes received from cluster external sources

    apiVersion: flows.netobserv.io/v1alpha1
    kind: FlowMetric
    metadata:
      name: flowmetric-cluster-external-ingress-traffic
      namespace: netobserv                              1
    spec:
      metricName: cluster_external_ingress_bytes_total  2
      type: Counter                                     3
      valueField: Bytes
      direction: Ingress                                4
      labels: [DstK8S_HostName,DstK8S_Namespace,DstK8S_OwnerName,DstK8S_OwnerType] 5
      filters:                                          6
      - field: SrcSubnetLabel
        matchType: Absence
    1
    The FlowMetric resources need to be created in the namespace defined in the FlowCollector spec.namespace, which is netobserv by default.
    2
    The name of the Prometheus metric, which in the web console appears with the prefix netobserv-<metricName>.
    3
    The type specifies the type of metric. The Counter type is useful for counting bytes or packets.
    4
    The direction of traffic to capture. If not specified, both ingress and egress are captured, which can lead to duplicated counts.
    5
    Labels define what the metrics look like and the relationship between the different entities and also define the metrics cardinality. For example, SrcK8S_Name is a high cardinality metric.
    6
    Refines results based on the listed criteria. In this example, selecting only the cluster external traffic is done by matching only flows where SrcSubnetLabel is absent. This assumes the subnet labels feature is enabled (via spec.processor.subnetLabels), which is done by default.

    Verification

    1. Once the pods refresh, navigate to Observe Metrics.
    2. In the Expression field, type the metric name to view the corresponding result. You can also enter an expression, such as topk(5, sum(rate(netobserv_cluster_external_ingress_bytes_total{DstK8S_Namespace="my-namespace"}[2m])) by (DstK8S_HostName, DstK8S_OwnerName, DstK8S_OwnerType))

    Example 8.2. Show RTT latency for cluster external ingress traffic

    apiVersion: flows.netobserv.io/v1alpha1
    kind: FlowMetric
    metadata:
      name: flowmetric-cluster-external-ingress-rtt
      namespace: netobserv    1
    spec:
      metricName: cluster_external_ingress_rtt_seconds
      type: Histogram                 2
      valueField: TimeFlowRttNs
      direction: Ingress
      labels: [DstK8S_HostName,DstK8S_Namespace,DstK8S_OwnerName,DstK8S_OwnerType]
      filters:
      - field: SrcSubnetLabel
        matchType: Absence
      - field: TimeFlowRttNs
        matchType: Presence
      divider: "1000000000"      3
      buckets: [".001", ".005", ".01", ".02", ".03", ".04", ".05", ".075", ".1", ".25", "1"]  4
    1
    The FlowMetric resources need to be created in the namespace defined in the FlowCollector spec.namespace, which is netobserv by default.
    2
    The type specifies the type of metric. The Histogram type is useful for a latency value (TimeFlowRttNs).
    3
    Since the Round-trip time (RTT) is provided as nanos in flows, use a divider of 1 billion to convert into seconds, which is standard in Prometheus guidelines.
    4
    The custom buckets specify precision on RTT, with optimal precision ranging between 5ms and 250ms.

    Verification

    1. Once the pods refresh, navigate to Observe Metrics.
    2. In the Expression field, you can type the metric name to view the corresponding result.
Important

High cardinality can affect the memory usage of Prometheus. You can check whether specific labels have high cardinality in the Network Flows format reference.

8.7. Configuring custom charts using FlowMetric API

You can generate charts for dashboards in the OpenShift Container Platform web console, which you can view as an administrator in the Dashboard menu by defining the charts section of the FlowMetric resource.

Procedure

  1. In the web console, navigate to Operators Installed Operators.
  2. In the Provided APIs heading for the NetObserv Operator, select FlowMetric.
  3. In the Project: dropdown list, select the project of the Network Observability Operator instance.
  4. Click Create FlowMetric.
  5. Configure the FlowMetric resource, similar to the following sample configurations:

Example 8.3. Chart for tracking ingress bytes received from cluster external sources

apiVersion: flows.netobserv.io/v1alpha1
kind: FlowMetric
metadata:
  name: flowmetric-cluster-external-ingress-traffic
  namespace: netobserv   1
# ...
  charts:
  - dashboardName: Main  2
    title: External ingress traffic
    unit: Bps
    type: SingleStat
    queries:
    - promQL: "sum(rate($METRIC[2m]))"
      legend: ""
  - dashboardName: Main  3
    sectionName: External
    title: Top external ingress traffic per workload
    unit: Bps
    type: StackArea
    queries:
    - promQL: "sum(rate($METRIC{DstK8S_Namespace!=\"\"}[2m])) by (DstK8S_Namespace, DstK8S_OwnerName)"
      legend: "{{DstK8S_Namespace}} / {{DstK8S_OwnerName}}"
# ...
1
The FlowMetric resources need to be created in the namespace defined in the FlowCollector spec.namespace, which is netobserv by default.

Verification

  1. Once the pods refresh, navigate to Observe Dashboards.
  2. Search for the NetObserv / Main dashboard. View two panels under the NetObserv / Main dashboard, or optionally a dashboard name that you create:

    • A textual single statistic showing the global external ingress rate summed across all dimensions
    • A timeseries graph showing the same metric per destination workload

For more information about the query language, refer to the Prometheus documentation.

Example 8.4. Chart for RTT latency for cluster external ingress traffic

apiVersion: flows.netobserv.io/v1alpha1
kind: FlowMetric
metadata:
  name: flowmetric-cluster-external-ingress-traffic
  namespace: netobserv   1
# ...
  charts:
  - dashboardName: Main  2
    title: External ingress TCP latency
    unit: seconds
    type: SingleStat
    queries:
    - promQL: "histogram_quantile(0.99, sum(rate($METRIC_bucket[2m])) by (le)) > 0"
      legend: "p99"
  - dashboardName: Main  3
    sectionName: External
    title: "Top external ingress sRTT per workload, p50 (ms)"
    unit: seconds
    type: Line
    queries:
    - promQL: "histogram_quantile(0.5, sum(rate($METRIC_bucket{DstK8S_Namespace!=\"\"}[2m])) by (le,DstK8S_Namespace,DstK8S_OwnerName))*1000 > 0"
      legend: "{{DstK8S_Namespace}} / {{DstK8S_OwnerName}}"
  - dashboardName: Main  4
    sectionName: External
    title: "Top external ingress sRTT per workload, p99 (ms)"
    unit: seconds
    type: Line
    queries:
    - promQL: "histogram_quantile(0.99, sum(rate($METRIC_bucket{DstK8S_Namespace!=\"\"}[2m])) by (le,DstK8S_Namespace,DstK8S_OwnerName))*1000 > 0"
      legend: "{{DstK8S_Namespace}} / {{DstK8S_OwnerName}}"
# ...
1
The FlowMetric resources need to be created in the namespace defined in the FlowCollector spec.namespace, which is netobserv by default.
2 3 4
Using a different dashboardName creates a new dashboard that is prefixed with Netobserv. For example, Netobserv / <dashboard_name>.

This example uses the histogram_quantile function to show p50 and p99.

You can show averages of histograms by dividing the metric, $METRIC_sum, by the metric, $METRIC_count, which are automatically generated when you create a histogram. With the preceding example, the Prometheus query to do this is as follows:

promQL: "(sum(rate($METRIC_sum{DstK8S_Namespace!=\"\"}[2m])) by (DstK8S_Namespace,DstK8S_OwnerName) / sum(rate($METRIC_count{DstK8S_Namespace!=\"\"}[2m])) by (DstK8S_Namespace,DstK8S_OwnerName))*1000"

Verification

  1. Once the pods refresh, navigate to Observe Dashboards.
  2. Search for the NetObserv / Main dashboard. View the new panel under the NetObserv / Main dashboard, or optionally a dashboard name that you create.

For more information about the query language, refer to the Prometheus documentation.

8.8. Detecting SYN flooding using the FlowMetric API and TCP flags

You can create an AlertingRule resouce to alert for SYN flooding.

Procedure

  1. In the web console, navigate to Operators Installed Operators.
  2. In the Provided APIs heading for the NetObserv Operator, select FlowMetric.
  3. In the Project dropdown list, select the project of the Network Observability Operator instance.
  4. Click Create FlowMetric.
  5. Create FlowMetric resources to add the following configurations:

    Configuration counting flows per destination host and resource, with TCP flags

    apiVersion: flows.netobserv.io/v1alpha1
    kind: FlowMetric
    metadata:
      name: flows-with-flags-per-destination
    spec:
      metricName: flows_with_flags_per_destination_total
      type: Counter
      labels: [SrcSubnetLabel,DstSubnetLabel,DstK8S_Name,DstK8S_Type,DstK8S_HostName,DstK8S_Namespace,Flags]

    Configuration counting flows per source host and resource, with TCP flags

    apiVersion: flows.netobserv.io/v1alpha1
    kind: FlowMetric
    metadata:
      name: flows-with-flags-per-source
    spec:
      metricName: flows_with_flags_per_source_total
      type: Counter
      labels: [DstSubnetLabel,SrcSubnetLabel,SrcK8S_Name,SrcK8S_Type,SrcK8S_HostName,SrcK8S_Namespace,Flags]

  6. Deploy the following AlertingRule resource to alert for SYN flooding:

    AlertingRule for SYN flooding

    apiVersion: monitoring.openshift.io/v1
    kind: AlertingRule
    metadata:
      name: netobserv-syn-alerts
      namespace: openshift-monitoring
    # ...
      spec:
      groups:
      - name: NetObservSYNAlerts
        rules:
        - alert: NetObserv-SYNFlood-in
          annotations:
            message: |-
              {{ $labels.job }}: incoming SYN-flood attack suspected to Host={{ $labels.DstK8S_HostName}}, Namespace={{ $labels.DstK8S_Namespace }}, Resource={{ $labels.DstK8S_Name }}. This is characterized by a high volume of SYN-only flows with different source IPs and/or ports.
            summary: "Incoming SYN-flood"
          expr: sum(rate(netobserv_flows_with_flags_per_destination_total{Flags="2"}[1m])) by (job, DstK8S_HostName, DstK8S_Namespace, DstK8S_Name) > 300      1
          for: 15s
          labels:
            severity: warning
            app: netobserv
        - alert: NetObserv-SYNFlood-out
          annotations:
            message: |-
              {{ $labels.job }}: outgoing SYN-flood attack suspected from Host={{ $labels.SrcK8S_HostName}}, Namespace={{ $labels.SrcK8S_Namespace }}, Resource={{ $labels.SrcK8S_Name }}. This is characterized by a high volume of SYN-only flows with different source IPs and/or ports.
            summary: "Outgoing SYN-flood"
          expr: sum(rate(netobserv_flows_with_flags_per_source_total{Flags="2"}[1m])) by (job, SrcK8S_HostName, SrcK8S_Namespace, SrcK8S_Name) > 300       2
          for: 15s
          labels:
            severity: warning
            app: netobserv
    # ...

    1 2
    In this example, the threshold for the alert is 300; however, you can adapt this value empirically. A threshold that is too low might produce false-positives, and if it’s too high it might miss actual attacks.

Verification

  1. In the web console, click Manage Columns in the Network Traffic table view and click TCP flags.
  2. In the Network Traffic table view, filter on TCP protocol SYN TCPFlag. A large number of flows with the same byteSize indicates a SYN flood.
  3. Go to Observe Alerting and select the Alerting Rules tab.
  4. Filter on netobserv-synflood-in alert. The alert should fire when SYN flooding occurs.
Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.