Chapter 12. Troubleshooting Logging


12.1. Viewing OpenShift Logging status

You can view the status of the Red Hat OpenShift Logging Operator and for a number of OpenShift Logging components.

12.1.1. Viewing the status of the Red Hat OpenShift Logging Operator

You can view the status of your Red Hat OpenShift Logging Operator.

Prerequisites

  • OpenShift Logging and Elasticsearch must be installed.

Procedure

  1. Change to the openshift-logging project.

    $ oc project openshift-logging
    Copy to Clipboard Toggle word wrap
  2. To view the OpenShift Logging status:

    1. Get the OpenShift Logging status:

      $ oc get clusterlogging instance -o yaml
      Copy to Clipboard Toggle word wrap

      Example output

      apiVersion: logging.openshift.io/v1
      kind: ClusterLogging
      
      ....
      
      status:  
      1
      
        collection:
          logs:
            fluentdStatus:
              daemonSet: fluentd  
      2
      
              nodes:
                fluentd-2rhqp: ip-10-0-169-13.ec2.internal
                fluentd-6fgjh: ip-10-0-165-244.ec2.internal
                fluentd-6l2ff: ip-10-0-128-218.ec2.internal
                fluentd-54nx5: ip-10-0-139-30.ec2.internal
                fluentd-flpnn: ip-10-0-147-228.ec2.internal
                fluentd-n2frh: ip-10-0-157-45.ec2.internal
              pods:
                failed: []
                notReady: []
                ready:
                - fluentd-2rhqp
                - fluentd-54nx5
                - fluentd-6fgjh
                - fluentd-6l2ff
                - fluentd-flpnn
                - fluentd-n2frh
        logstore: 
      3
      
          elasticsearchStatus:
          - ShardAllocationEnabled:  all
            cluster:
              activePrimaryShards:    5
              activeShards:           5
              initializingShards:     0
              numDataNodes:           1
              numNodes:               1
              pendingTasks:           0
              relocatingShards:       0
              status:                 green
              unassignedShards:       0
            clusterName:             elasticsearch
            nodeConditions:
              elasticsearch-cdm-mkkdys93-1:
            nodeCount:  1
            pods:
              client:
                failed:
                notReady:
                ready:
                - elasticsearch-cdm-mkkdys93-1-7f7c6-mjm7c
              data:
                failed:
                notReady:
                ready:
                - elasticsearch-cdm-mkkdys93-1-7f7c6-mjm7c
              master:
                failed:
                notReady:
                ready:
                - elasticsearch-cdm-mkkdys93-1-7f7c6-mjm7c
      visualization:  
      4
      
          kibanaStatus:
          - deployment: kibana
            pods:
              failed: []
              notReady: []
              ready:
              - kibana-7fb4fd4cc9-f2nls
            replicaSets:
            - kibana-7fb4fd4cc9
            replicas: 1
      Copy to Clipboard Toggle word wrap

      1
      In the output, the cluster status fields appear in the status stanza.
      2
      Information on the Fluentd pods.
      3
      Information on the Elasticsearch pods, including Elasticsearch cluster health, green, yellow, or red.
      4
      Information on the Kibana pods.

12.1.1.1. Example condition messages

The following are examples of some condition messages from the Status.Nodes section of the OpenShift Logging instance.

A status message similar to the following indicates a node has exceeded the configured low watermark and no shard will be allocated to this node:

Example output

  nodes:
  - conditions:
    - lastTransitionTime: 2019-03-15T15:57:22Z
      message: Disk storage usage for node is 27.5gb (36.74%). Shards will be not
        be allocated on this node.
      reason: Disk Watermark Low
      status: "True"
      type: NodeStorage
    deploymentName: example-elasticsearch-clientdatamaster-0-1
    upgradeStatus: {}
Copy to Clipboard Toggle word wrap

A status message similar to the following indicates a node has exceeded the configured high watermark and shards will be relocated to other nodes:

Example output

  nodes:
  - conditions:
    - lastTransitionTime: 2019-03-15T16:04:45Z
      message: Disk storage usage for node is 27.5gb (36.74%). Shards will be relocated
        from this node.
      reason: Disk Watermark High
      status: "True"
      type: NodeStorage
    deploymentName: cluster-logging-operator
    upgradeStatus: {}
Copy to Clipboard Toggle word wrap

A status message similar to the following indicates the Elasticsearch node selector in the CR does not match any nodes in the cluster:

Example output

    Elasticsearch Status:
      Shard Allocation Enabled:  shard allocation unknown
      Cluster:
        Active Primary Shards:  0
        Active Shards:          0
        Initializing Shards:    0
        Num Data Nodes:         0
        Num Nodes:              0
        Pending Tasks:          0
        Relocating Shards:      0
        Status:                 cluster health unknown
        Unassigned Shards:      0
      Cluster Name:             elasticsearch
      Node Conditions:
        elasticsearch-cdm-mkkdys93-1:
          Last Transition Time:  2019-06-26T03:37:32Z
          Message:               0/5 nodes are available: 5 node(s) didn't match node selector.
          Reason:                Unschedulable
          Status:                True
          Type:                  Unschedulable
        elasticsearch-cdm-mkkdys93-2:
      Node Count:  2
      Pods:
        Client:
          Failed:
          Not Ready:
            elasticsearch-cdm-mkkdys93-1-75dd69dccd-f7f49
            elasticsearch-cdm-mkkdys93-2-67c64f5f4c-n58vl
          Ready:
        Data:
          Failed:
          Not Ready:
            elasticsearch-cdm-mkkdys93-1-75dd69dccd-f7f49
            elasticsearch-cdm-mkkdys93-2-67c64f5f4c-n58vl
          Ready:
        Master:
          Failed:
          Not Ready:
            elasticsearch-cdm-mkkdys93-1-75dd69dccd-f7f49
            elasticsearch-cdm-mkkdys93-2-67c64f5f4c-n58vl
          Ready:
Copy to Clipboard Toggle word wrap

A status message similar to the following indicates that the requested PVC could not bind to PV:

Example output

      Node Conditions:
        elasticsearch-cdm-mkkdys93-1:
          Last Transition Time:  2019-06-26T03:37:32Z
          Message:               pod has unbound immediate PersistentVolumeClaims (repeated 5 times)
          Reason:                Unschedulable
          Status:                True
          Type:                  Unschedulable
Copy to Clipboard Toggle word wrap

A status message similar to the following indicates that the Fluentd pods cannot be scheduled because the node selector did not match any nodes:

Example output

Status:
  Collection:
    Logs:
      Fluentd Status:
        Daemon Set:  fluentd
        Nodes:
        Pods:
          Failed:
          Not Ready:
          Ready:
Copy to Clipboard Toggle word wrap

12.1.2. Viewing the status of OpenShift Logging components

You can view the status for a number of OpenShift Logging components.

Prerequisites

  • OpenShift Logging and Elasticsearch must be installed.

Procedure

  1. Change to the openshift-logging project.

    $ oc project openshift-logging
    Copy to Clipboard Toggle word wrap
  2. View the status of the OpenShift Logging environment:

    $ oc describe deployment cluster-logging-operator
    Copy to Clipboard Toggle word wrap

    Example output

    Name:                   cluster-logging-operator
    
    ....
    
    Conditions:
      Type           Status  Reason
      ----           ------  ------
      Available      True    MinimumReplicasAvailable
      Progressing    True    NewReplicaSetAvailable
    
    ....
    
    Events:
      Type    Reason             Age   From                   Message
      ----    ------             ----  ----                   -------
      Normal  ScalingReplicaSet  62m   deployment-controller  Scaled up replica set cluster-logging-operator-574b8987df to 1----
    Copy to Clipboard Toggle word wrap

  3. View the status of the OpenShift Logging replica set:

    1. Get the name of a replica set:

      Example output

      $ oc get replicaset
      Copy to Clipboard Toggle word wrap

      Example output

      NAME                                      DESIRED   CURRENT   READY   AGE
      cluster-logging-operator-574b8987df       1         1         1       159m
      elasticsearch-cdm-uhr537yu-1-6869694fb    1         1         1       157m
      elasticsearch-cdm-uhr537yu-2-857b6d676f   1         1         1       156m
      elasticsearch-cdm-uhr537yu-3-5b6fdd8cfd   1         1         1       155m
      kibana-5bd5544f87                         1         1         1       157m
      Copy to Clipboard Toggle word wrap

    2. Get the status of the replica set:

      $ oc describe replicaset cluster-logging-operator-574b8987df
      Copy to Clipboard Toggle word wrap

      Example output

      Name:           cluster-logging-operator-574b8987df
      
      ....
      
      Replicas:       1 current / 1 desired
      Pods Status:    1 Running / 0 Waiting / 0 Succeeded / 0 Failed
      
      ....
      
      Events:
        Type    Reason            Age   From                   Message
        ----    ------            ----  ----                   -------
        Normal  SuccessfulCreate  66m   replicaset-controller  Created pod: cluster-logging-operator-574b8987df-qjhqv----
      Copy to Clipboard Toggle word wrap

12.2. Viewing the status of the log store

You can view the status of the OpenShift Elasticsearch Operator and for a number of Elasticsearch components.

12.2.1. Viewing the status of the log store

You can view the status of your log store.

Prerequisites

  • OpenShift Logging and Elasticsearch must be installed.

Procedure

  1. Change to the openshift-logging project.

    $ oc project openshift-logging
    Copy to Clipboard Toggle word wrap
  2. To view the status:

    1. Get the name of the log store instance:

      $ oc get Elasticsearch
      Copy to Clipboard Toggle word wrap

      Example output

      NAME            AGE
      elasticsearch   5h9m
      Copy to Clipboard Toggle word wrap

    2. Get the log store status:

      $ oc get Elasticsearch <Elasticsearch-instance> -o yaml
      Copy to Clipboard Toggle word wrap

      For example:

      $ oc get Elasticsearch elasticsearch -n openshift-logging -o yaml
      Copy to Clipboard Toggle word wrap

      The output includes information similar to the following:

      Example output

      status: 
      1
      
        cluster: 
      2
      
          activePrimaryShards: 30
          activeShards: 60
          initializingShards: 0
          numDataNodes: 3
          numNodes: 3
          pendingTasks: 0
          relocatingShards: 0
          status: green
          unassignedShards: 0
        clusterHealth: ""
        conditions: [] 
      3
      
        nodes: 
      4
      
        - deploymentName: elasticsearch-cdm-zjf34ved-1
          upgradeStatus: {}
        - deploymentName: elasticsearch-cdm-zjf34ved-2
          upgradeStatus: {}
        - deploymentName: elasticsearch-cdm-zjf34ved-3
          upgradeStatus: {}
        pods: 
      5
      
          client:
            failed: []
            notReady: []
            ready:
            - elasticsearch-cdm-zjf34ved-1-6d7fbf844f-sn422
            - elasticsearch-cdm-zjf34ved-2-dfbd988bc-qkzjz
            - elasticsearch-cdm-zjf34ved-3-c8f566f7c-t7zkt
          data:
            failed: []
            notReady: []
            ready:
            - elasticsearch-cdm-zjf34ved-1-6d7fbf844f-sn422
            - elasticsearch-cdm-zjf34ved-2-dfbd988bc-qkzjz
            - elasticsearch-cdm-zjf34ved-3-c8f566f7c-t7zkt
          master:
            failed: []
            notReady: []
            ready:
            - elasticsearch-cdm-zjf34ved-1-6d7fbf844f-sn422
            - elasticsearch-cdm-zjf34ved-2-dfbd988bc-qkzjz
            - elasticsearch-cdm-zjf34ved-3-c8f566f7c-t7zkt
        shardAllocationEnabled: all
      Copy to Clipboard Toggle word wrap

      1
      In the output, the cluster status fields appear in the status stanza.
      2
      The status of the log store:
      • The number of active primary shards.
      • The number of active shards.
      • The number of shards that are initializing.
      • The number of log store data nodes.
      • The total number of log store nodes.
      • The number of pending tasks.
      • The log store status: green, red, yellow.
      • The number of unassigned shards.
      3
      Any status conditions, if present. The log store status indicates the reasons from the scheduler if a pod could not be placed. Any events related to the following conditions are shown:
      • Container Waiting for both the log store and proxy containers.
      • Container Terminated for both the log store and proxy containers.
      • Pod unschedulable. Also, a condition is shown for a number of issues; see Example condition messages.
      4
      The log store nodes in the cluster, with upgradeStatus.
      5
      The log store client, data, and master pods in the cluster, listed under 'failed`, notReady, or ready state.

12.2.1.1. Example condition messages

The following are examples of some condition messages from the Status section of the Elasticsearch instance.

The following status message indicates that a node has exceeded the configured low watermark, and no shard will be allocated to this node.

status:
  nodes:
  - conditions:
    - lastTransitionTime: 2019-03-15T15:57:22Z
      message: Disk storage usage for node is 27.5gb (36.74%). Shards will be not
        be allocated on this node.
      reason: Disk Watermark Low
      status: "True"
      type: NodeStorage
    deploymentName: example-elasticsearch-cdm-0-1
    upgradeStatus: {}
Copy to Clipboard Toggle word wrap

The following status message indicates that a node has exceeded the configured high watermark, and shards will be relocated to other nodes.

status:
  nodes:
  - conditions:
    - lastTransitionTime: 2019-03-15T16:04:45Z
      message: Disk storage usage for node is 27.5gb (36.74%). Shards will be relocated
        from this node.
      reason: Disk Watermark High
      status: "True"
      type: NodeStorage
    deploymentName: example-elasticsearch-cdm-0-1
    upgradeStatus: {}
Copy to Clipboard Toggle word wrap

The following status message indicates that the log store node selector in the CR does not match any nodes in the cluster:

status:
    nodes:
    - conditions:
      - lastTransitionTime: 2019-04-10T02:26:24Z
        message: '0/8 nodes are available: 8 node(s) didn''t match node selector.'
        reason: Unschedulable
        status: "True"
        type: Unschedulable
Copy to Clipboard Toggle word wrap

The following status message indicates that the log store CR uses a non-existent persistent volume claim (PVC).

status:
   nodes:
   - conditions:
     - last Transition Time:  2019-04-10T05:55:51Z
       message:               pod has unbound immediate PersistentVolumeClaims (repeated 5 times)
       reason:                Unschedulable
       status:                True
       type:                  Unschedulable
Copy to Clipboard Toggle word wrap

The following status message indicates that your log store cluster does not have enough nodes to support the redundancy policy.

status:
  clusterHealth: ""
  conditions:
  - lastTransitionTime: 2019-04-17T20:01:31Z
    message: Wrong RedundancyPolicy selected. Choose different RedundancyPolicy or
      add more nodes with data roles
    reason: Invalid Settings
    status: "True"
    type: InvalidRedundancy
Copy to Clipboard Toggle word wrap

This status message indicates your cluster has too many control plane nodes (also known as the master nodes):

status:
  clusterHealth: green
  conditions:
    - lastTransitionTime: '2019-04-17T20:12:34Z'
      message: >-
        Invalid master nodes count. Please ensure there are no more than 3 total
        nodes with master roles
      reason: Invalid Settings
      status: 'True'
      type: InvalidMasters
Copy to Clipboard Toggle word wrap

The following status message indicates that Elasticsearch storage does not support the change you tried to make.

For example:

status:
  clusterHealth: green
  conditions:
    - lastTransitionTime: "2021-05-07T01:05:13Z"
      message: Changing the storage structure for a custom resource is not supported
      reason: StorageStructureChangeIgnored
      status: 'True'
      type: StorageStructureChangeIgnored
Copy to Clipboard Toggle word wrap

The reason and type fields specify the type of unsupported change:

StorageClassNameChangeIgnored
Unsupported change to the storage class name.
StorageSizeChangeIgnored
Unsupported change the storage size.
StorageStructureChangeIgnored

Unsupported change between ephemeral and persistent storage structures.

Important

If you try to configure the ClusterLogging custom resource (CR) to switch from ephemeral to persistent storage, the OpenShift Elasticsearch Operator creates a persistent volume claim (PVC) but does not create a persistent volume (PV). To clear the StorageStructureChangeIgnored status, you must revert the change to the ClusterLogging CR and delete the PVC.

12.2.2. Viewing the status of the log store components

You can view the status for a number of the log store components.

Elasticsearch indices

You can view the status of the Elasticsearch indices.

  1. Get the name of an Elasticsearch pod:

    $ oc get pods --selector component=elasticsearch -o name
    Copy to Clipboard Toggle word wrap

    Example output

    pod/elasticsearch-cdm-1godmszn-1-6f8495-vp4lw
    pod/elasticsearch-cdm-1godmszn-2-5769cf-9ms2n
    pod/elasticsearch-cdm-1godmszn-3-f66f7d-zqkz7
    Copy to Clipboard Toggle word wrap

  2. Get the status of the indices:

    $ oc exec elasticsearch-cdm-4vjor49p-2-6d4d7db474-q2w7z -- indices
    Copy to Clipboard Toggle word wrap

    Example output

    Defaulting container name to elasticsearch.
    Use 'oc describe pod/elasticsearch-cdm-4vjor49p-2-6d4d7db474-q2w7z -n openshift-logging' to see all of the containers in this pod.
    
    green  open   infra-000002                                                     S4QANnf1QP6NgCegfnrnbQ   3   1     119926            0        157             78
    green  open   audit-000001                                                     8_EQx77iQCSTzFOXtxRqFw   3   1          0            0          0              0
    green  open   .security                                                        iDjscH7aSUGhIdq0LheLBQ   1   1          5            0          0              0
    green  open   .kibana_-377444158_kubeadmin                                     yBywZ9GfSrKebz5gWBZbjw   3   1          1            0          0              0
    green  open   infra-000001                                                     z6Dpe__ORgiopEpW6Yl44A   3   1     871000            0        874            436
    green  open   app-000001                                                       hIrazQCeSISewG3c2VIvsQ   3   1       2453            0          3              1
    green  open   .kibana_1                                                        JCitcBMSQxKOvIq6iQW6wg   1   1          0            0          0              0
    green  open   .kibana_-1595131456_user1                                        gIYFIEGRRe-ka0W3okS-mQ   3   1          1            0          0              0
    Copy to Clipboard Toggle word wrap

Log store pods

You can view the status of the pods that host the log store.

  1. Get the name of a pod:

    $ oc get pods --selector component=elasticsearch -o name
    Copy to Clipboard Toggle word wrap

    Example output

    pod/elasticsearch-cdm-1godmszn-1-6f8495-vp4lw
    pod/elasticsearch-cdm-1godmszn-2-5769cf-9ms2n
    pod/elasticsearch-cdm-1godmszn-3-f66f7d-zqkz7
    Copy to Clipboard Toggle word wrap

  2. Get the status of a pod:

    $ oc describe pod elasticsearch-cdm-1godmszn-1-6f8495-vp4lw
    Copy to Clipboard Toggle word wrap

    The output includes the following status information:

    Example output

    ....
    Status:             Running
    
    ....
    
    Containers:
      elasticsearch:
        Container ID:   cri-o://b7d44e0a9ea486e27f47763f5bb4c39dfd2
        State:          Running
          Started:      Mon, 08 Jun 2020 10:17:56 -0400
        Ready:          True
        Restart Count:  0
        Readiness:  exec [/usr/share/elasticsearch/probe/readiness.sh] delay=10s timeout=30s period=5s #success=1 #failure=3
    
    ....
    
      proxy:
        Container ID:  cri-o://3f77032abaddbb1652c116278652908dc01860320b8a4e741d06894b2f8f9aa1
        State:          Running
          Started:      Mon, 08 Jun 2020 10:18:38 -0400
        Ready:          True
        Restart Count:  0
    
    ....
    
    Conditions:
      Type              Status
      Initialized       True
      Ready             True
      ContainersReady   True
      PodScheduled      True
    
    ....
    
    Events:          <none>
    Copy to Clipboard Toggle word wrap

Log storage pod deployment configuration

You can view the status of the log store deployment configuration.

  1. Get the name of a deployment configuration:

    $ oc get deployment --selector component=elasticsearch -o name
    Copy to Clipboard Toggle word wrap

    Example output

    deployment.extensions/elasticsearch-cdm-1gon-1
    deployment.extensions/elasticsearch-cdm-1gon-2
    deployment.extensions/elasticsearch-cdm-1gon-3
    Copy to Clipboard Toggle word wrap

  2. Get the deployment configuration status:

    $ oc describe deployment elasticsearch-cdm-1gon-1
    Copy to Clipboard Toggle word wrap

    The output includes the following status information:

    Example output

    ....
      Containers:
       elasticsearch:
        Image:      registry.redhat.io/openshift-logging/elasticsearch6-rhel8
        Readiness:  exec [/usr/share/elasticsearch/probe/readiness.sh] delay=10s timeout=30s period=5s #success=1 #failure=3
    
    ....
    
    Conditions:
      Type           Status   Reason
      ----           ------   ------
      Progressing    Unknown  DeploymentPaused
      Available      True     MinimumReplicasAvailable
    
    ....
    
    Events:          <none>
    Copy to Clipboard Toggle word wrap

Log store replica set

You can view the status of the log store replica set.

  1. Get the name of a replica set:

    $ oc get replicaSet --selector component=elasticsearch -o name
    
    replicaset.extensions/elasticsearch-cdm-1gon-1-6f8495
    replicaset.extensions/elasticsearch-cdm-1gon-2-5769cf
    replicaset.extensions/elasticsearch-cdm-1gon-3-f66f7d
    Copy to Clipboard Toggle word wrap
  2. Get the status of the replica set:

    $ oc describe replicaSet elasticsearch-cdm-1gon-1-6f8495
    Copy to Clipboard Toggle word wrap

    The output includes the following status information:

    Example output

    ....
      Containers:
       elasticsearch:
        Image:      registry.redhat.io/openshift-logging/elasticsearch6-rhel8@sha256:4265742c7cdd85359140e2d7d703e4311b6497eec7676957f455d6908e7b1c25
        Readiness:  exec [/usr/share/elasticsearch/probe/readiness.sh] delay=10s timeout=30s period=5s #success=1 #failure=3
    
    ....
    
    Events:          <none>
    Copy to Clipboard Toggle word wrap

12.3. Understanding OpenShift Logging alerts

All of the logging collector alerts are listed on the Alerting UI of the OpenShift Container Platform web console.

12.3.1. Viewing logging collector alerts

Alerts are shown in the OpenShift Container Platform web console, on the Alerts tab of the Alerting UI. Alerts are in one of the following states:

  • Firing. The alert condition is true for the duration of the timeout. Click the Options menu at the end of the firing alert to view more information or silence the alert.
  • Pending The alert condition is currently true, but the timeout has not been reached.
  • Not Firing. The alert is not currently triggered.

Procedure

To view OpenShift Logging and other OpenShift Container Platform alerts:

  1. In the OpenShift Container Platform console, click Monitoring Alerting.
  2. Click the Alerts tab. The alerts are listed, based on the filters selected.

12.3.2. About logging collector alerts

The following alerts are generated by the logging collector. You can view these alerts in the OpenShift Container Platform web console, on the Alerts page of the Alerting UI.

Expand
Table 12.1. Fluentd Prometheus alerts
AlertMessageDescriptionSeverity

FluentDHighErrorRate

<value> of records have resulted in an error by fluentd <instance>.

The number of FluentD output errors is high, by default more than 10 in the previous 15 minutes.

Warning

FluentdNodeDown

Prometheus could not scrape fluentd <instance> for more than 10m.

Fluentd is reporting that Prometheus could not scrape a specific Fluentd instance.

Critical

FluentdQueueLengthIncreasing

In the last 12h, fluentd <instance> buffer queue length constantly increased more than 1. Current value is <value>.

Fluentd is reporting that the queue size is increasing.

Critical

FluentDVeryHighErrorRate

<value> of records have resulted in an error by fluentd <instance>.

The number of FluentD output errors is very high, by default more than 25 in the previous 15 minutes.

Critical

12.3.3. About Elasticsearch alerting rules

You can view these alerting rules in Prometheus.

Expand
Table 12.2. Alerting rules
AlertDescriptionSeverity

ElasticsearchClusterNotHealthy

The cluster health status has been RED for at least 2 minutes. The cluster does not accept writes, shards may be missing, or the master node hasn’t been elected yet.

Critical

ElasticsearchClusterNotHealthy

The cluster health status has been YELLOW for at least 20 minutes. Some shard replicas are not allocated.

Warning

ElasticsearchDiskSpaceRunningLow

The cluster is expected to be out of disk space within the next 6 hours.

Critical

ElasticsearchHighFileDescriptorUsage

The cluster is predicted to be out of file descriptors within the next hour.

Warning

ElasticsearchJVMHeapUseHigh

The JVM Heap usage on the specified node is high.

Alert

ElasticsearchNodeDiskWatermarkReached

The specified node has hit the low watermark due to low free disk space. Shards can not be allocated to this node anymore. You should consider adding more disk space to the node.

Info

ElasticsearchNodeDiskWatermarkReached

The specified node has hit the high watermark due to low free disk space. Some shards will be re-allocated to different nodes if possible. Make sure more disk space is added to the node or drop old indices allocated to this node.

Warning

ElasticsearchNodeDiskWatermarkReached

The specified node has hit the flood watermark due to low free disk space. Every index that has a shard allocated on this node is enforced a read-only block. The index block must be manually released when the disk use falls below the high watermark.

Critical

ElasticsearchJVMHeapUseHigh

The JVM Heap usage on the specified node is too high.

Alert

ElasticsearchWriteRequestsRejectionJumps

Elasticsearch is experiencing an increase in write rejections on the specified node. This node might not be keeping up with the indexing speed.

Warning

AggregatedLoggingSystemCPUHigh

The CPU used by the system on the specified node is too high.

Alert

ElasticsearchProcessCPUHigh

The CPU used by Elasticsearch on the specified node is too high.

Alert

12.4. Collecting logging data for Red Hat Support

When opening a support case, it is helpful to provide debugging information about your cluster to Red Hat Support.

The must-gather tool enables you to collect diagnostic information for project-level resources, cluster-level resources, and each of the OpenShift Logging components.

For prompt support, supply diagnostic information for both OpenShift Container Platform and OpenShift Logging.

Note

Do not use the hack/logging-dump.sh script. The script is no longer supported and does not collect data.

12.4.1. About the must-gather tool

The oc adm must-gather CLI command collects the information from your cluster that is most likely needed for debugging issues.

For your OpenShift Logging environment, must-gather collects the following information:

  • Project-level resources, including pods, configuration maps, service accounts, roles, role bindings, and events at the project level
  • Cluster-level resources, including nodes, roles, and role bindings at the cluster level
  • OpenShift Logging resources in the openshift-logging and openshift-operators-redhat namespaces, including health status for the log collector, the log store, and the log visualizer

When you run oc adm must-gather, a new pod is created on the cluster. The data is collected on that pod and saved in a new directory that starts with must-gather.local. This directory is created in the current working directory.

12.4.2. Prerequisites

  • OpenShift Logging and Elasticsearch must be installed.

12.4.3. Collecting OpenShift Logging data

You can use the oc adm must-gather CLI command to collect information about your OpenShift Logging environment.

Procedure

To collect OpenShift Logging information with must-gather:

  1. Navigate to the directory where you want to store the must-gather information.
  2. Run the oc adm must-gather command against the OpenShift Logging image:

    $ oc adm must-gather --image=$(oc -n openshift-logging get deployment.apps/cluster-logging-operator -o jsonpath='{.spec.template.spec.containers[?(@.name == "cluster-logging-operator")].image}')
    Copy to Clipboard Toggle word wrap

    The must-gather tool creates a new directory that starts with must-gather.local within the current directory. For example: must-gather.local.4157245944708210408.

  3. Create a compressed file from the must-gather directory that was just created. For example, on a computer that uses a Linux operating system, run the following command:

    $ tar -cvaf must-gather.tar.gz must-gather.local.4157245944708210408
    Copy to Clipboard Toggle word wrap
  4. Attach the compressed file to your support case on the Red Hat Customer Portal.

12.5. Troubleshooting for Critical Alerts

12.5.1. Elasticsearch Cluster Health is Red

At least one primary shard and its replicas are not allocated to a node.

Troubleshooting

  1. Check the Elasticsearch cluster health and verify that the cluster status is red.

    oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- health
    Copy to Clipboard Toggle word wrap
  2. List the nodes that have joined the cluster.

    oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/nodes?v
    Copy to Clipboard Toggle word wrap
  3. List the Elasticsearch pods and compare them with the nodes in the command output from the previous step.

    oc -n openshift-logging get pods -l component=elasticsearch
    Copy to Clipboard Toggle word wrap
  4. If some of the Elasticsearch nodes have not joined the cluster, perform the following steps.

    1. Confirm that Elasticsearch has an elected control plane node.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/master?v
      Copy to Clipboard Toggle word wrap
    2. Review the pod logs of the elected control plane node for issues.

      oc logs <elasticsearch_master_pod_name> -c elasticsearch -n openshift-logging
      Copy to Clipboard Toggle word wrap
    3. Review the logs of nodes that have not joined the cluster for issues.

      oc logs <elasticsearch_node_name> -c elasticsearch -n openshift-logging
      Copy to Clipboard Toggle word wrap
  5. If all the nodes have joined the cluster, perform the following steps, check if the cluster is in the process of recovering.

    oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/recovery?active_only=true
    Copy to Clipboard Toggle word wrap

    If there is no command output, the recovery process might be delayed or stalled by pending tasks.

  6. Check if there are pending tasks.

    oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- health |grep  number_of_pending_tasks
    Copy to Clipboard Toggle word wrap
  7. If there are pending tasks, monitor their status.

    If their status changes and indicates that the cluster is recovering, continue waiting. The recovery time varies according to the size of the cluster and other factors.

    Otherwise, if the status of the pending tasks does not change, this indicates that the recovery has stalled.

  8. If it seems like the recovery has stalled, check if cluster.routing.allocation.enable is set to none.

    oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/settings?pretty
    Copy to Clipboard Toggle word wrap
  9. If cluster.routing.allocation.enable is set to none, set it to all.

    oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/settings?pretty -X PUT -d '{"persistent": {"cluster.routing.allocation.enable":"all"}}'
    Copy to Clipboard Toggle word wrap
  10. Check which indices are still red.

    oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/indices?v
    Copy to Clipboard Toggle word wrap
  11. If any indices are still red, try to clear them by performing the following steps.

    1. Clear the cache.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name>/_cache/clear?pretty
      Copy to Clipboard Toggle word wrap
    2. Increase the max allocation retries.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name>/_settings?pretty -X PUT -d '{"index.allocation.max_retries":10}'
      Copy to Clipboard Toggle word wrap
    3. Delete all the scroll items.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_search/scroll/_all -X DELETE
      Copy to Clipboard Toggle word wrap
    4. Increase the timeout.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name>/_settings?pretty -X PUT -d '{"index.unassigned.node_left.delayed_timeout":"10m"}'
      Copy to Clipboard Toggle word wrap
  12. If the preceding steps do not clear the red indices, delete the indices individually.

    1. Identify the red index name.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/indices?v
      Copy to Clipboard Toggle word wrap
    2. Delete the red index.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_red_index_name> -X DELETE
      Copy to Clipboard Toggle word wrap
  13. If there are no red indices and the cluster status is red, check for a continuous heavy processing load on a data node.

    1. Check if the Elasticsearch JVM Heap usage is high.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_nodes/stats?pretty
      Copy to Clipboard Toggle word wrap

      In the command output, review the node_name.jvm.mem.heap_used_percent field to determine the JVM Heap usage.

    2. Check for high CPU utilization.

12.5.2. Elasticsearch Cluster Health is Yellow

Replica shards for at least one primary shard are not allocated to nodes.

Troubleshooting

  1. Increase the node count by adjusting nodeCount in the ClusterLogging CR.

12.5.3. Elasticsearch Node Disk Low Watermark Reached

Elasticsearch does not allocate shards to nodes that reach the low watermark.

Troubleshooting

  1. Identify the node on which Elasticsearch is deployed.

    oc -n openshift-logging get po -o wide
    Copy to Clipboard Toggle word wrap
  2. Check if there are unassigned shards.

    oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/health?pretty | grep unassigned_shards
    Copy to Clipboard Toggle word wrap
  3. If there are unassigned shards, check the disk space on each node.

    for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
    Copy to Clipboard Toggle word wrap
  4. Check the nodes.node_name.fs field to determine the free disk space on that node.

    If the used disk percentage is above 85%, the node has exceeded the low watermark, and shards can no longer be allocated to this node.

  5. Try to increase the disk space on all nodes.
  6. If increasing the disk space is not possible, try adding a new data node to the cluster.
  7. If adding a new data node is problematic, decrease the total cluster redundancy policy.

    1. Check the current redundancyPolicy.

      oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
      Copy to Clipboard Toggle word wrap
      Note

      If you are using a ClusterLogging CR, enter:

      oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
      Copy to Clipboard Toggle word wrap
    2. If the cluster redundancyPolicy is higher than SingleRedundancy, set it to SingleRedundancy and save this change.
  8. If the preceding steps do not fix the issue, delete the old indices.

    1. Check the status of all indices on Elasticsearch.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices
      Copy to Clipboard Toggle word wrap
    2. Identify an old index that can be deleted.
    3. Delete the index.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE
      Copy to Clipboard Toggle word wrap

12.5.4. Elasticsearch Node Disk High Watermark Reached

Elasticsearch attempts to relocate shards away from a node that has reached the high watermark.

Troubleshooting

  1. Identify the node on which Elasticsearch is deployed.

    oc -n openshift-logging get po -o wide
    Copy to Clipboard Toggle word wrap
  2. Check the disk space on each node.

    for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
    Copy to Clipboard Toggle word wrap
  3. Check if the cluster is rebalancing.

    oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/health?pretty | grep relocating_shards
    Copy to Clipboard Toggle word wrap

    If the command output shows relocating shards, the High Watermark has been exceeded. The default value of the High Watermark is 90%.

    The shards relocate to a node with low disk usage that has not crossed any watermark threshold limits.

  4. To allocate shards to a particular node, free up some space.
  5. Try to increase the disk space on all nodes.
  6. If increasing the disk space is not possible, try adding a new data node to the cluster.
  7. If adding a new data node is problematic, decrease the total cluster redundancy policy.

    1. Check the current redundancyPolicy.

      oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
      Copy to Clipboard Toggle word wrap
      Note

      If you are using a ClusterLogging CR, enter:

      oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
      Copy to Clipboard Toggle word wrap
    2. If the cluster redundancyPolicy is higher than SingleRedundancy, set it to SingleRedundancy and save this change.
  8. If the preceding steps do not fix the issue, delete the old indices.

    1. Check the status of all indices on Elasticsearch.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices
      Copy to Clipboard Toggle word wrap
    2. Identify an old index that can be deleted.
    3. Delete the index.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE
      Copy to Clipboard Toggle word wrap

12.5.5. Elasticsearch Node Disk Flood Watermark Reached

Elasticsearch enforces a read-only index block on every index that has both of these conditions:

  • One or more shards are allocated to the node.
  • One or more disks exceed the flood stage.

Troubleshooting

  1. Check the disk space of the Elasticsearch node.

    for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
    Copy to Clipboard Toggle word wrap

    Check the nodes.node_name.fs field to determine the free disk space on that node.

  2. If the used disk percentage is above 95%, it signifies that the node has crossed the flood watermark. Writing is blocked for shards allocated on this particular node.
  3. Try to increase the disk space on all nodes.
  4. If increasing the disk space is not possible, try adding a new data node to the cluster.
  5. If adding a new data node is problematic, decrease the total cluster redundancy policy.

    1. Check the current redundancyPolicy.

      oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
      Copy to Clipboard Toggle word wrap
      Note

      If you are using a ClusterLogging CR, enter:

      oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
      Copy to Clipboard Toggle word wrap
    2. If the cluster redundancyPolicy is higher than SingleRedundancy, set it to SingleRedundancy and save this change.
  6. If the preceding steps do not fix the issue, delete the old indices.

    1. Check the status of all indices on Elasticsearch.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices
      Copy to Clipboard Toggle word wrap
    2. Identify an old index that can be deleted.
    3. Delete the index.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE
      Copy to Clipboard Toggle word wrap
  7. Continue freeing up and monitoring the disk space until the used disk space drops below 90%. Then, unblock write to this particular node.

    oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_all/_settings?pretty -X PUT -d '{"index.blocks.read_only_allow_delete": null}'
    Copy to Clipboard Toggle word wrap

12.5.6. Elasticsearch JVM Heap Use is High

The Elasticsearch node JVM Heap memory used is above 75%.

Troubleshooting

Consider increasing the heap size.

12.5.7. Aggregated Logging System CPU is High

System CPU usage on the node is high.

Troubleshooting

Check the CPU of the cluster node. Consider allocating more CPU resources to the node.

12.5.8. Elasticsearch Process CPU is High

Elasticsearch process CPU usage on the node is high.

Troubleshooting

Check the CPU of the cluster node. Consider allocating more CPU resources to the node.

12.5.9. Elasticsearch Disk Space is Running Low

The Elasticsearch Cluster is predicted to be out of disk space within the next 6 hours based on current disk usage.

Troubleshooting

  1. Get the disk space of the Elasticsearch node.

    for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
    Copy to Clipboard Toggle word wrap
  2. In the command output, check the nodes.node_name.fs field to determine the free disk space on that node.
  3. Try to increase the disk space on all nodes.
  4. If increasing the disk space is not possible, try adding a new data node to the cluster.
  5. If adding a new data node is problematic, decrease the total cluster redundancy policy.

    1. Check the current redundancyPolicy.

      oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
      Copy to Clipboard Toggle word wrap
      Note

      If you are using a ClusterLogging CR, enter:

      oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
      Copy to Clipboard Toggle word wrap
    2. If the cluster redundancyPolicy is higher than SingleRedundancy, set it to SingleRedundancy and save this change.
  6. If the preceding steps do not fix the issue, delete the old indices.

    1. Check the status of all indices on Elasticsearch.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices
      Copy to Clipboard Toggle word wrap
    2. Identify an old index that can be deleted.
    3. Delete the index.

      oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE
      Copy to Clipboard Toggle word wrap

12.5.10. Elasticsearch FileDescriptor Usage is high

Based on current usage trends, the predicted number of file descriptors on the node is insufficient.

Troubleshooting

Check and, if needed, configure the value of max_file_descriptors for each node, as described in the Elasticsearch File descriptors topic.

맨 위로 이동
Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2025 Red Hat