Chapter 9. Troubleshooting cluster logging
9.1. Viewing cluster logging status Copy linkLink copied to clipboard!
You can view the status of the Cluster Logging Operator and for a number of cluster logging components.
9.1.1. Viewing the status of the Cluster Logging Operator Copy linkLink copied to clipboard!
You can view the status of your Cluster Logging Operator.
Prerequisites
- Cluster logging and Elasticsearch must be installed.
Procedure
Change to the
openshift-loggingproject.$ oc project openshift-loggingTo view the cluster logging status:
Get the cluster logging status:
$ oc get clusterlogging instance -o yamlExample output
apiVersion: logging.openshift.io/v1 kind: ClusterLogging .... status:1 collection: logs: fluentdStatus: daemonSet: fluentd2 nodes: fluentd-2rhqp: ip-10-0-169-13.ec2.internal fluentd-6fgjh: ip-10-0-165-244.ec2.internal fluentd-6l2ff: ip-10-0-128-218.ec2.internal fluentd-54nx5: ip-10-0-139-30.ec2.internal fluentd-flpnn: ip-10-0-147-228.ec2.internal fluentd-n2frh: ip-10-0-157-45.ec2.internal pods: failed: [] notReady: [] ready: - fluentd-2rhqp - fluentd-54nx5 - fluentd-6fgjh - fluentd-6l2ff - fluentd-flpnn - fluentd-n2frh logstore:3 elasticsearchStatus: - ShardAllocationEnabled: all cluster: activePrimaryShards: 5 activeShards: 5 initializingShards: 0 numDataNodes: 1 numNodes: 1 pendingTasks: 0 relocatingShards: 0 status: green unassignedShards: 0 clusterName: elasticsearch nodeConditions: elasticsearch-cdm-mkkdys93-1: nodeCount: 1 pods: client: failed: notReady: ready: - elasticsearch-cdm-mkkdys93-1-7f7c6-mjm7c data: failed: notReady: ready: - elasticsearch-cdm-mkkdys93-1-7f7c6-mjm7c master: failed: notReady: ready: - elasticsearch-cdm-mkkdys93-1-7f7c6-mjm7c visualization:4 kibanaStatus: - deployment: kibana pods: failed: [] notReady: [] ready: - kibana-7fb4fd4cc9-f2nls replicaSets: - kibana-7fb4fd4cc9 replicas: 1
9.1.1.1. Example condition messages Copy linkLink copied to clipboard!
The following are examples of some condition messages from the Status.Nodes section of the cluster logging instance.
A status message similar to the following indicates a node has exceeded the configured low watermark and no shard will be allocated to this node:
Example output
nodes:
- conditions:
- lastTransitionTime: 2019-03-15T15:57:22Z
message: Disk storage usage for node is 27.5gb (36.74%). Shards will be not
be allocated on this node.
reason: Disk Watermark Low
status: "True"
type: NodeStorage
deploymentName: example-elasticsearch-clientdatamaster-0-1
upgradeStatus: {}
A status message similar to the following indicates a node has exceeded the configured high watermark and shards will be relocated to other nodes:
Example output
nodes:
- conditions:
- lastTransitionTime: 2019-03-15T16:04:45Z
message: Disk storage usage for node is 27.5gb (36.74%). Shards will be relocated
from this node.
reason: Disk Watermark High
status: "True"
type: NodeStorage
deploymentName: cluster-logging-operator
upgradeStatus: {}
A status message similar to the following indicates the Elasticsearch node selector in the CR does not match any nodes in the cluster:
Example output
Elasticsearch Status:
Shard Allocation Enabled: shard allocation unknown
Cluster:
Active Primary Shards: 0
Active Shards: 0
Initializing Shards: 0
Num Data Nodes: 0
Num Nodes: 0
Pending Tasks: 0
Relocating Shards: 0
Status: cluster health unknown
Unassigned Shards: 0
Cluster Name: elasticsearch
Node Conditions:
elasticsearch-cdm-mkkdys93-1:
Last Transition Time: 2019-06-26T03:37:32Z
Message: 0/5 nodes are available: 5 node(s) didn't match node selector.
Reason: Unschedulable
Status: True
Type: Unschedulable
elasticsearch-cdm-mkkdys93-2:
Node Count: 2
Pods:
Client:
Failed:
Not Ready:
elasticsearch-cdm-mkkdys93-1-75dd69dccd-f7f49
elasticsearch-cdm-mkkdys93-2-67c64f5f4c-n58vl
Ready:
Data:
Failed:
Not Ready:
elasticsearch-cdm-mkkdys93-1-75dd69dccd-f7f49
elasticsearch-cdm-mkkdys93-2-67c64f5f4c-n58vl
Ready:
Master:
Failed:
Not Ready:
elasticsearch-cdm-mkkdys93-1-75dd69dccd-f7f49
elasticsearch-cdm-mkkdys93-2-67c64f5f4c-n58vl
Ready:
A status message similar to the following indicates that the requested PVC could not bind to PV:
Example output
Node Conditions:
elasticsearch-cdm-mkkdys93-1:
Last Transition Time: 2019-06-26T03:37:32Z
Message: pod has unbound immediate PersistentVolumeClaims (repeated 5 times)
Reason: Unschedulable
Status: True
Type: Unschedulable
A status message similar to the following indicates that the Fluentd pods cannot be scheduled because the node selector did not match any nodes:
Example output
Status:
Collection:
Logs:
Fluentd Status:
Daemon Set: fluentd
Nodes:
Pods:
Failed:
Not Ready:
Ready:
9.1.2. Viewing the status of cluster logging components Copy linkLink copied to clipboard!
You can view the status for a number of cluster logging components.
Prerequisites
- Cluster logging and Elasticsearch must be installed.
Procedure
Change to the
openshift-loggingproject.$ oc project openshift-loggingView the status of the cluster logging environment:
$ oc describe deployment cluster-logging-operatorExample output
Name: cluster-logging-operator .... Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True NewReplicaSetAvailable .... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScalingReplicaSet 62m deployment-controller Scaled up replica set cluster-logging-operator-574b8987df to 1----View the status of the cluster logging replica set:
Get the name of a replica set:
Example output
$ oc get replicasetExample output
NAME DESIRED CURRENT READY AGE cluster-logging-operator-574b8987df 1 1 1 159m elasticsearch-cdm-uhr537yu-1-6869694fb 1 1 1 157m elasticsearch-cdm-uhr537yu-2-857b6d676f 1 1 1 156m elasticsearch-cdm-uhr537yu-3-5b6fdd8cfd 1 1 1 155m kibana-5bd5544f87 1 1 1 157mGet the status of the replica set:
$ oc describe replicaset cluster-logging-operator-574b8987dfExample output
Name: cluster-logging-operator-574b8987df .... Replicas: 1 current / 1 desired Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed .... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 66m replicaset-controller Created pod: cluster-logging-operator-574b8987df-qjhqv----
9.2. Viewing the status of the log store Copy linkLink copied to clipboard!
You can view the status of the Elasticsearch Operator and for a number of Elasticsearch components.
9.2.1. Viewing the status of the log store Copy linkLink copied to clipboard!
You can view the status of your log store.
Prerequisites
- Cluster logging and Elasticsearch must be installed.
Procedure
Change to the
openshift-loggingproject.$ oc project openshift-loggingTo view the status:
Get the name of the log store instance:
$ oc get ElasticsearchExample output
NAME AGE elasticsearch 5h9mGet the log store status:
$ oc get Elasticsearch <Elasticsearch-instance> -o yamlFor example:
$ oc get Elasticsearch elasticsearch -n openshift-logging -o yamlThe output includes information similar to the following:
Example output
status:1 cluster:2 activePrimaryShards: 30 activeShards: 60 initializingShards: 0 numDataNodes: 3 numNodes: 3 pendingTasks: 0 relocatingShards: 0 status: green unassignedShards: 0 clusterHealth: "" conditions: []3 nodes:4 - deploymentName: elasticsearch-cdm-zjf34ved-1 upgradeStatus: {} - deploymentName: elasticsearch-cdm-zjf34ved-2 upgradeStatus: {} - deploymentName: elasticsearch-cdm-zjf34ved-3 upgradeStatus: {} pods:5 client: failed: [] notReady: [] ready: - elasticsearch-cdm-zjf34ved-1-6d7fbf844f-sn422 - elasticsearch-cdm-zjf34ved-2-dfbd988bc-qkzjz - elasticsearch-cdm-zjf34ved-3-c8f566f7c-t7zkt data: failed: [] notReady: [] ready: - elasticsearch-cdm-zjf34ved-1-6d7fbf844f-sn422 - elasticsearch-cdm-zjf34ved-2-dfbd988bc-qkzjz - elasticsearch-cdm-zjf34ved-3-c8f566f7c-t7zkt master: failed: [] notReady: [] ready: - elasticsearch-cdm-zjf34ved-1-6d7fbf844f-sn422 - elasticsearch-cdm-zjf34ved-2-dfbd988bc-qkzjz - elasticsearch-cdm-zjf34ved-3-c8f566f7c-t7zkt shardAllocationEnabled: all- 1
- In the output, the cluster status fields appear in the
statusstanza. - 2
- The status of the log store:
- The number of active primary shards.
- The number of active shards.
- The number of shards that are initializing.
- The number of log store data nodes.
- The total number of log store nodes.
- The number of pending tasks.
-
The log store status:
green,red,yellow. - The number of unassigned shards.
- 3
- Any status conditions, if present. The log store status indicates the reasons from the scheduler if a pod could not be placed. Any events related to the following conditions are shown:
- Container Waiting for both the log store and proxy containers.
- Container Terminated for both the log store and proxy containers.
- Pod unschedulable. Also, a condition is shown for a number of issues, see Example condition messages.
- 4
- The log store nodes in the cluster, with
upgradeStatus. - 5
- The log store client, data, and master pods in the cluster, listed under 'failed`,
notReadyorreadystate.
9.2.1.1. Example condition messages Copy linkLink copied to clipboard!
The following are examples of some condition messages from the Status section of the Elasticsearch instance.
This status message indicates a node has exceeded the configured low watermark and no shard will be allocated to this node.
status:
nodes:
- conditions:
- lastTransitionTime: 2019-03-15T15:57:22Z
message: Disk storage usage for node is 27.5gb (36.74%). Shards will be not
be allocated on this node.
reason: Disk Watermark Low
status: "True"
type: NodeStorage
deploymentName: example-elasticsearch-cdm-0-1
upgradeStatus: {}
This status message indicates a node has exceeded the configured high watermark and shards will be relocated to other nodes.
status:
nodes:
- conditions:
- lastTransitionTime: 2019-03-15T16:04:45Z
message: Disk storage usage for node is 27.5gb (36.74%). Shards will be relocated
from this node.
reason: Disk Watermark High
status: "True"
type: NodeStorage
deploymentName: example-elasticsearch-cdm-0-1
upgradeStatus: {}
This status message indicates the log store node selector in the CR does not match any nodes in the cluster:
status:
nodes:
- conditions:
- lastTransitionTime: 2019-04-10T02:26:24Z
message: '0/8 nodes are available: 8 node(s) didn''t match node selector.'
reason: Unschedulable
status: "True"
type: Unschedulable
This status message indicates that the log store CR uses a non-existent PVC.
status:
nodes:
- conditions:
- last Transition Time: 2019-04-10T05:55:51Z
message: pod has unbound immediate PersistentVolumeClaims (repeated 5 times)
reason: Unschedulable
status: True
type: Unschedulable
This status message indicates that your log store cluster does not have enough nodes to support your log store redundancy policy.
status:
clusterHealth: ""
conditions:
- lastTransitionTime: 2019-04-17T20:01:31Z
message: Wrong RedundancyPolicy selected. Choose different RedundancyPolicy or
add more nodes with data roles
reason: Invalid Settings
status: "True"
type: InvalidRedundancy
This status message indicates your cluster has too many master nodes:
status:
clusterHealth: green
conditions:
- lastTransitionTime: '2019-04-17T20:12:34Z'
message: >-
Invalid master nodes count. Please ensure there are no more than 3 total
nodes with master roles
reason: Invalid Settings
status: 'True'
type: InvalidMasters
9.2.2. Viewing the status of the log store components Copy linkLink copied to clipboard!
You can view the status for a number of the log store components.
- Elasticsearch indices
You can view the status of the Elasticsearch indices.
Get the name of an Elasticsearch pod:
$ oc get pods --selector component=elasticsearch -o nameExample output
pod/elasticsearch-cdm-1godmszn-1-6f8495-vp4lw pod/elasticsearch-cdm-1godmszn-2-5769cf-9ms2n pod/elasticsearch-cdm-1godmszn-3-f66f7d-zqkz7Get the status of the indices:
$ oc exec elasticsearch-cdm-4vjor49p-2-6d4d7db474-q2w7z -- indicesExample output
Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-4vjor49p-2-6d4d7db474-q2w7z -n openshift-logging' to see all of the containers in this pod. green open infra-000002 S4QANnf1QP6NgCegfnrnbQ 3 1 119926 0 157 78 green open audit-000001 8_EQx77iQCSTzFOXtxRqFw 3 1 0 0 0 0 green open .security iDjscH7aSUGhIdq0LheLBQ 1 1 5 0 0 0 green open .kibana_-377444158_kubeadmin yBywZ9GfSrKebz5gWBZbjw 3 1 1 0 0 0 green open infra-000001 z6Dpe__ORgiopEpW6Yl44A 3 1 871000 0 874 436 green open app-000001 hIrazQCeSISewG3c2VIvsQ 3 1 2453 0 3 1 green open .kibana_1 JCitcBMSQxKOvIq6iQW6wg 1 1 0 0 0 0 green open .kibana_-1595131456_user1 gIYFIEGRRe-ka0W3okS-mQ 3 1 1 0 0 0
- Log store pods
You can view the status of the pods that host the log store.
Get the name of a pod:
$ oc get pods --selector component=elasticsearch -o nameExample output
pod/elasticsearch-cdm-1godmszn-1-6f8495-vp4lw pod/elasticsearch-cdm-1godmszn-2-5769cf-9ms2n pod/elasticsearch-cdm-1godmszn-3-f66f7d-zqkz7Get the status of a pod:
$ oc describe pod elasticsearch-cdm-1godmszn-1-6f8495-vp4lwThe output includes the following status information:
Example output
.... Status: Running .... Containers: elasticsearch: Container ID: cri-o://b7d44e0a9ea486e27f47763f5bb4c39dfd2 State: Running Started: Mon, 08 Jun 2020 10:17:56 -0400 Ready: True Restart Count: 0 Readiness: exec [/usr/share/elasticsearch/probe/readiness.sh] delay=10s timeout=30s period=5s #success=1 #failure=3 .... proxy: Container ID: cri-o://3f77032abaddbb1652c116278652908dc01860320b8a4e741d06894b2f8f9aa1 State: Running Started: Mon, 08 Jun 2020 10:18:38 -0400 Ready: True Restart Count: 0 .... Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True .... Events: <none>
- Log storage pod deployment configuration
You can view the status of the log store deployment configuration.
Get the name of a deployment configuration:
$ oc get deployment --selector component=elasticsearch -o nameExample output
deployment.extensions/elasticsearch-cdm-1gon-1 deployment.extensions/elasticsearch-cdm-1gon-2 deployment.extensions/elasticsearch-cdm-1gon-3Get the deployment configuration status:
$ oc describe deployment elasticsearch-cdm-1gon-1The output includes the following status information:
Example output
.... Containers: elasticsearch: Image: registry.redhat.io/openshift4/ose-logging-elasticsearch5:v4.3 Readiness: exec [/usr/share/elasticsearch/probe/readiness.sh] delay=10s timeout=30s period=5s #success=1 #failure=3 .... Conditions: Type Status Reason ---- ------ ------ Progressing Unknown DeploymentPaused Available True MinimumReplicasAvailable .... Events: <none>
- Log store replica set
You can view the status of the log store replica set.
Get the name of a replica set:
$ oc get replicaSet --selector component=elasticsearch -o name replicaset.extensions/elasticsearch-cdm-1gon-1-6f8495 replicaset.extensions/elasticsearch-cdm-1gon-2-5769cf replicaset.extensions/elasticsearch-cdm-1gon-3-f66f7dGet the status of the replica set:
$ oc describe replicaSet elasticsearch-cdm-1gon-1-6f8495The output includes the following status information:
Example output
.... Containers: elasticsearch: Image: registry.redhat.io/openshift4/ose-logging-elasticsearch6@sha256:4265742c7cdd85359140e2d7d703e4311b6497eec7676957f455d6908e7b1c25 Readiness: exec [/usr/share/elasticsearch/probe/readiness.sh] delay=10s timeout=30s period=5s #success=1 #failure=3 .... Events: <none>
9.3. Understanding cluster logging alerts Copy linkLink copied to clipboard!
All of the logging collector alerts are listed on the Alerting UI of the OpenShift Container Platform web console.
9.3.1. Viewing logging collector alerts Copy linkLink copied to clipboard!
Alerts are shown in the OpenShift Container Platform web console, on the Alerts tab of the Alerting UI. Alerts are in one of the following states:
- Firing. The alert condition is true for the duration of the timeout. Click the Options menu at the end of the firing alert to view more information or silence the alert.
- Pending The alert condition is currently true, but the timeout has not been reached.
- Not Firing. The alert is not currently triggered.
Procedure
To view cluster logging and other OpenShift Container Platform alerts:
-
In the OpenShift Container Platform console, click Monitoring
Alerting. - Click the Alerts tab. The alerts are listed, based on the filters selected.
Additional resources
- For more information on the Alerting UI, see Managing cluster alerts.
9.3.2. About logging collector alerts Copy linkLink copied to clipboard!
The following alerts are generated by the logging collector. You can view these alerts in the OpenShift Container Platform web console, on the Alerts page of the Alerting UI.
| Alert | Message | Description | Severity |
|---|---|---|---|
|
|
| Fluentd is reporting a higher number of issues than the specified number, default 10. | Critical |
|
|
| Fluentd is reporting that Prometheus could not scrape a specific Fluentd instance. | Critical |
|
|
| Fluentd is reporting that it is overwhelmed. | Warning |
|
|
| Fluentd is reporting queue usage issues. | Critical |
9.3.3. About Elasticsearch alerting rules Copy linkLink copied to clipboard!
You can view these alerting rules in Prometheus.
| Alert | Description | Severity |
|---|---|---|
| ElasticsearchClusterNotHealthy | Cluster health status has been RED for at least 2m. Cluster does not accept writes, shards may be missing or master node hasn’t been elected yet. | critical |
| ElasticsearchClusterNotHealthy | Cluster health status has been YELLOW for at least 20m. Some shard replicas are not allocated. | warning |
| ElasticsearchBulkRequestsRejectionJumps | High Bulk Rejection Ratio at node in cluster. This node may not be keeping up with the indexing speed. | warning |
| ElasticsearchNodeDiskWatermarkReached | Disk Low Watermark Reached at node in cluster. Shards can not be allocated to this node anymore. You should consider adding more disk space to the node. | alert |
| ElasticsearchNodeDiskWatermarkReached | Disk High Watermark Reached at node in cluster. Some shards will be re-allocated to different nodes if possible. Make sure more disk space is added to the node or drop old indices allocated to this node. | high |
| ElasticsearchJVMHeapUseHigh | JVM Heap usage on the node in cluster is <value> | alert |
| AggregatedLoggingSystemCPUHigh | System CPU usage on the node in cluster is <value> | alert |
| ElasticsearchProcessCPUHigh | ES process CPU usage on the node in cluster is <value> | alert |
9.4. Troubleshooting the log curator Copy linkLink copied to clipboard!
You can use information in this section for debugging log curation. Curator is used to remove data that is in the Elasticsearch index format prior to OpenShift Container Platform 4.5, and will be removed in a later release.
9.4.1. Troubleshooting log curation Copy linkLink copied to clipboard!
You can use information in this section for debugging log curation. For example, if curator is in a failed state, but the log messages do not provide a reason, you could increase the log level and trigger a new job, instead of waiting for another scheduled run of the cron job.
Prerequisites
- Cluster logging and Elasticsearch must be installed.
Procedure
To enable the Curator debug log and trigger next Curator iteration manually:
Enable debug log of Curator:
$ oc set env cronjob/curator CURATOR_LOG_LEVEL=DEBUG CURATOR_SCRIPT_LOG_LEVEL=DEBUGSpecify the log level:
- CRITICAL. Curator displays only critical messages.
- ERROR. Curator displays only error and critical messages.
- WARNING. Curator displays only error, warning, and critical messages.
- INFO. Curator displays only informational, error, warning, and critical messages.
DEBUG. Curator displays only debug messages, in addition to all of the above.
The default value is INFO.
NoteCluster logging uses the OpenShift Container Platform custom environment variable
CURATOR_SCRIPT_LOG_LEVELin OpenShift Container Platform wrapper scripts (run.shandconvert.py). The environment variable takes the same values asCURATOR_LOG_LEVELfor script debugging, as needed.
Trigger next curator iteration:
$ oc create job --from=cronjob/curator <job_name>Use the following commands to control the cron job:
Suspend a cron job:
$ oc patch cronjob curator -p '{"spec":{"suspend":true}}'Resume a cron job:
$ oc patch cronjob curator -p '{"spec":{"suspend":false}}'Change a cron job schedule:
$ oc patch cronjob curator -p '{"spec":{"schedule":"0 0 * * *"}}'1 - 1
- The
scheduleoption accepts schedules in cron format.
9.5. Collecting logging data for Red Hat Support Copy linkLink copied to clipboard!
When opening a support case, it is helpful to provide debugging information about your cluster to Red Hat Support.
The must-gather tool enables you to collect diagnostic information for project-level resources, cluster-level resources, and each of the cluster logging components.
For prompt support, supply diagnostic information for both OpenShift Container Platform and cluster logging.
Do not use the hack/logging-dump.sh script. The script is no longer supported and does not collect data.
9.5.1. About the must-gather tool Copy linkLink copied to clipboard!
The oc adm must-gather CLI command collects the information from your cluster that is most likely needed for debugging issues.
For your cluster logging environment, must-gather collects the following information:
- project-level resources, including pods, configuration maps, service accounts, roles, role bindings, and events at the project level
- cluster-level resources, including nodes, roles, and role bindings at the cluster level
-
cluster logging resources in the
openshift-loggingandopenshift-operators-redhatnamespaces, including health status for the log collector, the log store, the curator, and the log visualizer
When you run oc adm must-gather, a new pod is created on the cluster. The data is collected on that pod and saved in a new directory that starts with must-gather.local. This directory is created in the current working directory.
9.5.2. Prerequisites Copy linkLink copied to clipboard!
- Cluster logging and Elasticsearch must be installed.
9.5.3. Collecting cluster logging data Copy linkLink copied to clipboard!
You can use the oc adm must-gather CLI command to collect information about your cluster logging environment.
Procedure
To collect cluster logging information with must-gather:
-
Navigate to the directory where you want to store the
must-gatherinformation. Run the
oc adm must-gathercommand against the cluster logging image:$ oc adm must-gather --image=$(oc -n openshift-logging get deployment.apps/cluster-logging-operator -o jsonpath='{.spec.template.spec.containers[?(@.name == "cluster-logging-operator")].image}')The
must-gathertool creates a new directory that starts withmust-gather.localwithin the current directory. For example:must-gather.local.4157245944708210408.Create a compressed file from the
must-gatherdirectory that was just created. For example, on a computer that uses a Linux operating system, run the following command:$ tar -cvaf must-gather.tar.gz must-gather.local.4157245944708210408- Attach the compressed file to your support case on the Red Hat Customer Portal.