Chapter 5. Troubleshooting logging
5.1. Viewing Logging status
You can view the status of the Red Hat OpenShift Logging Operator and other logging components.
5.1.1. Viewing the status of the Red Hat OpenShift Logging Operator
You can view the status of the Red Hat OpenShift Logging Operator.
Prerequisites
- The Red Hat OpenShift Logging Operator and OpenShift Elasticsearch Operator are installed.
Procedure
- Change to the - openshift-loggingproject by running the following command:- oc project openshift-logging - $ oc project openshift-logging- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Get the - ClusterLogginginstance status by running the following command:- oc get clusterlogging instance -o yaml - $ oc get clusterlogging instance -o yaml- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example output - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
5.1.1.1. Example condition messages
						The following are examples of some condition messages from the Status.Nodes section of the ClusterLogging instance.
					
A status message similar to the following indicates a node has exceeded the configured low watermark and no shard will be allocated to this node:
Example output
A status message similar to the following indicates a node has exceeded the configured high watermark and shards will be relocated to other nodes:
Example output
A status message similar to the following indicates the Elasticsearch node selector in the CR does not match any nodes in the cluster:
Example output
A status message similar to the following indicates that the requested PVC could not bind to PV:
Example output
A status message similar to the following indicates that the Fluentd pods cannot be scheduled because the node selector did not match any nodes:
Example output
5.1.2. Viewing the status of logging components
You can view the status for a number of logging components.
Prerequisites
- The Red Hat OpenShift Logging Operator and OpenShift Elasticsearch Operator are installed.
Procedure
- Change to the - openshift-loggingproject.- oc project openshift-logging - $ oc project openshift-logging- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- View the status of logging environment: - oc describe deployment cluster-logging-operator - $ oc describe deployment cluster-logging-operator- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example output - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- View the status of the logging replica set: - Get the name of a replica set: - Example output - oc get replicaset - $ oc get replicaset- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example output - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Get the status of the replica set: - oc describe replicaset cluster-logging-operator-574b8987df - $ oc describe replicaset cluster-logging-operator-574b8987df- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example output - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
5.2. Troubleshooting log forwarding
5.2.1. Redeploying Fluentd pods
					When you create a ClusterLogForwarder custom resource (CR), if the Red Hat OpenShift Logging Operator does not redeploy the Fluentd pods automatically, you can delete the Fluentd pods to force them to redeploy.
				
Prerequisites
- 
							You have created a ClusterLogForwardercustom resource (CR) object.
Procedure
- Delete the Fluentd pods to force them to redeploy by running the following command: - oc delete pod --selector logging-infra=collector - $ oc delete pod --selector logging-infra=collector- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
5.2.2. Troubleshooting Loki rate limit errors
					If the Log Forwarder API forwards a large block of messages that exceeds the rate limit to Loki, Loki generates rate limit (429) errors.
				
These errors can occur during normal operation. For example, when adding the logging to a cluster that already has some logs, rate limit errors might occur while the logging tries to ingest all of the existing log entries. In this case, if the rate of addition of new logs is less than the total rate limit, the historical data is eventually ingested, and the rate limit errors are resolved without requiring user intervention.
					In cases where the rate limit errors continue to occur, you can fix the issue by modifying the LokiStack custom resource (CR).
				
						The LokiStack CR is not available on Grafana-hosted Loki. This topic does not apply to Grafana-hosted Loki servers.
					
Conditions
- The Log Forwarder API is configured to forward logs to Loki.
- Your system sends a block of messages that is larger than 2 MB to Loki. For example: - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- After you enter - oc logs -n openshift-logging -l component=collector, the collector logs in your cluster show a line containing one of the following error messages:- 429 Too Many Requests Ingestion rate limit exceeded - 429 Too Many Requests Ingestion rate limit exceeded- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example Vector error message - 2023-08-25T16:08:49.301780Z WARN sink{component_kind="sink" component_id=default_loki_infra component_type=loki component_name=default_loki_infra}: vector::sinks::util::retries: Retrying after error. error=Server responded with an error: 429 Too Many Requests internal_log_rate_limit=true- 2023-08-25T16:08:49.301780Z WARN sink{component_kind="sink" component_id=default_loki_infra component_type=loki component_name=default_loki_infra}: vector::sinks::util::retries: Retrying after error. error=Server responded with an error: 429 Too Many Requests internal_log_rate_limit=true- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example Fluentd error message - 2023-08-30 14:52:15 +0000 [warn]: [default_loki_infra] failed to flush the buffer. retry_times=2 next_retry_time=2023-08-30 14:52:19 +0000 chunk="604251225bf5378ed1567231a1c03b8b" error_class=Fluent::Plugin::LokiOutput::LogPostError error="429 Too Many Requests Ingestion rate limit exceeded for user infrastructure (limit: 4194304 bytes/sec) while attempting to ingest '4082' lines totaling '7820025' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased\n" - 2023-08-30 14:52:15 +0000 [warn]: [default_loki_infra] failed to flush the buffer. retry_times=2 next_retry_time=2023-08-30 14:52:19 +0000 chunk="604251225bf5378ed1567231a1c03b8b" error_class=Fluent::Plugin::LokiOutput::LogPostError error="429 Too Many Requests Ingestion rate limit exceeded for user infrastructure (limit: 4194304 bytes/sec) while attempting to ingest '4082' lines totaling '7820025' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased\n"- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - The error is also visible on the receiving end. For example, in the LokiStack ingester pod: - Example Loki ingester error message - level=warn ts=2023-08-30T14:57:34.155592243Z caller=grpc_logging.go:43 duration=1.434942ms method=/logproto.Pusher/Push err="rpc error: code = Code(429) desc = entry with timestamp 2023-08-30 14:57:32.012778399 +0000 UTC ignored, reason: 'Per stream rate limit exceeded (limit: 3MB/sec) while attempting to ingest for stream - level=warn ts=2023-08-30T14:57:34.155592243Z caller=grpc_logging.go:43 duration=1.434942ms method=/logproto.Pusher/Push err="rpc error: code = Code(429) desc = entry with timestamp 2023-08-30 14:57:32.012778399 +0000 UTC ignored, reason: 'Per stream rate limit exceeded (limit: 3MB/sec) while attempting to ingest for stream- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
Procedure
- Update the - ingestionBurstSizeand- ingestionRatefields in the- LokiStackCR:- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - 1
- TheingestionBurstSizefield defines the maximum local rate-limited sample size per distributor replica in MB. This value is a hard limit. Set this value to at least the maximum logs size expected in a single push request. Single requests that are larger than theingestionBurstSizevalue are not permitted.
- 2
- TheingestionRatefield is a soft limit on the maximum amount of ingested samples per second in MB. Rate limit errors occur if the rate of logs exceeds the limit, but the collector retries sending the logs. As long as the total average is lower than the limit, the system recovers and errors are resolved without user intervention.
 
5.3. Troubleshooting logging alerts
You can use the following procedures to troubleshoot logging alerts on your cluster.
5.3.1. Elasticsearch cluster health status is red
At least one primary shard and its replicas are not allocated to a node. Use the following procedure to troubleshoot this alert.
					Some commands in this documentation reference an Elasticsearch pod by using a $ES_POD_NAME shell variable. If you want to copy and paste the commands directly from this documentation, you must set this variable to a value that is valid for your Elasticsearch cluster.
				
You can list the available Elasticsearch pods by running the following command:
oc -n openshift-logging get pods -l component=elasticsearch
$ oc -n openshift-logging get pods -l component=elasticsearch
					Choose one of the pods listed and set the $ES_POD_NAME variable, by running the following command:
				
export ES_POD_NAME=<elasticsearch_pod_name>
$ export ES_POD_NAME=<elasticsearch_pod_name>
					You can now use the $ES_POD_NAME variable in commands.
				
Procedure
- Check the Elasticsearch cluster health and verify that the cluster - statusis red by running the following command:- oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- health - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- health- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- List the nodes that have joined the cluster by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cat/nodes?v - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cat/nodes?v- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- List the Elasticsearch pods and compare them with the nodes in the command output from the previous step, by running the following command: - oc -n openshift-logging get pods -l component=elasticsearch - $ oc -n openshift-logging get pods -l component=elasticsearch- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- If some of the Elasticsearch nodes have not joined the cluster, perform the following steps. - Confirm that Elasticsearch has an elected master node by running the following command and observing the output: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cat/master?v - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cat/master?v- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Review the pod logs of the elected master node for issues by running the following command and observing the output: - oc logs <elasticsearch_master_pod_name> -c elasticsearch -n openshift-logging - $ oc logs <elasticsearch_master_pod_name> -c elasticsearch -n openshift-logging- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Review the logs of nodes that have not joined the cluster for issues by running the following command and observing the output: - oc logs <elasticsearch_node_name> -c elasticsearch -n openshift-logging - $ oc logs <elasticsearch_node_name> -c elasticsearch -n openshift-logging- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
- If all the nodes have joined the cluster, check if the cluster is in the process of recovering by running the following command and observing the output: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cat/recovery?active_only=true - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cat/recovery?active_only=true- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - If there is no command output, the recovery process might be delayed or stalled by pending tasks. 
- Check if there are pending tasks by running the following command and observing the output: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- health | grep number_of_pending_tasks - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- health | grep number_of_pending_tasks- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- If there are pending tasks, monitor their status. If their status changes and indicates that the cluster is recovering, continue waiting. The recovery time varies according to the size of the cluster and other factors. Otherwise, if the status of the pending tasks does not change, this indicates that the recovery has stalled.
- If it seems like the recovery has stalled, check if the - cluster.routing.allocation.enablevalue is set to- none, by running the following command and observing the output:- oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cluster/settings?pretty - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cluster/settings?pretty- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- If the - cluster.routing.allocation.enablevalue is set to- none, set it to- all, by running the following command:- oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cluster/settings?pretty \ -X PUT -d '{"persistent": {"cluster.routing.allocation.enable":"all"}}'- $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cluster/settings?pretty \ -X PUT -d '{"persistent": {"cluster.routing.allocation.enable":"all"}}'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Check if any indices are still red by running the following command and observing the output: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cat/indices?v - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cat/indices?v- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- If any indices are still red, try to clear them by performing the following steps. - Clear the cache by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name>/_cache/clear?pretty - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name>/_cache/clear?pretty- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Increase the max allocation retries by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name>/_settings?pretty \ -X PUT -d '{"index.allocation.max_retries":10}'- $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name>/_settings?pretty \ -X PUT -d '{"index.allocation.max_retries":10}'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Delete all the scroll items by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_search/scroll/_all -X DELETE - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_search/scroll/_all -X DELETE- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Increase the timeout by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name>/_settings?pretty \ -X PUT -d '{"index.unassigned.node_left.delayed_timeout":"10m"}'- $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name>/_settings?pretty \ -X PUT -d '{"index.unassigned.node_left.delayed_timeout":"10m"}'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
- If the preceding steps do not clear the red indices, delete the indices individually. - Identify the red index name by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cat/indices?v - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cat/indices?v- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Delete the red index by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_red_index_name> -X DELETE - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_red_index_name> -X DELETE- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
- If there are no red indices and the cluster status is red, check for a continuous heavy processing load on a data node. - Check if the Elasticsearch JVM Heap usage is high by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_nodes/stats?pretty - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_nodes/stats?pretty- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - In the command output, review the - node_name.jvm.mem.heap_used_percentfield to determine the JVM Heap usage.
- Check for high CPU utilization. For more information about CPU utilitzation, see the OpenShift Container Platform "Reviewing monitoring dashboards" documentation.
 
5.3.2. Elasticsearch cluster health status is yellow
					Replica shards for at least one primary shard are not allocated to nodes. Increase the node count by adjusting the nodeCount value in the ClusterLogging custom resource (CR).
				
5.3.3. Elasticsearch node disk low watermark reached
Elasticsearch does not allocate shards to nodes that reach the low watermark.
					Some commands in this documentation reference an Elasticsearch pod by using a $ES_POD_NAME shell variable. If you want to copy and paste the commands directly from this documentation, you must set this variable to a value that is valid for your Elasticsearch cluster.
				
You can list the available Elasticsearch pods by running the following command:
oc -n openshift-logging get pods -l component=elasticsearch
$ oc -n openshift-logging get pods -l component=elasticsearch
					Choose one of the pods listed and set the $ES_POD_NAME variable, by running the following command:
				
export ES_POD_NAME=<elasticsearch_pod_name>
$ export ES_POD_NAME=<elasticsearch_pod_name>
					You can now use the $ES_POD_NAME variable in commands.
				
Procedure
- Identify the node on which Elasticsearch is deployed by running the following command: - oc -n openshift-logging get po -o wide - $ oc -n openshift-logging get po -o wide- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Check if there are unassigned shards by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cluster/health?pretty | grep unassigned_shards - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cluster/health?pretty | grep unassigned_shards- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- If there are unassigned shards, check the disk space on each node, by running the following command: - for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; \ do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod \ -- df -h /elasticsearch/persistent; done- $ for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; \ do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod \ -- df -h /elasticsearch/persistent; done- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- In the command output, check the - Usecolumn to determine the used disk percentage on that node.- Example output - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - If the used disk percentage is above 85%, the node has exceeded the low watermark, and shards can no longer be allocated to this node. 
- To check the current - redundancyPolicy, run the following command:- oc -n openshift-logging get es elasticsearch \ -o jsonpath='{.spec.redundancyPolicy}'- $ oc -n openshift-logging get es elasticsearch \ -o jsonpath='{.spec.redundancyPolicy}'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - If you are using a - ClusterLoggingresource on your cluster, run the following command:- oc -n openshift-logging get cl \ -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'- $ oc -n openshift-logging get cl \ -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - If the cluster - redundancyPolicyvalue is higher than the- SingleRedundancyvalue, set it to the- SingleRedundancyvalue and save this change.
- If the preceding steps do not fix the issue, delete the old indices. - Check the status of all indices on Elasticsearch by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- indices - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- indices- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Identify an old index that can be deleted.
- Delete the index by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name> -X DELETE - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name> -X DELETE- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
5.3.4. Elasticsearch node disk high watermark reached
Elasticsearch attempts to relocate shards away from a node that has reached the high watermark to a node with low disk usage that has not crossed any watermark threshold limits.
To allocate shards to a particular node, you must free up some space on that node. If increasing the disk space is not possible, try adding a new data node to the cluster, or decrease the total cluster redundancy policy.
					Some commands in this documentation reference an Elasticsearch pod by using a $ES_POD_NAME shell variable. If you want to copy and paste the commands directly from this documentation, you must set this variable to a value that is valid for your Elasticsearch cluster.
				
You can list the available Elasticsearch pods by running the following command:
oc -n openshift-logging get pods -l component=elasticsearch
$ oc -n openshift-logging get pods -l component=elasticsearch
					Choose one of the pods listed and set the $ES_POD_NAME variable, by running the following command:
				
export ES_POD_NAME=<elasticsearch_pod_name>
$ export ES_POD_NAME=<elasticsearch_pod_name>
					You can now use the $ES_POD_NAME variable in commands.
				
Procedure
- Identify the node on which Elasticsearch is deployed by running the following command: - oc -n openshift-logging get po -o wide - $ oc -n openshift-logging get po -o wide- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Check the disk space on each node: - for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; \ do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod \ -- df -h /elasticsearch/persistent; done- $ for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; \ do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod \ -- df -h /elasticsearch/persistent; done- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Check if the cluster is rebalancing: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cluster/health?pretty | grep relocating_shards - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cluster/health?pretty | grep relocating_shards- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - If the command output shows relocating shards, the high watermark has been exceeded. The default value of the high watermark is 90%. 
- Increase the disk space on all nodes. If increasing the disk space is not possible, try adding a new data node to the cluster, or decrease the total cluster redundancy policy.
- To check the current - redundancyPolicy, run the following command:- oc -n openshift-logging get es elasticsearch \ -o jsonpath='{.spec.redundancyPolicy}'- $ oc -n openshift-logging get es elasticsearch \ -o jsonpath='{.spec.redundancyPolicy}'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - If you are using a - ClusterLoggingresource on your cluster, run the following command:- oc -n openshift-logging get cl \ -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'- $ oc -n openshift-logging get cl \ -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - If the cluster - redundancyPolicyvalue is higher than the- SingleRedundancyvalue, set it to the- SingleRedundancyvalue and save this change.
- If the preceding steps do not fix the issue, delete the old indices. - Check the status of all indices on Elasticsearch by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- indices - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- indices- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Identify an old index that can be deleted.
- Delete the index by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name> -X DELETE - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name> -X DELETE- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
5.3.5. Elasticsearch node disk flood watermark reached
Elasticsearch enforces a read-only index block on every index that has both of these conditions:
- One or more shards are allocated to the node.
- One or more disks exceed the flood stage.
Use the following procedure to troubleshoot this alert.
					Some commands in this documentation reference an Elasticsearch pod by using a $ES_POD_NAME shell variable. If you want to copy and paste the commands directly from this documentation, you must set this variable to a value that is valid for your Elasticsearch cluster.
				
You can list the available Elasticsearch pods by running the following command:
oc -n openshift-logging get pods -l component=elasticsearch
$ oc -n openshift-logging get pods -l component=elasticsearch
					Choose one of the pods listed and set the $ES_POD_NAME variable, by running the following command:
				
export ES_POD_NAME=<elasticsearch_pod_name>
$ export ES_POD_NAME=<elasticsearch_pod_name>
					You can now use the $ES_POD_NAME variable in commands.
				
Procedure
- Get the disk space of the Elasticsearch node: - for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; \ do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod \ -- df -h /elasticsearch/persistent; done- $ for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; \ do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod \ -- df -h /elasticsearch/persistent; done- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- In the command output, check the - Availcolumn to determine the free disk space on that node.- Example output - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Increase the disk space on all nodes. If increasing the disk space is not possible, try adding a new data node to the cluster, or decrease the total cluster redundancy policy.
- To check the current - redundancyPolicy, run the following command:- oc -n openshift-logging get es elasticsearch \ -o jsonpath='{.spec.redundancyPolicy}'- $ oc -n openshift-logging get es elasticsearch \ -o jsonpath='{.spec.redundancyPolicy}'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - If you are using a - ClusterLoggingresource on your cluster, run the following command:- oc -n openshift-logging get cl \ -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'- $ oc -n openshift-logging get cl \ -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - If the cluster - redundancyPolicyvalue is higher than the- SingleRedundancyvalue, set it to the- SingleRedundancyvalue and save this change.
- If the preceding steps do not fix the issue, delete the old indices. - Check the status of all indices on Elasticsearch by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- indices - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- indices- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Identify an old index that can be deleted.
- Delete the index by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name> -X DELETE - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name> -X DELETE- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
- Continue freeing up and monitoring the disk space. After the used disk space drops below 90%, unblock writing to this node by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_all/_settings?pretty \ -X PUT -d '{"index.blocks.read_only_allow_delete": null}'- $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_all/_settings?pretty \ -X PUT -d '{"index.blocks.read_only_allow_delete": null}'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
5.3.6. Elasticsearch JVM heap usage is high
The Elasticsearch node Java virtual machine (JVM) heap memory used is above 75%. Consider increasing the heap size.
5.3.7. Aggregated logging system CPU is high
System CPU usage on the node is high. Check the CPU of the cluster node. Consider allocating more CPU resources to the node.
5.3.8. Elasticsearch process CPU is high
Elasticsearch process CPU usage on the node is high. Check the CPU of the cluster node. Consider allocating more CPU resources to the node.
5.3.9. Elasticsearch disk space is running low
Elasticsearch is predicted to run out of disk space within the next 6 hours based on current disk usage. Use the following procedure to troubleshoot this alert.
Procedure
- Get the disk space of the Elasticsearch node: - for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; \ do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod \ -- df -h /elasticsearch/persistent; done- $ for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; \ do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod \ -- df -h /elasticsearch/persistent; done- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- In the command output, check the - Availcolumn to determine the free disk space on that node.- Example output - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Increase the disk space on all nodes. If increasing the disk space is not possible, try adding a new data node to the cluster, or decrease the total cluster redundancy policy.
- To check the current - redundancyPolicy, run the following command:- oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'- $ oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - If you are using a - ClusterLoggingresource on your cluster, run the following command:- oc -n openshift-logging get cl \ -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'- $ oc -n openshift-logging get cl \ -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - If the cluster - redundancyPolicyvalue is higher than the- SingleRedundancyvalue, set it to the- SingleRedundancyvalue and save this change.
- If the preceding steps do not fix the issue, delete the old indices. - Check the status of all indices on Elasticsearch by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- indices - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- indices- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Identify an old index that can be deleted.
- Delete the index by running the following command: - oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name> -X DELETE - $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name> -X DELETE- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
5.3.10. Elasticsearch FileDescriptor usage is high
					Based on current usage trends, the predicted number of file descriptors on the node is insufficient. Check the value of max_file_descriptors for each node as described in the Elasticsearch File Descriptors documentation.
				
5.4. Viewing the status of the Elasticsearch log store
You can view the status of the OpenShift Elasticsearch Operator and for a number of Elasticsearch components.
5.4.1. Viewing the status of the Elasticsearch log store
You can view the status of the Elasticsearch log store.
Prerequisites
- The Red Hat OpenShift Logging Operator and OpenShift Elasticsearch Operator are installed.
Procedure
- Change to the - openshift-loggingproject by running the following command:- oc project openshift-logging - $ oc project openshift-logging- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- To view the status: - Get the name of the Elasticsearch log store instance by running the following command: - oc get Elasticsearch - $ oc get Elasticsearch- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example output - NAME AGE elasticsearch 5h9m - NAME AGE elasticsearch 5h9m- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Get the Elasticsearch log store status by running the following command: - oc get Elasticsearch <Elasticsearch-instance> -o yaml - $ oc get Elasticsearch <Elasticsearch-instance> -o yaml- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - For example: - oc get Elasticsearch elasticsearch -n openshift-logging -o yaml - $ oc get Elasticsearch elasticsearch -n openshift-logging -o yaml- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - The output includes information similar to the following: - Example output - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - 1
- In the output, the cluster status fields appear in thestatusstanza.
- 2
- The status of the Elasticsearch log store:- The number of active primary shards.
- The number of active shards.
- The number of shards that are initializing.
- The number of Elasticsearch log store data nodes.
- The total number of Elasticsearch log store nodes.
- The number of pending tasks.
- 
													The Elasticsearch log store status: green,red,yellow.
- The number of unassigned shards.
 
- 3
- Any status conditions, if present. The Elasticsearch log store status indicates the reasons from the scheduler if a pod could not be placed. Any events related to the following conditions are shown:- Container Waiting for both the Elasticsearch log store and proxy containers.
- Container Terminated for both the Elasticsearch log store and proxy containers.
- Pod unschedulable. Also, a condition is shown for a number of issues; see Example condition messages.
 
- 4
- The Elasticsearch log store nodes in the cluster, withupgradeStatus.
- 5
- The Elasticsearch log store client, data, and master pods in the cluster, listed underfailed,notReady, orreadystate.
 
 
5.4.1.1. Example condition messages
						The following are examples of some condition messages from the Status section of the Elasticsearch instance.
					
The following status message indicates that a node has exceeded the configured low watermark, and no shard will be allocated to this node.
The following status message indicates that a node has exceeded the configured high watermark, and shards will be relocated to other nodes.
The following status message indicates that the Elasticsearch log store node selector in the custom resource (CR) does not match any nodes in the cluster:
The following status message indicates that the Elasticsearch log store CR uses a non-existent persistent volume claim (PVC).
The following status message indicates that your Elasticsearch log store cluster does not have enough nodes to support the redundancy policy.
This status message indicates your cluster has too many control plane nodes:
The following status message indicates that Elasticsearch storage does not support the change you tried to make.
For example:
						The reason and type fields specify the type of unsupported change:
					
- StorageClassNameChangeIgnored
- Unsupported change to the storage class name.
- StorageSizeChangeIgnored
- Unsupported change the storage size.
- StorageStructureChangeIgnored
- Unsupported change between ephemeral and persistent storage structures. Important- If you try to configure the - ClusterLoggingCR to switch from ephemeral to persistent storage, the OpenShift Elasticsearch Operator creates a persistent volume claim (PVC) but does not create a persistent volume (PV). To clear the- StorageStructureChangeIgnoredstatus, you must revert the change to the- ClusterLoggingCR and delete the PVC.
5.4.2. Viewing the status of the log store components
You can view the status for a number of the log store components.
- Elasticsearch indices
- You can view the status of the Elasticsearch indices. - Get the name of an Elasticsearch pod: - oc get pods --selector component=elasticsearch -o name - $ oc get pods --selector component=elasticsearch -o name- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example output - pod/elasticsearch-cdm-1godmszn-1-6f8495-vp4lw pod/elasticsearch-cdm-1godmszn-2-5769cf-9ms2n pod/elasticsearch-cdm-1godmszn-3-f66f7d-zqkz7 - pod/elasticsearch-cdm-1godmszn-1-6f8495-vp4lw pod/elasticsearch-cdm-1godmszn-2-5769cf-9ms2n pod/elasticsearch-cdm-1godmszn-3-f66f7d-zqkz7- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Get the status of the indices: - oc exec elasticsearch-cdm-4vjor49p-2-6d4d7db474-q2w7z -- indices - $ oc exec elasticsearch-cdm-4vjor49p-2-6d4d7db474-q2w7z -- indices- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example output - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
- Log store pods
- You can view the status of the pods that host the log store. - Get the name of a pod: - oc get pods --selector component=elasticsearch -o name - $ oc get pods --selector component=elasticsearch -o name- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example output - pod/elasticsearch-cdm-1godmszn-1-6f8495-vp4lw pod/elasticsearch-cdm-1godmszn-2-5769cf-9ms2n pod/elasticsearch-cdm-1godmszn-3-f66f7d-zqkz7 - pod/elasticsearch-cdm-1godmszn-1-6f8495-vp4lw pod/elasticsearch-cdm-1godmszn-2-5769cf-9ms2n pod/elasticsearch-cdm-1godmszn-3-f66f7d-zqkz7- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Get the status of a pod: - oc describe pod elasticsearch-cdm-1godmszn-1-6f8495-vp4lw - $ oc describe pod elasticsearch-cdm-1godmszn-1-6f8495-vp4lw- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - The output includes the following status information: - Example output - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
- Log storage pod deployment configuration
- You can view the status of the log store deployment configuration. - Get the name of a deployment configuration: - oc get deployment --selector component=elasticsearch -o name - $ oc get deployment --selector component=elasticsearch -o name- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example output - deployment.extensions/elasticsearch-cdm-1gon-1 deployment.extensions/elasticsearch-cdm-1gon-2 deployment.extensions/elasticsearch-cdm-1gon-3 - deployment.extensions/elasticsearch-cdm-1gon-1 deployment.extensions/elasticsearch-cdm-1gon-2 deployment.extensions/elasticsearch-cdm-1gon-3- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Get the deployment configuration status: - oc describe deployment elasticsearch-cdm-1gon-1 - $ oc describe deployment elasticsearch-cdm-1gon-1- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - The output includes the following status information: - Example output - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
- Log store replica set
- You can view the status of the log store replica set. - Get the name of a replica set: - oc get replicaSet --selector component=elasticsearch -o name - $ oc get replicaSet --selector component=elasticsearch -o name replicaset.extensions/elasticsearch-cdm-1gon-1-6f8495 replicaset.extensions/elasticsearch-cdm-1gon-2-5769cf replicaset.extensions/elasticsearch-cdm-1gon-3-f66f7d- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Get the status of the replica set: - oc describe replicaSet elasticsearch-cdm-1gon-1-6f8495 - $ oc describe replicaSet elasticsearch-cdm-1gon-1-6f8495- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - The output includes the following status information: - Example output - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
5.4.3. Elasticsearch cluster status
A dashboard in the Observe section of the OpenShift Container Platform web console displays the status of the Elasticsearch cluster.
					To get the status of the OpenShift Elasticsearch cluster, visit the dashboard in the Observe section of the OpenShift Container Platform web console at <cluster_url>/monitoring/dashboards/grafana-dashboard-cluster-logging.
				
Elasticsearch status fields
- eo_elasticsearch_cr_cluster_management_state
- Shows whether the Elasticsearch cluster is in a managed or unmanaged state. For example: - eo_elasticsearch_cr_cluster_management_state{state="managed"} 1 eo_elasticsearch_cr_cluster_management_state{state="unmanaged"} 0- eo_elasticsearch_cr_cluster_management_state{state="managed"} 1 eo_elasticsearch_cr_cluster_management_state{state="unmanaged"} 0- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- eo_elasticsearch_cr_restart_total
- Shows the number of times the Elasticsearch nodes have restarted for certificate restarts, rolling restarts, or scheduled restarts. For example: - eo_elasticsearch_cr_restart_total{reason="cert_restart"} 1 eo_elasticsearch_cr_restart_total{reason="rolling_restart"} 1 eo_elasticsearch_cr_restart_total{reason="scheduled_restart"} 3- eo_elasticsearch_cr_restart_total{reason="cert_restart"} 1 eo_elasticsearch_cr_restart_total{reason="rolling_restart"} 1 eo_elasticsearch_cr_restart_total{reason="scheduled_restart"} 3- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- es_index_namespaces_total
- Shows the total number of Elasticsearch index namespaces. For example: - Total number of Namespaces. es_index_namespaces_total 5 - Total number of Namespaces. es_index_namespaces_total 5- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- es_index_document_count
- Shows the number of records for each namespace. For example: - es_index_document_count{namespace="namespace_1"} 25 es_index_document_count{namespace="namespace_2"} 10 es_index_document_count{namespace="namespace_3"} 5- es_index_document_count{namespace="namespace_1"} 25 es_index_document_count{namespace="namespace_2"} 10 es_index_document_count{namespace="namespace_3"} 5- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
The "Secret Elasticsearch fields are either missing or empty" message
						If Elasticsearch is missing the admin-cert, admin-key, logging-es.crt, or logging-es.key files, the dashboard shows a status message similar to the following example:
					
message": "Secret \"elasticsearch\" fields are either missing or empty: [admin-cert, admin-key, logging-es.crt, logging-es.key]", "reason": "Missing Required Secrets",
message": "Secret \"elasticsearch\" fields are either missing or empty: [admin-cert, admin-key, logging-es.crt, logging-es.key]",
"reason": "Missing Required Secrets",