Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.
Chapter 3. Troubleshooting logging
3.1. Viewing Logging status Link kopierenLink in die Zwischenablage kopiert!
You can view the status of the Red Hat OpenShift Logging Operator and other logging components.
3.1.1. Viewing the status of the Red Hat OpenShift Logging Operator Link kopierenLink in die Zwischenablage kopiert!
You can view the status of the Red Hat OpenShift Logging Operator.
Prerequisites
- The Red Hat OpenShift Logging Operator and OpenShift Elasticsearch Operator are installed.
Procedure
Change to the
project by running the following command:openshift-logging$ oc project openshift-loggingGet the
instance status by running the following command:ClusterLogging$ oc get clusterlogging instance -o yamlExample output
apiVersion: logging.openshift.io/v1 kind: ClusterLogging # ... status:1 collection: logs: fluentdStatus: daemonSet: fluentd2 nodes: collector-2rhqp: ip-10-0-169-13.ec2.internal collector-6fgjh: ip-10-0-165-244.ec2.internal collector-6l2ff: ip-10-0-128-218.ec2.internal collector-54nx5: ip-10-0-139-30.ec2.internal collector-flpnn: ip-10-0-147-228.ec2.internal collector-n2frh: ip-10-0-157-45.ec2.internal pods: failed: [] notReady: [] ready: - collector-2rhqp - collector-54nx5 - collector-6fgjh - collector-6l2ff - collector-flpnn - collector-n2frh logstore:3 elasticsearchStatus: - ShardAllocationEnabled: all cluster: activePrimaryShards: 5 activeShards: 5 initializingShards: 0 numDataNodes: 1 numNodes: 1 pendingTasks: 0 relocatingShards: 0 status: green unassignedShards: 0 clusterName: elasticsearch nodeConditions: elasticsearch-cdm-mkkdys93-1: nodeCount: 1 pods: client: failed: notReady: ready: - elasticsearch-cdm-mkkdys93-1-7f7c6-mjm7c data: failed: notReady: ready: - elasticsearch-cdm-mkkdys93-1-7f7c6-mjm7c master: failed: notReady: ready: - elasticsearch-cdm-mkkdys93-1-7f7c6-mjm7c visualization:4 kibanaStatus: - deployment: kibana pods: failed: [] notReady: [] ready: - kibana-7fb4fd4cc9-f2nls replicaSets: - kibana-7fb4fd4cc9 replicas: 1
3.1.1.1. Example condition messages Link kopierenLink in die Zwischenablage kopiert!
The following are examples of some condition messages from the
Status.Nodes
ClusterLogging
A status message similar to the following indicates a node has exceeded the configured low watermark and no shard will be allocated to this node:
Example output
nodes:
- conditions:
- lastTransitionTime: 2019-03-15T15:57:22Z
message: Disk storage usage for node is 27.5gb (36.74%). Shards will be not
be allocated on this node.
reason: Disk Watermark Low
status: "True"
type: NodeStorage
deploymentName: example-elasticsearch-clientdatamaster-0-1
upgradeStatus: {}
A status message similar to the following indicates a node has exceeded the configured high watermark and shards will be relocated to other nodes:
Example output
nodes:
- conditions:
- lastTransitionTime: 2019-03-15T16:04:45Z
message: Disk storage usage for node is 27.5gb (36.74%). Shards will be relocated
from this node.
reason: Disk Watermark High
status: "True"
type: NodeStorage
deploymentName: cluster-logging-operator
upgradeStatus: {}
A status message similar to the following indicates the Elasticsearch node selector in the CR does not match any nodes in the cluster:
Example output
Elasticsearch Status:
Shard Allocation Enabled: shard allocation unknown
Cluster:
Active Primary Shards: 0
Active Shards: 0
Initializing Shards: 0
Num Data Nodes: 0
Num Nodes: 0
Pending Tasks: 0
Relocating Shards: 0
Status: cluster health unknown
Unassigned Shards: 0
Cluster Name: elasticsearch
Node Conditions:
elasticsearch-cdm-mkkdys93-1:
Last Transition Time: 2019-06-26T03:37:32Z
Message: 0/5 nodes are available: 5 node(s) didn't match node selector.
Reason: Unschedulable
Status: True
Type: Unschedulable
elasticsearch-cdm-mkkdys93-2:
Node Count: 2
Pods:
Client:
Failed:
Not Ready:
elasticsearch-cdm-mkkdys93-1-75dd69dccd-f7f49
elasticsearch-cdm-mkkdys93-2-67c64f5f4c-n58vl
Ready:
Data:
Failed:
Not Ready:
elasticsearch-cdm-mkkdys93-1-75dd69dccd-f7f49
elasticsearch-cdm-mkkdys93-2-67c64f5f4c-n58vl
Ready:
Master:
Failed:
Not Ready:
elasticsearch-cdm-mkkdys93-1-75dd69dccd-f7f49
elasticsearch-cdm-mkkdys93-2-67c64f5f4c-n58vl
Ready:
A status message similar to the following indicates that the requested PVC could not bind to PV:
Example output
Node Conditions:
elasticsearch-cdm-mkkdys93-1:
Last Transition Time: 2019-06-26T03:37:32Z
Message: pod has unbound immediate PersistentVolumeClaims (repeated 5 times)
Reason: Unschedulable
Status: True
Type: Unschedulable
A status message similar to the following indicates that the Fluentd pods cannot be scheduled because the node selector did not match any nodes:
Example output
Status:
Collection:
Logs:
Fluentd Status:
Daemon Set: fluentd
Nodes:
Pods:
Failed:
Not Ready:
Ready:
3.1.2. Viewing the status of logging components Link kopierenLink in die Zwischenablage kopiert!
You can view the status for a number of logging components.
Prerequisites
- The Red Hat OpenShift Logging Operator and OpenShift Elasticsearch Operator are installed.
Procedure
Change to the
project.openshift-logging$ oc project openshift-loggingView the status of logging environment:
$ oc describe deployment cluster-logging-operatorExample output
Name: cluster-logging-operator .... Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True NewReplicaSetAvailable .... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScalingReplicaSet 62m deployment-controller Scaled up replica set cluster-logging-operator-574b8987df to 1----View the status of the logging replica set:
Get the name of a replica set:
Example output
$ oc get replicasetExample output
NAME DESIRED CURRENT READY AGE cluster-logging-operator-574b8987df 1 1 1 159m elasticsearch-cdm-uhr537yu-1-6869694fb 1 1 1 157m elasticsearch-cdm-uhr537yu-2-857b6d676f 1 1 1 156m elasticsearch-cdm-uhr537yu-3-5b6fdd8cfd 1 1 1 155m kibana-5bd5544f87 1 1 1 157mGet the status of the replica set:
$ oc describe replicaset cluster-logging-operator-574b8987dfExample output
Name: cluster-logging-operator-574b8987df .... Replicas: 1 current / 1 desired Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed .... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 66m replicaset-controller Created pod: cluster-logging-operator-574b8987df-qjhqv----
3.2. Troubleshooting log forwarding Link kopierenLink in die Zwischenablage kopiert!
3.2.1. Redeploying Fluentd pods Link kopierenLink in die Zwischenablage kopiert!
When you create a
ClusterLogForwarder
Prerequisites
-
You have created a custom resource (CR) object.
ClusterLogForwarder
Procedure
Delete the Fluentd pods to force them to redeploy by running the following command:
$ oc delete pod --selector logging-infra=collector
3.2.2. Troubleshooting Loki rate limit errors Link kopierenLink in die Zwischenablage kopiert!
If the Log Forwarder API forwards a large block of messages that exceeds the rate limit to Loki, Loki generates rate limit (
429
These errors can occur during normal operation. For example, when adding the logging to a cluster that already has some logs, rate limit errors might occur while the logging tries to ingest all of the existing log entries. In this case, if the rate of addition of new logs is less than the total rate limit, the historical data is eventually ingested, and the rate limit errors are resolved without requiring user intervention.
In cases where the rate limit errors continue to occur, you can fix the issue by modifying the
LokiStack
The
LokiStack
Conditions
- The Log Forwarder API is configured to forward logs to Loki.
Your system sends a block of messages that is larger than 2 MB to Loki. For example:
"values":[["1630410392689800468","{\"kind\":\"Event\",\"apiVersion\":\ ....... ...... ...... ...... \"received_at\":\"2021-08-31T11:46:32.800278+00:00\",\"version\":\"1.7.4 1.6.0\"}},\"@timestamp\":\"2021-08-31T11:46:32.799692+00:00\",\"viaq_index_name\":\"audit-write\",\"viaq_msg_id\":\"MzFjYjJkZjItNjY0MC00YWU4LWIwMTEtNGNmM2E5ZmViMGU4\",\"log_type\":\"audit\"}"]]}]}After you enter
, the collector logs in your cluster show a line containing one of the following error messages:oc logs -n openshift-logging -l component=collector429 Too Many Requests Ingestion rate limit exceededExample Vector error message
2023-08-25T16:08:49.301780Z WARN sink{component_kind="sink" component_id=default_loki_infra component_type=loki component_name=default_loki_infra}: vector::sinks::util::retries: Retrying after error. error=Server responded with an error: 429 Too Many Requests internal_log_rate_limit=trueExample Fluentd error message
2023-08-30 14:52:15 +0000 [warn]: [default_loki_infra] failed to flush the buffer. retry_times=2 next_retry_time=2023-08-30 14:52:19 +0000 chunk="604251225bf5378ed1567231a1c03b8b" error_class=Fluent::Plugin::LokiOutput::LogPostError error="429 Too Many Requests Ingestion rate limit exceeded for user infrastructure (limit: 4194304 bytes/sec) while attempting to ingest '4082' lines totaling '7820025' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased\n"The error is also visible on the receiving end. For example, in the LokiStack ingester pod:
Example Loki ingester error message
level=warn ts=2023-08-30T14:57:34.155592243Z caller=grpc_logging.go:43 duration=1.434942ms method=/logproto.Pusher/Push err="rpc error: code = Code(429) desc = entry with timestamp 2023-08-30 14:57:32.012778399 +0000 UTC ignored, reason: 'Per stream rate limit exceeded (limit: 3MB/sec) while attempting to ingest for stream
Procedure
Update the
andingestionBurstSizefields in theingestionRateCR:LokiStackapiVersion: loki.grafana.com/v1 kind: LokiStack metadata: name: logging-loki namespace: openshift-logging spec: limits: global: ingestion: ingestionBurstSize: 161 ingestionRate: 82 # ...- 1
- The
ingestionBurstSizefield defines the maximum local rate-limited sample size per distributor replica in MB. This value is a hard limit. Set this value to at least the maximum logs size expected in a single push request. Single requests that are larger than theingestionBurstSizevalue are not permitted. - 2
- The
ingestionRatefield is a soft limit on the maximum amount of ingested samples per second in MB. Rate limit errors occur if the rate of logs exceeds the limit, but the collector retries sending the logs. As long as the total average is lower than the limit, the system recovers and errors are resolved without user intervention.
3.3. Troubleshooting logging alerts Link kopierenLink in die Zwischenablage kopiert!
You can use the following procedures to troubleshoot logging alerts on your cluster.
3.3.1. Elasticsearch cluster health status is red Link kopierenLink in die Zwischenablage kopiert!
At least one primary shard and its replicas are not allocated to a node. Use the following procedure to troubleshoot this alert.
Some commands in this documentation reference an Elasticsearch pod by using a
$ES_POD_NAME
You can list the available Elasticsearch pods by running the following command:
$ oc -n openshift-logging get pods -l component=elasticsearch
Choose one of the pods listed and set the
$ES_POD_NAME
$ export ES_POD_NAME=<elasticsearch_pod_name>
You can now use the
$ES_POD_NAME
Procedure
Check the Elasticsearch cluster health and verify that the cluster
is red by running the following command:status$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- healthList the nodes that have joined the cluster by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cat/nodes?vList the Elasticsearch pods and compare them with the nodes in the command output from the previous step, by running the following command:
$ oc -n openshift-logging get pods -l component=elasticsearchIf some of the Elasticsearch nodes have not joined the cluster, perform the following steps.
Confirm that Elasticsearch has an elected master node by running the following command and observing the output:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cat/master?vReview the pod logs of the elected master node for issues by running the following command and observing the output:
$ oc logs <elasticsearch_master_pod_name> -c elasticsearch -n openshift-loggingReview the logs of nodes that have not joined the cluster for issues by running the following command and observing the output:
$ oc logs <elasticsearch_node_name> -c elasticsearch -n openshift-logging
If all the nodes have joined the cluster, check if the cluster is in the process of recovering by running the following command and observing the output:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cat/recovery?active_only=trueIf there is no command output, the recovery process might be delayed or stalled by pending tasks.
Check if there are pending tasks by running the following command and observing the output:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- health | grep number_of_pending_tasks- If there are pending tasks, monitor their status. If their status changes and indicates that the cluster is recovering, continue waiting. The recovery time varies according to the size of the cluster and other factors. Otherwise, if the status of the pending tasks does not change, this indicates that the recovery has stalled.
If it seems like the recovery has stalled, check if the
value is set tocluster.routing.allocation.enable, by running the following command and observing the output:none$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cluster/settings?prettyIf the
value is set tocluster.routing.allocation.enable, set it tonone, by running the following command:all$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cluster/settings?pretty \ -X PUT -d '{"persistent": {"cluster.routing.allocation.enable":"all"}}'Check if any indices are still red by running the following command and observing the output:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cat/indices?vIf any indices are still red, try to clear them by performing the following steps.
Clear the cache by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name>/_cache/clear?prettyIncrease the max allocation retries by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name>/_settings?pretty \ -X PUT -d '{"index.allocation.max_retries":10}'Delete all the scroll items by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_search/scroll/_all -X DELETEIncrease the timeout by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name>/_settings?pretty \ -X PUT -d '{"index.unassigned.node_left.delayed_timeout":"10m"}'
If the preceding steps do not clear the red indices, delete the indices individually.
Identify the red index name by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cat/indices?vDelete the red index by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_red_index_name> -X DELETE
If there are no red indices and the cluster status is red, check for a continuous heavy processing load on a data node.
Check if the Elasticsearch JVM Heap usage is high by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_nodes/stats?prettyIn the command output, review the
field to determine the JVM Heap usage.node_name.jvm.mem.heap_used_percent- Check for high CPU utilization. For more information about CPU utilitzation, see the OpenShift Container Platform "Reviewing monitoring dashboards" documentation.
3.3.2. Elasticsearch cluster health status is yellow Link kopierenLink in die Zwischenablage kopiert!
Replica shards for at least one primary shard are not allocated to nodes. Increase the node count by adjusting the
nodeCount
ClusterLogging
3.3.3. Elasticsearch node disk low watermark reached Link kopierenLink in die Zwischenablage kopiert!
Elasticsearch does not allocate shards to nodes that reach the low watermark.
Some commands in this documentation reference an Elasticsearch pod by using a
$ES_POD_NAME
You can list the available Elasticsearch pods by running the following command:
$ oc -n openshift-logging get pods -l component=elasticsearch
Choose one of the pods listed and set the
$ES_POD_NAME
$ export ES_POD_NAME=<elasticsearch_pod_name>
You can now use the
$ES_POD_NAME
Procedure
Identify the node on which Elasticsearch is deployed by running the following command:
$ oc -n openshift-logging get po -o wideCheck if there are unassigned shards by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cluster/health?pretty | grep unassigned_shardsIf there are unassigned shards, check the disk space on each node, by running the following command:
$ for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; \ do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod \ -- df -h /elasticsearch/persistent; doneIn the command output, check the
column to determine the used disk percentage on that node.UseExample output
elasticsearch-cdm-kcrsda6l-1-586cc95d4f-h8zq8 Filesystem Size Used Avail Use% Mounted on /dev/nvme1n1 19G 522M 19G 3% /elasticsearch/persistent elasticsearch-cdm-kcrsda6l-2-5b548fc7b-cwwk7 Filesystem Size Used Avail Use% Mounted on /dev/nvme2n1 19G 522M 19G 3% /elasticsearch/persistent elasticsearch-cdm-kcrsda6l-3-5dfc884d99-59tjw Filesystem Size Used Avail Use% Mounted on /dev/nvme3n1 19G 528M 19G 3% /elasticsearch/persistentIf the used disk percentage is above 85%, the node has exceeded the low watermark, and shards can no longer be allocated to this node.
To check the current
, run the following command:redundancyPolicy$ oc -n openshift-logging get es elasticsearch \ -o jsonpath='{.spec.redundancyPolicy}'If you are using a
resource on your cluster, run the following command:ClusterLogging$ oc -n openshift-logging get cl \ -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'If the cluster
value is higher than theredundancyPolicyvalue, set it to theSingleRedundancyvalue and save this change.SingleRedundancyIf the preceding steps do not fix the issue, delete the old indices.
Check the status of all indices on Elasticsearch by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- indices- Identify an old index that can be deleted.
Delete the index by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name> -X DELETE
3.3.4. Elasticsearch node disk high watermark reached Link kopierenLink in die Zwischenablage kopiert!
Elasticsearch attempts to relocate shards away from a node that has reached the high watermark to a node with low disk usage that has not crossed any watermark threshold limits.
To allocate shards to a particular node, you must free up some space on that node. If increasing the disk space is not possible, try adding a new data node to the cluster, or decrease the total cluster redundancy policy.
Some commands in this documentation reference an Elasticsearch pod by using a
$ES_POD_NAME
You can list the available Elasticsearch pods by running the following command:
$ oc -n openshift-logging get pods -l component=elasticsearch
Choose one of the pods listed and set the
$ES_POD_NAME
$ export ES_POD_NAME=<elasticsearch_pod_name>
You can now use the
$ES_POD_NAME
Procedure
Identify the node on which Elasticsearch is deployed by running the following command:
$ oc -n openshift-logging get po -o wideCheck the disk space on each node:
$ for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; \ do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod \ -- df -h /elasticsearch/persistent; doneCheck if the cluster is rebalancing:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_cluster/health?pretty | grep relocating_shardsIf the command output shows relocating shards, the high watermark has been exceeded. The default value of the high watermark is 90%.
- Increase the disk space on all nodes. If increasing the disk space is not possible, try adding a new data node to the cluster, or decrease the total cluster redundancy policy.
To check the current
, run the following command:redundancyPolicy$ oc -n openshift-logging get es elasticsearch \ -o jsonpath='{.spec.redundancyPolicy}'If you are using a
resource on your cluster, run the following command:ClusterLogging$ oc -n openshift-logging get cl \ -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'If the cluster
value is higher than theredundancyPolicyvalue, set it to theSingleRedundancyvalue and save this change.SingleRedundancyIf the preceding steps do not fix the issue, delete the old indices.
Check the status of all indices on Elasticsearch by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- indices- Identify an old index that can be deleted.
Delete the index by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name> -X DELETE
3.3.5. Elasticsearch node disk flood watermark reached Link kopierenLink in die Zwischenablage kopiert!
Elasticsearch enforces a read-only index block on every index that has both of these conditions:
- One or more shards are allocated to the node.
- One or more disks exceed the flood stage.
Use the following procedure to troubleshoot this alert.
Some commands in this documentation reference an Elasticsearch pod by using a
$ES_POD_NAME
You can list the available Elasticsearch pods by running the following command:
$ oc -n openshift-logging get pods -l component=elasticsearch
Choose one of the pods listed and set the
$ES_POD_NAME
$ export ES_POD_NAME=<elasticsearch_pod_name>
You can now use the
$ES_POD_NAME
Procedure
Get the disk space of the Elasticsearch node:
$ for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; \ do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod \ -- df -h /elasticsearch/persistent; doneIn the command output, check the
column to determine the free disk space on that node.AvailExample output
elasticsearch-cdm-kcrsda6l-1-586cc95d4f-h8zq8 Filesystem Size Used Avail Use% Mounted on /dev/nvme1n1 19G 522M 19G 3% /elasticsearch/persistent elasticsearch-cdm-kcrsda6l-2-5b548fc7b-cwwk7 Filesystem Size Used Avail Use% Mounted on /dev/nvme2n1 19G 522M 19G 3% /elasticsearch/persistent elasticsearch-cdm-kcrsda6l-3-5dfc884d99-59tjw Filesystem Size Used Avail Use% Mounted on /dev/nvme3n1 19G 528M 19G 3% /elasticsearch/persistent- Increase the disk space on all nodes. If increasing the disk space is not possible, try adding a new data node to the cluster, or decrease the total cluster redundancy policy.
To check the current
, run the following command:redundancyPolicy$ oc -n openshift-logging get es elasticsearch \ -o jsonpath='{.spec.redundancyPolicy}'If you are using a
resource on your cluster, run the following command:ClusterLogging$ oc -n openshift-logging get cl \ -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'If the cluster
value is higher than theredundancyPolicyvalue, set it to theSingleRedundancyvalue and save this change.SingleRedundancyIf the preceding steps do not fix the issue, delete the old indices.
Check the status of all indices on Elasticsearch by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- indices- Identify an old index that can be deleted.
Delete the index by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name> -X DELETE
Continue freeing up and monitoring the disk space. After the used disk space drops below 90%, unblock writing to this node by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=_all/_settings?pretty \ -X PUT -d '{"index.blocks.read_only_allow_delete": null}'
3.3.6. Elasticsearch JVM heap usage is high Link kopierenLink in die Zwischenablage kopiert!
The Elasticsearch node Java virtual machine (JVM) heap memory used is above 75%. Consider increasing the heap size.
3.3.7. Aggregated logging system CPU is high Link kopierenLink in die Zwischenablage kopiert!
System CPU usage on the node is high. Check the CPU of the cluster node. Consider allocating more CPU resources to the node.
3.3.8. Elasticsearch process CPU is high Link kopierenLink in die Zwischenablage kopiert!
Elasticsearch process CPU usage on the node is high. Check the CPU of the cluster node. Consider allocating more CPU resources to the node.
3.3.9. Elasticsearch disk space is running low Link kopierenLink in die Zwischenablage kopiert!
Elasticsearch is predicted to run out of disk space within the next 6 hours based on current disk usage. Use the following procedure to troubleshoot this alert.
Procedure
Get the disk space of the Elasticsearch node:
$ for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; \ do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod \ -- df -h /elasticsearch/persistent; doneIn the command output, check the
column to determine the free disk space on that node.AvailExample output
elasticsearch-cdm-kcrsda6l-1-586cc95d4f-h8zq8 Filesystem Size Used Avail Use% Mounted on /dev/nvme1n1 19G 522M 19G 3% /elasticsearch/persistent elasticsearch-cdm-kcrsda6l-2-5b548fc7b-cwwk7 Filesystem Size Used Avail Use% Mounted on /dev/nvme2n1 19G 522M 19G 3% /elasticsearch/persistent elasticsearch-cdm-kcrsda6l-3-5dfc884d99-59tjw Filesystem Size Used Avail Use% Mounted on /dev/nvme3n1 19G 528M 19G 3% /elasticsearch/persistent- Increase the disk space on all nodes. If increasing the disk space is not possible, try adding a new data node to the cluster, or decrease the total cluster redundancy policy.
To check the current
, run the following command:redundancyPolicy$ oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'If you are using a
resource on your cluster, run the following command:ClusterLogging$ oc -n openshift-logging get cl \ -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'If the cluster
value is higher than theredundancyPolicyvalue, set it to theSingleRedundancyvalue and save this change.SingleRedundancyIf the preceding steps do not fix the issue, delete the old indices.
Check the status of all indices on Elasticsearch by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- indices- Identify an old index that can be deleted.
Delete the index by running the following command:
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \ -- es_util --query=<elasticsearch_index_name> -X DELETE
3.3.10. Elasticsearch FileDescriptor usage is high Link kopierenLink in die Zwischenablage kopiert!
Based on current usage trends, the predicted number of file descriptors on the node is insufficient. Check the value of
max_file_descriptors
3.4. Viewing the status of the Elasticsearch log store Link kopierenLink in die Zwischenablage kopiert!
You can view the status of the OpenShift Elasticsearch Operator and for a number of Elasticsearch components.
3.4.1. Viewing the status of the Elasticsearch log store Link kopierenLink in die Zwischenablage kopiert!
You can view the status of the Elasticsearch log store.
Prerequisites
- The Red Hat OpenShift Logging Operator and OpenShift Elasticsearch Operator are installed.
Procedure
Change to the
project by running the following command:openshift-logging$ oc project openshift-loggingTo view the status:
Get the name of the Elasticsearch log store instance by running the following command:
$ oc get ElasticsearchExample output
NAME AGE elasticsearch 5h9mGet the Elasticsearch log store status by running the following command:
$ oc get Elasticsearch <Elasticsearch-instance> -o yamlFor example:
$ oc get Elasticsearch elasticsearch -n openshift-logging -o yamlThe output includes information similar to the following:
Example output
status:1 cluster:2 activePrimaryShards: 30 activeShards: 60 initializingShards: 0 numDataNodes: 3 numNodes: 3 pendingTasks: 0 relocatingShards: 0 status: green unassignedShards: 0 clusterHealth: "" conditions: []3 nodes:4 - deploymentName: elasticsearch-cdm-zjf34ved-1 upgradeStatus: {} - deploymentName: elasticsearch-cdm-zjf34ved-2 upgradeStatus: {} - deploymentName: elasticsearch-cdm-zjf34ved-3 upgradeStatus: {} pods:5 client: failed: [] notReady: [] ready: - elasticsearch-cdm-zjf34ved-1-6d7fbf844f-sn422 - elasticsearch-cdm-zjf34ved-2-dfbd988bc-qkzjz - elasticsearch-cdm-zjf34ved-3-c8f566f7c-t7zkt data: failed: [] notReady: [] ready: - elasticsearch-cdm-zjf34ved-1-6d7fbf844f-sn422 - elasticsearch-cdm-zjf34ved-2-dfbd988bc-qkzjz - elasticsearch-cdm-zjf34ved-3-c8f566f7c-t7zkt master: failed: [] notReady: [] ready: - elasticsearch-cdm-zjf34ved-1-6d7fbf844f-sn422 - elasticsearch-cdm-zjf34ved-2-dfbd988bc-qkzjz - elasticsearch-cdm-zjf34ved-3-c8f566f7c-t7zkt shardAllocationEnabled: all- 1
- In the output, the cluster status fields appear in the
statusstanza. - 2
- The status of the Elasticsearch log store:
- The number of active primary shards.
- The number of active shards.
- The number of shards that are initializing.
- The number of Elasticsearch log store data nodes.
- The total number of Elasticsearch log store nodes.
- The number of pending tasks.
-
The Elasticsearch log store status: ,
green,red.yellow - The number of unassigned shards.
- 3
- Any status conditions, if present. The Elasticsearch log store status indicates the reasons from the scheduler if a pod could not be placed. Any events related to the following conditions are shown:
- Container Waiting for both the Elasticsearch log store and proxy containers.
- Container Terminated for both the Elasticsearch log store and proxy containers.
- Pod unschedulable. Also, a condition is shown for a number of issues; see Example condition messages.
- 4
- The Elasticsearch log store nodes in the cluster, with
upgradeStatus. - 5
- The Elasticsearch log store client, data, and master pods in the cluster, listed under
failed,notReady, orreadystate.
3.4.1.1. Example condition messages Link kopierenLink in die Zwischenablage kopiert!
The following are examples of some condition messages from the
Status
The following status message indicates that a node has exceeded the configured low watermark, and no shard will be allocated to this node.
status:
nodes:
- conditions:
- lastTransitionTime: 2019-03-15T15:57:22Z
message: Disk storage usage for node is 27.5gb (36.74%). Shards will be not
be allocated on this node.
reason: Disk Watermark Low
status: "True"
type: NodeStorage
deploymentName: example-elasticsearch-cdm-0-1
upgradeStatus: {}
The following status message indicates that a node has exceeded the configured high watermark, and shards will be relocated to other nodes.
status:
nodes:
- conditions:
- lastTransitionTime: 2019-03-15T16:04:45Z
message: Disk storage usage for node is 27.5gb (36.74%). Shards will be relocated
from this node.
reason: Disk Watermark High
status: "True"
type: NodeStorage
deploymentName: example-elasticsearch-cdm-0-1
upgradeStatus: {}
The following status message indicates that the Elasticsearch log store node selector in the custom resource (CR) does not match any nodes in the cluster:
status:
nodes:
- conditions:
- lastTransitionTime: 2019-04-10T02:26:24Z
message: '0/8 nodes are available: 8 node(s) didn''t match node selector.'
reason: Unschedulable
status: "True"
type: Unschedulable
The following status message indicates that the Elasticsearch log store CR uses a non-existent persistent volume claim (PVC).
status:
nodes:
- conditions:
- last Transition Time: 2019-04-10T05:55:51Z
message: pod has unbound immediate PersistentVolumeClaims (repeated 5 times)
reason: Unschedulable
status: True
type: Unschedulable
The following status message indicates that your Elasticsearch log store cluster does not have enough nodes to support the redundancy policy.
status:
clusterHealth: ""
conditions:
- lastTransitionTime: 2019-04-17T20:01:31Z
message: Wrong RedundancyPolicy selected. Choose different RedundancyPolicy or
add more nodes with data roles
reason: Invalid Settings
status: "True"
type: InvalidRedundancy
This status message indicates your cluster has too many control plane nodes:
status:
clusterHealth: green
conditions:
- lastTransitionTime: '2019-04-17T20:12:34Z'
message: >-
Invalid master nodes count. Please ensure there are no more than 3 total
nodes with master roles
reason: Invalid Settings
status: 'True'
type: InvalidMasters
The following status message indicates that Elasticsearch storage does not support the change you tried to make.
For example:
status:
clusterHealth: green
conditions:
- lastTransitionTime: "2021-05-07T01:05:13Z"
message: Changing the storage structure for a custom resource is not supported
reason: StorageStructureChangeIgnored
status: 'True'
type: StorageStructureChangeIgnored
The
reason
type
StorageClassNameChangeIgnored- Unsupported change to the storage class name.
StorageSizeChangeIgnored- Unsupported change the storage size.
StorageStructureChangeIgnoredUnsupported change between ephemeral and persistent storage structures.
ImportantIf you try to configure the
CR to switch from ephemeral to persistent storage, the OpenShift Elasticsearch Operator creates a persistent volume claim (PVC) but does not create a persistent volume (PV). To clear theClusterLoggingstatus, you must revert the change to theStorageStructureChangeIgnoredCR and delete the PVC.ClusterLogging
3.4.2. Viewing the status of the log store components Link kopierenLink in die Zwischenablage kopiert!
You can view the status for a number of the log store components.
- Elasticsearch indices
You can view the status of the Elasticsearch indices.
Get the name of an Elasticsearch pod:
$ oc get pods --selector component=elasticsearch -o nameExample output
pod/elasticsearch-cdm-1godmszn-1-6f8495-vp4lw pod/elasticsearch-cdm-1godmszn-2-5769cf-9ms2n pod/elasticsearch-cdm-1godmszn-3-f66f7d-zqkz7Get the status of the indices:
$ oc exec elasticsearch-cdm-4vjor49p-2-6d4d7db474-q2w7z -- indicesExample output
Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-4vjor49p-2-6d4d7db474-q2w7z -n openshift-logging' to see all of the containers in this pod. green open infra-000002 S4QANnf1QP6NgCegfnrnbQ 3 1 119926 0 157 78 green open audit-000001 8_EQx77iQCSTzFOXtxRqFw 3 1 0 0 0 0 green open .security iDjscH7aSUGhIdq0LheLBQ 1 1 5 0 0 0 green open .kibana_-377444158_kubeadmin yBywZ9GfSrKebz5gWBZbjw 3 1 1 0 0 0 green open infra-000001 z6Dpe__ORgiopEpW6Yl44A 3 1 871000 0 874 436 green open app-000001 hIrazQCeSISewG3c2VIvsQ 3 1 2453 0 3 1 green open .kibana_1 JCitcBMSQxKOvIq6iQW6wg 1 1 0 0 0 0 green open .kibana_-1595131456_user1 gIYFIEGRRe-ka0W3okS-mQ 3 1 1 0 0 0
- Log store pods
You can view the status of the pods that host the log store.
Get the name of a pod:
$ oc get pods --selector component=elasticsearch -o nameExample output
pod/elasticsearch-cdm-1godmszn-1-6f8495-vp4lw pod/elasticsearch-cdm-1godmszn-2-5769cf-9ms2n pod/elasticsearch-cdm-1godmszn-3-f66f7d-zqkz7Get the status of a pod:
$ oc describe pod elasticsearch-cdm-1godmszn-1-6f8495-vp4lwThe output includes the following status information:
Example output
.... Status: Running .... Containers: elasticsearch: Container ID: cri-o://b7d44e0a9ea486e27f47763f5bb4c39dfd2 State: Running Started: Mon, 08 Jun 2020 10:17:56 -0400 Ready: True Restart Count: 0 Readiness: exec [/usr/share/elasticsearch/probe/readiness.sh] delay=10s timeout=30s period=5s #success=1 #failure=3 .... proxy: Container ID: cri-o://3f77032abaddbb1652c116278652908dc01860320b8a4e741d06894b2f8f9aa1 State: Running Started: Mon, 08 Jun 2020 10:18:38 -0400 Ready: True Restart Count: 0 .... Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True .... Events: <none>
- Log storage pod deployment configuration
You can view the status of the log store deployment configuration.
Get the name of a deployment configuration:
$ oc get deployment --selector component=elasticsearch -o nameExample output
deployment.extensions/elasticsearch-cdm-1gon-1 deployment.extensions/elasticsearch-cdm-1gon-2 deployment.extensions/elasticsearch-cdm-1gon-3Get the deployment configuration status:
$ oc describe deployment elasticsearch-cdm-1gon-1The output includes the following status information:
Example output
.... Containers: elasticsearch: Image: registry.redhat.io/openshift-logging/elasticsearch6-rhel8 Readiness: exec [/usr/share/elasticsearch/probe/readiness.sh] delay=10s timeout=30s period=5s #success=1 #failure=3 .... Conditions: Type Status Reason ---- ------ ------ Progressing Unknown DeploymentPaused Available True MinimumReplicasAvailable .... Events: <none>
- Log store replica set
You can view the status of the log store replica set.
Get the name of a replica set:
$ oc get replicaSet --selector component=elasticsearch -o name replicaset.extensions/elasticsearch-cdm-1gon-1-6f8495 replicaset.extensions/elasticsearch-cdm-1gon-2-5769cf replicaset.extensions/elasticsearch-cdm-1gon-3-f66f7dGet the status of the replica set:
$ oc describe replicaSet elasticsearch-cdm-1gon-1-6f8495The output includes the following status information:
Example output
.... Containers: elasticsearch: Image: registry.redhat.io/openshift-logging/elasticsearch6-rhel8@sha256:4265742c7cdd85359140e2d7d703e4311b6497eec7676957f455d6908e7b1c25 Readiness: exec [/usr/share/elasticsearch/probe/readiness.sh] delay=10s timeout=30s period=5s #success=1 #failure=3 .... Events: <none>
3.4.3. Elasticsearch cluster status Link kopierenLink in die Zwischenablage kopiert!
A dashboard in the Observe section of the OpenShift Container Platform web console displays the status of the Elasticsearch cluster.
To get the status of the OpenShift Elasticsearch cluster, visit the dashboard in the Observe section of the OpenShift Container Platform web console at
<cluster_url>/monitoring/dashboards/grafana-dashboard-cluster-logging
Elasticsearch status fields
eo_elasticsearch_cr_cluster_management_stateShows whether the Elasticsearch cluster is in a managed or unmanaged state. For example:
eo_elasticsearch_cr_cluster_management_state{state="managed"} 1 eo_elasticsearch_cr_cluster_management_state{state="unmanaged"} 0eo_elasticsearch_cr_restart_totalShows the number of times the Elasticsearch nodes have restarted for certificate restarts, rolling restarts, or scheduled restarts. For example:
eo_elasticsearch_cr_restart_total{reason="cert_restart"} 1 eo_elasticsearch_cr_restart_total{reason="rolling_restart"} 1 eo_elasticsearch_cr_restart_total{reason="scheduled_restart"} 3es_index_namespaces_totalShows the total number of Elasticsearch index namespaces. For example:
Total number of Namespaces. es_index_namespaces_total 5es_index_document_countShows the number of records for each namespace. For example:
es_index_document_count{namespace="namespace_1"} 25 es_index_document_count{namespace="namespace_2"} 10 es_index_document_count{namespace="namespace_3"} 5
The "Secret Elasticsearch fields are either missing or empty" message
If Elasticsearch is missing the
admin-cert
admin-key
logging-es.crt
logging-es.key
message": "Secret \"elasticsearch\" fields are either missing or empty: [admin-cert, admin-key, logging-es.crt, logging-es.key]",
"reason": "Missing Required Secrets",