Chapter 8. Troubleshooting

8.1. Review your cluster notifications
Link kopieren

When you are trying to resolve a problem with your cluster, your cluster notifications are a good source of information.

Cluster notifications are messages about the status, health, or performance of your cluster. They are also the primary way that Red Hat Site Reliability Engineering (SRE) communicates with you about cluster health and resolving problems with your cluster.

8.1.1. Viewing cluster notifications using the Red Hat Hybrid Cloud Console
Link kopieren

Cluster notifications provide important information about the health of your cluster. You can view notifications that have been sent to your cluster in the Cluster history tab on the Red Hat Hybrid Cloud Console.

Prerequisites

You are logged in to the Hybrid Cloud Console.

Procedure

Navigate to the Clusters page of the Hybrid Cloud Console.
Click the name of your cluster to go to the cluster details page.
Click the Cluster history tab.
Cluster notifications appear under the Cluster history heading.
Optional: Filter for relevant cluster notifications
Use the filter controls to hide cluster notifications that are not relevant to you, so that you can focus on your area of expertise or on resolving a critical issue. You can filter notifications based on text in the notification description, severity level, notification type, when the notification was received, and which system or person triggered the notification.

8.2. Troubleshooting Red Hat OpenShift Service on AWS cluster installations
Link kopieren

For help with the installation of Red Hat OpenShift Service on AWS clusters, see the following sections.

8.2.1. Installation troubleshooting
Link kopieren

8.2.1.1. Inspect install or uninstall logs
Link kopieren

To display install logs:

Run the following command, replacing <cluster_name> with the name of your cluster:
```
rosa logs install --cluster=<cluster_name>
```
```
$ rosa logs install --cluster=<cluster_name>
```
Copy to Clipboard Toggle word wrap

To watch the logs, include the --watch flag:

rosa logs install --cluster=<cluster_name> --watch

$ rosa logs install --cluster=<cluster_name> --watch

Copy to Clipboard

Toggle word wrap

To display uninstall logs:

Run the following command, replacing <cluster_name> with the name of your cluster:
```
rosa logs uninstall --cluster=<cluster_name>
```
```
$ rosa logs uninstall --cluster=<cluster_name>
```
Copy to Clipboard Toggle word wrap

To watch the logs, include the --watch flag:

rosa logs uninstall --cluster=<cluster_name> --watch

$ rosa logs uninstall --cluster=<cluster_name> --watch

Copy to Clipboard

Toggle word wrap

8.2.1.2. Verify your AWS account and quota
Link kopieren

Run the following command to verify you have the available quota on your AWS account:

rosa verify quota

$ rosa verify quota

Copy to Clipboard

Toggle word wrap

AWS quotas change based on region. Be sure you are verifying your quota for the correct AWS region. If you need to increase your quota, navigate to your AWS console, and request a quota increase for the service that failed.

8.2.1.3. AWS notification emails
Link kopieren

When creating a cluster, the Red Hat OpenShift Service on AWS service creates small instances in all supported regions. This check ensures the AWS account being used can deploy to each supported region.

For AWS accounts that are not using all supported regions, AWS may send one or more emails confirming that "Your Request For Accessing AWS Resources Has Been Validated". Typically the sender of this email is aws-verification@amazon.com.

This is expected behavior as the Red Hat OpenShift Service on AWS service is validating your AWS account configuration.

8.2.2. Verifying installation of Red Hat OpenShift Service on AWS clusters
Link kopieren

If the ROSA with HCP cluster is in the installing state for over 30 minutes and has not become ready, ensure the AWS account environment is prepared for the required cluster configurations. If the AWS account environment is prepared for the required cluster configurations correctly, try to delete and recreate the cluster. If the problem persists, contact support.

8.2.3. Troubleshooting Red Hat OpenShift Service on AWS installation error codes
Link kopieren

The following table lists Red Hat OpenShift Service on AWS installation error codes and what you can do to troubleshoot these errors.

Expand

Table 8.1. Red Hat OpenShift Service on AWS installation error codes
Error code	Description	Resolution
OCM3999	Unknown error.	Check the cluster installation logs for more details, or delete this cluster and retry cluster installation. If this issue persists, contact support by logging in to the Customer Support page.
OCM5001	Red Hat OpenShift Service on AWS cluster provision has failed.	Check the cluster installation logs for more details, or delete this cluster and retry cluster installation. If this issue persists, contact support by logging in to the Customer Support page.
OCM5002	The maximum resource tag size of 25 has been exceeded.	Check the cluster information to determine if you can remove any unnecessary tags you have specified and retry cluster installation.
OCM5003	Unable to establish an AWS client to provision the cluster.	You must create several role resources on your AWS account to create and manage a Red Hat OpenShift Service on AWS cluster. Ensure that your provided AWS credentials are correct and retry cluster installation. For more information about Red Hat OpenShift Service on AWS IAM role resources, see ROSA IAM role resources in the Additional resources section.
OCM5004	Unable to establish a cross-account AWS client to provision the cluster.	You must create several role resources on your AWS account to create and manage a Red Hat OpenShift Service on AWS cluster. Ensure that your provided AWS credentials are correct and retry cluster installation. For more information about Red Hat OpenShift Service on AWS IAM role resources, see ROSA IAM role resources in the Additional resources section.
OCM5005	Failed to retrieve AWS subnets defined for the cluster.	Review the provided subnet IDs and retry cluster installation.
OCM5006	You must configure at least one private AWS subnet for the cluster.	Review the provided subnet IDs and retry cluster installation.
OCM5007	Unable to create AWS STS prerequisites for the cluster.	Verify that account and operator roles have been created and are correct. For more information, see AWS STS and ROSA with HCP explained in the Additional resources section.
OCM5008	The provided cluster flavour is incorrect.	Verify that the provided name or ID is correct when you are using the flavour parameter and retry cluster creation.
OCM5009	The cluster version could not be found.	Ensure that the configured version ID matches a valid Red Hat OpenShift Service on AWS version.
OCM5010	Failed to tag subnets for the cluster.	Confirm that the AWS permissions and the subnet configurations are correct. You must tag at least one private subnet and, if applicable, one public subnet.
OCM5011	Cluster installation has failed due to unavailable capacity in the selected region.	Try your cluster installation in another region or retry cluster installation.

8.2.4. Troubleshooting access to Red Hat Hybrid Cloud Console
Link kopieren

In Red Hat OpenShift Service on AWS clusters, the Red Hat OpenShift Service on AWS OAuth server is hosted in the Red Hat service’s AWS account while the web console service is published using the cluster’s default ingress controller in the cluster’s AWS account. If you can log in to your cluster using the OpenShift CLI (oc) but cannot access the Red Hat OpenShift Service on AWS web console, verify the following criteria are met:

The console workloads are running.
The default ingress controller’s load balancer is active.
You are accessing the console from a machine that has network connectivity to the cluster’s VPC network.

8.2.5. Verifying access to Red Hat OpenShift Service on AWS web console for Red Hat OpenShift Service on AWS cluster in ready state
Link kopieren

Red Hat OpenShift Service on AWS clusters return a ready status when the control plane hosted in the Red Hat OpenShift Service on AWS service account becomes ready. Cluster console workloads are deployed on the cluster’s worker nodes. The Red Hat OpenShift Service on AWS web console will not be available and accessible until the worker nodes have joined the cluster and console workloads are running.

If your Red Hat OpenShift Service on AWS cluster is ready but you are unable to access the Red Hat OpenShift Service on AWS web console for the cluster, wait for the worker nodes to join the cluster and retry accessing the console.

You can either log in to the Red Hat OpenShift Service on AWS cluster or use the rosa describe machinepool command in the rosa CLI watch the nodes.

8.2.6. Verifying access to Red Hat Hybrid Cloud Console for private Red Hat OpenShift Service on AWS clusters
Link kopieren

The console of the private cluster is private by default. During cluster installation, the default Ingress Controller managed by OpenShift’s Ingress Operator is configured with an internal AWS Network Load Balancer (NLB).

If your private Red Hat OpenShift Service on AWS cluster shows a ready status but you cannot access the Red Hat OpenShift Service on AWS web console for the cluster, try accessing the cluster console from either within the cluster VPC or from a network that is connected to the VPC.

8.3. Troubleshooting networking
Link kopieren

This document describes how to troubleshoot networking errors.

8.3.1. Connectivity issues on clusters with private Network Load Balancers
Link kopieren

Red Hat OpenShift Service on AWS clusters created with version 4 deploy AWS Network Load Balancers (NLB) by default for the default ingress controller. In the case of a private NLB, the NLB’s client IP address preservation might cause connections to be dropped where the source and destination are the same host. See the AWS’s documentation about how to Troubleshoot your Network Load Balancer. This IP address preservation has the implication that any customer workloads cohabitating on the same node with the router pods, may not be able send traffic to the private NLB fronting the ingress controller router.

To mitigate this impact, customers should reschedule their workloads onto nodes separate from those where the router pods are scheduled. Alternatively, customers should rely on the internal pod and service networks for accessing other workloads co-located within the same cluster.

8.4. Verifying node health
Link kopieren

8.4.1. Reviewing node status, resource usage, and configuration
Link kopieren

Review cluster node health status, resource consumption statistics, and node logs. Additionally, query kubelet status on individual nodes.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
You have installed the OpenShift CLI (oc).

Procedure

List the name, status, and role for all nodes in the cluster:
```
oc get nodes
```
```
$ oc get nodes
```
Copy to Clipboard Toggle word wrap
Summarize CPU and memory usage for each node within the cluster:
```
oc adm top nodes
```
```
$ oc adm top nodes
```
Copy to Clipboard Toggle word wrap
Summarize CPU and memory usage for a specific node:
```
oc adm top node my-node
```
```
$ oc adm top node my-node
```
Copy to Clipboard Toggle word wrap

8.5. Troubleshooting Operator issues
Link kopieren

Operators are a method of packaging, deploying, and managing an Red Hat OpenShift Service on AWS application. They act like an extension of the software vendor’s engineering team, watching over an Red Hat OpenShift Service on AWS environment and using its current state to make decisions in real time. Operators are designed to handle upgrades seamlessly, react to failures automatically, and not take shortcuts, such as skipping a software backup process to save time.

Red Hat OpenShift Service on AWS 4 includes a default set of Operators that are required for proper functioning of the cluster. These default Operators are managed by the Cluster Version Operator (CVO).

As a cluster administrator, you can install application Operators from the software catalog using the Red Hat OpenShift Service on AWS web console or the CLI. You can then subscribe the Operator to one or more namespaces to make it available for developers on your cluster. Application Operators are managed by Operator Lifecycle Manager (OLM).

If you experience Operator issues, verify Operator subscription status. Check Operator pod health across the cluster and gather Operator logs for diagnosis.

8.5.1. Operator subscription condition types
Link kopieren

Subscriptions can report the following condition types:

Expand

Table 8.2. Subscription condition types
Condition	Description
`CatalogSourcesUnhealthy`	Some or all of the catalog sources to be used in resolution are unhealthy.
`InstallPlanMissing`	An install plan for a subscription is missing.
`InstallPlanPending`	An install plan for a subscription is pending installation.
`InstallPlanFailed`	An install plan for a subscription has failed.
`ResolutionFailed`	The dependency resolution for a subscription has failed.

Note

Default Red Hat OpenShift Service on AWS cluster Operators are managed by the Cluster Version Operator (CVO) and they do not have a Subscription object. Application Operators are managed by Operator Lifecycle Manager (OLM) and they have a Subscription object.

8.5.2. Viewing Operator subscription status by using the CLI
Link kopieren

You can view Operator subscription status by using the CLI.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
You have installed the OpenShift CLI (oc).

Procedure

List Operator subscriptions:
```
oc get subs -n <operator_namespace>
```
```
$ oc get subs -n <operator_namespace>
```
Copy to Clipboard Toggle word wrap

Use the oc describe command to inspect a Subscription resource:

oc describe sub <subscription_name> -n <operator_namespace>

$ oc describe sub <subscription_name> -n <operator_namespace>

Copy to Clipboard

Toggle word wrap

In the command output, find the Conditions section for the status of Operator subscription condition types. In the following example, the CatalogSourcesUnhealthy condition type has a status of false because all available catalog sources are healthy:

Example output

Name:         cluster-logging
Namespace:    openshift-logging
Labels:       operators.coreos.com/cluster-logging.openshift-logging=
Annotations:  <none>
API Version:  operators.coreos.com/v1alpha1
Kind:         Subscription
# ...
Conditions:
   Last Transition Time:  2019-07-29T13:42:57Z
   Message:               all available catalogsources are healthy
   Reason:                AllCatalogSourcesHealthy
   Status:                False
   Type:                  CatalogSourcesUnhealthy
# ...

Name:         cluster-logging
Namespace:    openshift-logging
Labels:       operators.coreos.com/cluster-logging.openshift-logging=
Annotations:  <none>
API Version:  operators.coreos.com/v1alpha1
Kind:         Subscription
# ...
Conditions:
   Last Transition Time:  2019-07-29T13:42:57Z
   Message:               all available catalogsources are healthy
   Reason:                AllCatalogSourcesHealthy
   Status:                False
   Type:                  CatalogSourcesUnhealthy
# ...

Copy to Clipboard

Toggle word wrap

Note

Default Red Hat OpenShift Service on AWS cluster Operators are managed by the Cluster Version Operator (CVO) and they do not have a Subscription object. Application Operators are managed by Operator Lifecycle Manager (OLM) and they have a Subscription object.

8.5.3. Viewing Operator catalog source status by using the CLI
Link kopieren

You can view the status of an Operator catalog source by using the CLI.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
You have installed the OpenShift CLI (oc).

Procedure

List the catalog sources in a namespace. For example, you can check the openshift-marketplace namespace, which is used for cluster-wide catalog sources:

oc get catalogsources -n openshift-marketplace

$ oc get catalogsources -n openshift-marketplace

Copy to Clipboard

Toggle word wrap

Example output

NAME                  DISPLAY               TYPE   PUBLISHER   AGE
certified-operators   Certified Operators   grpc   Red Hat     55m
community-operators   Community Operators   grpc   Red Hat     55m
example-catalog       Example Catalog       grpc   Example Org 2m25s
redhat-operators      Red Hat Operators     grpc   Red Hat     55m

NAME                  DISPLAY               TYPE   PUBLISHER   AGE
certified-operators   Certified Operators   grpc   Red Hat     55m
community-operators   Community Operators   grpc   Red Hat     55m
example-catalog       Example Catalog       grpc   Example Org 2m25s
redhat-operators      Red Hat Operators     grpc   Red Hat     55m

Copy to Clipboard

Toggle word wrap

Use the oc describe command to get more details and status about a catalog source:

oc describe catalogsource example-catalog -n openshift-marketplace

$ oc describe catalogsource example-catalog -n openshift-marketplace

Copy to Clipboard

Toggle word wrap

Example output

Name:         example-catalog
Namespace:    openshift-marketplace
Labels:       <none>
Annotations:  operatorframework.io/managed-by: marketplace-operator
              target.workload.openshift.io/management: {"effect": "PreferredDuringScheduling"}
API Version:  operators.coreos.com/v1alpha1
Kind:         CatalogSource
# ...
Status:
  Connection State:
    Address:              example-catalog.openshift-marketplace.svc:50051
    Last Connect:         2021-09-09T17:07:35Z
    Last Observed State:  TRANSIENT_FAILURE
  Registry Service:
    Created At:         2021-09-09T17:05:45Z
    Port:               50051
    Protocol:           grpc
    Service Name:       example-catalog
    Service Namespace:  openshift-marketplace
# ...

Name:         example-catalog
Namespace:    openshift-marketplace
Labels:       <none>
Annotations:  operatorframework.io/managed-by: marketplace-operator
              target.workload.openshift.io/management: {"effect": "PreferredDuringScheduling"}
API Version:  operators.coreos.com/v1alpha1
Kind:         CatalogSource
# ...
Status:
  Connection State:
    Address:              example-catalog.openshift-marketplace.svc:50051
    Last Connect:         2021-09-09T17:07:35Z
    Last Observed State:  TRANSIENT_FAILURE
  Registry Service:
    Created At:         2021-09-09T17:05:45Z
    Port:               50051
    Protocol:           grpc
    Service Name:       example-catalog
    Service Namespace:  openshift-marketplace
# ...

Copy to Clipboard

Toggle word wrap

In the preceding example output, the last observed state is TRANSIENT_FAILURE. This state indicates that there is a problem establishing a connection for the catalog source.

List the pods in the namespace where your catalog source was created:

oc get pods -n openshift-marketplace

$ oc get pods -n openshift-marketplace

Copy to Clipboard

Toggle word wrap

Example output

NAME                                    READY   STATUS             RESTARTS   AGE
certified-operators-cv9nn               1/1     Running            0          36m
community-operators-6v8lp               1/1     Running            0          36m
marketplace-operator-86bfc75f9b-jkgbc   1/1     Running            0          42m
example-catalog-bwt8z                   0/1     ImagePullBackOff   0          3m55s
redhat-operators-smxx8                  1/1     Running            0          36m

NAME                                    READY   STATUS             RESTARTS   AGE
certified-operators-cv9nn               1/1     Running            0          36m
community-operators-6v8lp               1/1     Running            0          36m
marketplace-operator-86bfc75f9b-jkgbc   1/1     Running            0          42m
example-catalog-bwt8z                   0/1     ImagePullBackOff   0          3m55s
redhat-operators-smxx8                  1/1     Running            0          36m

Copy to Clipboard

Toggle word wrap

When a catalog source is created in a namespace, a pod for the catalog source is created in that namespace. In the preceding example output, the status for the example-catalog-bwt8z pod is ImagePullBackOff. This status indicates that there is an issue pulling the catalog source’s index image.

Use the oc describe command to inspect a pod for more detailed information:

oc describe pod example-catalog-bwt8z -n openshift-marketplace

$ oc describe pod example-catalog-bwt8z -n openshift-marketplace

Copy to Clipboard

Toggle word wrap

Example output

Name:         example-catalog-bwt8z
Namespace:    openshift-marketplace
Priority:     0
Node:         ci-ln-jyryyg2-f76d1-ggdbq-worker-b-vsxjd/10.0.128.2
...
Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Normal   Scheduled       48s                default-scheduler  Successfully assigned openshift-marketplace/example-catalog-bwt8z to ci-ln-jyryyf2-f76d1-fgdbq-worker-b-vsxjd
  Normal   AddedInterface  47s                multus             Add eth0 [10.131.0.40/23] from openshift-sdn
  Normal   BackOff         20s (x2 over 46s)  kubelet            Back-off pulling image "quay.io/example-org/example-catalog:v1"
  Warning  Failed          20s (x2 over 46s)  kubelet            Error: ImagePullBackOff
  Normal   Pulling         8s (x3 over 47s)   kubelet            Pulling image "quay.io/example-org/example-catalog:v1"
  Warning  Failed          8s (x3 over 47s)   kubelet            Failed to pull image "quay.io/example-org/example-catalog:v1": rpc error: code = Unknown desc = reading manifest v1 in quay.io/example-org/example-catalog: unauthorized: access to the requested resource is not authorized
  Warning  Failed          8s (x3 over 47s)   kubelet            Error: ErrImagePull

Name:         example-catalog-bwt8z
Namespace:    openshift-marketplace
Priority:     0
Node:         ci-ln-jyryyg2-f76d1-ggdbq-worker-b-vsxjd/10.0.128.2
...
Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Normal   Scheduled       48s                default-scheduler  Successfully assigned openshift-marketplace/example-catalog-bwt8z to ci-ln-jyryyf2-f76d1-fgdbq-worker-b-vsxjd
  Normal   AddedInterface  47s                multus             Add eth0 [10.131.0.40/23] from openshift-sdn
  Normal   BackOff         20s (x2 over 46s)  kubelet            Back-off pulling image "quay.io/example-org/example-catalog:v1"
  Warning  Failed          20s (x2 over 46s)  kubelet            Error: ImagePullBackOff
  Normal   Pulling         8s (x3 over 47s)   kubelet            Pulling image "quay.io/example-org/example-catalog:v1"
  Warning  Failed          8s (x3 over 47s)   kubelet            Failed to pull image "quay.io/example-org/example-catalog:v1": rpc error: code = Unknown desc = reading manifest v1 in quay.io/example-org/example-catalog: unauthorized: access to the requested resource is not authorized
  Warning  Failed          8s (x3 over 47s)   kubelet            Error: ErrImagePull

Copy to Clipboard

Toggle word wrap

In the preceding example output, the error messages indicate that the catalog source’s index image is failing to pull successfully because of an authorization issue. For example, the index image might be stored in a registry that requires login credentials.

8.5.4. Querying Operator pod status
Link kopieren

You can list Operator pods within a cluster and their status. You can also collect a detailed Operator pod summary.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
Your API service is still functional.
You have installed the OpenShift CLI (oc).

Procedure

List Operators running in the cluster. The output includes Operator version, availability, and up-time information:
```
oc get clusteroperators
```
```
$ oc get clusteroperators
```
Copy to Clipboard Toggle word wrap
List Operator pods running in the Operator’s namespace, plus pod status, restarts, and age:
```
oc get pod -n <operator_namespace>
```
```
$ oc get pod -n <operator_namespace>
```
Copy to Clipboard Toggle word wrap

Output a detailed Operator pod summary:

oc describe pod <operator_pod_name> -n <operator_namespace>

$ oc describe pod <operator_pod_name> -n <operator_namespace>

Copy to Clipboard

Toggle word wrap

8.6. Investigating pod issues
Link kopieren

Red Hat OpenShift Service on AWS leverages the Kubernetes concept of a pod, which is one or more containers deployed together on one host. A pod is the smallest compute unit that can be defined, deployed, and managed on Red Hat OpenShift Service on AWS 4.

After a pod is defined, it is assigned to run on a node until its containers exit, or until it is removed. Depending on policy and exit code, pods are either removed after exiting or retained so that their logs can be accessed.

The first thing to check when pod issues arise is the pod’s status. If an explicit pod failure has occurred, observe the pod’s error state to identify specific image, container, or pod network issues. Focus diagnostic data collection according to the error state. Review pod event messages, as well as pod and container log information. Diagnose issues dynamically by accessing running Pods on the command line, or start a debug pod with root access based on a problematic pod’s deployment configuration.

8.6.1. Understanding pod error states
Link kopieren

Pod failures return explicit error states that can be observed in the status field in the output of oc get pods. Pod error states cover image, container, and container network related failures.

The following table provides a list of pod error states along with their descriptions.

Expand

Table 8.3. Pod error states
Pod error state	Description
`ErrImagePull`	Generic image retrieval error.
`ErrImagePullBackOff`	Image retrieval failed and is backed off.
`ErrInvalidImageName`	The specified image name was invalid.
`ErrImageInspect`	Image inspection did not succeed.
`ErrImageNeverPull`	`PullPolicy` is set to `NeverPullImage` and the target image is not present locally on the host.
`ErrRegistryUnavailable`	When attempting to retrieve an image from a registry, an HTTP error was encountered.
`ErrContainerNotFound`	The specified container is either not present or not managed by the kubelet, within the declared pod.
`ErrRunInitContainer`	Container initialization failed.
`ErrRunContainer`	None of the pod’s containers started successfully.
`ErrKillContainer`	None of the pod’s containers were killed successfully.
`ErrCrashLoopBackOff`	A container has terminated. The kubelet will not attempt to restart it.
`ErrVerifyNonRoot`	A container or image attempted to run with root privileges.
`ErrCreatePodSandbox`	Pod sandbox creation did not succeed.
`ErrConfigPodSandbox`	Pod sandbox configuration was not obtained.
`ErrKillPodSandbox`	A pod sandbox did not stop successfully.
`ErrSetupNetwork`	Network initialization failed.
`ErrTeardownNetwork`	Network termination failed.

8.6.2. Reviewing pod status
Link kopieren

You can query pod status and error states. You can also query a pod’s associated deployment configuration and review base image availability.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
You have installed the OpenShift CLI (oc).
skopeo is installed.

Procedure

Switch into a project:
```
oc project <project_name>
```
```
$ oc project <project_name>
```
Copy to Clipboard Toggle word wrap
List pods running within the namespace, as well as pod status, error states, restarts, and age:
```
oc get pods
```
```
$ oc get pods
```
Copy to Clipboard Toggle word wrap
Determine whether the namespace is managed by a deployment configuration:
```
oc status
```
```
$ oc status
```
Copy to Clipboard Toggle word wrap
If the namespace is managed by a deployment configuration, the output includes the deployment configuration name and a base image reference.
Inspect the base image referenced in the preceding command’s output:
```
skopeo inspect docker://<image_reference>
```
```
$ skopeo inspect docker://<image_reference>
```
Copy to Clipboard Toggle word wrap
If the base image reference is not correct, update the reference in the deployment configuration:
```
oc edit deployment/my-deployment
```
```
$ oc edit deployment/my-deployment
```
Copy to Clipboard Toggle word wrap
When deployment configuration changes on exit, the configuration will automatically redeploy. Watch pod status as the deployment progresses, to determine whether the issue has been resolved:
```
oc get pods -w
```
```
$ oc get pods -w
```
Copy to Clipboard Toggle word wrap
Review events within the namespace for diagnostic information relating to pod failures:
```
oc get events
```
```
$ oc get events
```
Copy to Clipboard Toggle word wrap

8.6.3. Inspecting pod and container logs
Link kopieren

You can inspect pod and container logs for warnings and error messages related to explicit pod failures. Depending on policy and exit code, pod and container logs remain available after pods have been terminated.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
Your API service is still functional.
You have installed the OpenShift CLI (oc).

Procedure

Query logs for a specific pod:
```
oc logs <pod_name>
```
```
$ oc logs <pod_name>
```
Copy to Clipboard Toggle word wrap
Query logs for a specific container within a pod:
```
oc logs <pod_name> -c <container_name>
```
```
$ oc logs <pod_name> -c <container_name>
```
Copy to Clipboard Toggle word wrap
Logs retrieved using the preceding oc logs commands are composed of messages sent to stdout within pods or containers.

Inspect logs contained in /var/log/ within a pod.

List log files and subdirectories contained in /var/log within a pod:

oc exec <pod_name>  -- ls -alh /var/log

$ oc exec <pod_name>  -- ls -alh /var/log

Copy to Clipboard

Toggle word wrap

Example output

total 124K
drwxr-xr-x. 1 root root   33 Aug 11 11:23 .
drwxr-xr-x. 1 root root   28 Sep  6  2022 ..
-rw-rw----. 1 root utmp    0 Jul 10 10:31 btmp
-rw-r--r--. 1 root root  33K Jul 17 10:07 dnf.librepo.log
-rw-r--r--. 1 root root  69K Jul 17 10:07 dnf.log
-rw-r--r--. 1 root root 8.8K Jul 17 10:07 dnf.rpm.log
-rw-r--r--. 1 root root  480 Jul 17 10:07 hawkey.log
-rw-rw-r--. 1 root utmp    0 Jul 10 10:31 lastlog
drwx------. 2 root root   23 Aug 11 11:14 openshift-apiserver
drwx------. 2 root root    6 Jul 10 10:31 private
drwxr-xr-x. 1 root root   22 Mar  9 08:05 rhsm
-rw-rw-r--. 1 root utmp    0 Jul 10 10:31 wtmp

total 124K
drwxr-xr-x. 1 root root   33 Aug 11 11:23 .
drwxr-xr-x. 1 root root   28 Sep  6  2022 ..
-rw-rw----. 1 root utmp    0 Jul 10 10:31 btmp
-rw-r--r--. 1 root root  33K Jul 17 10:07 dnf.librepo.log
-rw-r--r--. 1 root root  69K Jul 17 10:07 dnf.log
-rw-r--r--. 1 root root 8.8K Jul 17 10:07 dnf.rpm.log
-rw-r--r--. 1 root root  480 Jul 17 10:07 hawkey.log
-rw-rw-r--. 1 root utmp    0 Jul 10 10:31 lastlog
drwx------. 2 root root   23 Aug 11 11:14 openshift-apiserver
drwx------. 2 root root    6 Jul 10 10:31 private
drwxr-xr-x. 1 root root   22 Mar  9 08:05 rhsm
-rw-rw-r--. 1 root utmp    0 Jul 10 10:31 wtmp

Copy to Clipboard

Toggle word wrap

Query a specific log file contained in /var/log within a pod:

oc exec <pod_name> cat /var/log/<path_to_log>

$ oc exec <pod_name> cat /var/log/<path_to_log>

Copy to Clipboard

Toggle word wrap

Example output

2023-07-10T10:29:38+0000 INFO --- logging initialized ---
2023-07-10T10:29:38+0000 DDEBUG timer: config: 13 ms
2023-07-10T10:29:38+0000 DEBUG Loaded plugins: builddep, changelog, config-manager, copr, debug, debuginfo-install, download, generate_completion_cache, groups-manager, needs-restarting, playground, product-id, repoclosure, repodiff, repograph, repomanage, reposync, subscription-manager, uploadprofile
2023-07-10T10:29:38+0000 INFO Updating Subscription Management repositories.
2023-07-10T10:29:38+0000 INFO Unable to read consumer identity
2023-07-10T10:29:38+0000 INFO Subscription Manager is operating in container mode.
2023-07-10T10:29:38+0000 INFO

2023-07-10T10:29:38+0000 INFO --- logging initialized ---
2023-07-10T10:29:38+0000 DDEBUG timer: config: 13 ms
2023-07-10T10:29:38+0000 DEBUG Loaded plugins: builddep, changelog, config-manager, copr, debug, debuginfo-install, download, generate_completion_cache, groups-manager, needs-restarting, playground, product-id, repoclosure, repodiff, repograph, repomanage, reposync, subscription-manager, uploadprofile
2023-07-10T10:29:38+0000 INFO Updating Subscription Management repositories.
2023-07-10T10:29:38+0000 INFO Unable to read consumer identity
2023-07-10T10:29:38+0000 INFO Subscription Manager is operating in container mode.
2023-07-10T10:29:38+0000 INFO

Copy to Clipboard

Toggle word wrap

List log files and subdirectories contained in /var/log within a specific container:
```
oc exec <pod_name> -c <container_name> ls /var/log
```
```
$ oc exec <pod_name> -c <container_name> ls /var/log
```
Copy to Clipboard Toggle word wrap

Query a specific log file contained in /var/log within a specific container:

oc exec <pod_name> -c <container_name> cat /var/log/<path_to_log>

$ oc exec <pod_name> -c <container_name> cat /var/log/<path_to_log>

Copy to Clipboard

Toggle word wrap

8.6.4. Accessing running pods
Link kopieren

You can review running pods dynamically by opening a shell inside a pod or by gaining network access through port forwarding.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
Your API service is still functional.
You have installed the OpenShift CLI (oc).

Procedure

Switch into the project that contains the pod you would like to access. This is necessary because the oc rsh command does not accept the -n namespace option:
```
oc project <namespace>
```
```
$ oc project <namespace>
```
Copy to Clipboard Toggle word wrap
Start a remote shell into a pod:
```
oc rsh <pod_name>
```
```
$ oc rsh <pod_name>  
```
1
Copy to Clipboard Toggle word wrap
1
If a pod has multiple containers, oc rsh defaults to the first container unless -c <container_name> is specified.
Start a remote shell into a specific container within a pod:
```
oc rsh -c <container_name> pod/<pod_name>
```
```
$ oc rsh -c <container_name> pod/<pod_name>
```
Copy to Clipboard Toggle word wrap
Create a port forwarding session to a port on a pod:
```
oc port-forward <pod_name> <host_port>:<pod_port>
```
```
$ oc port-forward <pod_name> <host_port>:<pod_port>  
```
1
Copy to Clipboard Toggle word wrap
1
Enter Ctrl+C to cancel the port forwarding session.

8.6.5. Starting debug pods with root access
Link kopieren

You can start a debug pod with root access, based on a problematic pod’s deployment or deployment configuration. Pod users typically run with non-root privileges, but running troubleshooting pods with temporary root privileges can be useful during issue investigation.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
Your API service is still functional.
You have installed the OpenShift CLI (oc).

Procedure

Start a debug pod with root access, based on a deployment.
1. Obtain a project’s deployment name:
  $ oc get deployment -n <project_name>
  Copy to Clipboard Toggle word wrap
2. Start a debug pod with root privileges, based on the deployment:
  $ oc debug deployment/my-deployment --as-root -n <project_name>
  Copy to Clipboard Toggle word wrap
Start a debug pod with root access, based on a deployment configuration.
1. Obtain a project’s deployment configuration name:
  $ oc get deploymentconfigs -n <project_name>
  Copy to Clipboard Toggle word wrap
2. Start a debug pod with root privileges, based on the deployment configuration:
  $ oc debug deploymentconfig/my-deployment-configuration --as-root -n <project_name>
  Copy to Clipboard Toggle word wrap

Note

You can append -- <command> to the preceding oc debug commands to run individual commands within a debug pod, instead of running an interactive shell.

8.6.6. Copying files to and from pods and containers
Link kopieren

You can copy files to and from a pod to test configuration changes or gather diagnostic information.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
Your API service is still functional.
You have installed the OpenShift CLI (oc).

Procedure

Copy a file to a pod:
```
oc cp <local_path> <pod_name>:/<path> -c <container_name>
```
```
$ oc cp <local_path> <pod_name>:/<path> -c <container_name>  
```
1
Copy to Clipboard Toggle word wrap
1
The first container in a pod is selected if the -c option is not specified.
Copy a file from a pod:
```
oc cp <pod_name>:/<path>  -c <container_name> <local_path>
```
```
$ oc cp <pod_name>:/<path>  -c <container_name> <local_path>  
```
1
Copy to Clipboard Toggle word wrap
1
The first container in a pod is selected if the -c option is not specified.
Note
For oc cp to function, the tar binary must be available within the container.

8.7. Troubleshooting the Source-to-Image process
Link kopieren

8.7.1. Strategies for Source-to-Image troubleshooting
Link kopieren

Use Source-to-Image (S2I) to build reproducible, Docker-formatted container images. You can create ready-to-run images by injecting application source code into a container image and assembling a new image. The new image incorporates the base image (the builder) and built source.

Procedure

To determine where in the S2I process a failure occurs, you can observe the state of the pods relating to each of the following S2I stages:
1. During the build configuration stage, a build pod is used to create an application container image from a base image and application source code.
2. During the deployment configuration stage, a deployment pod is used to deploy application pods from the application container image that was built in the build configuration stage. The deployment pod also deploys other resources such as services and routes. The deployment configuration begins after the build configuration succeeds.
3. After the deployment pod has started the application pods, application failures can occur within the running application pods. For instance, an application might not behave as expected even though the application pods are in a Running state. In this scenario, you can access running application pods to investigate application failures within a pod.
When troubleshooting S2I issues, follow this strategy:
1. Monitor build, deployment, and application pod status.
2. Determine the stage of the S2I process where the problem occurred.
3. Review logs corresponding to the failed stage.

8.7.2. Gathering Source-to-Image diagnostic data
Link kopieren

The S2I tool runs a build pod and a deployment pod in sequence. The deployment pod is responsible for deploying the application pods based on the application container image created in the build stage. Watch build, deployment and application pod status to determine where in the S2I process a failure occurs. Then, focus diagnostic data collection accordingly.

Prerequisites

You have access to the cluster as a user with the cluster-admin role.
Your API service is still functional.
You have installed the OpenShift CLI (oc).

Procedure

Watch the pod status throughout the S2I process to determine at which stage a failure occurs:
```
oc get pods -w
```
```
$ oc get pods -w  
```
1
Copy to Clipboard Toggle word wrap
1
Use -w to monitor pods for changes until you quit the command using Ctrl+C.
Review a failed pod’s logs for errors.
- If the build pod fails, review the build pod’s logs:
  $ oc logs -f pod/<application_name>-<build_number>-build
  Copy to Clipboard Toggle word wrap
  Note
  Alternatively, you can review the build configuration’s logs using oc logs -f bc/<application_name>. The build configuration’s logs include the logs from the build pod.
- If the deployment pod fails, review the deployment pod’s logs:
  $ oc logs -f pod/<application_name>-<build_number>-deploy
  Copy to Clipboard Toggle word wrap
  Note
  Alternatively, you can review the deployment configuration’s logs using oc logs -f dc/<application_name>. This outputs logs from the deployment pod until the deployment pod completes successfully. The command outputs logs from the application pods if you run it after the deployment pod has completed. After a deployment pod completes, its logs can still be accessed by running oc logs -f pod/<application_name>-<build_number>-deploy.
- If an application pod fails, or if an application is not behaving as expected within a running application pod, review the application pod’s logs:
  $ oc logs -f pod/<application_name>-<build_number>-<random_string>
  Copy to Clipboard Toggle word wrap

8.7.3. Gathering application diagnostic data to investigate application failures
Link kopieren

Application failures can occur within running application pods. In these situations, you can retrieve diagnostic information with these strategies:

Review events relating to the application pods.
Review the logs from the application pods, including application-specific log files that are not collected by the OpenShift Logging framework.
Test application functionality interactively and run diagnostic tools in an application container.

Prerequisites

You have access to the cluster as a user with the cluster-admin role.
You have installed the OpenShift CLI (oc).

Procedure

List events relating to a specific application pod. The following example retrieves events for an application pod named my-app-1-akdlg:
```
oc describe pod/my-app-1-akdlg
```
```
$ oc describe pod/my-app-1-akdlg
```
Copy to Clipboard Toggle word wrap
Review logs from an application pod:
```
oc logs -f pod/my-app-1-akdlg
```
```
$ oc logs -f pod/my-app-1-akdlg
```
Copy to Clipboard Toggle word wrap
Query specific logs within a running application pod. Logs that are sent to stdout are collected by the OpenShift Logging framework and are included in the output of the preceding command. The following query is only required for logs that are not sent to stdout.
1. If an application log can be accessed without root privileges within a pod, concatenate the log file as follows:
  $ oc exec my-app-1-akdlg -- cat /var/log/my-application.log
  Copy to Clipboard Toggle word wrap
2. If root access is required to view an application log, you can start a debug container with root privileges and then view the log file from within the container. Start the debug container from the project’s DeploymentConfig object. Pod users typically run with non-root privileges, but running troubleshooting pods with temporary root privileges can be useful during issue investigation:
  $ oc debug dc/my-deployment-configuration --as-root -- cat /var/log/my-application.log
  Copy to Clipboard Toggle word wrap
  Note
  You can access an interactive shell with root access within the debug pod if you run oc debug dc/<deployment_configuration> --as-root without appending -- <command>.
Test application functionality interactively and run diagnostic tools, in an application container with an interactive shell.
1. Start an interactive shell on the application container:
  $ oc exec -it my-app-1-akdlg /bin/bash
  Copy to Clipboard Toggle word wrap
2. Test application functionality interactively from within the shell. For example, you can run the container’s entry point command and observe the results. Then, test changes from the command line directly, before updating the source code and rebuilding the application container through the S2I process.
3. Run diagnostic binaries available within the container.
  Note
  Root privileges are required to run some diagnostic binaries. In these situations you can start a debug pod with root access, based on a problematic pod’s DeploymentConfig object, by running oc debug dc/<deployment_configuration> --as-root. Then, you can run diagnostic binaries as root from within the debug pod.
If diagnostic binaries are not available within a container, you can run a host’s diagnostic binaries within a container’s namespace by using nsenter. The following example runs ip ad within a container’s namespace, using the host`s ip binary.
1. Enter into a debug session on the target node. This step instantiates a debug pod called <node_name>-debug:
  $ oc debug node/my-cluster-node
  Copy to Clipboard Toggle word wrap
2. Set /host as the root directory within the debug shell. The debug pod mounts the host’s root file system in /host within the pod. By changing the root directory to /host, you can run binaries contained in the host’s executable paths:
  # chroot /host
  Copy to Clipboard Toggle word wrap
  Note
  Red Hat OpenShift Service on AWS 4 cluster nodes running Red Hat Enterprise Linux CoreOS (RHCOS) are immutable and rely on Operators to apply cluster changes. Accessing cluster nodes by using SSH is not recommended. However, if the Red Hat OpenShift Service on AWS API is not available, or the kubelet is not properly functioning on the target node, oc operations will be impacted. In such situations, it is possible to access nodes using ssh core@<node>.<cluster_name>.<base_domain> instead.
3. Determine the target container ID:
  # crictl ps
  Copy to Clipboard Toggle word wrap
4. Determine the container’s process ID. In this example, the target container ID is a7fe32346b120:
  # crictl inspect a7fe32346b120 --output yaml | grep 'pid:' | awk '{print $2}'
  Copy to Clipboard Toggle word wrap
5. Run ip ad within the container’s namespace, using the host’s ip binary. This example uses 31150 as the container’s process ID. The nsenter command enters the namespace of a target process and runs a command in its namespace. Because the target process in this example is a container’s process ID, the ip ad command is run in the container’s namespace from the host:
  # nsenter -n -t 31150 -- ip ad
  Copy to Clipboard Toggle word wrap
  Note
  Running a host’s diagnostic binaries within a container’s namespace is only possible if you are using a privileged container such as a debug node.

8.8. Troubleshooting storage issues
Link kopieren

8.8.1. Resolving multi-attach errors
Link kopieren

When a node crashes or shuts down abruptly, the attached ReadWriteOnce (RWO) volume is expected to be unmounted from the node so that it can be used by a pod scheduled on another node.

However, mounting on a new node is not possible because the failed node is unable to unmount the attached volume.

A multi-attach error is reported:

Example output

Unable to attach or mount volumes: unmounted volumes=[sso-mysql-pvol], unattached volumes=[sso-mysql-pvol default-token-x4rzc]: timed out waiting for the condition
Multi-Attach error for volume "pvc-8837384d-69d7-40b2-b2e6-5df86943eef9" Volume is already used by pod(s) sso-mysql-1-ns6b4

Unable to attach or mount volumes: unmounted volumes=[sso-mysql-pvol], unattached volumes=[sso-mysql-pvol default-token-x4rzc]: timed out waiting for the condition
Multi-Attach error for volume "pvc-8837384d-69d7-40b2-b2e6-5df86943eef9" Volume is already used by pod(s) sso-mysql-1-ns6b4

Copy to Clipboard

Toggle word wrap

Procedure

To resolve the multi-attach issue, use one of the following solutions:

Enable multiple attachments by using RWX volumes.
For most storage solutions, you can use ReadWriteMany (RWX) volumes to prevent multi-attach errors.
Recover or delete the failed node when using an RWO volume.
For storage that does not support RWX, such as VMware vSphere, RWO volumes must be used instead. However, RWO volumes cannot be mounted on multiple nodes.
If you encounter a multi-attach error message with an RWO volume, force delete the pod on a shutdown or crashed node to avoid data loss in critical workloads, such as when dynamic persistent volumes are attached.
```
oc delete pod <old_pod> --force=true --grace-period=0
```
```
$ oc delete pod <old_pod> --force=true --grace-period=0
```
Copy to Clipboard Toggle word wrap
This command deletes the volumes stuck on shutdown or crashed nodes after six minutes.

8.9. Investigating monitoring issues
Link kopieren

Red Hat OpenShift Service on AWS includes a preconfigured, preinstalled, and self-updating monitoring stack that provides monitoring for core platform components. In Red Hat OpenShift Service on AWS 4, cluster administrators can optionally enable monitoring for user-defined projects.

Use these procedures if the following issues occur:

Your own metrics are unavailable.
Prometheus is consuming a lot of disk space.
The KubePersistentVolumeFillingUp alert is firing for Prometheus.

8.9.1. Investigating why user-defined project metrics are unavailable
Link kopieren

ServiceMonitor resources enable you to determine how to use the metrics exposed by a service in user-defined projects. Follow the steps outlined in this procedure if you have created a ServiceMonitor resource but cannot see any corresponding metrics in the Metrics UI.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
You have installed the OpenShift CLI (oc).
You have enabled and configured monitoring for user-defined projects.
You have created a ServiceMonitor resource.

Procedure

Ensure that your project and resources are not excluded from user workload monitoring. The following examples use the ns1 project.
1. Verify that the project does not have the openshift.io/user-monitoring=false label attached:
  $ oc get namespace ns1 --show-labels | grep 'openshift.io/user-monitoring=false'
  Copy to Clipboard Toggle word wrap
  Note
  The default label set for user workload projects is openshift.io/user-monitoring=true. However, the label is not visible unless you manually apply it.
2. Verify that the ServiceMonitor and PodMonitor resources do not have the openshift.io/user-monitoring=false label attached. The following example checks the prometheus-example-monitor service monitor.
  $ oc -n ns1 get servicemonitor prometheus-example-monitor --show-labels | grep 'openshift.io/user-monitoring=false'
  Copy to Clipboard Toggle word wrap
3. If the label is attached, remove the label:
  Example of removing the label from the project
  $ oc label namespace ns1 'openshift.io/user-monitoring-'
  
  Copy to Clipboard Toggle word wrap
  Example of removing the label from the resource
  $ oc -n ns1 label servicemonitor prometheus-example-monitor 'openshift.io/user-monitoring-'
  
  Copy to Clipboard Toggle word wrap
  Example output
  namespace/ns1 unlabeled
  
  Copy to Clipboard Toggle word wrap

Check that the corresponding labels match in the service and ServiceMonitor resource configurations. The following examples use the prometheus-example-app service, the prometheus-example-monitor service monitor, and the ns1 project.

Obtain the label defined in the service.

oc -n ns1 get service prometheus-example-app -o yaml

$ oc -n ns1 get service prometheus-example-app -o yaml

Copy to Clipboard

Toggle word wrap

Example output

  labels:
    app: prometheus-example-app

  labels:
    app: prometheus-example-app

Copy to Clipboard

Toggle word wrap

Check that the matchLabels definition in the ServiceMonitor resource configuration matches the label output in the previous step.

oc -n ns1 get servicemonitor prometheus-example-monitor -o yaml

$ oc -n ns1 get servicemonitor prometheus-example-monitor -o yaml

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: v1
kind: ServiceMonitor
metadata:
  name: prometheus-example-monitor
  namespace: ns1
spec:
  endpoints:
  - interval: 30s
    port: web
    scheme: http
  selector:
    matchLabels:
      app: prometheus-example-app

apiVersion: v1
kind: ServiceMonitor
metadata:
  name: prometheus-example-monitor
  namespace: ns1
spec:
  endpoints:
  - interval: 30s
    port: web
    scheme: http
  selector:
    matchLabels:
      app: prometheus-example-app

Copy to Clipboard

Toggle word wrap

Note

You can check service and ServiceMonitor resource labels as a developer with view permissions for the project.

Inspect the logs for the Prometheus Operator in the openshift-user-workload-monitoring project.

List the pods in the openshift-user-workload-monitoring project:

oc -n openshift-user-workload-monitoring get pods

$ oc -n openshift-user-workload-monitoring get pods

Copy to Clipboard

Toggle word wrap

Example output

NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-776fcbbd56-2nbfm   2/2     Running   0          132m
prometheus-user-workload-0             5/5     Running   1          132m
prometheus-user-workload-1             5/5     Running   1          132m
thanos-ruler-user-workload-0           3/3     Running   0          132m
thanos-ruler-user-workload-1           3/3     Running   0          132m

NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-776fcbbd56-2nbfm   2/2     Running   0          132m
prometheus-user-workload-0             5/5     Running   1          132m
prometheus-user-workload-1             5/5     Running   1          132m
thanos-ruler-user-workload-0           3/3     Running   0          132m
thanos-ruler-user-workload-1           3/3     Running   0          132m

Copy to Clipboard

Toggle word wrap

Obtain the logs from the prometheus-operator container in the prometheus-operator pod. In the following example, the pod is called prometheus-operator-776fcbbd56-2nbfm:

oc -n openshift-user-workload-monitoring logs prometheus-operator-776fcbbd56-2nbfm -c prometheus-operator

$ oc -n openshift-user-workload-monitoring logs prometheus-operator-776fcbbd56-2nbfm -c prometheus-operator

Copy to Clipboard

Toggle word wrap

If there is a issue with the service monitor, the logs might include an error similar to this example:

level=warn ts=2020-08-10T11:48:20.906739623Z caller=operator.go:1829 component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via bearer token file which Prometheus specification prohibits" servicemonitor=eagle/eagle namespace=openshift-user-workload-monitoring prometheus=user-workload

level=warn ts=2020-08-10T11:48:20.906739623Z caller=operator.go:1829 component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via bearer token file which Prometheus specification prohibits" servicemonitor=eagle/eagle namespace=openshift-user-workload-monitoring prometheus=user-workload

Copy to Clipboard

Toggle word wrap

Review the target status for your endpoint on the Metrics targets page in the Red Hat OpenShift Service on AWS web console UI.
1. Log in to the Red Hat OpenShift Service on AWS web console and go to Observe Targets.
2. Locate the metrics endpoint in the list, and review the status of the target in the Status column.
3. If the Status is Down, click the URL for the endpoint to view more information on the Target Details page for that metrics target.
Configure debug level logging for the Prometheus Operator in the openshift-user-workload-monitoring project.
1. Edit the user-workload-monitoring-config ConfigMap object in the openshift-user-workload-monitoring project:
  $ oc -n openshift-user-workload-monitoring edit configmap user-workload-monitoring-config
  Copy to Clipboard Toggle word wrap
2. Add logLevel: debug for prometheusOperator under data/config.yaml to set the log level to debug:
  apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | prometheusOperator: logLevel: debug # ...
  Copy to Clipboard Toggle word wrap
3. Save the file to apply the changes. The affected prometheus-operator pod is automatically redeployed.
4. Confirm that the debug log-level has been applied to the prometheus-operator deployment in the openshift-user-workload-monitoring project:
  $ oc -n openshift-user-workload-monitoring get deploy prometheus-operator -o yaml | grep "log-level"
  Copy to Clipboard Toggle word wrap
  Example output
  - --log-level=debug
  
  Copy to Clipboard Toggle word wrap
  Debug level logging will show all calls made by the Prometheus Operator.
5. Check that the prometheus-operator pod is running:
  $ oc -n openshift-user-workload-monitoring get pods
  Copy to Clipboard Toggle word wrap
  Note
  If an unrecognized Prometheus Operator loglevel value is included in the config map, the prometheus-operator pod might not restart successfully.
6. Review the debug logs to see if the Prometheus Operator is using the ServiceMonitor resource. Review the logs for other related errors.

8.9.2. Determining why Prometheus is consuming a lot of disk space
Link kopieren

Developers can create labels to define attributes for metrics in the form of key-value pairs. The number of potential key-value pairs corresponds to the number of possible values for an attribute. An attribute that has an unlimited number of potential values is called an unbound attribute. For example, a customer_id attribute is unbound because it has an infinite number of possible values.

Every assigned key-value pair has a unique time series. The use of many unbound attributes in labels can result in an exponential increase in the number of time series created. This can impact Prometheus performance and can consume a lot of disk space.

You can use the following measures when Prometheus consumes a lot of disk:

Check the time series database (TSDB) status using the Prometheus HTTP API for more information about which labels are creating the most time series data. Doing so requires cluster administrator privileges.
Check the number of scrape samples that are being collected.
Reduce the number of unique time series that are created by reducing the number of unbound attributes that are assigned to user-defined metrics.
Note
Using attributes that are bound to a limited set of possible values reduces the number of potential key-value pair combinations.
Enforce limits on the number of samples that can be scraped across user-defined projects. This requires cluster administrator privileges.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
You have installed the OpenShift CLI (oc).

Procedure

In the Red Hat OpenShift Service on AWS web console, go to Observe Metrics.
Enter a Prometheus Query Language (PromQL) query in the Expression field. The following example queries help to identify high cardinality metrics that might result in high disk space consumption:
- By running the following query, you can identify the ten jobs that have the highest number of scrape samples:
  topk(10, max by(namespace, job) (topk by(namespace, job) (1, scrape_samples_post_metric_relabeling)))
  Copy to Clipboard Toggle word wrap
- By running the following query, you can pinpoint time series churn by identifying the ten jobs that have created the most time series data in the last hour:
  topk(10, sum by(namespace, job) (sum_over_time(scrape_series_added[1h])))
  Copy to Clipboard Toggle word wrap
Investigate the number of unbound label values assigned to metrics with higher than expected scrape sample counts:
- If the metrics relate to a user-defined project, review the metrics key-value pairs assigned to your workload. These are implemented through Prometheus client libraries at the application level. Try to limit the number of unbound attributes referenced in your labels.
- If the metrics relate to a core Red Hat OpenShift Service on AWS project, create a Red Hat support case on the Red Hat Customer Portal.

Review the TSDB status using the Prometheus HTTP API by following these steps when logged in as a dedicated-admin:

Get the Prometheus API route URL by running the following command:

HOST=$(oc -n openshift-monitoring get route prometheus-k8s -ojsonpath='{.status.ingress[].host}')

$ HOST=$(oc -n openshift-monitoring get route prometheus-k8s -ojsonpath='{.status.ingress[].host}')

Copy to Clipboard

Toggle word wrap

Extract an authentication token by running the following command:
```
TOKEN=$(oc whoami -t)
```
```
$ TOKEN=$(oc whoami -t)
```
Copy to Clipboard Toggle word wrap

Query the TSDB status for Prometheus by running the following command:

curl -H "Authorization: Bearer $TOKEN" -k "https://$HOST/api/v1/status/tsdb"

$ curl -H "Authorization: Bearer $TOKEN" -k "https://$HOST/api/v1/status/tsdb"

Copy to Clipboard

Toggle word wrap

Example output

"status": "success","data":{"headStats":{"numSeries":507473,
"numLabelPairs":19832,"chunkCount":946298,"minTime":1712253600010,
"maxTime":1712257935346},"seriesCountByMetricName":
[{"name":"etcd_request_duration_seconds_bucket","value":51840},
{"name":"apiserver_request_sli_duration_seconds_bucket","value":47718},
...

"status": "success","data":{"headStats":{"numSeries":507473,
"numLabelPairs":19832,"chunkCount":946298,"minTime":1712253600010,
"maxTime":1712257935346},"seriesCountByMetricName":
[{"name":"etcd_request_duration_seconds_bucket","value":51840},
{"name":"apiserver_request_sli_duration_seconds_bucket","value":47718},
...

Copy to Clipboard

Toggle word wrap

8.10. Diagnosing OpenShift CLI (oc) issues
Link kopieren

8.10.1. Understanding OpenShift CLI (oc) log levels
Link kopieren

With the OpenShift CLI (oc), you can create applications and manage Red Hat OpenShift Service on AWS projects from a terminal.

If oc command-specific issues arise, increase the oc log level to output API request, API response, and curl request details generated by the command. This provides a granular view of a particular oc command’s underlying operation, which in turn might provide insight into the nature of a failure.

oc log levels range from 1 to 10. The following table provides a list of oc log levels, along with their descriptions.

Expand

Table 8.4. OpenShift CLI (oc) log levels
Log level	Description
1 to 5	No additional logging to stderr.
6	Log API requests to stderr.
7	Log API requests and headers to stderr.
8	Log API requests, headers, and body, plus API response headers and body to stderr.
9	Log API requests, headers, and body, API response headers and body, plus `curl` requests to stderr.
10	Log API requests, headers, and body, API response headers and body, plus `curl` requests to stderr, in verbose detail.

8.10.2. Specifying OpenShift CLI (oc) log levels
Link kopieren

You can investigate OpenShift CLI (oc) issues by increasing the command’s log level.

The Red Hat OpenShift Service on AWS user’s current session token is typically included in logged curl requests where required. You can also obtain the current user’s session token manually, for use when testing aspects of an oc command’s underlying process step-by-step.

Prerequisites

Install the OpenShift CLI (oc).

Procedure

Specify the oc log level when running an oc command:
```
oc <command> --loglevel <log_level>
```
```
$ oc <command> --loglevel <log_level>
```
Copy to Clipboard Toggle word wrap
where:
<command>
Specifies the command you are running.
<log_level>
Specifies the log level to apply to the command.
To obtain the current user’s session token, run the following command:
```
oc whoami -t
```
```
$ oc whoami -t
```
Copy to Clipboard Toggle word wrap
Example output
```
sha256~RCV3Qcn7H-OEfqCGVI0CvnZ6...
```
```
sha256~RCV3Qcn7H-OEfqCGVI0CvnZ6...
```
Copy to Clipboard Toggle word wrap

8.11. Troubleshooting expired tokens
Link kopieren

8.11.1. Troubleshooting expired offline access tokens
Link kopieren

If you use the Red Hat OpenShift Service on AWS (ROSA) CLI, rosa, and your api.openshift.com offline access token expires, an error message appears. This happens when sso.redhat.com invalidates the token.

Example output

Can't get tokens ....
Can't get access tokens ....

Can't get tokens ....
Can't get access tokens ....

Copy to Clipboard

Toggle word wrap

Procedure

Generate a new offline access token at the following URL. A new offline access token is generated every time you visit the URL.
- Red Hat OpenShift Service on AWS (ROSA): https://console.redhat.com/openshift/token/rosa

8.12. Troubleshooting IAM roles
Link kopieren

8.12.1. Resolving issues with ocm-roles and user-role IAM resources
Link kopieren

You may receive an error when trying to create a cluster using the Red Hat OpenShift Service on AWS (ROSA) CLI, rosa.

Example output

E: Failed to create cluster: The sts_user_role is not linked to account '1oNl'. Please create a user role and link it to the account.

E: Failed to create cluster: The sts_user_role is not linked to account '1oNl'. Please create a user role and link it to the account.

Copy to Clipboard

Toggle word wrap

This error means that the user-role IAM role is not linked to your AWS account. The most likely cause of this error is that another user in your Red Hat organization created the ocm-role IAM role. Your user-role IAM role needs to be created.

Note

After any user sets up an ocm-role IAM resource linked to a Red Hat account, any subsequent users wishing to create a cluster in that Red Hat organization must have a user-role IAM role to provision a cluster.

Procedure

Assess the status of your ocm-role and user-role IAM roles with the following commands:

rosa list ocm-role

$ rosa list ocm-role

Copy to Clipboard

Toggle word wrap

Example output

I: Fetching ocm roles
ROLE NAME                           ROLE ARN                                          LINKED  ADMIN
ManagedOpenShift-OCM-Role-1158  arn:aws:iam::2066:role/ManagedOpenShift-OCM-Role-1158   No      No

I: Fetching ocm roles
ROLE NAME                           ROLE ARN                                          LINKED  ADMIN
ManagedOpenShift-OCM-Role-1158  arn:aws:iam::2066:role/ManagedOpenShift-OCM-Role-1158   No      No

Copy to Clipboard

Toggle word wrap

rosa list user-role

$ rosa list user-role

Copy to Clipboard

Toggle word wrap

Example output

I: Fetching user roles
ROLE NAME                                   ROLE ARN                                        LINKED
ManagedOpenShift-User.osdocs-Role  arn:aws:iam::2066:role/ManagedOpenShift-User.osdocs-Role  Yes

I: Fetching user roles
ROLE NAME                                   ROLE ARN                                        LINKED
ManagedOpenShift-User.osdocs-Role  arn:aws:iam::2066:role/ManagedOpenShift-User.osdocs-Role  Yes

Copy to Clipboard

Toggle word wrap

With the results of these commands, you can create and link the missing IAM resources.

8.12.1.1. Creating an ocm-role IAM role
Link kopieren

You create your ocm-role IAM roles by using the ROSA command-line interface (CLI) (rosa).

Prerequisites

You have an AWS account.
You have Red Hat Organization Administrator privileges in the OpenShift Cluster Manager organization.
You have the permissions required to install AWS account-wide roles.
You have installed and configured the latest ROSA CLI, rosa, on your installation host.

Procedure

To create an ocm-role IAM role with basic privileges, run the following command:
```
rosa create ocm-role
```
```
$ rosa create ocm-role
```
Copy to Clipboard Toggle word wrap
To create an ocm-role IAM role with admin privileges, run the following command:
```
rosa create ocm-role --admin
```
```
$ rosa create ocm-role --admin
```
Copy to Clipboard Toggle word wrap
This command allows you to create the role by specifying specific attributes. The following example output shows the "auto mode" selected, which lets the ROSA CLI (rosa) create your Operator roles and policies. See "Methods of account-wide role creation" for more information. The following example shows what your creation flow may look like.
```
I: Creating ocm role
? Role prefix: ManagedOpenShift
? Enable admin capabilities for the OCM role (optional): No
? Permissions boundary ARN (optional):
? Role Path (optional):
? Role creation mode: auto
I: Creating role using 'arn:aws:iam::<ARN>:user/<UserName>'
? Create the 'ManagedOpenShift-OCM-Role-182' role? Yes
I: Created role 'ManagedOpenShift-OCM-Role-182' with ARN  'arn:aws:iam::<ARN>:role/ManagedOpenShift-OCM-Role-182'
I: Linking OCM role
? OCM Role ARN: arn:aws:iam::<ARN>:role/ManagedOpenShift-OCM-Role-182
? Link the 'arn:aws:iam::<ARN>:role/ManagedOpenShift-OCM-Role-182' role with organization '<AWS ARN>'? Yes
I: Successfully linked role-arn 'arn:aws:iam::<ARN>:role/ManagedOpenShift-OCM-Role-182' with organization account '<AWS ARN>'
```
```
I: Creating ocm role
? Role prefix: ManagedOpenShift
? Enable admin capabilities for the OCM role (optional): No
? Permissions boundary ARN (optional):
? Role Path (optional):
? Role creation mode: auto
I: Creating role using 'arn:aws:iam::<ARN>:user/<UserName>'
? Create the 'ManagedOpenShift-OCM-Role-182' role? Yes
I: Created role 'ManagedOpenShift-OCM-Role-182' with ARN  'arn:aws:iam::<ARN>:role/ManagedOpenShift-OCM-Role-182'
I: Linking OCM role
? OCM Role ARN: arn:aws:iam::<ARN>:role/ManagedOpenShift-OCM-Role-182
? Link the 'arn:aws:iam::<ARN>:role/ManagedOpenShift-OCM-Role-182' role with organization '<AWS ARN>'? Yes
I: Successfully linked role-arn 'arn:aws:iam::<ARN>:role/ManagedOpenShift-OCM-Role-182' with organization account '<AWS ARN>'
```
Copy to Clipboard Toggle word wrap
where:
Role prefix
A prefix value for all of the created AWS resources. In this example, ManagedOpenShift prepends all of the AWS resources.
Enable admin capabilities for the OCM role (optional)
Choose if you want this role to have the additional admin permissions.
Note
You do not see this prompt if you used the --admin option.
Permissions boundary ARN (optional)
The Amazon Resource Name (ARN) of the policy to set permission boundaries.
Role Path (optional)
Specify an IAM path for the user name.
Role creation mode
Choose the method to create your AWS roles. Using auto, the ROSA CLI generates and links the roles and policies. In the auto mode, you receive some different prompts to create the AWS roles.
Create the 'ManagedOpenShift-OCM-Role-182' role?
The auto method asks if you want to create a specific ocm-role using your prefix.
OCM Role ARN
Confirm that you want to associate your IAM role with your OpenShift Cluster Manager.
Link the 'arn:aws:iam::<ARN>:role/ManagedOpenShift-OCM-Role-182' role with organization '<AWS ARN>'?
Links the created role with your AWS organization.

8.12.1.2. Creating a user-role IAM role
Link kopieren

You can create your user-role IAM roles by using the ROSA command-line interface (CLI) (rosa).

Prerequisites

You have an AWS account.
You have installed and configured the latest ROSA CLI, rosa, on your installation host.

Procedure

To create a user-role IAM role with basic privileges, run the following command:
```
rosa create user-role
```
```
$ rosa create user-role
```
Copy to Clipboard Toggle word wrap
This command allows you to create the role by specifying specific attributes. The following example output shows the "auto mode" selected, which lets the ROSA CLI (rosa) to create your Operator roles and policies. See "Understanding the auto and manual deployment modes" for more information. The following example shows what your creation flow may look like.
```
I: Creating User role
? Role prefix: ManagedOpenShift
? Permissions boundary ARN (optional):
? Role Path (optional):
? Role creation mode: auto
I: Creating ocm user role using 'arn:aws:iam::2066:user'
? Create the 'ManagedOpenShift-User.osdocs-Role' role? Yes
I: Created role 'ManagedOpenShift-User.osdocs-Role' with ARN 'arn:aws:iam::2066:role/ManagedOpenShift-User.osdocs-Role'
I: Linking User role
? User Role ARN: arn:aws:iam::2066:role/ManagedOpenShift-User.osdocs-Role
? Link the 'arn:aws:iam::2066:role/ManagedOpenShift-User.osdocs-Role' role with account '1AGE'? Yes
I: Successfully linked role ARN 'arn:aws:iam::2066:role/ManagedOpenShift-User.osdocs-Role' with account '1AGE'
```
```
I: Creating User role
? Role prefix: ManagedOpenShift
? Permissions boundary ARN (optional):
? Role Path (optional):
? Role creation mode: auto
I: Creating ocm user role using 'arn:aws:iam::2066:user'
? Create the 'ManagedOpenShift-User.osdocs-Role' role? Yes
I: Created role 'ManagedOpenShift-User.osdocs-Role' with ARN 'arn:aws:iam::2066:role/ManagedOpenShift-User.osdocs-Role'
I: Linking User role
? User Role ARN: arn:aws:iam::2066:role/ManagedOpenShift-User.osdocs-Role
? Link the 'arn:aws:iam::2066:role/ManagedOpenShift-User.osdocs-Role' role with account '1AGE'? Yes
I: Successfully linked role ARN 'arn:aws:iam::2066:role/ManagedOpenShift-User.osdocs-Role' with account '1AGE'
```
Copy to Clipboard Toggle word wrap
where:
Role prefix
A prefix value for all of the created AWS resources. In this example, ManagedOpenShift prepends all of the AWS resources.
Permissions boundary ARN (optional)
The Amazon Resource Name (ARN) of the policy to set permission boundaries.
Role Path (optional)
Specify an IAM path for the user name.
Role creation mode
Choose the method to create your AWS roles. Using auto, the ROSA CLI generates and links the roles and policies. In the auto mode, you receive some different prompts to create the AWS roles.
Create the 'ManagedOpenShift-User.osdocs-Role' role?
The auto method asks if you want to create a specific user-role using your prefix.
Link the 'arn:aws:iam::2066:role/ManagedOpenShift-User.osdocs-Role' role with account '1AGE'?
Links the created role with your AWS organization.
Important
If you unlink or delete your user-role IAM role before deleting your cluster, an error prevents you from deleting your cluster. You must create or relink this role to proceed with the deletion process.

8.12.1.3. Associating your AWS account with IAM roles
Link kopieren

You can associate or link your AWS account with existing IAM roles by using the ROSA command-line interface (CLI) (rosa).

Prerequisites

You have an AWS account.
You have the permissions required to install AWS account-wide roles. See the "Additional resources" of this section for more information.
You have installed and configured the latest AWS CLI (aws) and ROSA CLI on your installation host.
You have created the ocm-role and user-role IAM roles, but have not yet linked them to your AWS account. You can check whether your IAM roles are already linked by running the following commands:
```
rosa list ocm-role
```
```
$ rosa list ocm-role
```
Copy to Clipboard Toggle word wrap
```
rosa list user-role
```
```
$ rosa list user-role
```
Copy to Clipboard Toggle word wrap
If Yes is displayed in the Linked column for both roles, you have already linked the roles to an AWS account.

Procedure

In the ROSA CLI, link your ocm-role resource to your Red Hat organization by using your Amazon Resource Name (ARN):
Note
You must have Red Hat Organization Administrator privileges to run the rosa link command. After you link the ocm-role resource with your AWS account, it takes effect and is visible to all users in the organization.
```
rosa link ocm-role --role-arn <arn>
```
```
$ rosa link ocm-role --role-arn <arn>
```
Copy to Clipboard Toggle word wrap
For example:
```
I: Linking OCM role
? Link the '<AWS ACCOUNT ID>` role with organization '<ORG ID>'? Yes
I: Successfully linked role-arn '<AWS ACCOUNT ID>' with organization account '<ORG ID>'
```
```
I: Linking OCM role
? Link the '<AWS ACCOUNT ID>` role with organization '<ORG ID>'? Yes
I: Successfully linked role-arn '<AWS ACCOUNT ID>' with organization account '<ORG ID>'
```
Copy to Clipboard Toggle word wrap

In the ROSA CLI, link your user-role resource to your Red Hat user account by using your Amazon Resource Name (ARN):

rosa link user-role --role-arn <arn>

$ rosa link user-role --role-arn <arn>

Copy to Clipboard

Toggle word wrap

For example:

I: Linking User role
? Link the 'arn:aws:iam::<ARN>:role/ManagedOpenShift-User-Role-125' role with organization '<AWS ID>'? Yes
I: Successfully linked role-arn 'arn:aws:iam::<ARN>:role/ManagedOpenShift-User-Role-125' with organization account '<AWS ID>'

I: Linking User role
? Link the 'arn:aws:iam::<ARN>:role/ManagedOpenShift-User-Role-125' role with organization '<AWS ID>'? Yes
I: Successfully linked role-arn 'arn:aws:iam::<ARN>:role/ManagedOpenShift-User-Role-125' with organization account '<AWS ID>'

Copy to Clipboard

Toggle word wrap

8.12.1.4. Associating multiple AWS accounts with your Red Hat organization
Link kopieren

You can associate multiple AWS accounts with your Red Hat organization. Associating multiple accounts lets you create Red Hat OpenShift Service on AWS clusters on any of the associated AWS accounts from your Red Hat organization.

With this capability, you can create clusters on different AWS profiles according to characteristics that make sense for your business, for example, by using one AWS profile for each region to create region-bound environments.

Prerequisites

You have an AWS account.
You are using OpenShift Cluster Manager to create clusters.
You have the permissions required to install AWS account-wide roles.
You have installed and configured the latest AWS CLI (aws) and ROSA command-line interface (CLI) (rosa) on your installation host.
You have created the ocm-role and user-role IAM roles for Red Hat OpenShift Service on AWS.

Procedure

To specify an AWS account profile when creating an OpenShift Cluster Manager role:
```
rosa create --profile <aws_profile> ocm-role
```
```
$ rosa create --profile <aws_profile> ocm-role
```
Copy to Clipboard Toggle word wrap
To specify an AWS account profile when creating a user role:
```
rosa create --profile <aws_profile> user-role
```
```
$ rosa create --profile <aws_profile> user-role
```
Copy to Clipboard Toggle word wrap
To specify an AWS account profile when creating the account roles:
```
rosa create --profile <aws_profile> account-roles
```
```
$ rosa create --profile <aws_profile> account-roles
```
Copy to Clipboard Toggle word wrap
Note
If you do not specify a profile, the default AWS profile and its associated AWS region are used.

8.13. Troubleshooting Red Hat OpenShift Service on AWS cluster deployments
Link kopieren

This document describes how to troubleshoot cluster deployment errors.

8.13.1. Obtaining information about a failed cluster
Link kopieren

If a cluster deployment fails, the cluster is put into an "error" state.

Procedure

Run the following command to get more information:

rosa describe cluster -c <my_cluster_name> --debug

$ rosa describe cluster -c <my_cluster_name> --debug

Copy to Clipboard

Toggle word wrap

8.13.2. Troubleshooting cluster creation with an osdCcsAdmin error
Link kopieren

If a cluster creation action fails, you might receive the following error message.

Example output

Failed to create cluster: Unable to create cluster spec: Failed to get access keys for user 'osdCcsAdmin': NoSuchEntity: The user with name osdCcsAdmin cannot be found.

Failed to create cluster: Unable to create cluster spec: Failed to get access keys for user 'osdCcsAdmin': NoSuchEntity: The user with name osdCcsAdmin cannot be found.

Copy to Clipboard

Toggle word wrap

Procedure

To fix this issue:

Delete the stack:
```
rosa init --delete
```
```
$ rosa init --delete
```
Copy to Clipboard Toggle word wrap
Reinitialize your account:
```
rosa init
```
```
$ rosa init
```
Copy to Clipboard Toggle word wrap

8.13.3. Creating the Elastic Load Balancing (ELB) service-linked role
Link kopieren

If you have not created a load balancer in your AWS account, it is possible that the service-linked role for Elastic Load Balancing (ELB) might not exist yet. You may receive the following error:

Error: Error creating network Load Balancer: AccessDenied: User: arn:aws:sts::xxxxxxxxxxxx:assumed-role/ManagedOpenShift-Installer-Role/xxxxxxxxxxxxxxxxxxx is not authorized to perform: iam:CreateServiceLinkedRole on resource: arn:aws:iam::xxxxxxxxxxxx:role/aws-service-role/elasticloadbalancing.amazonaws.com/AWSServiceRoleForElasticLoadBalancing"

Error: Error creating network Load Balancer: AccessDenied: User: arn:aws:sts::xxxxxxxxxxxx:assumed-role/ManagedOpenShift-Installer-Role/xxxxxxxxxxxxxxxxxxx is not authorized to perform: iam:CreateServiceLinkedRole on resource: arn:aws:iam::xxxxxxxxxxxx:role/aws-service-role/elasticloadbalancing.amazonaws.com/AWSServiceRoleForElasticLoadBalancing"

Copy to Clipboard

Toggle word wrap

Procedure

To resolve this issue, ensure that the role exists on your AWS account. If not, create this role with the following command:

aws iam get-role --role-name "AWSServiceRoleForElasticLoadBalancing" || aws iam create-service-linked-role --aws-service-name "elasticloadbalancing.amazonaws.com"

aws iam get-role --role-name "AWSServiceRoleForElasticLoadBalancing" || aws iam create-service-linked-role --aws-service-name "elasticloadbalancing.amazonaws.com"

Copy to Clipboard

Toggle word wrap

Note

This command only needs to be executed once per account.

8.13.4. Repairing a cluster that cannot be deleted
Link kopieren

In specific cases, the following error appears in OpenShift Cluster Manager if you attempt to delete your cluster.

Error deleting cluster
CLUSTERS-MGMT-400: Failed to delete cluster <hash>: sts_user_role is not linked to your account. sts_ocm_role is linked to your organization <org number> which requires sts_user_role to be linked to your Red Hat account <account ID>.Please create a user role and link it to the account: User Account <account ID> is not authorized to perform STS cluster operations

Operation ID: b0572d6e-fe54-499b-8c97-46bf6890011c

Error deleting cluster
CLUSTERS-MGMT-400: Failed to delete cluster <hash>: sts_user_role is not linked to your account. sts_ocm_role is linked to your organization <org number> which requires sts_user_role to be linked to your Red Hat account <account ID>.Please create a user role and link it to the account: User Account <account ID> is not authorized to perform STS cluster operations

Operation ID: b0572d6e-fe54-499b-8c97-46bf6890011c

Copy to Clipboard

Toggle word wrap

If you try to delete your cluster from the CLI, the following error appears.

E: Failed to delete cluster <hash>: sts_user_role is not linked to your account. sts_ocm_role is linked to your organization <org_number> which requires sts_user_role to be linked to your Red Hat account <account_id>.Please create a user role and link it to the account: User Account <account ID> is not authorized to perform STS cluster operations

E: Failed to delete cluster <hash>: sts_user_role is not linked to your account. sts_ocm_role is linked to your organization <org_number> which requires sts_user_role to be linked to your Red Hat account <account_id>.Please create a user role and link it to the account: User Account <account ID> is not authorized to perform STS cluster operations

Copy to Clipboard

Toggle word wrap

This error occurs when the user-role is unlinked or deleted.

Procedure

Run the following command to create the user-role IAM resource:
```
rosa create user-role
```
```
$ rosa create user-role
```
Copy to Clipboard Toggle word wrap
After you see that the role has been created, you can delete the cluster. The following confirms that the role was created and linked:
```
I: Successfully linked role ARN <user role ARN> with account <account ID>
```
```
I: Successfully linked role ARN <user role ARN> with account <account ID>
```
Copy to Clipboard Toggle word wrap

Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

8.1. Review your cluster notificationsLink kopierenLink in die Zwischenablage kopiert!

8.1.1. Viewing cluster notifications using the Red Hat Hybrid Cloud ConsoleLink kopierenLink in die Zwischenablage kopiert!

8.2. Troubleshooting Red Hat OpenShift Service on AWS cluster installationsLink kopierenLink in die Zwischenablage kopiert!

8.2.1. Installation troubleshootingLink kopierenLink in die Zwischenablage kopiert!

8.2.1.1. Inspect install or uninstall logsLink kopierenLink in die Zwischenablage kopiert!

8.2.1.2. Verify your AWS account and quotaLink kopierenLink in die Zwischenablage kopiert!

8.2.1.3. AWS notification emailsLink kopierenLink in die Zwischenablage kopiert!

8.2.2. Verifying installation of Red Hat OpenShift Service on AWS clustersLink kopierenLink in die Zwischenablage kopiert!

8.2.3. Troubleshooting Red Hat OpenShift Service on AWS installation error codesLink kopierenLink in die Zwischenablage kopiert!

8.2.4. Troubleshooting access to Red Hat Hybrid Cloud ConsoleLink kopierenLink in die Zwischenablage kopiert!

8.2.5. Verifying access to Red Hat OpenShift Service on AWS web console for Red Hat OpenShift Service on AWS cluster in ready stateLink kopierenLink in die Zwischenablage kopiert!

8.2.6. Verifying access to Red Hat Hybrid Cloud Console for private Red Hat OpenShift Service on AWS clustersLink kopierenLink in die Zwischenablage kopiert!

8.3. Troubleshooting networkingLink kopierenLink in die Zwischenablage kopiert!

8.3.1. Connectivity issues on clusters with private Network Load BalancersLink kopierenLink in die Zwischenablage kopiert!

8.4. Verifying node healthLink kopierenLink in die Zwischenablage kopiert!

8.4.1. Reviewing node status, resource usage, and configurationLink kopierenLink in die Zwischenablage kopiert!

8.5. Troubleshooting Operator issuesLink kopierenLink in die Zwischenablage kopiert!

8.5.1. Operator subscription condition typesLink kopierenLink in die Zwischenablage kopiert!

8.5.2. Viewing Operator subscription status by using the CLILink kopierenLink in die Zwischenablage kopiert!

8.5.3. Viewing Operator catalog source status by using the CLILink kopierenLink in die Zwischenablage kopiert!

8.5.4. Querying Operator pod statusLink kopierenLink in die Zwischenablage kopiert!

8.6. Investigating pod issuesLink kopierenLink in die Zwischenablage kopiert!

8.6.1. Understanding pod error statesLink kopierenLink in die Zwischenablage kopiert!

8.6.2. Reviewing pod statusLink kopierenLink in die Zwischenablage kopiert!

8.6.3. Inspecting pod and container logsLink kopierenLink in die Zwischenablage kopiert!

8.6.4. Accessing running podsLink kopierenLink in die Zwischenablage kopiert!

8.6.5. Starting debug pods with root accessLink kopierenLink in die Zwischenablage kopiert!

8.6.6. Copying files to and from pods and containersLink kopierenLink in die Zwischenablage kopiert!

8.7. Troubleshooting the Source-to-Image processLink kopierenLink in die Zwischenablage kopiert!

8.7.1. Strategies for Source-to-Image troubleshootingLink kopierenLink in die Zwischenablage kopiert!

8.7.2. Gathering Source-to-Image diagnostic dataLink kopierenLink in die Zwischenablage kopiert!

8.7.3. Gathering application diagnostic data to investigate application failuresLink kopierenLink in die Zwischenablage kopiert!

8.8. Troubleshooting storage issuesLink kopierenLink in die Zwischenablage kopiert!

8.8.1. Resolving multi-attach errorsLink kopierenLink in die Zwischenablage kopiert!

8.9. Investigating monitoring issuesLink kopierenLink in die Zwischenablage kopiert!

8.9.1. Investigating why user-defined project metrics are unavailableLink kopierenLink in die Zwischenablage kopiert!

8.9.2. Determining why Prometheus is consuming a lot of disk spaceLink kopierenLink in die Zwischenablage kopiert!

8.10. Diagnosing OpenShift CLI (oc) issuesLink kopierenLink in die Zwischenablage kopiert!

8.10.1. Understanding OpenShift CLI (oc) log levelsLink kopierenLink in die Zwischenablage kopiert!

8.10.2. Specifying OpenShift CLI (oc) log levelsLink kopierenLink in die Zwischenablage kopiert!

8.11. Troubleshooting expired tokensLink kopierenLink in die Zwischenablage kopiert!

8.11.1. Troubleshooting expired offline access tokensLink kopierenLink in die Zwischenablage kopiert!

8.12. Troubleshooting IAM rolesLink kopierenLink in die Zwischenablage kopiert!

8.12.1. Resolving issues with ocm-roles and user-role IAM resourcesLink kopierenLink in die Zwischenablage kopiert!

8.12.1.1. Creating an ocm-role IAM roleLink kopierenLink in die Zwischenablage kopiert!

8.12.1.2. Creating a user-role IAM roleLink kopierenLink in die Zwischenablage kopiert!

8.12.1.3. Associating your AWS account with IAM rolesLink kopierenLink in die Zwischenablage kopiert!

8.12.1.4. Associating multiple AWS accounts with your Red Hat organizationLink kopierenLink in die Zwischenablage kopiert!

8.13. Troubleshooting Red Hat OpenShift Service on AWS cluster deploymentsLink kopierenLink in die Zwischenablage kopiert!

8.13.1. Obtaining information about a failed clusterLink kopierenLink in die Zwischenablage kopiert!

8.13.2. Troubleshooting cluster creation with an osdCcsAdmin errorLink kopierenLink in die Zwischenablage kopiert!

8.13.3. Creating the Elastic Load Balancing (ELB) service-linked roleLink kopierenLink in die Zwischenablage kopiert!

8.13.4. Repairing a cluster that cannot be deletedLink kopierenLink in die Zwischenablage kopiert!

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Mehr Inklusion in Open Source

Über Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

8.1. Review your cluster notifications
Link kopieren

8.1.1. Viewing cluster notifications using the Red Hat Hybrid Cloud Console
Link kopieren

8.2. Troubleshooting Red Hat OpenShift Service on AWS cluster installations
Link kopieren

8.2.1. Installation troubleshooting
Link kopieren

8.2.1.1. Inspect install or uninstall logs
Link kopieren

8.2.1.2. Verify your AWS account and quota
Link kopieren

8.2.1.3. AWS notification emails
Link kopieren

8.2.2. Verifying installation of Red Hat OpenShift Service on AWS clusters
Link kopieren

8.2.3. Troubleshooting Red Hat OpenShift Service on AWS installation error codes
Link kopieren

8.2.4. Troubleshooting access to Red Hat Hybrid Cloud Console
Link kopieren

8.2.5. Verifying access to Red Hat OpenShift Service on AWS web console for Red Hat OpenShift Service on AWS cluster in ready state
Link kopieren

8.2.6. Verifying access to Red Hat Hybrid Cloud Console for private Red Hat OpenShift Service on AWS clusters
Link kopieren

8.3. Troubleshooting networking
Link kopieren

8.3.1. Connectivity issues on clusters with private Network Load Balancers
Link kopieren

8.4. Verifying node health
Link kopieren

8.4.1. Reviewing node status, resource usage, and configuration
Link kopieren

8.5. Troubleshooting Operator issues
Link kopieren

8.5.1. Operator subscription condition types
Link kopieren

8.5.2. Viewing Operator subscription status by using the CLI
Link kopieren

8.5.3. Viewing Operator catalog source status by using the CLI
Link kopieren

8.5.4. Querying Operator pod status
Link kopieren

8.6. Investigating pod issues
Link kopieren

8.6.1. Understanding pod error states
Link kopieren

8.6.2. Reviewing pod status
Link kopieren

8.6.3. Inspecting pod and container logs
Link kopieren

8.6.4. Accessing running pods
Link kopieren

8.6.5. Starting debug pods with root access
Link kopieren

8.6.6. Copying files to and from pods and containers
Link kopieren

8.7. Troubleshooting the Source-to-Image process
Link kopieren

8.7.1. Strategies for Source-to-Image troubleshooting
Link kopieren

8.7.2. Gathering Source-to-Image diagnostic data
Link kopieren

8.7.3. Gathering application diagnostic data to investigate application failures
Link kopieren

8.8. Troubleshooting storage issues
Link kopieren

8.8.1. Resolving multi-attach errors
Link kopieren

8.9. Investigating monitoring issues
Link kopieren

8.9.1. Investigating why user-defined project metrics are unavailable
Link kopieren

8.9.2. Determining why Prometheus is consuming a lot of disk space
Link kopieren

8.10. Diagnosing OpenShift CLI (oc) issues
Link kopieren

8.10.1. Understanding OpenShift CLI (oc) log levels
Link kopieren

8.10.2. Specifying OpenShift CLI (oc) log levels
Link kopieren

8.11. Troubleshooting expired tokens
Link kopieren

8.11.1. Troubleshooting expired offline access tokens
Link kopieren

8.12. Troubleshooting IAM roles
Link kopieren

8.12.1. Resolving issues with ocm-roles and user-role IAM resources
Link kopieren

8.12.1.1. Creating an ocm-role IAM role
Link kopieren

8.12.1.2. Creating a user-role IAM role
Link kopieren

8.12.1.3. Associating your AWS account with IAM roles
Link kopieren

8.12.1.4. Associating multiple AWS accounts with your Red Hat organization
Link kopieren

8.13. Troubleshooting Red Hat OpenShift Service on AWS cluster deployments
Link kopieren

8.13.1. Obtaining information about a failed cluster
Link kopieren

8.13.2. Troubleshooting cluster creation with an osdCcsAdmin error
Link kopieren

8.13.3. Creating the Elastic Load Balancing (ELB) service-linked role
Link kopieren

8.13.4. Repairing a cluster that cannot be deleted
Link kopieren