This documentation is for a release that is no longer maintained
See documentation for the latest supported version 3 or the latest supported version 4.Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.
Chapter 16. Topology Aware Lifecycle Manager for cluster updates
You can use the Topology Aware Lifecycle Manager (TALM) to manage the software lifecycle of multiple single-node OpenShift clusters. TALM uses Red Hat Advanced Cluster Management (RHACM) policies to perform changes on the target clusters.
Topology Aware Lifecycle Manager is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
16.1. About the Topology Aware Lifecycle Manager configuration Link kopierenLink in die Zwischenablage kopiert!
The Topology Aware Lifecycle Manager (TALM) manages the deployment of Red Hat Advanced Cluster Management (RHACM) policies for one or more OpenShift Container Platform clusters. Using TALM in a large network of clusters allows the phased rollout of policies to the clusters in limited batches. This helps to minimize possible service disruptions when updating. With TALM, you can control the following actions:
- The timing of the update
- The number of RHACM-managed clusters
- The subset of managed clusters to apply the policies to
- The update order of the clusters
- The set of policies remediated to the cluster
- The order of policies remediated to the cluster
TALM supports the orchestration of the OpenShift Container Platform y-stream and z-stream updates, and day-two operations on y-streams and z-streams.
16.2. About managed policies used with Topology Aware Lifecycle Manager Link kopierenLink in die Zwischenablage kopiert!
The Topology Aware Lifecycle Manager (TALM) uses RHACM policies for cluster updates.
TALM can be used to manage the rollout of any policy CR where the remediationAction field is set to inform. Supported use cases include the following:
- Manual user creation of policy CRs
-
Automatically generated policies from the
PolicyGenTemplatecustom resource definition (CRD)
For policies that update an Operator subscription with manual approval, TALM provides additional functionality that approves the installation of the updated Operator.
For more information about managed policies, see Policy Overview in the RHACM documentation.
For more information about the PolicyGenTemplate CRD, see the "About the PolicyGenTemplate CRD" section in "Configuring managed clusters with policies and PolicyGenTemplate resources".
16.3. Installing the Topology Aware Lifecycle Manager by using the web console Link kopierenLink in die Zwischenablage kopiert!
You can use the OpenShift Container Platform web console to install the Topology Aware Lifecycle Manager.
Prerequisites
- Install the latest version of the RHACM Operator.
- Set up a hub cluster with disconnected regitry.
-
Log in as a user with
cluster-adminprivileges.
Procedure
-
In the OpenShift Container Platform web console, navigate to Operators
OperatorHub. - Search for the Topology Aware Lifecycle Manager from the list of available Operators, and then click Install.
- Keep the default selection of Installation mode ["All namespaces on the cluster (default)"] and Installed Namespace ("openshift-operators") to ensure that the Operator is installed properly.
- Click Install.
Verification
To confirm that the installation is successful:
-
Navigate to the Operators
Installed Operators page. -
Check that the Operator is installed in the
All Namespacesnamespace and its status isSucceeded.
If the Operator is not installed successfully:
-
Navigate to the Operators
Installed Operators page and inspect the Statuscolumn for any errors or failures. -
Navigate to the Workloads
Pods page and check the logs in any containers in the cluster-group-upgrades-controller-managerpod that are reporting issues.
16.4. Installing the Topology Aware Lifecycle Manager by using the CLI Link kopierenLink in die Zwischenablage kopiert!
You can use the OpenShift CLI (oc) to install the Topology Aware Lifecycle Manager (TALM).
Prerequisites
-
Install the OpenShift CLI (
oc). - Install the latest version of the RHACM Operator.
- Set up a hub cluster with disconnected registry.
-
Log in as a user with
cluster-adminprivileges.
Procedure
Create a
SubscriptionCR:Define the
SubscriptionCR and save the YAML file, for example,talm-subscription.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
SubscriptionCR by running the following command:oc create -f talm-subscription.yaml
$ oc create -f talm-subscription.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that the installation succeeded by inspecting the CSV resource:
oc get csv -n openshift-operators
$ oc get csv -n openshift-operatorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME DISPLAY VERSION REPLACES PHASE topology-aware-lifecycle-manager.4.10.0-202206301927 Topology Aware Lifecycle Manager 4.10.0-202206301927 Succeeded
NAME DISPLAY VERSION REPLACES PHASE topology-aware-lifecycle-manager.4.10.0-202206301927 Topology Aware Lifecycle Manager 4.10.0-202206301927 SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the TALM is up and running:
oc get deploy -n openshift-operators
$ oc get deploy -n openshift-operatorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE openshift-operators cluster-group-upgrades-controller-manager 1/1 1 1 14s
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE openshift-operators cluster-group-upgrades-controller-manager 1/1 1 1 14sCopy to Clipboard Copied! Toggle word wrap Toggle overflow
16.5. About the ClusterGroupUpgrade CR Link kopierenLink in die Zwischenablage kopiert!
The Topology Aware Lifecycle Manager (TALM) builds the remediation plan from the ClusterGroupUpgrade CR for a group of clusters. You can define the following specifications in a ClusterGroupUpgrade CR:
- Clusters in the group
-
Blocking
ClusterGroupUpgradeCRs - Applicable list of managed policies
- Number of concurrent updates
- Applicable canary updates
- Actions to perform before and after the update
- Update timing
As TALM works through remediation of the policies to the specified clusters, the ClusterGroupUpgrade CR can have the following states:
-
UpgradeNotStarted -
UpgradeCannotStart -
UpgradeNotComplete -
UpgradeTimedOut -
UpgradeCompleted -
PrecachingRequired
After TALM completes a cluster update, the cluster does not update again under the control of the same ClusterGroupUpgrade CR. You must create a new ClusterGroupUpgrade CR in the following cases:
- When you need to update the cluster again
-
When the cluster changes to non-compliant with the
informpolicy after being updated
16.5.1. The UpgradeNotStarted state Link kopierenLink in die Zwischenablage kopiert!
The initial state of the ClusterGroupUpgrade CR is UpgradeNotStarted.
TALM builds a remediation plan based on the following fields:
-
The
clusterSelectorfield specifies the labels of the clusters that you want to update. -
The
clustersfield specifies a list of clusters to update. -
The
canariesfield specifies the clusters for canary updates. -
The
maxConcurrencyfield specifies the number of clusters to update in a batch.
You can use the clusters and the clusterSelector fields together to create a combined list of clusters.
The remediation plan starts with the clusters listed in the canaries field. Each canary cluster forms a single-cluster batch.
Any failures during the update of a canary cluster stops the update process.
The ClusterGroupUpgrade CR transitions to the UpgradeNotCompleted state after the remediation plan is successfully created and after the enable field is set to true. At this point, TALM starts to update the non-compliant clusters with the specified managed policies.
You can only make changes to the spec fields if the ClusterGroupUpgrade CR is either in the UpgradeNotStarted or the UpgradeCannotStart state.
Sample ClusterGroupUpgrade CR in the UpgradeNotStarted state
- 1
- Defines the list of clusters to update.
- 2
- Lists the user-defined set of policies to remediate.
- 3
- Defines the specifics of the cluster updates.
- 4
- Defines the clusters for canary updates.
- 5
- Defines the maximum number of concurrent updates in a batch. The number of remediation batches is the number of canary clusters, plus the number of clusters, except the canary clusters, divided by the
maxConcurrencyvalue. The clusters that are already compliant with all the managed policies are excluded from the remediation plan. - 6
- Displays information about the status of the updates.
16.5.2. The UpgradeCannotStart state Link kopierenLink in die Zwischenablage kopiert!
In the UpgradeCannotStart state, the update cannot start because of the following reasons:
- Blocking CRs are missing from the system
- Blocking CRs have not yet finished
16.5.3. The UpgradeNotCompleted state Link kopierenLink in die Zwischenablage kopiert!
In the UpgradeNotCompleted state, TALM enforces the policies following the remediation plan defined in the UpgradeNotStarted state.
Enforcing the policies for subsequent batches starts immediately after all the clusters of the current batch are compliant with all the managed policies. If the batch times out, TALM moves on to the next batch. The timeout value of a batch is the spec.timeout field divided by the number of batches in the remediation plan.
The managed policies apply in the order that they are listed in the managedPolicies field in the ClusterGroupUpgrade CR. One managed policy is applied to the specified clusters at a time. After the specified clusters comply with the current policy, the next managed policy is applied to the next non-compliant cluster.
Sample ClusterGroupUpgrade CR in the UpgradeNotCompleted state
- 1
- The update starts when the value of the
spec.enablefield istrue. - 2
- The
statusfields change accordingly when the update begins. - 3
- Lists the clusters in the batch and the index of the policy that is being currently applied to each cluster. The index of the policies starts with
0and the index follows the order of thestatus.managedPoliciesForUpgradelist.
16.5.4. The UpgradeTimedOut state Link kopierenLink in die Zwischenablage kopiert!
In the UpgradeTimedOut state, TALM checks every hour if all the policies for the ClusterGroupUpgrade CR are compliant. The checks continue until the ClusterGroupUpgrade CR is deleted or the updates are completed. The periodic checks allow the updates to complete if they get prolonged due to network, CPU, or other issues.
TALM transitions to the UpgradeTimedOut state in two cases:
- When the current batch contains canary updates and the cluster in the batch does not comply with all the managed policies within the batch timeout.
-
When the clusters do not comply with the managed policies within the
timeoutvalue specified in theremediationStrategyfield.
If the policies are compliant, TALM transitions to the UpgradeCompleted state.
16.5.5. The UpgradeCompleted state Link kopierenLink in die Zwischenablage kopiert!
In the UpgradeCompleted state, the cluster updates are complete.
Sample ClusterGroupUpgrade CR in the UpgradeCompleted state
- 1
- The value of
spec.action.afterCompletion.deleteObjectsfield istrueby default. After the update is completed, TALM deletes the underlying RHACM objects that were created during the update. This option is to prevent the RHACM hub from continuously checking for compliance after a successful update. - 2
- The
statusfields show that the updates completed successfully. - 3
- Displays that all the policies are applied to the cluster.
In the PrecachingRequired state, the clusters need to have images pre-cached before the update can start. For more information about pre-caching, see the "Using the container image pre-cache feature" section.
16.5.6. Blocking ClusterGroupUpgrade CRs Link kopierenLink in die Zwischenablage kopiert!
You can create multiple ClusterGroupUpgrade CRs and control their order of application.
For example, if you create ClusterGroupUpgrade CR C that blocks the start of ClusterGroupUpgrade CR A, then ClusterGroupUpgrade CR A cannot start until the status of ClusterGroupUpgrade CR C becomes UpgradeComplete.
One ClusterGroupUpgrade CR can have multiple blocking CRs. In this case, all the blocking CRs must complete before the upgrade for the current CR can start.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Provision one or more managed clusters.
-
Log in as a user with
cluster-adminprivileges. - Create RHACM policies in the hub cluster.
Procedure
Save the content of the
ClusterGroupUpgradeCRs in thecgu-a.yaml,cgu-b.yaml, andcgu-c.yamlfiles.Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Defines the blocking CRs. The
cgu-aupdate cannot start untilcgu-cis complete.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
cgu-bupdate cannot start untilcgu-ais complete.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
cgu-cupdate does not have any blocking CRs. TALM starts thecgu-cupdate when theenablefield is set totrue.
Create the
ClusterGroupUpgradeCRs by running the following command for each relevant CR:oc apply -f <name>.yaml
$ oc apply -f <name>.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Start the update process by running the following command for each relevant CR:
oc --namespace=default patch clustergroupupgrade.ran.openshift.io/<name> \ --type merge -p '{"spec":{"enable":true}}'$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/<name> \ --type merge -p '{"spec":{"enable":true}}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following examples show
ClusterGroupUpgradeCRs where theenablefield is set totrue:Example for
cgu-awith blocking CRsCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Shows the list of blocking CRs.
Example for
cgu-bwith blocking CRsCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Shows the list of blocking CRs.
Example for
cgu-cwith blocking CRsCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
cgu-cupdate does not have any blocking CRs.
16.6. Update policies on managed clusters Link kopierenLink in die Zwischenablage kopiert!
The Topology Aware Lifecycle Manager (TALM) remediates a set of inform policies for the clusters specified in the ClusterGroupUpgrade CR. TALM remediates inform policies by making enforce copies of the managed RHACM policies. Each copied policy has its own corresponding RHACM placement rule and RHACM placement binding.
One by one, TALM adds each cluster from the current batch to the placement rule that corresponds with the applicable managed policy. If a cluster is already compliant with a policy, TALM skips applying that policy on the compliant cluster. TALM then moves on to applying the next policy to the non-compliant cluster. After TALM completes the updates in a batch, all clusters are removed from the placement rules associated with the copied policies. Then, the update of the next batch starts.
If a spoke cluster does not report any compliant state to RHACM, the managed policies on the hub cluster can be missing status information that TALM needs. TALM handles these cases in the following ways:
-
If a policy’s
status.compliantfield is missing, TALM ignores the policy and adds a log entry. Then, TALM continues looking at the policy’sstatus.statusfield. -
If a policy’s
status.statusis missing, TALM produces an error. -
If a cluster’s compliance status is missing in the policy’s
status.statusfield, TALM considers that cluster to be non-compliant with that policy.
For more information about RHACM policies, see Policy overview.
16.6.1. Applying update policies to managed clusters Link kopierenLink in die Zwischenablage kopiert!
You can update your managed clusters by applying your policies.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Provision one or more managed clusters.
-
Log in as a user with
cluster-adminprivileges. - Create RHACM policies in the hub cluster.
Procedure
Save the contents of the
ClusterGroupUpgradeCR in thecgu-1.yamlfile.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
ClusterGroupUpgradeCR by running the following command:oc create -f cgu-1.yaml
$ oc create -f cgu-1.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check if the
ClusterGroupUpgradeCR was created in the hub cluster by running the following command:oc get cgu --all-namespaces
$ oc get cgu --all-namespacesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAMESPACE NAME AGE default cgu-1 8m55s
NAMESPACE NAME AGE default cgu-1 8m55sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the status of the update by running the following command:
oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq$ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
spec.enablefield in theClusterGroupUpgradeCR is set tofalse.
Check the status of the policies by running the following command:
oc get policies -A
$ oc get policies -ACopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
spec.remediationActionfield of policies currently applied on the clusters is set toenforce. The managed policies ininformmode from theClusterGroupUpgradeCR remain ininformmode during the update.
Change the value of the
spec.enablefield totrueby running the following command:oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-1 \ --patch '{"spec":{"enable":true}}' --type=merge$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-1 \ --patch '{"spec":{"enable":true}}' --type=mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check the status of the update again by running the following command:
oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq$ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Reflects the update progress of the current batch. Run this command again to receive updated information about the progress.
If the policies include Operator subscriptions, you can check the installation progress directly on the single-node cluster.
Export the
KUBECONFIGfile of the single-node cluster you want to check the installation progress for by running the following command:export KUBECONFIG=<cluster_kubeconfig_absolute_path>
$ export KUBECONFIG=<cluster_kubeconfig_absolute_path>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check all the subscriptions present on the single-node cluster and look for the one in the policy you are trying to install through the
ClusterGroupUpgradeCR by running the following command:oc get subs -A | grep -i <subscription_name>
$ oc get subs -A | grep -i <subscription_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output for
cluster-loggingpolicyNAMESPACE NAME PACKAGE SOURCE CHANNEL openshift-logging cluster-logging cluster-logging redhat-operators stable
NAMESPACE NAME PACKAGE SOURCE CHANNEL openshift-logging cluster-logging cluster-logging redhat-operators stableCopy to Clipboard Copied! Toggle word wrap Toggle overflow
If one of the managed policies includes a
ClusterVersionCR, check the status of platform updates in the current batch by running the following command against the spoke cluster:oc get clusterversion
$ oc get clusterversionCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.5 True True 43s Working towards 4.9.7: 71 of 735 done (9% complete)
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.5 True True 43s Working towards 4.9.7: 71 of 735 done (9% complete)Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the Operator subscription by running the following command:
oc get subs -n <operator-namespace> <operator-subscription> -ojsonpath="{.status}"$ oc get subs -n <operator-namespace> <operator-subscription> -ojsonpath="{.status}"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the install plans present on the single-node cluster that is associated with the desired subscription by running the following command:
oc get installplan -n <subscription_namespace>
$ oc get installplan -n <subscription_namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output for
cluster-loggingOperatorNAMESPACE NAME CSV APPROVAL APPROVED openshift-logging install-6khtw cluster-logging.5.3.3-4 Manual true
NAMESPACE NAME CSV APPROVAL APPROVED openshift-logging install-6khtw cluster-logging.5.3.3-4 Manual true1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The install plans have their
Approvalfield set toManualand theirApprovedfield changes fromfalsetotrueafter TALM approves the install plan.
NoteWhen TALM is remediating a policy containing a subscription, it automatically approves any install plans attached to that subscription. Where multiple install plans are needed to get the operator to the latest known version, TALM might approve multiple install plans, upgrading through one or more intermediate versions to get to the final version.
Check if the cluster service version for the Operator of the policy that the
ClusterGroupUpgradeis installing reached theSucceededphase by running the following command:oc get csv -n <operator_namespace>
$ oc get csv -n <operator_namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output for OpenShift Logging Operator
NAME DISPLAY VERSION REPLACES PHASE cluster-logging.5.4.2 Red Hat OpenShift Logging 5.4.2 Succeeded
NAME DISPLAY VERSION REPLACES PHASE cluster-logging.5.4.2 Red Hat OpenShift Logging 5.4.2 SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow
16.7. Using the container image pre-cache feature Link kopierenLink in die Zwischenablage kopiert!
Clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed.
The time of the update is not set by TALM. You can apply the ClusterGroupUpgrade CR at the beginning of the update by manual application or by external automation.
The container image pre-caching starts when the preCaching field is set to true in the ClusterGroupUpgrade CR. After a successful pre-caching process, you can start remediating policies. The remediation actions start when the enable field is set to true.
The pre-caching process can be in the following statuses:
PrecacheNotStartedThis is the initial state all clusters are automatically assigned to on the first reconciliation pass of the
ClusterGroupUpgradeCR.In this state, TALM deletes any pre-caching namespace and hub view resources of spoke clusters that remain from previous incomplete updates. TALM then creates a new
ManagedClusterViewresource for the spoke pre-caching namespace to verify its deletion in thePrecachePreparingstate.PrecachePreparing- Cleaning up any remaining resources from previous incomplete updates is in progress.
PrecacheStarting- Pre-caching job prerequisites and the job are created.
PrecacheActive- The job is in "Active" state.
PrecacheSucceeded- The pre-cache job has succeeded.
PrecacheTimeout- The artifact pre-caching has been partially done.
PrecacheUnrecoverableError- The job ends with a non-zero exit code.
16.7.1. Creating a ClusterGroupUpgrade CR with pre-caching Link kopierenLink in die Zwischenablage kopiert!
The pre-cache feature allows the required container images to be present on the spoke cluster before the update starts.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Provision one or more managed clusters.
-
Log in as a user with
cluster-adminprivileges.
Procedure
Save the contents of the
ClusterGroupUpgradeCR with thepreCachingfield set totruein theclustergroupupgrades-group-du.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
preCachingfield is set totrue, which enables TALM to pull the container images before starting the update.
When you want to start the update, apply the
ClusterGroupUpgradeCR by running the following command:oc apply -f clustergroupupgrades-group-du.yaml
$ oc apply -f clustergroupupgrades-group-du.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check if the
ClusterGroupUpgradeCR exists in the hub cluster by running the following command:oc get cgu -A
$ oc get cgu -ACopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAMESPACE NAME AGE ztp-group-du-sno du-upgrade-4918 10s
NAMESPACE NAME AGE ztp-group-du-sno du-upgrade-4918 10s1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The CR is created.
Check the status of the pre-caching task by running the following command:
oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'$ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the status of the pre-caching job by running the following command on the spoke cluster:
oc get jobs,pods -n openshift-talm-pre-cache
$ oc get jobs,pods -n openshift-talm-pre-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME COMPLETIONS DURATION AGE job.batch/pre-cache 0/1 3m10s 3m10s NAME READY STATUS RESTARTS AGE pod/pre-cache--1-9bmlr 1/1 Running 0 3m10s
NAME COMPLETIONS DURATION AGE job.batch/pre-cache 0/1 3m10s 3m10s NAME READY STATUS RESTARTS AGE pod/pre-cache--1-9bmlr 1/1 Running 0 3m10sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the status of the
ClusterGroupUpgradeCR by running the following command:oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'$ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The pre-cache tasks are done.
16.8. Troubleshooting the Topology Aware Lifecycle Manager Link kopierenLink in die Zwischenablage kopiert!
The Topology Aware Lifecycle Manager (TALM) is an OpenShift Container Platform Operator that remediates RHACM policies. When issues occur, use the oc adm must-gather command to gather details and logs and to take steps in debugging the issues.
For more information about related topics, see the following documentation:
- Red Hat Advanced Cluster Management for Kubernetes 2.4 Support Matrix
- Red Hat Advanced Cluster Management Troubleshooting
- The "Troubleshooting Operator issues" section
16.8.1. General troubleshooting Link kopierenLink in die Zwischenablage kopiert!
You can determine the cause of the problem by reviewing the following questions:
Is the configuration that you are applying supported?
- Are the RHACM and the OpenShift Container Platform versions compatible?
- Are the TALM and RHACM versions compatible?
Which of the following components is causing the problem?
To ensure that the ClusterGroupUpgrade configuration is functional, you can do the following:
-
Create the
ClusterGroupUpgradeCR with thespec.enablefield set tofalse. - Wait for the status to be updated and go through the troubleshooting questions.
-
If everything looks as expected, set the
spec.enablefield totruein theClusterGroupUpgradeCR.
After you set the spec.enable field to true in the ClusterUpgradeGroup CR, the update procedure starts and you cannot edit the CR’s spec fields anymore.
16.8.2. Cannot modify the ClusterUpgradeGroup CR Link kopierenLink in die Zwischenablage kopiert!
- Issue
-
You cannot edit the
ClusterUpgradeGroupCR after enabling the update. - Resolution
Restart the procedure by performing the following steps:
Remove the old
ClusterGroupUpgradeCR by running the following command:oc delete cgu -n <ClusterGroupUpgradeCR_namespace> <ClusterGroupUpgradeCR_name>
$ oc delete cgu -n <ClusterGroupUpgradeCR_namespace> <ClusterGroupUpgradeCR_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check and fix the existing issues with the managed clusters and policies.
- Ensure that all the clusters are managed clusters and available.
-
Ensure that all the policies exist and have the
spec.remediationActionfield set toinform.
Create a new
ClusterGroupUpgradeCR with the correct configurations.oc apply -f <ClusterGroupUpgradeCR_YAML>
$ oc apply -f <ClusterGroupUpgradeCR_YAML>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
16.8.3. Managed policies Link kopierenLink in die Zwischenablage kopiert!
Checking managed policies on the system
- Issue
- You want to check if you have the correct managed policies on the system.
- Resolution
Run the following command:
oc get cgu lab-upgrade -ojsonpath='{.spec.managedPolicies}'$ oc get cgu lab-upgrade -ojsonpath='{.spec.managedPolicies}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
["group-du-sno-validator-du-validator-policy", "policy2-common-pao-sub-policy", "policy3-common-ptp-sub-policy"]
["group-du-sno-validator-du-validator-policy", "policy2-common-pao-sub-policy", "policy3-common-ptp-sub-policy"]Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Checking remediationAction mode
- Issue
-
You want to check if the
remediationActionfield is set toinformin thespecof the managed policies. - Resolution
Run the following command:
oc get policies --all-namespaces
$ oc get policies --all-namespacesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE default policy1-common-cluster-version-policy inform NonCompliant 5d21h default policy2-common-pao-sub-policy inform Compliant 5d21h default policy3-common-ptp-sub-policy inform NonCompliant 5d21h default policy4-common-sriov-sub-policy inform NonCompliant 5d21h
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE default policy1-common-cluster-version-policy inform NonCompliant 5d21h default policy2-common-pao-sub-policy inform Compliant 5d21h default policy3-common-ptp-sub-policy inform NonCompliant 5d21h default policy4-common-sriov-sub-policy inform NonCompliant 5d21hCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Checking policy compliance state
- Issue
- You want to check the compliance state of policies.
- Resolution
Run the following command:
oc get policies --all-namespaces
$ oc get policies --all-namespacesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE default policy1-common-cluster-version-policy inform NonCompliant 5d21h default policy2-common-pao-sub-policy inform Compliant 5d21h default policy3-common-ptp-sub-policy inform NonCompliant 5d21h default policy4-common-sriov-sub-policy inform NonCompliant 5d21h
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE default policy1-common-cluster-version-policy inform NonCompliant 5d21h default policy2-common-pao-sub-policy inform Compliant 5d21h default policy3-common-ptp-sub-policy inform NonCompliant 5d21h default policy4-common-sriov-sub-policy inform NonCompliant 5d21hCopy to Clipboard Copied! Toggle word wrap Toggle overflow
16.8.4. Clusters Link kopierenLink in die Zwischenablage kopiert!
Checking if managed clusters are present
- Issue
-
You want to check if the clusters in the
ClusterGroupUpgradeCR are managed clusters. - Resolution
Run the following command:
oc get managedclusters
$ oc get managedclustersCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE local-cluster true https://api.hub.example.com:6443 True Unknown 13d spoke1 true https://api.spoke1.example.com:6443 True True 13d spoke3 true https://api.spoke3.example.com:6443 True True 27h
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE local-cluster true https://api.hub.example.com:6443 True Unknown 13d spoke1 true https://api.spoke1.example.com:6443 True True 13d spoke3 true https://api.spoke3.example.com:6443 True True 27hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Alternatively, check the TALM manager logs:
Get the name of the TALM manager by running the following command:
oc get pod -n openshift-operators
$ oc get pod -n openshift-operatorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp 2/2 Running 0 45m
NAME READY STATUS RESTARTS AGE cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp 2/2 Running 0 45mCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the TALM manager logs by running the following command:
oc logs -n openshift-operators \ cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager
$ oc logs -n openshift-operators \ cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c managerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItemERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"}1 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItemCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The error message shows that the cluster is not a managed cluster.
Checking if managed clusters are available
- Issue
-
You want to check if the managed clusters specified in the
ClusterGroupUpgradeCR are available. - Resolution
Run the following command:
oc get managedclusters
$ oc get managedclustersCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE local-cluster true https://api.hub.testlab.com:6443 True Unknown 13d spoke1 true https://api.spoke1.testlab.com:6443 True True 13d spoke3 true https://api.spoke3.testlab.com:6443 True True 27h
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE local-cluster true https://api.hub.testlab.com:6443 True Unknown 13d spoke1 true https://api.spoke1.testlab.com:6443 True True 13d1 spoke3 true https://api.spoke3.testlab.com:6443 True True 27h2 Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Checking clusterSelector
- Issue
-
You want to check if the
clusterSelectorfield is specified in theClusterGroupUpgradeCR in at least one of the managed clusters. - Resolution
Run the following command:
oc get managedcluster --selector=upgrade=true
$ oc get managedcluster --selector=upgrade=true1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The label for the clusters you want to update is
upgrade:true.
Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE spoke1 true https://api.spoke1.testlab.com:6443 True True 13d spoke3 true https://api.spoke3.testlab.com:6443 True True 27h
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE spoke1 true https://api.spoke1.testlab.com:6443 True True 13d spoke3 true https://api.spoke3.testlab.com:6443 True True 27hCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Checking if canary clusters are present
- Issue
You want to check if the canary clusters are present in the list of clusters.
Example
ClusterGroupUpgradeCRCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Resolution
Run the following commands:
oc get cgu lab-upgrade -ojsonpath='{.spec.clusters}'$ oc get cgu lab-upgrade -ojsonpath='{.spec.clusters}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
["spoke1", "spoke3"]
["spoke1", "spoke3"]Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check if the canary clusters are present in the list of clusters that match
clusterSelectorlabels by running the following command:oc get managedcluster --selector=upgrade=true
$ oc get managedcluster --selector=upgrade=trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE spoke1 true https://api.spoke1.testlab.com:6443 True True 13d spoke3 true https://api.spoke3.testlab.com:6443 True True 27h
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE spoke1 true https://api.spoke1.testlab.com:6443 True True 13d spoke3 true https://api.spoke3.testlab.com:6443 True True 27hCopy to Clipboard Copied! Toggle word wrap Toggle overflow
A cluster can be present in spec.clusters and also be matched by the spec.clusterSelecter label.
Checking the pre-caching status on spoke clusters
Check the status of pre-caching by running the following command on the spoke cluster:
oc get jobs,pods -n openshift-talo-pre-cache
$ oc get jobs,pods -n openshift-talo-pre-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow
16.8.5. Remediation Strategy Link kopierenLink in die Zwischenablage kopiert!
Checking if remediationStrategy is present in the ClusterGroupUpgrade CR
- Issue
-
You want to check if the
remediationStrategyis present in theClusterGroupUpgradeCR. - Resolution
Run the following command:
oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy}'$ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
{"maxConcurrency":2, "timeout":240}{"maxConcurrency":2, "timeout":240}Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Checking if maxConcurrency is specified in the ClusterGroupUpgrade CR
- Issue
-
You want to check if the
maxConcurrencyis specified in theClusterGroupUpgradeCR. - Resolution
Run the following command:
oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy.maxConcurrency}'$ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy.maxConcurrency}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
2
2Copy to Clipboard Copied! Toggle word wrap Toggle overflow
16.8.6. Topology Aware Lifecycle Manager Link kopierenLink in die Zwischenablage kopiert!
Checking condition message and status in the ClusterGroupUpgrade CR
- Issue
-
You want to check the value of the
status.conditionsfield in theClusterGroupUpgradeCR. - Resolution
Run the following command:
oc get cgu lab-upgrade -ojsonpath='{.status.conditions}'$ oc get cgu lab-upgrade -ojsonpath='{.status.conditions}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
{"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"The ClusterGroupUpgrade CR has managed policies that are missing:[policyThatDoesntExist]", "reason":"UpgradeCannotStart", "status":"False", "type":"Ready"}{"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"The ClusterGroupUpgrade CR has managed policies that are missing:[policyThatDoesntExist]", "reason":"UpgradeCannotStart", "status":"False", "type":"Ready"}Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Checking corresponding copied policies
- Issue
-
You want to check if every policy from
status.managedPoliciesForUpgradehas a corresponding policy instatus.copiedPolicies. - Resolution
Run the following command:
oc get cgu lab-upgrade -oyaml
$ oc get cgu lab-upgrade -oyamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Checking if status.remediationPlan was computed
- Issue
-
You want to check if
status.remediationPlanis computed. - Resolution
Run the following command:
oc get cgu lab-upgrade -ojsonpath='{.status.remediationPlan}'$ oc get cgu lab-upgrade -ojsonpath='{.status.remediationPlan}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
[["spoke2", "spoke3"]]
[["spoke2", "spoke3"]]Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Errors in the TALM manager container
- Issue
- You want to check the logs of the manager container of TALM.
- Resolution
Run the following command:
oc logs -n openshift-operators \ cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager
$ oc logs -n openshift-operators \ cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c managerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItemERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"}1 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItemCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Displays the error.