1.13. 证书更改后导入的集群离线故障排除
支持安装自定义 apiserver
证书,但在更改证书信息前导入的一个或多个集群 处于离线状态
。
1.13.1. 症状:证书更改后集群处于离线状态
完成更新证书 secret 的步骤后,在线的一个或多个集群现在在控制台中显示 离线状态
。
1.13.2. 鉴别问题: 证书更改后集群处于离线状态
更新自定义 API 服务器证书信息后,在新证书前导入并运行的集群会处于 offline
状态。
表示证书有问题的错误会出现在离线受管集群的 open-cluster-management-agent
命名空间中的 pod 日志中。以下示例与日志中显示的错误类似:
请参阅以下 work-agent
日志:
E0917 03:04:05.874759 1 manifestwork_controller.go:179] Reconcile work test-1-klusterlet-addon-workmgr fails with err: Failed to update work status with err Get "https://api.aaa-ocp.dev02.location.com:6443/apis/cluster.management.io/v1/namespaces/test-1/manifestworks/test-1-klusterlet-addon-workmgr": x509: certificate signed by unknown authority E0917 03:04:05.874887 1 base_controller.go:231] "ManifestWorkAgent" controller failed to sync "test-1-klusterlet-addon-workmgr", err: Failed to update work status with err Get "api.aaa-ocp.dev02.location.com:6443/apis/cluster.management.io/v1/namespaces/test-1/manifestworks/test-1-klusterlet-addon-workmgr": x509: certificate signed by unknown authority E0917 03:04:37.245859 1 reflector.go:127] k8s.io/client-go@v0.19.0/tools/cache/reflector.go:156: Failed to watch *v1.ManifestWork: failed to list *v1.ManifestWork: Get "api.aaa-ocp.dev02.location.com:6443/apis/cluster.management.io/v1/namespaces/test-1/manifestworks?resourceVersion=607424": x509: certificate signed by unknown authority
请参阅以下 registration-agent
日志:
I0917 02:27:41.525026 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"open-cluster-management-agent", Name:"open-cluster-management-agent", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ManagedClusterAvailableConditionUpdated' update managed cluster "test-1" available condition to "True", due to "Managed cluster is available" E0917 02:58:26.315984 1 reflector.go:127] k8s.io/client-go@v0.19.0/tools/cache/reflector.go:156: Failed to watch *v1beta1.CertificateSigningRequest: Get "https://api.aaa-ocp.dev02.location.com:6443/apis/cluster.management.io/v1/managedclusters?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dtest-1&resourceVersion=607408&timeout=9m33s&timeoutSeconds=573&watch=true"": x509: certificate signed by unknown authority E0917 02:58:26.598343 1 reflector.go:127] k8s.io/client-go@v0.19.0/tools/cache/reflector.go:156: Failed to watch *v1.ManagedCluster: Get "https://api.aaa-ocp.dev02.location.com:6443/apis/cluster.management.io/v1/managedclusters?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dtest-1&resourceVersion=607408&timeout=9m33s&timeoutSeconds=573&watch=true": x509: certificate signed by unknown authority E0917 02:58:27.613963 1 reflector.go:127] k8s.io/client-go@v0.19.0/tools/cache/reflector.go:156: Failed to watch *v1.ManagedCluster: failed to list *v1.ManagedCluster: Get "https://api.aaa-ocp.dev02.location.com:6443/apis/cluster.management.io/v1/managedclusters?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dtest-1&resourceVersion=607408&timeout=9m33s&timeoutSeconds=573&watch=true"": x509: certificate signed by unknown authority
1.13.3. 解决问题: 证书更改后集群处于离线状态
如果您的受管集群是 local-cluster
,或者受管集群是使用 Red Hat Advanced Cluster Management for Kubernetes 创建的,则必须等待 10 分钟或更长时间重新导入受管集群。
要立即重新导入受管集群,您可以删除 hub 集群上的受管集群导入 secret,并使用 Red Hat Advanced Cluster Management 重新导入它。运行以下命令:
oc delete secret -n <cluster_name> <cluster_name>-import
将 <cluster_name>
替换为您要导入的受管集群的名称。
如果要重新导入使用 Red Hat Advanced Cluster Management 导入的受管集群,请完成以下步骤以再次导入受管集群:
在 hub 集群中,运行以下命令来重新创建受管集群导入 secret:
oc delete secret -n <cluster_name> <cluster_name>-import
将
<cluster_name>
替换为您要导入的受管集群的名称。在 hub 集群中,运行以下命令来将受管集群导入 secret 公开给 YAML 文件:
oc get secret -n <cluster_name> <cluster_name>-import -ojsonpath='{.data.import\.yaml}' | base64 --decode > import.yaml
将
<cluster_name>
替换为您要导入的受管集群的名称。在受管集群中,运行以下命令应用
import.yaml
文件:oc apply -f import.yaml
注:前面的步骤不会从 hub 集群中分离受管集群。步骤使用受管集群中的当前设置更新所需的清单,包括新证书信息。