5.2.4. 替换不健康的 etcd 成员
根据不健康的 etcd 成员的状态,使用以下一个流程:
5.2.4.1. 替换机器没有运行或节点未就绪的不健康 etcd 成员
此流程详细介绍了替换因机器没有运行或节点未就绪造成不健康的 etcd 成员的步骤。
先决条件
- 您已找出不健康的 etcd 成员。
- 您已确认机器没有运行,或者该节点未就绪。
-
您可以使用具有
cluster-admin
角色的用户访问集群。 已进行 etcd 备份。
重要执行此流程前务必要进行 etcd 备份,以便在遇到任何问题时可以恢复集群。
流程
删除不健康的成员。
选择一个不在受影响节点上的 pod:
在一个终端中使用
cluster-admin
用户连接到集群,运行以下命令:$ oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd
输出示例
etcd-ip-10-0-131-183.ec2.internal 3/3 Running 0 123m etcd-ip-10-0-164-97.ec2.internal 3/3 Running 0 123m etcd-ip-10-0-154-204.ec2.internal 3/3 Running 0 124m
连接到正在运行的 etcd 容器,传递没有在受影响节点上的 pod 的名称:
在一个终端中使用
cluster-admin
用户连接到集群,运行以下命令:$ oc rsh -n openshift-etcd etcd-ip-10-0-154-204.ec2.internal
查看成员列表:
sh-4.2# etcdctl member list -w table
输出示例
+------------------+---------+------------------------------+---------------------------+---------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+------------------------------+---------------------------+---------------------------+ | 6fc1e7c9db35841d | started | ip-10-0-131-183.ec2.internal | https://10.0.131.183:2380 | https://10.0.131.183:2379 | | 757b6793e2408b6c | started | ip-10-0-164-97.ec2.internal | https://10.0.164.97:2380 | https://10.0.164.97:2379 | | ca8c2990a0aa29d1 | started | ip-10-0-154-204.ec2.internal | https://10.0.154.204:2380 | https://10.0.154.204:2379 | +------------------+---------+------------------------------+---------------------------+---------------------------+
记录不健康的 etcd 成员的 ID 和名称,因为稍后需要这些值。
$ etcdctl endpoint health
命令将会在替换过程完成并添加新成员前,列出删除的成员。通过向
etcdctl member remove
命令提供 ID 来删除不健康的 etcd 成员 :sh-4.2# etcdctl member remove 6fc1e7c9db35841d
输出示例
Member 6fc1e7c9db35841d removed from cluster ead669ce1fbfb346
再次查看成员列表,并确认成员已被删除:
sh-4.2# etcdctl member list -w table
输出示例
+------------------+---------+------------------------------+---------------------------+---------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+------------------------------+---------------------------+---------------------------+ | 757b6793e2408b6c | started | ip-10-0-164-97.ec2.internal | https://10.0.164.97:2380 | https://10.0.164.97:2379 | | ca8c2990a0aa29d1 | started | ip-10-0-154-204.ec2.internal | https://10.0.154.204:2380 | https://10.0.154.204:2379 | +------------------+---------+------------------------------+---------------------------+---------------------------+
现在您可以退出节点 shell。
重要删除成员后,在剩余的 etcd 实例重启时,集群可能无法访问。
删除已删除的不健康 etcd 成员的旧 secret。
列出已删除的不健康 etcd 成员的 secret。
$ oc get secrets -n openshift-etcd | grep ip-10-0-131-183.ec2.internal 1
- 1
- 传递您之前在这个过程中记录的不健康 etcd 成员的名称。
有一个对等的、服务和指标的 secret,如以下输出所示:
输出示例
etcd-peer-ip-10-0-131-183.ec2.internal kubernetes.io/tls 2 47m etcd-serving-ip-10-0-131-183.ec2.internal kubernetes.io/tls 2 47m etcd-serving-metrics-ip-10-0-131-183.ec2.internal kubernetes.io/tls 2 47m
删除已删除的不健康 etcd 成员的 secret。
删除 peer(对等)secret:
$ oc delete secret -n openshift-etcd etcd-peer-ip-10-0-131-183.ec2.internal
删除 serving secret:
$ oc delete secret -n openshift-etcd etcd-serving-ip-10-0-131-183.ec2.internal
删除 metrics secret:
$ oc delete secret -n openshift-etcd etcd-serving-metrics-ip-10-0-131-183.ec2.internal
删除并重新创建 control plane 机器(也称为 master 机器)。重新创建此机器后,会强制一个新修订版本并自动扩展 etcd。
如果您正在运行安装程序置备的基础架构,或者您使用 Machine API 创建机器,请按照以下步骤执行。否则,您必须使用最初创建 master 时使用的相同方法创建新的 master。
获取不健康成员的机器。
在一个终端中使用
cluster-admin
用户连接到集群,运行以下命令:$ oc get machines -n openshift-machine-api -o wide
输出示例
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE clustername-8qw5l-master-0 Running m4.xlarge us-east-1 us-east-1a 3h37m ip-10-0-131-183.ec2.internal aws:///us-east-1a/i-0ec2782f8287dfb7e stopped 1 clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-154-204.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-164-97.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running clustername-8qw5l-worker-us-east-1a-wbtgd Running m4.large us-east-1 us-east-1a 3h28m ip-10-0-129-226.ec2.internal aws:///us-east-1a/i-010ef6279b4662ced running clustername-8qw5l-worker-us-east-1b-lrdxb Running m4.large us-east-1 us-east-1b 3h28m ip-10-0-144-248.ec2.internal aws:///us-east-1b/i-0cb45ac45a166173b running clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us-east-1c 3h28m ip-10-0-170-181.ec2.internal aws:///us-east-1c/i-06861c00007751b0a running
- 1
- 这是不健康节点的 control plane 机器
ip-10-0-131-183.ec2.internal
。
将机器配置保存到文件系统中的一个文件中:
$ oc get machine clustername-8qw5l-master-0 \ 1 -n openshift-machine-api \ -o yaml \ > new-master-machine.yaml
- 1
- 为不健康的节点指定 control plane 机器的名称。
编辑上一步中创建的
new-master-machine.yaml
文件,以分配新名称并删除不必要的字段。删除整个
status
部分:status: addresses: - address: 10.0.131.183 type: InternalIP - address: ip-10-0-131-183.ec2.internal type: InternalDNS - address: ip-10-0-131-183.ec2.internal type: Hostname lastUpdated: "2020-04-20T17:44:29Z" nodeRef: kind: Node name: ip-10-0-131-183.ec2.internal uid: acca4411-af0d-4387-b73e-52b2484295ad phase: Running providerStatus: apiVersion: awsproviderconfig.openshift.io/v1beta1 conditions: - lastProbeTime: "2020-04-20T16:53:50Z" lastTransitionTime: "2020-04-20T16:53:50Z" message: machine successfully created reason: MachineCreationSucceeded status: "True" type: MachineCreation instanceId: i-0fdb85790d76d0c3f instanceState: stopped kind: AWSMachineProviderStatus
将
metadata.name
字段更改为新名称。建议您保留与旧机器相同的基础名称,并将结束号码改为下一个可用数字。在本例中,
clustername-8qw5l-master-0
改为clustername-8qw5l-master-3
。例如:
apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: ... name: clustername-8qw5l-master-3 ...
删除
spec.providerID
字段:providerID: aws:///us-east-1a/i-0fdb85790d76d0c3f
删除
metadata.annotations
和metadata.generation
字段:annotations: machine.openshift.io/instance-state: running ... generation: 2
删除
metadata.resourceVersion
和metadata.uid
字段:resourceVersion: "13291" uid: a282eb70-40a2-4e89-8009-d05dd420d31a
删除不健康成员的机器:
$ oc delete machine -n openshift-machine-api clustername-8qw5l-master-0 1
- 1
- 为不健康的节点指定 control plane 机器的名称。
验证机器是否已删除:
$ oc get machines -n openshift-machine-api -o wide
输出示例
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-154-204.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-164-97.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running clustername-8qw5l-worker-us-east-1a-wbtgd Running m4.large us-east-1 us-east-1a 3h28m ip-10-0-129-226.ec2.internal aws:///us-east-1a/i-010ef6279b4662ced running clustername-8qw5l-worker-us-east-1b-lrdxb Running m4.large us-east-1 us-east-1b 3h28m ip-10-0-144-248.ec2.internal aws:///us-east-1b/i-0cb45ac45a166173b running clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us-east-1c 3h28m ip-10-0-170-181.ec2.internal aws:///us-east-1c/i-06861c00007751b0a running
使用
new-master-machine.yaml
文件创建新机器:$ oc apply -f new-master-machine.yaml
验证新机器是否已创建:
$ oc get machines -n openshift-machine-api -o wide
输出示例
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-154-204.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-164-97.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running clustername-8qw5l-master-3 Provisioning m4.xlarge us-east-1 us-east-1a 85s ip-10-0-133-53.ec2.internal aws:///us-east-1a/i-015b0888fe17bc2c8 running 1 clustername-8qw5l-worker-us-east-1a-wbtgd Running m4.large us-east-1 us-east-1a 3h28m ip-10-0-129-226.ec2.internal aws:///us-east-1a/i-010ef6279b4662ced running clustername-8qw5l-worker-us-east-1b-lrdxb Running m4.large us-east-1 us-east-1b 3h28m ip-10-0-144-248.ec2.internal aws:///us-east-1b/i-0cb45ac45a166173b running clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us-east-1c 3h28m ip-10-0-170-181.ec2.internal aws:///us-east-1c/i-06861c00007751b0a running
- 1
- 新机器
clustername-8qw5l-master-3
将被创建,并当阶段从Provisioning
变为Running
后就可以使用。
创建新机器可能需要几分钟时间。当机器或节点返回一个健康状态时,etcd cluster Operator 将自动同步。
验证
验证所有 etcd pod 是否都正常运行。
在一个终端中使用
cluster-admin
用户连接到集群,运行以下命令:$ oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd
输出示例
etcd-ip-10-0-133-53.ec2.internal 3/3 Running 0 7m49s etcd-ip-10-0-164-97.ec2.internal 3/3 Running 0 123m etcd-ip-10-0-154-204.ec2.internal 3/3 Running 0 124m
如果上一命令的输出只列出两个 pod,您可以手动强制重新部署 etcd。在一个终端中使用
cluster-admin
用户连接到集群,运行以下命令:$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge 1
- 1
forceRedeploymentReason
值必须是唯一的,这就是为什么附加时间戳的原因。
验证只有三个 etcd 成员。
连接到正在运行的 etcd 容器,传递没有在受影响节点上的 pod 的名称:
在一个终端中使用
cluster-admin
用户连接到集群,运行以下命令:$ oc rsh -n openshift-etcd etcd-ip-10-0-154-204.ec2.internal
查看成员列表:
sh-4.2# etcdctl member list -w table
输出示例
+------------------+---------+------------------------------+---------------------------+---------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+------------------------------+---------------------------+---------------------------+ | 5eb0d6b8ca24730c | started | ip-10-0-133-53.ec2.internal | https://10.0.133.53:2380 | https://10.0.133.53:2379 | | 757b6793e2408b6c | started | ip-10-0-164-97.ec2.internal | https://10.0.164.97:2380 | https://10.0.164.97:2379 | | ca8c2990a0aa29d1 | started | ip-10-0-154-204.ec2.internal | https://10.0.154.204:2380 | https://10.0.154.204:2379 | +------------------+---------+------------------------------+---------------------------+---------------------------+
如果上一命令的输出列出了超过三个 etcd 成员,您必须删除不需要的成员。
警告确保删除正确的 etcd 成员;如果删除了正常的 etcd 成员则有可能会导致仲裁丢失。