第 38 章恢复 etcd 仲裁

如果丢失了 etcd 仲裁，可以恢复它。

如果在单独的主机上运行 etcd，您必须备份 etcd，请关闭 etcd 集群，并形成一个新数据。您可以使用一个健康的 etcd 节点来组成新集群，但必须删除所有其他健康节点。
如果您在 master 节点上以静态 pod 身份运行 etcd，请停止 etcd pod，创建一个临时集群，然后重启 etcd pod。

注意

在 etcd 仲裁丢失过程中，在 OpenShift Container Platform 上运行的应用程序不受影响。但是，平台功能仅限于只读操作。您无法采取操作，如扩展应用程序、更改部署或运行或修改构建。

要确认 etcd 仲裁丢失，请运行以下命令并确认集群不健康：

如果使用 etcd v2 API，请运行以下命令：

etcd_ctl=2 etcdctl  --cert-file=/etc/origin/master/master.etcd-client.crt  \
          --key-file /etc/origin/master/master.etcd-client.key \
          --ca-file /etc/origin/master/master.etcd-ca.crt \
          --endpoints="https://*master-0.example.com*:2379,\
          https://*master-1.example.com*:2379,\
          https://*master-2.example.com*:2379"\
          cluster-health

# etcd_ctl=2 etcdctl  --cert-file=/etc/origin/master/master.etcd-client.crt  \
          --key-file /etc/origin/master/master.etcd-client.key \
          --ca-file /etc/origin/master/master.etcd-ca.crt \
          --endpoints="https://*master-0.example.com*:2379,\
          https://*master-1.example.com*:2379,\
          https://*master-2.example.com*:2379"\
          cluster-health

member 165201190bf7f217 is unhealthy: got unhealthy result from https://master-0.example.com:2379
member b50b8a0acab2fa71 is unreachable: [https://master-1.example.com:2379] are all unreachable
member d40307cbca7bc2df is unreachable: [https://master-2.example.com:2379] are all unreachable
cluster is unhealthy

Copy to Clipboard

Toggle word wrap

如果使用 v3 API，请运行以下命令：

ETCDCTL_API=3 etcdctl --cert=/etc/origin/master/master.etcd-client.crt  \
          --key=/etc/origin/master/master.etcd-client.key \
          --cacert=/etc/origin/masterca.crt \
          --endpoints="https://*master-0.example.com*:2379,\
          https://*master-1.example.com*:2379,\
          https://*master-2.example.com*:2379"\
          endpoint health

# ETCDCTL_API=3 etcdctl --cert=/etc/origin/master/master.etcd-client.crt  \
          --key=/etc/origin/master/master.etcd-client.key \
          --cacert=/etc/origin/masterca.crt \
          --endpoints="https://*master-0.example.com*:2379,\
          https://*master-1.example.com*:2379,\
          https://*master-2.example.com*:2379"\
          endpoint health
https://master-0.example.com:2379 is unhealthy: failed to connect: context deadline exceeded
https://master-1.example.com:2379 is unhealthy: failed to connect: context deadline exceeded
https://master-2.example.com:2379 is unhealthy: failed to connect: context deadline exceeded
Error:  unhealthy cluster

Copy to Clipboard

Toggle word wrap

记录主机的成员 ID 和主机名。您可以使用其中一个节点组成新集群。

38.1. 为独立的服务恢复 etcd 仲裁
复制链接

38.1.1. 备份 etcd
复制链接

当备份 etcd 时，您必须备份 etcd 配置文件和 etcd 数据。

38.1.1.1. 备份 etcd 配置文件
复制链接

要保留的 etcd 配置文件都存储在运行 etcd 的实例的 /etc/etcd 目录中。这包括 etcd 配置文件(/etc/etcd/etcd.conf)和集群通信所需的证书。所有这些文件都是在安装时由 Ansible 安装程序生成的。

流程

对于集群的每个 etcd 成员，备份 etcd 配置。

ssh master-0
mkdir -p /backup/etcd-config-$(date +%Y%m%d)/
cp -R /etc/etcd/ /backup/etcd-config-$(date +%Y%m%d)/

$ ssh master-0


# mkdir -p /backup/etcd-config-$(date +%Y%m%d)/
# cp -R /etc/etcd/ /backup/etcd-config-$(date +%Y%m%d)/

Copy to Clipboard

Toggle word wrap

1: 将 master-0 替换为 etcd 成员的名称。

注意

每个 etcd 集群成员上的证书和配置文件是唯一的。

38.1.1.2. 备份 etcd 数据
复制链接

先决条件

注意

OpenShift Container Platform 安装程序创建别名，以避免为 etcd v2 任务输入名为 etcdctl2 的所有标志，以及用于 etcd v3 任务的 etcdctl3。

但是，etcdctl3 别名不会向 etcdctl 命令提供完整的端点列表，因此您必须指定 --endpoints 选项并列出所有端点。

备份 etcd 之前：

etcdctl 二进制文件必须可用，或者在容器化安装中，rhel7/etcd 容器必须可用。
确保 OpenShift Container Platform API 服务正在运行。
确保与 etcd 集群的连接（端口 2379/tcp）。
确保正确的证书以连接到 etcd 集群。

流程

注意

虽然 etcdctl backup 命令用于执行备份，etcd v3 没有备份的概念。您可以使用 etcdctl snapshot save 命令对一个实时成员进行快照，或从 etcd 数据目录中复制 member/snap/db 文件。

etcdctl 备份 命令重写了备份中所含的一些元数据，特别是节点 ID 和集群 ID，这意味着在备份中，节点会丢失它的以前的身份。要从备份重新创建集群，您可以创建一个新的单节点集群，然后将其他节点的其余部分添加到集群中。元数据被重写以防止新节点加入现有集群。

备份 etcd 数据：

重要

从以前的 OpenShift Container Platform 版本升级的集群可能包含 v2 数据存储。备份所有 etcd 数据存储。

从静态 pod 清单获取 etcd 端点 IP 地址：

export ETCD_POD_MANIFEST="/etc/origin/node/pods/etcd.yaml"

$ export ETCD_POD_MANIFEST="/etc/origin/node/pods/etcd.yaml"

Copy to Clipboard

Toggle word wrap

export ETCD_EP=$(grep https ${ETCD_POD_MANIFEST} | cut -d '/' -f3)

$ export ETCD_EP=$(grep https ${ETCD_POD_MANIFEST} | cut -d '/' -f3)

Copy to Clipboard

Toggle word wrap

以管理员身份登录：
```
oc login -u system:admin
```
```
$ oc login -u system:admin
```
Copy to Clipboard Toggle word wrap

获取 etcd pod 名称：

export ETCD_POD=$(oc get pods -n kube-system | grep -o -m 1 '^master-etcd\S*')

$ export ETCD_POD=$(oc get pods -n kube-system | grep -o -m 1 '^master-etcd\S*')

Copy to Clipboard

Toggle word wrap

进入 kube-system 项目：
```
oc project kube-system
```
```
$ oc project kube-system
```
Copy to Clipboard Toggle word wrap

在 pod 中生成 etcd 数据快照并将其保存在本地：

oc exec ${ETCD_POD} -c etcd -- /bin/bash -c "ETCDCTL_API=3 etcdctl \
    --cert /etc/etcd/peer.crt \
    --key /etc/etcd/peer.key \
    --cacert /etc/etcd/ca.crt \
    --endpoints $ETCD_EP \
    snapshot save /var/lib/etcd/snapshot.db"

$ oc exec ${ETCD_POD} -c etcd -- /bin/bash -c "ETCDCTL_API=3 etcdctl \
    --cert /etc/etcd/peer.crt \
    --key /etc/etcd/peer.key \
    --cacert /etc/etcd/ca.crt \
    --endpoints $ETCD_EP \
    snapshot save /var/lib/etcd/snapshot.db"

Copy to Clipboard

Toggle word wrap

1: 您必须将快照写入 /var/lib/etcd/ 下的目录中。

38.1.2. 删除 etcd 主机
复制链接

如果 etcd 主机无法恢复，将其从集群中移除。要从 etcd 仲裁丢失中恢复，还必须删除所有健康的 etcd 节点，但从集群中删除。

在所有 master 主机上执行的步骤

流程

从 etcd 集群中删除其他 etcd 主机。为每个 etcd 节点运行以下命令：

etcdctl3 --endpoints=https://<surviving host IP>:2379

# etcdctl3 --endpoints=https://<surviving host IP>:2379
  --cacert=/etc/etcd/ca.crt
  --cert=/etc/etcd/peer.crt
  --key=/etc/etcd/peer.key member remove <failed member ID>

Copy to Clipboard

Toggle word wrap

从每个 master 上的 /etc/origin/master/master-config.yaml +master 配置文件中删除其他 etcd 主机：

etcdClientInfo:
  ca: master.etcd-ca.crt
  certFile: master.etcd-client.crt
  keyFile: master.etcd-client.key
  urls:
    - https://master-0.example.com:2379
    - https://master-1.example.com:2379 
    - https://master-2.example.com:2379

etcdClientInfo:
  ca: master.etcd-ca.crt
  certFile: master.etcd-client.crt
  keyFile: master.etcd-client.key
  urls:
    - https://master-0.example.com:2379
    - https://master-1.example.com:2379


    - https://master-2.example.com:2379

Copy to Clipboard

Toggle word wrap

1 2: 要移除的主机。

在每个 master 上重启 master API 服务：
```
master-restart api restart-master controller
```
```
# master-restart api restart-master controller
```
Copy to Clipboard Toggle word wrap

在当前 etcd 集群中执行的步骤

流程

从集群中删除失败的主机：

etcdctl2 cluster-health
etcdctl2 member remove 8372784203e11288
etcdctl2 cluster-health

# etcdctl2 cluster-health
member 5ee217d19001 is healthy: got healthy result from https://192.168.55.12:2379
member 2a529ba1840722c0 is healthy: got healthy result from https://192.168.55.8:2379
failed to check the health of member 8372784203e11288 on https://192.168.55.21:2379: Get https://192.168.55.21:2379/health: dial tcp 192.168.55.21:2379: getsockopt: connection refused
member 8372784203e11288 is unreachable: [https://192.168.55.21:2379] are all unreachable
member ed4f0efd277d7599 is healthy: got healthy result from https://192.168.55.13:2379
cluster is healthy

# etcdctl2 member remove 8372784203e11288


Removed member 8372784203e11288 from cluster

# etcdctl2 cluster-health
member 5ee217d19001 is healthy: got healthy result from https://192.168.55.12:2379
member 2a529ba1840722c0 is healthy: got healthy result from https://192.168.55.8:2379
member ed4f0efd277d7599 is healthy: got healthy result from https://192.168.55.13:2379
cluster is healthy

Copy to Clipboard

Toggle word wrap

1: remove 命令需要 etcd ID，而不是主机名。

要确保 etcd 配置在 etcd 服务重启时不使用失败的主机，修改所有剩余的 etcd 主机上的 /etc/etcd/etcd.conf 文件，并在 ETCD_INITIAL_CLUSTER 变量的值中删除失败主机：

vi /etc/etcd/etcd.conf

# vi /etc/etcd/etcd.conf

Copy to Clipboard

Toggle word wrap

例如：

ETCD_INITIAL_CLUSTER=master-0.example.com=https://192.168.55.8:2380,master-1.example.com=https://192.168.55.12:2380,master-2.example.com=https://192.168.55.13:2380

ETCD_INITIAL_CLUSTER=master-0.example.com=https://192.168.55.8:2380,master-1.example.com=https://192.168.55.12:2380,master-2.example.com=https://192.168.55.13:2380

Copy to Clipboard

Toggle word wrap

成为：

ETCD_INITIAL_CLUSTER=master-0.example.com=https://192.168.55.8:2380,master-1.example.com=https://192.168.55.12:2380

ETCD_INITIAL_CLUSTER=master-0.example.com=https://192.168.55.8:2380,master-1.example.com=https://192.168.55.12:2380

Copy to Clipboard

Toggle word wrap

注意

不需要重启 etcd 服务，因为失败的主机是使用 etcdctl 被删除。

修改 Ansible 清单文件，以反映集群的当前状态，并避免在重新运行 playbook 时出现问题：

[OSEv3:children]
masters
nodes
etcd

... [OUTPUT ABBREVIATED] ...

[etcd]
master-0.example.com
master-1.example.com

[OSEv3:children]
masters
nodes
etcd

... [OUTPUT ABBREVIATED] ...

[etcd]
master-0.example.com
master-1.example.com

Copy to Clipboard

Toggle word wrap

如果您使用 Flannel，请修改每个主机上 /etc/sysconfig/flanneld 的 flanneld 服务配置并删除 etcd 主机：

FLANNEL_ETCD_ENDPOINTS=https://master-0.example.com:2379,https://master-1.example.com:2379,https://master-2.example.com:2379

FLANNEL_ETCD_ENDPOINTS=https://master-0.example.com:2379,https://master-1.example.com:2379,https://master-2.example.com:2379

Copy to Clipboard

Toggle word wrap

重启 flanneld 服务：
```
systemctl restart flanneld.service
```
```
# systemctl restart flanneld.service
```
Copy to Clipboard Toggle word wrap

38.1.3. 创建单节点 etcd 集群
复制链接

要恢复 OpenShift Container Platform 实例的完整功能，请使 etcd 节点变为独立的 etcd 集群。

流程

在没有从集群中删除的 etcd 节点上，通过删除 etcd pod 定义来停止所有 etcd 服务：

mkdir -p /etc/origin/node/pods-stopped
mv /etc/origin/node/pods/etcd.yaml /etc/origin/node/pods-stopped/
systemctl stop atomic-openshift-node
mv /etc/origin/node/pods-stopped/etcd.yaml /etc/origin/node/pods/

# mkdir -p /etc/origin/node/pods-stopped
# mv /etc/origin/node/pods/etcd.yaml /etc/origin/node/pods-stopped/
# systemctl stop atomic-openshift-node
# mv /etc/origin/node/pods-stopped/etcd.yaml /etc/origin/node/pods/

Copy to Clipboard

Toggle word wrap

在主机上运行 etcd 服务，强制使用新集群。

这些命令为 etcd 服务创建自定义文件，它会在 etcd start 命令中添加 --force-new-cluster 选项：

mkdir -p /etc/systemd/system/etcd.service.d/
echo "[Service]" > /etc/systemd/system/etcd.service.d/temp.conf
echo "ExecStart=" >> /etc/systemd/system/etcd.service.d/temp.conf
sed -n '/ExecStart/s/"$/ --force-new-cluster"/p' \
    /usr/lib/systemd/system/etcd.service \
    >> /etc/systemd/system/etcd.service.d/temp.conf
systemctl daemon-reload
master-restart etcd

# mkdir -p /etc/systemd/system/etcd.service.d/
# echo "[Service]" > /etc/systemd/system/etcd.service.d/temp.conf
# echo "ExecStart=" >> /etc/systemd/system/etcd.service.d/temp.conf
# sed -n '/ExecStart/s/"$/ --force-new-cluster"/p' \
    /usr/lib/systemd/system/etcd.service \
    >> /etc/systemd/system/etcd.service.d/temp.conf

# systemctl daemon-reload
# master-restart etcd

Copy to Clipboard

Toggle word wrap

列出 etcd 成员，并确认成员列表仅包含您的单个 etcd 主机：

etcdctl member list

# etcdctl member list
165201190bf7f217: name=192.168.34.20 peerURLs=http://localhost:2380 clientURLs=https://master-0.example.com:2379 isLeader=true

Copy to Clipboard

Toggle word wrap

恢复数据并创建新集群后，您必须更新 peerURLs 参数值，以使用 etcd 侦听对等通信的 IP 地址：
```
etcdctl member update 165201190bf7f217 https://192.168.34.20:2380
```
```
# etcdctl member update 165201190bf7f217 https://192.168.34.20:2380 
```
1
Copy to Clipboard Toggle word wrap
1
165201190bf7f217 是上一命令输出中显示的成员 ID，而 https://192.168.34.20:2380 是其 IP 地址。

要验证，请检查 IP 是否在成员列表中：

etcdctl2 member list

$ etcdctl2 member list
5ee217d17301: name=master-0.example.com peerURLs=https://*192.168.55.8*:2380 clientURLs=https://192.168.55.8:2379 isLeader=true

Copy to Clipboard

Toggle word wrap

38.1.4. 恢复后添加 etcd 节点
复制链接

在第一个实例运行后，您可以在集群中添加多个 etcd 服务器。

流程

在 ETCD_NAME 变量中获取实例的 etcd 名称：
```
grep ETCD_NAME /etc/etcd/etcd.conf
```
```
# grep ETCD_NAME /etc/etcd/etcd.conf
```
Copy to Clipboard Toggle word wrap

获取 etcd 侦听对等通信的 IP 地址：

grep ETCD_INITIAL_ADVERTISE_PEER_URLS /etc/etcd/etcd.conf

# grep ETCD_INITIAL_ADVERTISE_PEER_URLS /etc/etcd/etcd.conf

Copy to Clipboard

Toggle word wrap

如果节点之前是 etcd 集群的一部分，请删除之前的 etcd 数据：
```
rm -Rf /var/lib/etcd/*
```
```
# rm -Rf /var/lib/etcd/*
```
Copy to Clipboard Toggle word wrap

在运行 etcd 的 etcd 主机上，添加新成员：

etcdctl3 member add *<name>* \
  --peer-urls="*<advertise_peer_urls>*"

# etcdctl3 member add *<name>* \
  --peer-urls="*<advertise_peer_urls>*"

Copy to Clipboard

Toggle word wrap

命令输出一些变量。例如：

ETCD_NAME="master2"
ETCD_INITIAL_CLUSTER="master-0.example.com=https://192.168.55.8:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

ETCD_NAME="master2"
ETCD_INITIAL_CLUSTER="master-0.example.com=https://192.168.55.8:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

Copy to Clipboard

Toggle word wrap

将上一个命令中的值添加到新主机的 /etc/etcd/etcd.conf 文件中：
```
vi /etc/etcd/etcd.conf
```
```
# vi /etc/etcd/etcd.conf
```
Copy to Clipboard Toggle word wrap
在加入集群的节点中启动 etcd 服务：
```
systemctl start etcd.service
```
```
# systemctl start etcd.service
```
Copy to Clipboard Toggle word wrap
检查错误信息：
```
master-logs etcd etcd
```
```
# master-logs etcd etcd
```
Copy to Clipboard Toggle word wrap

添加所有节点后，验证集群状态和集群健康状况：

etcdctl3 endpoint health --endpoints="https://<etcd_host1>:2379,https://<etcd_host2>:2379,https://<etcd_host3>:2379"
etcdctl3 endpoint status --endpoints="https://<etcd_host1>:2379,https://<etcd_host2>:2379,https://<etcd_host3>:2379"

# etcdctl3 endpoint health --endpoints="https://<etcd_host1>:2379,https://<etcd_host2>:2379,https://<etcd_host3>:2379"
https://master-0.example.com:2379 is healthy: successfully committed proposal: took = 1.423459ms
https://master-1.example.com:2379 is healthy: successfully committed proposal: took = 1.767481ms
https://master-2.example.com:2379 is healthy: successfully committed proposal: took = 1.599694ms

# etcdctl3 endpoint status --endpoints="https://<etcd_host1>:2379,https://<etcd_host2>:2379,https://<etcd_host3>:2379"
https://master-0.example.com:2379, 40bef1f6c79b3163, 3.2.5, 28 MB, true, 9, 2878
https://master-1.example.com:2379, 1ea57201a3ff620a, 3.2.5, 28 MB, false, 9, 2878
https://master-2.example.com:2379, 59229711e4bc65c8, 3.2.5, 28 MB, false, 9, 2878

Copy to Clipboard

Toggle word wrap

将剩余的同级服务器重新添加到集群中。

返回顶部

第 38 章恢复 etcd 仲裁

38.1. 为独立的服务恢复 etcd 仲裁
复制链接

38.1.1. 备份 etcd
复制链接

38.1.1.1. 备份 etcd 配置文件
复制链接

流程

38.1.1.2. 备份 etcd 数据
复制链接

先决条件

流程

38.1.2. 删除 etcd 主机
复制链接

流程

流程

38.1.3. 创建单节点 etcd 集群
复制链接

流程

38.1.4. 恢复后添加 etcd 节点
复制链接

流程

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

第 38 章 恢复 etcd 仲裁

38.1. 为独立的服务恢复 etcd 仲裁复制链接链接已复制到粘贴板!

38.1.1. 备份 etcd复制链接链接已复制到粘贴板!

38.1.1.1. 备份 etcd 配置文件复制链接链接已复制到粘贴板!

流程

38.1.1.2. 备份 etcd 数据复制链接链接已复制到粘贴板!

先决条件

流程

38.1.2. 删除 etcd 主机复制链接链接已复制到粘贴板!

流程

流程

38.1.3. 创建单节点 etcd 集群复制链接链接已复制到粘贴板!

流程

38.1.4. 恢复后添加 etcd 节点复制链接链接已复制到粘贴板!

流程

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

第 38 章恢复 etcd 仲裁

38.1. 为独立的服务恢复 etcd 仲裁
复制链接

38.1.1. 备份 etcd
复制链接

38.1.1.1. 备份 etcd 配置文件
复制链接

38.1.1.2. 备份 etcd 数据
复制链接

38.1.2. 删除 etcd 主机
复制链接

38.1.3. 创建单节点 etcd 集群
复制链接

38.1.4. 恢复后添加 etcd 节点
复制链接