第 11 章 在 OpenShift Data Foundation 中恢复 monitor pod
如果所有三个 Pod 都停机,并且 OpenShift Data Foundation 无法自动恢复 monitor pod,则恢复 monitor pod。
这是一个灾难恢复过程,必须在红帽支持团队的指导下执行。请联系红帽支持团队。
流程
缩减
rook-ceph-operator
和ocs operator
部署。oc scale deployment rook-ceph-operator --replicas=0 -n openshift-storage
# oc scale deployment rook-ceph-operator --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment ocs-operator --replicas=0 -n openshift-storage
# oc scale deployment ocs-operator --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 在
openshift-storage
命名空间中创建所有部署的备份。mkdir backup
# mkdir backup
Copy to Clipboard Copied! Toggle word wrap Toggle overflow cd backup
# cd backup
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc project openshift-storage
# oc project openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow for d in $(oc get deployment|awk -F' ' '{print $1}'|grep -v NAME); do echo $d;oc get deployment $d -o yaml > oc_get_deployment.${d}.yaml; done
# for d in $(oc get deployment|awk -F' ' '{print $1}'|grep -v NAME); do echo $d;oc get deployment $d -o yaml > oc_get_deployment.${d}.yaml; done
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 修补对象存储设备(OSD)部署以删除
livenessProbe
参数,并使用命令参数作为sleep
运行它。for i in $(oc get deployment -l app=rook-ceph-osd -oname);do oc patch ${i} -n openshift-storage --type='json' -p '[{"op":"remove", "path":"/spec/template/spec/containers/0/livenessProbe"}]' ; oc patch ${i} -n openshift-storage -p '{"spec": {"template": {"spec": {"containers": [{"name": "osd", "command": ["sleep", "infinity"], "args": []}]}}}}' ; done
# for i in $(oc get deployment -l app=rook-ceph-osd -oname);do oc patch ${i} -n openshift-storage --type='json' -p '[{"op":"remove", "path":"/spec/template/spec/containers/0/livenessProbe"}]' ; oc patch ${i} -n openshift-storage -p '{"spec": {"template": {"spec": {"containers": [{"name": "osd", "command": ["sleep", "infinity"], "args": []}]}}}}' ; done
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 将
tar
复制到 OSD。for i in `oc get pods -l app=rook-ceph-osd -o name | sed -e "s/pod\///g"` ; do cat /usr/bin/tar | oc exec -i ${i} -- bash -c 'cat - >/usr/bin/tar' ; oc exec -i ${i} -- bash -c 'chmod +x /usr/bin/tar' ;done
for i in `oc get pods -l app=rook-ceph-osd -o name | sed -e "s/pod\///g"` ; do cat /usr/bin/tar | oc exec -i ${i} -- bash -c 'cat - >/usr/bin/tar' ; oc exec -i ${i} -- bash -c 'chmod +x /usr/bin/tar' ;done
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 注意在将 tar 二进制文件复制到 OSD 时,务必要确保
tar
二进制文件与 pod 的容器镜像操作系统匹配。从不同的操作系统(如 macOS、Ubuntu 等)复制二进制文件可能会导致兼容性问题。从所有 OSD 检索
monstore
集群映射。创建
restore_mon.sh
脚本。Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行
restore_mon.sh
脚本。chmod +x recover_mon.sh
# chmod +x recover_mon.sh
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ./recover_mon.sh
# ./recover_mon.sh
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
修补 MON 部署,并使用命令参数作为
sleep
状态运行它。编辑 MON 部署。
for i in $(oc get deployment -l app=rook-ceph-mon -oname);do oc patch ${i} -n openshift-storage -p '{"spec": {"template": {"spec": {"containers": [{"name": "mon", "command": ["sleep", "infinity"], "args": []}]}}}}'; done
# for i in $(oc get deployment -l app=rook-ceph-mon -oname);do oc patch ${i} -n openshift-storage -p '{"spec": {"template": {"spec": {"containers": [{"name": "mon", "command": ["sleep", "infinity"], "args": []}]}}}}'; done
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 修补 MON 部署,以增加
initialDelaySeconds
。for i in a b c ; do oc get deployment rook-ceph-mon-${i} -o yaml | sed "s/initialDelaySeconds: 10/initialDelaySeconds: 10000/g" | oc replace -f - ; done
# for i in a b c ; do oc get deployment rook-ceph-mon-${i} -o yaml | sed "s/initialDelaySeconds: 10/initialDelaySeconds: 10000/g" | oc replace -f - ; done
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 将
tar
复制到 MON 容器集。for i in `oc get pods -l app=rook-ceph-mon -o name | sed -e "s/pod\///g"` ; do cat /usr/bin/tar | oc exec -i ${i} -- bash -c 'cat - >/usr/bin/tar' ; oc exec -i ${i} -- bash -c 'chmod +x /usr/bin/tar' ;done
# for i in `oc get pods -l app=rook-ceph-mon -o name | sed -e "s/pod\///g"` ; do cat /usr/bin/tar | oc exec -i ${i} -- bash -c 'cat - >/usr/bin/tar' ; oc exec -i ${i} -- bash -c 'chmod +x /usr/bin/tar' ;done
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 注意在将 tar 二进制文件复制到 MON 时,务必要确保
tar
二进制文件与 pod 的容器镜像操作系统匹配。从不同的操作系统(如 macOS、Ubuntu 等)复制二进制文件可能会导致兼容性问题。
将之前检索到的
monstore
复制到 mon-a pod。oc cp /tmp/monstore/ $(oc get po -l app=rook-ceph-mon,mon=a -oname |sed 's/pod\///g'):/tmp/
# oc cp /tmp/monstore/ $(oc get po -l app=rook-ceph-mon,mon=a -oname |sed 's/pod\///g'):/tmp/
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 导航到 MON 容器集,再更改检索到的
monstore
的所有权。oc rsh $(oc get po -l app=rook-ceph-mon,mon=a -oname)
# oc rsh $(oc get po -l app=rook-ceph-mon,mon=a -oname)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chown -R ceph:ceph /tmp/monstore
# chown -R ceph:ceph /tmp/monstore
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 在重建
mon db
之前复制密钥环模板文件。oc rsh $(oc get po -l app=rook-ceph-mon,mon=a -oname)
# oc rsh $(oc get po -l app=rook-ceph-mon,mon=a -oname)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow cp /etc/ceph/keyring-store/keyring /tmp/keyring
# cp /etc/ceph/keyring-store/keyring /tmp/keyring
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow 从对应的机密中填充所有其他 Ceph 守护进程(OSD、MGR、MDS 和 RGW)的密钥环。
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 在获取守护进程密钥环时,使用以下命令:
for i in `oc get secret | grep keyring| awk '{print $1}'` ; do oc get secret ${i} -ojson | jq .data.keyring | xargs echo | base64 -d ; done
# for i in `oc get secret | grep keyring| awk '{print $1}'` ; do oc get secret ${i} -ojson | jq .data.keyring | xargs echo | base64 -d ; done
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 使用以下脚本获取 OSD 密钥:
for i in `oc get pods -l app=rook-ceph-osd -o name | sed -e "s/pod\///g"` ; do oc exec -i ${i} -- bash -c 'cat /var/lib/ceph/osd/ceph-*/keyring ' ;done
# for i in `oc get pods -l app=rook-ceph-osd -o name | sed -e "s/pod\///g"` ; do oc exec -i ${i} -- bash -c 'cat /var/lib/ceph/osd/ceph-*/keyring ' ;done
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 在本地复制 mon keyring,然后通过添加上一步中捕获的所有守护进程密钥,然后将其复制到其中一个 MON pod (mon-a)来编辑它:
oc cp $(oc get po -l app=rook-ceph-mon,mon=a -oname|sed -e "s/pod\///g"):/etc/ceph/keyring-store/..data/keyring /tmp/keyring-mon-a
oc cp $(oc get po -l app=rook-ceph-mon,mon=a -oname|sed -e "s/pod\///g"):/etc/ceph/keyring-store/..data/keyring /tmp/keyring-mon-a
Copy to Clipboard Copied! Toggle word wrap Toggle overflow vi /tmp/keyring-mon-a
vi /tmp/keyring-mon-a
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 例如,密钥环文件应类似如下:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 注意如果 OSD 密钥输出中不存在
caps
条目,请确保将caps
添加到所有 OSD 输出,如前面的密钥环文件示例中所述。oc cp /tmp/keyring-mon-a $(oc get po -l app=rook-ceph-mon,mon=a -oname|sed -e "s/pod\///g"):/tmp/keyring
oc cp /tmp/keyring-mon-a $(oc get po -l app=rook-ceph-mon,mon=a -oname|sed -e "s/pod\///g"):/tmp/keyring
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 导航到 mon-a 容器集,再验证
monstore
是否具有monmap
。进入到 mon-a 容器集。
oc rsh $(oc get po -l app=rook-ceph-mon,mon=a -oname)
# oc rsh $(oc get po -l app=rook-ceph-mon,mon=a -oname)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 验证
monstore
是否具有monmap
。ceph-monstore-tool /tmp/monstore get monmap -- --out /tmp/monmap
# ceph-monstore-tool /tmp/monstore get monmap -- --out /tmp/monmap
Copy to Clipboard Copied! Toggle word wrap Toggle overflow monmaptool /tmp/monmap --print
# monmaptool /tmp/monmap --print
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
可选:如果缺少
monmap
,则创建新的monmap
。monmaptool --create --add <mon-a-id> <mon-a-ip> --add <mon-b-id> <mon-b-ip> --add <mon-c-id> <mon-c-ip> --enable-all-features --clobber /root/monmap --fsid <fsid>
# monmaptool --create --add <mon-a-id> <mon-a-ip> --add <mon-b-id> <mon-b-ip> --add <mon-c-id> <mon-c-ip> --enable-all-features --clobber /root/monmap --fsid <fsid>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <mon-a-id>
- mon-a pod 的 ID。
<mon-a-ip>
- mon-a pod 的 IP 地址。
<mon-b-id>
- mon-b pod 的 ID。
<mon-b-ip>
- mon-b pod 的 IP 地址。
<mon-c-id>
- mon-c pod 的 ID。
<mon-c-ip>
- mon-c pod 的 IP 地址。
<fsid>
- 文件系统 ID。
验证
monmap
。monmaptool /root/monmap --print
# monmaptool /root/monmap --print
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 导入
monmap
。重要使用之前创建的 keyring 文件。
ceph-monstore-tool /tmp/monstore rebuild -- --keyring /tmp/keyring --monmap /root/monmap
# ceph-monstore-tool /tmp/monstore rebuild -- --keyring /tmp/keyring --monmap /root/monmap
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chown -R ceph:ceph /tmp/monstore
# chown -R ceph:ceph /tmp/monstore
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 创建旧
store.db
文件的备份。mv /var/lib/ceph/mon/ceph-a/store.db /var/lib/ceph/mon/ceph-a/store.db.corrupted
# mv /var/lib/ceph/mon/ceph-a/store.db /var/lib/ceph/mon/ceph-a/store.db.corrupted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow mv /var/lib/ceph/mon/ceph-b/store.db /var/lib/ceph/mon/ceph-b/store.db.corrupted
# mv /var/lib/ceph/mon/ceph-b/store.db /var/lib/ceph/mon/ceph-b/store.db.corrupted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow mv /var/lib/ceph/mon/ceph-c/store.db /var/lib/ceph/mon/ceph-c/store.db.corrupted
# mv /var/lib/ceph/mon/ceph-c/store.db /var/lib/ceph/mon/ceph-c/store.db.corrupted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 将重新构建
store.db
文件复制到monstore
目录。mv /tmp/monstore/store.db /var/lib/ceph/mon/ceph-a/store.db
# mv /tmp/monstore/store.db /var/lib/ceph/mon/ceph-a/store.db
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chown -R ceph:ceph /var/lib/ceph/mon/ceph-a/store.db
# chown -R ceph:ceph /var/lib/ceph/mon/ceph-a/store.db
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 在重建了
monstore
目录后,将store.db
文件从本地 复制到 MON 容器集的其余部分。oc cp $(oc get po -l app=rook-ceph-mon,mon=a -oname | sed 's/pod\///g'):/var/lib/ceph/mon/ceph-a/store.db /tmp/store.db
# oc cp $(oc get po -l app=rook-ceph-mon,mon=a -oname | sed 's/pod\///g'):/var/lib/ceph/mon/ceph-a/store.db /tmp/store.db
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc cp /tmp/store.db $(oc get po -l app=rook-ceph-mon,mon=<id> -oname | sed 's/pod\///g'):/var/lib/ceph/mon/ceph-<id>
# oc cp /tmp/store.db $(oc get po -l app=rook-ceph-mon,mon=<id> -oname | sed 's/pod\///g'):/var/lib/ceph/mon/ceph-<id>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <id>
- 是 MON Pod 的 ID
前往 MON 容器集的其余部分,再更改复制的
monstore
的所有权。oc rsh $(oc get po -l app=rook-ceph-mon,mon=<id> -oname)
# oc rsh $(oc get po -l app=rook-ceph-mon,mon=<id> -oname)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chown -R ceph:ceph /var/lib/ceph/mon/ceph-<id>/store.db
# chown -R ceph:ceph /var/lib/ceph/mon/ceph-<id>/store.db
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <id>
- 是 MON Pod 的 ID
恢复补丁的更改。
对于 MON 部署:
oc replace --force -f <mon-deployment.yaml>
# oc replace --force -f <mon-deployment.yaml>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <mon-deployment.yaml>
- 是 MON 部署 yaml 文件
对于 OSD 部署:
oc replace --force -f <osd-deployment.yaml>
# oc replace --force -f <osd-deployment.yaml>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <osd-deployment.yaml>
- 是 OSD 部署 yaml 文件
对于 MGR 部署:
oc replace --force -f <mgr-deployment.yaml>
# oc replace --force -f <mgr-deployment.yaml>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <mgr-deployment.yaml>
是 MGR 部署 yaml 文件
重要确保 MON、MGR 和 OSD 容器集已启动并在运行。
扩展
rook-ceph-operator
和ocs-operator
部署。oc -n openshift-storage scale deployment rook-ceph-operator --replicas=1
# oc -n openshift-storage scale deployment rook-ceph-operator --replicas=1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc -n openshift-storage scale deployment ocs-operator --replicas=1
# oc -n openshift-storage scale deployment ocs-operator --replicas=1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
验证步骤
检查 Ceph 状态,以确认 CephFS 正在运行。
ceph -s
# ceph -s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 检查 Multicloud 对象网关 (MCG) 状态。它应该处于活动状态,后备存储和存储桶类应处于
Ready
状态。noobaa status -n openshift-storage
noobaa status -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 重要如果 MCG 没有处于 active 状态,且后备存储和存储桶类没有处于
Ready
状态,则需要重启所有与 MCG 相关的 pod。如需更多信息,请参阅 第 11.1 节 “恢复 Multicloud 对象网关”。
11.1. 恢复 Multicloud 对象网关 复制链接链接已复制到粘贴板!
如果 Multicloud 对象网关(MCG)没有处于 active 状态,且后备存储和存储桶类没有处于 Ready
状态,您需要重启所有与 MCG 相关的 pod,并检查 MCG 状态以确认 MCG 是否已备份并在运行。
流程
重启与 MCG 相关的所有 pod。
oc delete pods <noobaa-operator> -n openshift-storage
# oc delete pods <noobaa-operator> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pods <noobaa-core> -n openshift-storage
# oc delete pods <noobaa-core> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pods <noobaa-endpoint> -n openshift-storage
# oc delete pods <noobaa-endpoint> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pods <noobaa-db> -n openshift-storage
# oc delete pods <noobaa-db> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <noobaa-operator>
- 是 MCG operator 的名称
<noobaa-core>
- 是 MCG 内核 pod 的名称
<noobaa-endpoint>
- 是 MCG 端点的名称
<noobaa-db>
- 是 MCG db pod 的名称
如果配置了 RADOS 对象网关(RGW),请重新启动容器集。
oc delete pods <rgw-pod> -n openshift-storage
# oc delete pods <rgw-pod> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <rgw-pod>
- 是 RGW pod 的名称
在 OpenShift Container Platform 4.11 中,在恢复后 RBD PVC 无法挂载到应用程序 pod 上。因此,您需要重启托管应用容器集的节点。要获取托管应用程序 pod 的节点名称,请运行以下命令:
oc get pods <application-pod> -n <namespace> -o yaml | grep nodeName
# oc get pods <application-pod> -n <namespace> -o yaml | grep nodeName
nodeName: node_name