第 12 章 在 OpenShift Data Foundation 中恢复监控 pod
如果所有三个 pod 都停机,并且 OpenShift Data Foundation 无法自动恢复 monitor pod,则恢复 monitor pod。
这是一个灾难恢复流程,必须在红帽支持团队的指导下执行。在 上联系红帽支持团队,请联系红帽支持团队。
流程
缩减
rook-ceph-operator
和ocs operator
部署。oc scale deployment rook-ceph-operator --replicas=0 -n openshift-storage
# oc scale deployment rook-ceph-operator --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment ocs-operator --replicas=0 -n openshift-storage
# oc scale deployment ocs-operator --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 为
openshift-storage
命名空间中的所有部署创建备份。mkdir backup
# mkdir backup
Copy to Clipboard Copied! Toggle word wrap Toggle overflow cd backup
# cd backup
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc project openshift-storage
# oc project openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow for d in $(oc get deployment|awk -F' ' '{print $1}'|grep -v NAME); do echo $d;oc get deployment $d -o yaml > oc_get_deployment.${d}.yaml; done
# for d in $(oc get deployment|awk -F' ' '{print $1}'|grep -v NAME); do echo $d;oc get deployment $d -o yaml > oc_get_deployment.${d}.yaml; done
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 修补 Object Storage Device (OSD)部署以删除
livenessProbe
参数,并使用命令参数作为sleep
运行它。for i in $(oc get deployment -l app=rook-ceph-osd -oname);do oc patch ${i} -n openshift-storage --type='json' -p '[{"op":"remove", "path":"/spec/template/spec/containers/0/livenessProbe"}]' ; oc patch ${i} -n openshift-storage -p '{"spec": {"template": {"spec": {"containers": [{"name": "osd", "command": ["sleep", "infinity"], "args": []}]}}}}' ; done
# for i in $(oc get deployment -l app=rook-ceph-osd -oname);do oc patch ${i} -n openshift-storage --type='json' -p '[{"op":"remove", "path":"/spec/template/spec/containers/0/livenessProbe"}]' ; oc patch ${i} -n openshift-storage -p '{"spec": {"template": {"spec": {"containers": [{"name": "osd", "command": ["sleep", "infinity"], "args": []}]}}}}' ; done
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 从所有 OSD 检索
monstore
集群映射。创建
restore_mon.sh
脚本。Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行
restore_mon.sh
脚本。chmod +x recover_mon.sh
# chmod +x recover_mon.sh
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ./recover_mon.sh
# ./recover_mon.sh
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
修补 MON 部署,并使用命令参数作为
sleep
运行。编辑 MON 部署。
for i in $(oc get deployment -l app=rook-ceph-mon -oname);do oc patch ${i} -n openshift-storage -p '{"spec": {"template": {"spec": {"containers": [{"name": "mon", "command": ["sleep", "infinity"], "args": []}]}}}}'; done
# for i in $(oc get deployment -l app=rook-ceph-mon -oname);do oc patch ${i} -n openshift-storage -p '{"spec": {"template": {"spec": {"containers": [{"name": "mon", "command": ["sleep", "infinity"], "args": []}]}}}}'; done
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 修补 MON 部署,以增加
initialDelaySeconds
。oc get deployment rook-ceph-mon-a -o yaml | sed "s/initialDelaySeconds: 10/initialDelaySeconds: 2000/g" | oc replace -f -
# oc get deployment rook-ceph-mon-a -o yaml | sed "s/initialDelaySeconds: 10/initialDelaySeconds: 2000/g" | oc replace -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get deployment rook-ceph-mon-b -o yaml | sed "s/initialDelaySeconds: 10/initialDelaySeconds: 2000/g" | oc replace -f -
# oc get deployment rook-ceph-mon-b -o yaml | sed "s/initialDelaySeconds: 10/initialDelaySeconds: 2000/g" | oc replace -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get deployment rook-ceph-mon-c -o yaml | sed "s/initialDelaySeconds: 10/initialDelaySeconds: 2000/g" | oc replace -f -
# oc get deployment rook-ceph-mon-c -o yaml | sed "s/initialDelaySeconds: 10/initialDelaySeconds: 2000/g" | oc replace -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
将之前检索到的
monstore
复制到 mon-a pod。oc cp /tmp/monstore/ $(oc get po -l app=rook-ceph-mon,mon=a -oname |sed 's/pod\///g'):/tmp/
# oc cp /tmp/monstore/ $(oc get po -l app=rook-ceph-mon,mon=a -oname |sed 's/pod\///g'):/tmp/
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 导航到 MON 容器集,再更改检索到的
monstore
的所有权。oc rsh $(oc get po -l app=rook-ceph-mon,mon=a -oname)
# oc rsh $(oc get po -l app=rook-ceph-mon,mon=a -oname)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chown -R ceph:ceph /tmp/monstore
# chown -R ceph:ceph /tmp/monstore
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 在重新构建
mon db
之前,复制密钥环模板文件。oc rsh $(oc get po -l app=rook-ceph-mon,mon=a -oname)
# oc rsh $(oc get po -l app=rook-ceph-mon,mon=a -oname)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow cp /etc/ceph/keyring-store/keyring /tmp/keyring
# cp /etc/ceph/keyring-store/keyring /tmp/keyring
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow 从对应的机密中识别所有其他 Ceph 守护进程(MGR、MDS、RGW、Crash、CSI 和 CSI 置备程序)的密钥环。
Copy to Clipboard Copied! Toggle word wrap Toggle overflow keyring 文件示例:
/etc/ceph/ceph.client.admin.keyring
:Copy to Clipboard Copied! Toggle word wrap Toggle overflow 重要-
对于
client.csi
相关的密钥环,请参考前面的密钥环文件输出,并在从其相应的 OpenShift Data Foundation 机密获取密钥后添加默认caps
。 - OSD 密钥环会在恢复后自动添加。
-
对于
导航到 mon-a pod,再验证
monstore
是否具有monmap
。进入 mon-a pod。
oc rsh $(oc get po -l app=rook-ceph-mon,mon=a -oname)
# oc rsh $(oc get po -l app=rook-ceph-mon,mon=a -oname)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 验证
monstore
是否具有monmap
。ceph-monstore-tool /tmp/monstore get monmap -- --out /tmp/monmap
# ceph-monstore-tool /tmp/monstore get monmap -- --out /tmp/monmap
Copy to Clipboard Copied! Toggle word wrap Toggle overflow monmaptool /tmp/monmap --print
# monmaptool /tmp/monmap --print
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
可选: 如果缺少
monmap
,则创建新的monmap
。monmaptool --create --add <mon-a-id> <mon-a-ip> --add <mon-b-id> <mon-b-ip> --add <mon-c-id> <mon-c-ip> --enable-all-features --clobber /root/monmap --fsid <fsid>
# monmaptool --create --add <mon-a-id> <mon-a-ip> --add <mon-b-id> <mon-b-ip> --add <mon-c-id> <mon-c-ip> --enable-all-features --clobber /root/monmap --fsid <fsid>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <mon-a-id>
- 是 mon-a pod 的 ID。
<mon-a-ip>
- 是 mon-a pod 的 IP 地址。
<mon-b-id>
- 是 mon-b pod 的 ID。
<mon-b-ip>
- 是 mon-b pod 的 IP 地址。
<mon-c-id>
- 是 mon-c pod 的 ID。
<mon-c-ip>
- 是 mon-c pod 的 IP 地址。
<fsid>
- 是文件系统 ID。
验证
monmap
。monmaptool /root/monmap --print
# monmaptool /root/monmap --print
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 导入
monmap
。重要使用之前创建的 keyring 文件。
ceph-monstore-tool /tmp/monstore rebuild -- --keyring /tmp/keyring --monmap /root/monmap
# ceph-monstore-tool /tmp/monstore rebuild -- --keyring /tmp/keyring --monmap /root/monmap
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chown -R ceph:ceph /tmp/monstore
# chown -R ceph:ceph /tmp/monstore
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 创建旧
store.db
文件的备份。mv /var/lib/ceph/mon/ceph-a/store.db /var/lib/ceph/mon/ceph-a/store.db.corrupted
# mv /var/lib/ceph/mon/ceph-a/store.db /var/lib/ceph/mon/ceph-a/store.db.corrupted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow mv /var/lib/ceph/mon/ceph-b/store.db /var/lib/ceph/mon/ceph-b/store.db.corrupted
# mv /var/lib/ceph/mon/ceph-b/store.db /var/lib/ceph/mon/ceph-b/store.db.corrupted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow mv /var/lib/ceph/mon/ceph-c/store.db /var/lib/ceph/mon/ceph-c/store.db.corrupted
# mv /var/lib/ceph/mon/ceph-c/store.db /var/lib/ceph/mon/ceph-c/store.db.corrupted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 将重建
store.db
文件复制到monstore
目录。mv /tmp/monstore/store.db /var/lib/ceph/mon/ceph-a/store.db
# mv /tmp/monstore/store.db /var/lib/ceph/mon/ceph-a/store.db
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chown -R ceph:ceph /var/lib/ceph/mon/ceph-a/store.db
# chown -R ceph:ceph /var/lib/ceph/mon/ceph-a/store.db
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 在重建了
monstore
目录后,将store.db
文件从本地复制到 MON 容器集的其余部分。oc cp $(oc get po -l app=rook-ceph-mon,mon=a -oname | sed 's/pod\///g'):/var/lib/ceph/mon/ceph-a/store.db /tmp/store.db
# oc cp $(oc get po -l app=rook-ceph-mon,mon=a -oname | sed 's/pod\///g'):/var/lib/ceph/mon/ceph-a/store.db /tmp/store.db
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc cp /tmp/store.db $(oc get po -l app=rook-ceph-mon,mon=<id> -oname | sed 's/pod\///g'):/var/lib/ceph/mon/ceph-<id>
# oc cp /tmp/store.db $(oc get po -l app=rook-ceph-mon,mon=<id> -oname | sed 's/pod\///g'):/var/lib/ceph/mon/ceph-<id>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <id>
- 是 MON pod 的 ID
导航到 MON 容器集的其余部分,再更改复制的
monstore
的所有权。oc rsh $(oc get po -l app=rook-ceph-mon,mon=<id> -oname)
# oc rsh $(oc get po -l app=rook-ceph-mon,mon=<id> -oname)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chown -R ceph:ceph /var/lib/ceph/mon/ceph-<id>/store.db
# chown -R ceph:ceph /var/lib/ceph/mon/ceph-<id>/store.db
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <id>
- 是 MON pod 的 ID
恢复补丁的更改。
对于 MON 部署:
oc replace --force -f <mon-deployment.yaml>
# oc replace --force -f <mon-deployment.yaml>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <mon-deployment.yaml>
- 是 MON 部署 yaml 文件
对于 OSD 部署:
oc replace --force -f <osd-deployment.yaml>
# oc replace --force -f <osd-deployment.yaml>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <osd-deployment.yaml>
- 是 OSD 部署 yaml 文件
对于 MGR 部署:
oc replace --force -f <mgr-deployment.yaml>
# oc replace --force -f <mgr-deployment.yaml>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <mgr-deployment.yaml>
是 MGR 部署 yaml 文件
重要确保 MON、MGR 和 OSD 容器集已启动并在运行。
扩展
rook-ceph-operator
和ocs-operator
部署。oc -n openshift-storage scale deployment ocs-operator --replicas=1
# oc -n openshift-storage scale deployment ocs-operator --replicas=1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
验证步骤
检查 Ceph 状态,以确认 CephFS 正在运行。
ceph -s
# ceph -s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 检查 Multicloud Object Gateway (MCG)状态。它应当处于活动状态,并且后备存储和 bucket 类应处于
Ready
状态。noobaa status -n openshift-storage
noobaa status -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 重要如果 MCG 没有处于 active 状态,且后备存储和存储桶类没有处于
Ready
状态,则需要重启所有 MCG 相关的 pod。如需更多信息,请参阅 第 12.1 节 “恢复 Multicloud 对象网关”。
12.1. 恢复 Multicloud 对象网关 复制链接链接已复制到粘贴板!
如果 Multicloud Object Gateway (MCG)没有处于 active 状态,且后备存储和存储桶类没有处于 Ready
状态,您需要重启所有 MCG 相关的 pod,并检查 MCG 状态以确认 MCG 已备份并在运行。
流程
重启与 MCG 相关的所有 pod。
oc delete pods <noobaa-operator> -n openshift-storage
# oc delete pods <noobaa-operator> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pods <noobaa-core> -n openshift-storage
# oc delete pods <noobaa-core> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pods <noobaa-endpoint> -n openshift-storage
# oc delete pods <noobaa-endpoint> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pods <noobaa-db> -n openshift-storage
# oc delete pods <noobaa-db> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <noobaa-operator>
- 是 MCG operator 的名称
<noobaa-core>
- 是 MCG 核心 pod 的名称
<noobaa-endpoint>
- 是 MCG 端点的名称
<noobaa-db>
- 是 MCG db pod 的名称
如果配置了 RADOS 对象网关(RGW),请重新启动容器集。
oc delete pods <rgw-pod> -n openshift-storage
# oc delete pods <rgw-pod> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <rgw-pod>
- 是 RGW pod 的名称
在 OpenShift Container Platform 4.11 中,在恢复后,RBD PVC 无法挂载到应用程序 pod 上。因此,您需要重启托管应用容器集的节点。要获取托管应用程序 pod 的节点名称,请运行以下命令:
oc get pods <application-pod> -n <namespace> -o yaml | grep nodeName
# oc get pods <application-pod> -n <namespace> -o yaml | grep nodeName
nodeName: node_name