第 2 章 Ceph 迁移


2.1. 迁移 Ceph RBD

在这种情况下,假设 Ceph 已经是 >= 5,对于 HCI 或专用存储节点,OpenStack control plane 中的守护进程应移动/迁移到现有的外部 RHEL 节点(通常是 HCI 环境或所有剩余的用例中的专用存储节点)。

2.1.1. 要求

  • Ceph 是 >= 5,并由 cephadm/orchestrator 管理。
  • Ceph NFS (ganesha)从 基于 TripleO 的部署迁移到 cephadm
  • Ceph 公共和集群网络都会通过 TripleO 传播到目标节点。
  • Ceph Mons 需要保留其 IP (以避免冷迁移)。

2.1.2. 场景:从控制器节点迁移 mon 和 mgr

第一个 POC 的目标是证明您能够在 ceph 守护进程中成功排空控制器节点,并将它们移到不同的节点。POC 的初始目标是 RBD,这意味着您将仅移动 mon 和 mgr 守护进程。就此 POC 而言,您将部署一个带有 mon、mgrs 和 osds 的 ceph 集群,以便在开始迁移前模拟客户所在的环境。第一个 POC 的目标是确保:

  • 您可以保留 mon IP 地址,将它们移到 Ceph Storage 节点。
  • 您可以排空现有的控制器节点并关闭它们。
  • 您可以将额外的监控器部署到现有节点上,将它们提升为 _admin 节点,供管理员用于管理 Ceph 集群并对其执行 day2 操作。
  • 您可以在迁移期间保持集群运行。

2.1.2.1. 先决条件

Storage 节点应该同时配置为具有 storagestorage_mgmt 网络,以确保您可以使用 Ceph 公共和集群网络。

此步骤是唯一需要与 TripleO 交互的唯一步骤。从 17+,您不必运行任何堆栈更新。但是,您应该执行这些命令在裸机节点上运行 os-net-config 并配置额外网络。

确保在 metalsmith.yaml 中为 CephStorageNodes 定义网络:

  - name: CephStorage
    count: 2
    instances:
      - hostname: oc0-ceph-0
        name: oc0-ceph-0
      - hostname: oc0-ceph-1
        name: oc0-ceph-1
    defaults:
      networks:
        - network: ctlplane
          vif: true
        - network: storage_cloud_0
            subnet: storage_cloud_0_subnet
        - network: storage_mgmt_cloud_0
            subnet: storage_mgmt_cloud_0_subnet
      network_config:
        template: templates/single_nic_vlans/single_nic_vlans_storage.j2
Copy to Clipboard Toggle word wrap

然后运行:

openstack overcloud node provision \
  -o overcloud-baremetal-deployed-0.yaml --stack overcloud-0 \
  --network-config -y --concurrency 2 /home/stack/metalsmith-0.yam
Copy to Clipboard Toggle word wrap

验证存储网络是否在节点上运行:

(undercloud) [CentOS-9 - stack@undercloud ~]$ ssh heat-admin@192.168.24.14 ip -o -4 a
Warning: Permanently added '192.168.24.14' (ED25519) to the list of known hosts.
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
5: br-storage    inet 192.168.24.14/24 brd 192.168.24.255 scope global br-storage\       valid_lft forever preferred_lft forever
6: vlan1    inet 192.168.24.14/24 brd 192.168.24.255 scope global vlan1\       valid_lft forever preferred_lft forever
7: vlan11    inet 172.16.11.172/24 brd 172.16.11.255 scope global vlan11\       valid_lft forever preferred_lft forever
8: vlan12    inet 172.16.12.46/24 brd 172.16.12.255 scope global vlan12\       valid_lft forever preferred_lft forever
Copy to Clipboard Toggle word wrap

基于在控制器节点上使用 mon/mgr 的默认角色创建 ceph spec。

openstack overcloud ceph spec -o ceph_spec.yaml -y  \
   --stack overcloud-0     overcloud-baremetal-deployed-0.yaml
Copy to Clipboard Toggle word wrap

部署 Ceph 集群:

 openstack overcloud ceph deploy overcloud-baremetal-deployed-0.yaml \
    --stack overcloud-0 -o deployed_ceph.yaml \
    --network-data ~/oc0-network-data.yaml \
    --ceph-spec ~/ceph_spec.yaml
Copy to Clipboard Toggle word wrap

:

ceph_spec.yaml 是 ceph 集群的 OSP 生成的描述,稍后将在过程中使用,作为 cephadm 更新守护进程的 status/info 所需的基本模板。

检查集群的状态:

[ceph: root@oc0-controller-0 /]# ceph -s
  cluster:
    id:     f6ec3ebe-26f7-56c8-985d-eb974e8e08e3
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum oc0-controller-0,oc0-controller-1,oc0-controller-2 (age 19m)
    mgr: oc0-controller-0.xzgtvo(active, since 32m), standbys: oc0-controller-1.mtxohd, oc0-controller-2.ahrgsk
    osd: 8 osds: 8 up (since 12m), 8 in (since 18m); 1 remapped pgs

  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   43 MiB used, 400 GiB / 400 GiB avail
    pgs:     1 active+clean
Copy to Clipboard Toggle word wrap
[ceph: root@oc0-controller-0 /]# ceph orch host ls
HOST              ADDR           LABELS          STATUS
oc0-ceph-0        192.168.24.14  osd
oc0-ceph-1        192.168.24.7   osd
oc0-controller-0  192.168.24.15  _admin mgr mon
oc0-controller-1  192.168.24.23  _admin mgr mon
oc0-controller-2  192.168.24.13  _admin mgr mon
Copy to Clipboard Toggle word wrap

下一节的目标是将 oc0-controller-{1,2} 守护进程迁移到 oc0-ceph-{0,1},它演示了如何使用 cephadm 进行此类迁移。

2.1.2.3. 将 oc0-controller-1 迁移到 oc0-ceph-0

SSH 到 controller-0,然后

cephadm shell -v /home/ceph-admin/specs:/specs
Copy to Clipboard Toggle word wrap

SSH 到 ceph-0,然后

sudo “watch podman ps”  # watch the new mon/mgr being deployed here
Copy to Clipboard Toggle word wrap

(可选)如果 mgr 在源节点上处于活跃状态,则:

ceph mgr fail <mgr instance>
Copy to Clipboard Toggle word wrap

从 cephadm shell 中,删除 oc0-controller-1 上的标签

    for label in mon mgr _admin; do
           ceph orch host rm label oc0-controller-1 $label;
    done
Copy to Clipboard Toggle word wrap

将缺少的标签添加到 oc0-ceph-0

[ceph: root@oc0-controller-0 /]#
> for label in mon mgr _admin; do ceph orch host label add oc0-ceph-0 $label; done
Added label mon to host oc0-ceph-0
Added label mgr to host oc0-ceph-0
Added label _admin to host oc0-ceph-0
Copy to Clipboard Toggle word wrap

drain 和 force-remove oc0-controller-1 节点

[ceph: root@oc0-controller-0 /]# ceph orch host drain oc0-controller-1
Scheduled to remove the following daemons from host 'oc0-controller-1'
type                 id
-------------------- ---------------
mon                  oc0-controller-1
mgr                  oc0-controller-1.mtxohd
crash                oc0-controller-1
Copy to Clipboard Toggle word wrap
[ceph: root@oc0-controller-0 /]# ceph orch host rm oc0-controller-1 --force
Removed  host 'oc0-controller-1'

[ceph: root@oc0-controller-0 /]# ceph orch host ls
HOST              ADDR           LABELS          STATUS
oc0-ceph-0        192.168.24.14  osd
oc0-ceph-1        192.168.24.7   osd
oc0-controller-0  192.168.24.15  mgr mon _admin
oc0-controller-2  192.168.24.13  _admin mgr mon
Copy to Clipboard Toggle word wrap

如果您只有 3 个 mon 节点,且排空节点无法按预期工作(容器仍然存在),则 SSH 到 controller-1,并强制使用节点中的容器:

[root@oc0-controller-1 ~]# sudo podman ps
CONTAINER ID  IMAGE                                                                                        COMMAND               CREATED         STATUS             PORTS       NAMES
5c1ad36472bc  registry.redhat.io/ceph/rhceph@sha256:320c364dcc8fc8120e2a42f54eb39ecdba12401a2546763b7bef15b02ce93bc4  -n mon.oc0-contro...  35 minutes ago  Up 35 minutes ago              ceph-f6ec3ebe-26f7-56c8-985d-eb974e8e08e3-mon-oc0-controller-1
3b14cc7bf4dd  registry.redhat.io/ceph/rhceph@sha256:320c364dcc8fc8120e2a42f54eb39ecdba12401a2546763b7bef15b02ce93bc4  -n mgr.oc0-contro...  35 minutes ago  Up 35 minutes ago              ceph-f6ec3ebe-26f7-56c8-985d-eb974e8e08e3-mgr-oc0-controller-1-mtxohd

[root@oc0-controller-1 ~]# cephadm rm-cluster --fsid f6ec3ebe-26f7-56c8-985d-eb974e8e08e3 --force

[root@oc0-controller-1 ~]# sudo podman ps
CONTAINER ID  IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES
Copy to Clipboard Toggle word wrap
注意

在不是集群一部分的节点上 cephadm rm-cluster 不再对所有容器进行一些影响,并对文件系统进行一些清理。

在关闭 oc0-controller-1 前,将 IP 地址(在同一网络上)移到 oc0-ceph-0 节点:

mon_host = [v2:172.16.11.54:3300/0,v1:172.16.11.54:6789/0] [v2:172.16.11.121:3300/0,v1:172.16.11.121:6789/0] [v2:172.16.11.205:3300/0,v1:172.16.11.205:6789/0]

[root@oc0-controller-1 ~]# ip -o -4 a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
5: br-ex    inet 192.168.24.23/24 brd 192.168.24.255 scope global br-ex\       valid_lft forever preferred_lft forever
6: vlan100    inet 192.168.100.96/24 brd 192.168.100.255 scope global vlan100\       valid_lft forever preferred_lft forever
7: vlan12    inet 172.16.12.154/24 brd 172.16.12.255 scope global vlan12\       valid_lft forever preferred_lft forever
8: vlan11    inet 172.16.11.121/24 brd 172.16.11.255 scope global vlan11\       valid_lft forever preferred_lft forever
9: vlan13    inet 172.16.13.178/24 brd 172.16.13.255 scope global vlan13\       valid_lft forever preferred_lft forever
10: vlan70    inet 172.17.0.23/20 brd 172.17.15.255 scope global vlan70\       valid_lft forever preferred_lft forever
11: vlan1    inet 192.168.24.23/24 brd 192.168.24.255 scope global vlan1\       valid_lft forever preferred_lft forever
12: vlan14    inet 172.16.14.223/24 brd 172.16.14.255 scope global vlan14\       valid_lft forever preferred_lft forever
Copy to Clipboard Toggle word wrap

在 oc0-ceph-0 上:

[heat-admin@oc0-ceph-0 ~]$ ip -o -4 a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
5: br-storage    inet 192.168.24.14/24 brd 192.168.24.255 scope global br-storage\       valid_lft forever preferred_lft forever
6: vlan1    inet 192.168.24.14/24 brd 192.168.24.255 scope global vlan1\       valid_lft forever preferred_lft forever
7: vlan11    inet 172.16.11.172/24 brd 172.16.11.255 scope global vlan11\       valid_lft forever preferred_lft forever
8: vlan12    inet 172.16.12.46/24 brd 172.16.12.255 scope global vlan12\       valid_lft forever preferred_lft forever
[heat-admin@oc0-ceph-0 ~]$ sudo ip a add 172.16.11.121 dev vlan11
[heat-admin@oc0-ceph-0 ~]$ ip -o -4 a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
5: br-storage    inet 192.168.24.14/24 brd 192.168.24.255 scope global br-storage\       valid_lft forever preferred_lft forever
6: vlan1    inet 192.168.24.14/24 brd 192.168.24.255 scope global vlan1\       valid_lft forever preferred_lft forever
7: vlan11    inet 172.16.11.172/24 brd 172.16.11.255 scope global vlan11\       valid_lft forever preferred_lft forever
7: vlan11    inet 172.16.11.121/32 scope global vlan11\       valid_lft forever preferred_lft forever
8: vlan12    inet 172.16.12.46/24 brd 172.16.12.255 scope global vlan12\       valid_lft forever preferred_lft forever
Copy to Clipboard Toggle word wrap

poweroff oc0-controller-1。

使用旧 IP 地址在 oc0-ceph-0 中添加新的 mon:

[ceph: root@oc0-controller-0 /]# ceph orch daemon add mon oc0-ceph-0:172.16.11.121
Deployed mon.oc0-ceph-0 on host 'oc0-ceph-0'
Copy to Clipboard Toggle word wrap

检查 oc0-ceph-0 节点中的新容器:

b581dc8bbb78  registry.redhat.io/ceph/rhceph@sha256:320c364dcc8fc8120e2a42f54eb39ecdba12401a2546763b7bef15b02ce93bc4  -n mon.oc0-ceph-0...  24 seconds ago  Up 24 seconds ago              ceph-f6ec3ebe-26f7-56c8-985d-eb974e8e08e3-mon-oc0-ceph-0
Copy to Clipboard Toggle word wrap

在 cephadm shell 上,备份现有的 ceph_spec.yaml,编辑 spec removing any oc0-controller-1 条目,并将其替换为 oc0-ceph-0:

cp ceph_spec.yaml ceph_spec.yaml.bkp # backup the ceph_spec.yaml file

[ceph: root@oc0-controller-0 specs]# diff -u ceph_spec.yaml.bkp ceph_spec.yaml

--- ceph_spec.yaml.bkp  2022-07-29 15:41:34.516329643 +0000
+++ ceph_spec.yaml      2022-07-29 15:28:26.455329643 +0000
@@ -7,14 +7,6 @@
 - mgr
 service_type: host
 ---
-addr: 192.168.24.12
-hostname: oc0-controller-1
-labels:
-- _admin
-- mon
-- mgr
-service_type: host
 ----
 addr: 192.168.24.19
 hostname: oc0-controller-2
 labels:
@@ -38,7 +30,7 @@
 placement:
   hosts:
   - oc0-controller-0
-  - oc0-controller-1
+  - oc0-ceph-0
   - oc0-controller-2
 service_id: mon
 service_name: mon
@@ -47,8 +39,8 @@
 placement:
   hosts:
   - oc0-controller-0
-  - oc0-controller-1
   - oc0-controller-2
+  - oc0-ceph-0
 service_id: mgr
 service_name: mgr
 service_type: mgr
Copy to Clipboard Toggle word wrap

应用生成的 spec:

ceph orch apply -i ceph_spec.yaml

 The result of 12 is having a new mgr deployed on the oc0-ceph-0 node, and the spec reconciled within cephadm

[ceph: root@oc0-controller-0 specs]# ceph orch ls
NAME                     PORTS  RUNNING  REFRESHED  AGE  PLACEMENT
crash                               4/4  5m ago     61m  *
mgr                                 3/3  5m ago     69s  oc0-controller-0;oc0-ceph-0;oc0-controller-2
mon                                 3/3  5m ago     70s  oc0-controller-0;oc0-ceph-0;oc0-controller-2
osd.default_drive_group               8  2m ago     69s  oc0-ceph-0;oc0-ceph-1

[ceph: root@oc0-controller-0 specs]# ceph -s
  cluster:
    id:     f6ec3ebe-26f7-56c8-985d-eb974e8e08e3
    health: HEALTH_WARN
            1 stray host(s) with 1 daemon(s) not managed by cephadm

  services:
    mon: 3 daemons, quorum oc0-controller-0,oc0-controller-2,oc0-ceph-0 (age 5m)
    mgr: oc0-controller-0.xzgtvo(active, since 62m), standbys: oc0-controller-2.ahrgsk, oc0-ceph-0.hccsbb
    osd: 8 osds: 8 up (since 42m), 8 in (since 49m); 1 remapped pgs

  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   43 MiB used, 400 GiB / 400 GiB avail
    pgs:     1 active+clean
Copy to Clipboard Toggle word wrap

通过刷新 mgr 来修复警告:

ceph mgr fail oc0-controller-0.xzgtvo
Copy to Clipboard Toggle word wrap

此时集群清理干净:

[ceph: root@oc0-controller-0 specs]# ceph -s
  cluster:
    id:     f6ec3ebe-26f7-56c8-985d-eb974e8e08e3
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum oc0-controller-0,oc0-controller-2,oc0-ceph-0 (age 7m)
    mgr: oc0-controller-2.ahrgsk(active, since 25s), standbys: oc0-controller-0.xzgtvo, oc0-ceph-0.hccsbb
    osd: 8 osds: 8 up (since 44m), 8 in (since 50m); 1 remapped pgs

  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   43 MiB used, 400 GiB / 400 GiB avail
    pgs:     1 active+clean
Copy to Clipboard Toggle word wrap

oc0-controller-1 已被删除并关机,而不在 ceph 集群上保留 trace。

可以应用相同的方法和相同的步骤,将 oc0-controller-2 迁移到 oc0-ceph-1。

2.1.2.4. 屏幕记录:

2.1.3. 有用的资源

返回顶部
Red Hat logoGithubredditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。 了解我们当前的更新.

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

Theme

© 2025 Red Hat