Chapter 3. Metro-DR solution for OpenShift Data Foundation
This section of the guide provides details of the Metro Disaster Recovery (Metro-DR) steps and commands necessary to be able to failover an application from one OpenShift Container Platform cluster to another and then failback the same application to the original primary cluster. In this case the OpenShift Container Platform clusters will be created or imported using Red Hat Advanced Cluster Management (RHACM) and have distance limitations between the OpenShift Container Platform clusters of less than 10ms RTT latency.
The persistent storage for applications is provided by an external Red Hat Ceph Storage (RHCS) cluster stretched between the two locations with the OpenShift Container Platform instances connected to this storage cluster. An arbiter node with a storage monitor service is required at a third location (different location than where OpenShift Container Platform instances are deployed) to establish quorum for the RHCS cluster in the case of a site outage. This third location can be in the range of ~100ms RTT from the storage cluster connected to the OpenShift Container Platform instances.
This is a general overview of the Metro DR steps required to configure and execute OpenShift Disaster Recovery (ODR) capabilities using OpenShift Data Foundation and RHACM across two distinct OpenShift Container Platform clusters separated by distance. In addition to these two clusters called managed clusters, a third OpenShift Container Platform cluster is required that will be the Red Hat Advanced Cluster Management (RHACM) hub cluster.
You can now easily set up Metropolitan disaster recovery solutions for workloads based on OpenShift virtualization technology using OpenShift Data Foundation. For more information, see the knowledgebase article.
3.1. Components of Metro-DR solution
Metro-DR is composed of Red Hat Advanced Cluster Management for Kubernetes, Red Hat Ceph Storage and OpenShift Data Foundation components to provide application and data mobility across OpenShift Container Platform clusters.
Red Hat Advanced Cluster Management for Kubernetes
Red Hat Advanced Cluster Management (RHACM) provides the ability to manage multiple clusters and application lifecycles. Hence, it serves as a control plane in a multi-cluster environment.
RHACM is split into two parts:
- RHACM Hub: components that run on the multi-cluster control plane.
- Managed clusters: components that run on the clusters that are managed.
For more information about this product, see RHACM documentation and the RHACM “Manage Applications” documentation.
Red Hat Ceph Storage
Red Hat Ceph Storage is a massively scalable, open, software-defined storage platform that combines the most stable version of the Ceph storage system with a Ceph management platform, deployment utilities, and support services. It significantly lowers the cost of storing enterprise data and helps organizations manage exponential data growth. The software is a robust and modern petabyte-scale storage platform for public or private cloud deployments.
For more product information, see Red Hat Ceph Storage.
OpenShift Data Foundation
OpenShift Data Foundation provides the ability to provision and manage storage for stateful applications in an OpenShift Container Platform cluster. It is backed by Ceph as the storage provider, whose lifecycle is managed by Rook in the OpenShift Data Foundation component stack and Ceph-CSI provides the provisioning and management of Persistent Volumes for stateful applications.
OpenShift DR
OpenShift DR is a disaster recovery orchestrator for stateful applications across a set of peer OpenShift clusters which are deployed and managed using RHACM and provides cloud-native interfaces to orchestrate the life-cycle of an application’s state on Persistent Volumes. These include:
- Protecting an application and its state relationship across OpenShift clusters
- Failing over an application and its state to a peer cluster
- Relocate an application and its state to the previously deployed cluster
OpenShift DR is split into three components:
- ODF Multicluster Orchestrator: Installed on the multi-cluster control plane (RHACM Hub), it orchestrates configuration and peering of OpenShift Data Foundation clusters for Metro and Regional DR relationships.
- OpenShift DR Hub Operator: Automatically installed as part of ODF Multicluster Orchestrator installation on the hub cluster to orchestrate failover or relocation of DR enabled applications.
- OpenShift DR Cluster Operator: Automatically installed on each managed cluster that is part of a Metro and Regional DR relationship to manage the lifecycle of all PVCs of an application.
3.2. Metro-DR deployment workflow
This section provides an overview of the steps required to configure and deploy Metro-DR capabilities using the latest versions of Red Hat OpenShift Data Foundation, Red Hat Ceph Storage (RHCS) and Red Hat Advanced Cluster Management for Kubernetes (RHACM) version 2.10 or later, across two distinct OpenShift Container Platform clusters. In addition to two managed clusters, a third OpenShift Container Platform cluster will be required to deploy the Advanced Cluster Management.
To configure your infrastructure, perform the below steps in the order given:
- Ensure requirements across the Hub, Primary and Secondary Openshift Container Platform clusters that are part of the DR solution are met. See Requirements for enabling Metro-DR.
- Ensure you meet the requirements for deploying Red Hat Ceph Storage stretch cluster with arbiter. See Requirements for deploying Red Hat Ceph Storage.
- Deploy and configure Red Hat Ceph Storage stretch mode. For instructions on enabling Ceph cluster on two different data centers using stretched mode functionality, see Deploying Red Hat Ceph Storage.
- Install OpenShift Data Foundation operator and create a storage system on Primary and Secondary managed clusters. See Installing OpenShift Data Foundation on managed clusters.
- Install the ODF Multicluster Orchestrator on the Hub cluster. See Installing ODF Multicluster Orchestrator on Hub cluster.
- Configure SSL access between the Hub, Primary and Secondary clusters. See Configuring SSL access across clusters.
Create a DRPolicy resource for use with applications requiring DR protection across the Primary and Secondary clusters. See Creating Disaster Recovery Policy on Hub cluster.
NoteThe Metro-DR solution can only have one DRpolicy.
Testing your disaster recovery solution with:
Subscription-based application:
- Create sample applications. See Creating sample application.
- Test failover and relocate operations using the sample application between managed clusters. See Subscription-based application failover and relocating subscription-based application.
ApplicationSet-based application:
- Create sample applications. See Creating ApplicationSet-based applications.
- Test failover and relocate operations using the sample application between managed clusters. See ApplicationSet-based application failover and relocating ApplicationSet-based application.
Discovered applications
- Ensure all requirements mentioned in Prerequisites is addressed. See Prerequisites for disaster recovery protection of discovered applications
- Create a sample discovered application. See Creating a sample discovered application
- Enroll the discovered application. See Enrolling a sample discovered application for disaster recovery protection
- Test failover and relocate. See Discovered application failover and relocate
3.3. Requirements for enabling Metro-DR
The prerequisites to installing a disaster recovery solution supported by Red Hat OpenShift Data Foundation are as follows:
You must have the following OpenShift clusters that have network reachability between them:
- Hub cluster where Red Hat Advanced Cluster Management (RHACM) for Kubernetes operator are installed.
- Primary managed cluster where OpenShift Data Foundation is running.
- Secondary managed cluster where OpenShift Data Foundation is running.
NoteFor configuring hub recovery setup, you need a 4th cluster which acts as the passive hub. The primary managed cluster (Site-1) can be co-situated with the active RHACM hub cluster while the passive hub cluster is situated along with the secondary managed cluster (Site-2). Alternatively, the active RHACM hub cluster can be placed in a neutral site (Site-3) that is not impacted by the failures of either of the primary managed cluster at Site-1 or the secondary cluster at Site-2. In this situation, if a passive hub cluster is used it can be placed with the secondary cluster at Site-2. For more information, see Configuring passive hub cluster for hub recovery.
Hub recovery is a Technology Preview feature and is subject to Technology Preview support limitations.
Ensure that RHACM operator and MultiClusterHub is installed on the Hub cluster. See RHACM installation guide for instructions.
After the operator is successfully installed, a popover with a message that the Web console update is available appears on the user interface. Click Refresh web console from this popover for the console changes to reflect.
Ensure that application traffic routing and redirection are configured appropriately.
On the Hub cluster
-
Navigate to All Clusters
Infrastructure Clusters. - Import or create the Primary managed cluster and the Secondary managed cluster using the RHACM console.
- Choose the appropriate options for your environment.
After the managed clusters are successfully created or imported, you can see the list of clusters that were imported or created on the console. For instructions, see Creating a cluster and Importing a target managed cluster to the hub cluster.
-
Navigate to All Clusters
The Openshift Container Platform managed clusters and the Red Hat Ceph Storage (RHCS) nodes have distance limitations. The network latency between the sites must be below 10 milliseconds round-trip time (RTT).
3.4. Requirements for deploying Red Hat Ceph Storage stretch cluster with arbiter
Red Hat Ceph Storage is an open-source enterprise platform that provides unified software-defined storage on standard, economical servers and disks. With block, object, and file storage combined into one platform, Red Hat Ceph Storage efficiently and automatically manages all your data, so you can focus on the applications and workloads that use it.
This section provides a basic overview of the Red Hat Ceph Storage deployment. For more complex deployment, refer to the official documentation guide for Red Hat Ceph Storage 7.
Only Flash media is supported since it runs with min_size=1
when degraded. Use stretch mode only with all-flash OSDs. Using all-flash OSDs minimizes the time needed to recover once connectivity is restored, thus minimizing the potential for data loss.
Erasure coded pools cannot be used with stretch mode.
3.4.1. Hardware requirements
For information on minimum hardware requirements for deploying Red Hat Ceph Storage, see Minimum hardware recommendations for containerized Ceph.
Node name | Datacenter | Ceph components |
---|---|---|
ceph1 | DC1 | OSD+MON+MGR |
ceph2 | DC1 | OSD+MON |
ceph3 | DC1 | OSD+MDS+RGW |
ceph4 | DC2 | OSD+MON+MGR |
ceph5 | DC2 | OSD+MON |
ceph6 | DC2 | OSD+MDS+RGW |
ceph7 | DC3 | MON |
3.4.2. Software requirements
Use the latest software version of Red Hat Ceph Storage 7.
For more information on the supported Operating System versions for Red Hat Ceph Storage, see knowledgebase article on Red Hat Ceph Storage: Supported configurations.
3.4.3. Network configuration requirements
The recommended Red Hat Ceph Storage configuration is as follows:
- You must have two separate networks, one public network and one private network.
You must have three different datacenters that support VLANS and subnets for Cephs private and public network for all datacenters.
NoteYou can use different subnets for each of the datacenters.
- The latencies between the two datacenters running the Red Hat Ceph Storage Object Storage Devices (OSDs) cannot exceed 10 ms RTT. For the arbiter datacenter, this was tested with values as high up to 100 ms RTT to the other two OSD datacenters.
Here is an example of a basic network configuration that we have used in this guide:
- DC1: Ceph public/private network: 10.0.40.0/24
- DC2: Ceph public/private network: 10.0.40.0/24
- DC3: Ceph public/private network: 10.0.40.0/24
For more information on the required network environment, see Ceph network configuration.
3.5. Deploying Red Hat Ceph Storage
3.5.1. Node pre-deployment steps
Before installing the Red Hat Ceph Storage Ceph cluster, perform the following steps to fulfill all the requirements needed.
Register all the nodes to the Red Hat Network or Red Hat Satellite and subscribe to a valid pool:
subscription-manager register subscription-manager subscribe --pool=8a8XXXXXX9e0
Enable access for all the nodes in the Ceph cluster for the following repositories:
-
rhel9-for-x86_64-baseos-rpms
rhel9-for-x86_64-appstream-rpms
subscription-manager repos --disable="*" --enable="rhel9-for-x86_64-baseos-rpms" --enable="rhel9-for-x86_64-appstream-rpms"
-
Update the operating system RPMs to the latest version and reboot if needed:
dnf update -y reboot
Select a node from the cluster to be your bootstrap node.
ceph1
is our bootstrap node in this example going forward.Only on the bootstrap node
ceph1
, enable theansible-2.9-for-rhel-9-x86_64-rpms
andrhceph-6-tools-for-rhel-9-x86_64-rpms
repositories:subscription-manager repos --enable="ansible-2.9-for-rhel-9-x86_64-rpms" --enable="rhceph-6-tools-for-rhel-9-x86_64-rpms"
Configure the
hostname
using the bare/short hostname in all the hosts.hostnamectl set-hostname <short_name>
Verify the hostname configuration for deploying Red Hat Ceph Storage with cephadm.
$ hostname
Example output:
ceph1
Modify /etc/hosts file and add the fqdn entry to the 127.0.0.1 IP by setting the DOMAIN variable with our DNS domain name.
DOMAIN="example.domain.com" cat <<EOF >/etc/hosts 127.0.0.1 $(hostname).${DOMAIN} $(hostname) localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 $(hostname).${DOMAIN} $(hostname) localhost6 localhost6.localdomain6 EOF
Check the long hostname with the
fqdn
using thehostname -f
option.$ hostname -f
Example output:
ceph1.example.domain.com
NoteTo know more about why these changes are required, see Fully Qualified Domain Names vs Bare Host Names.
Run the following steps on the bootstrap node. In our example, the bootstrap node is
ceph1
.Install the
cephadm-ansible
RPM package:$ sudo dnf install -y cephadm-ansible
ImportantTo run the ansible playbooks, you must have
ssh
passwordless access to all the nodes that are configured to the Red Hat Ceph Storage cluster. Ensure that the configured user (for example,deployment-user
) has root privileges to invoke thesudo
command without needing a password.To use a custom key, configure the selected user (for example,
deployment-user
) ssh config file to specify the id/key that will be used for connecting to the nodes via ssh:cat <<EOF > ~/.ssh/config Host ceph* User deployment-user IdentityFile ~/.ssh/ceph.pem EOF
Build the ansible inventory
cat <<EOF > /usr/share/cephadm-ansible/inventory ceph1 ceph2 ceph3 ceph4 ceph5 ceph6 ceph7 [admin] ceph1 ceph4 EOF
NoteHere, the Hosts (
Ceph1
andCeph4
) belonging to two different data centers are configured as part of the [admin] group on the inventory file and are tagged as_admin
bycephadm
. Each of these admin nodes receive the admin ceph keyring during the bootstrap process so that when one data center is down, we can check using the other available admin node.Verify that
ansible
can access all nodes using the ping module before running the pre-flight playbook.$ ansible -i /usr/share/cephadm-ansible/inventory -m ping all -b
Example output:
ceph6 | SUCCESS => { "ansible_facts": { "discovered_interpreter_python": "/usr/libexec/platform-python" }, "changed": false, "ping": "pong" } ceph4 | SUCCESS => { "ansible_facts": { "discovered_interpreter_python": "/usr/libexec/platform-python" }, "changed": false, "ping": "pong" } ceph3 | SUCCESS => { "ansible_facts": { "discovered_interpreter_python": "/usr/libexec/platform-python" }, "changed": false, "ping": "pong" } ceph2 | SUCCESS => { "ansible_facts": { "discovered_interpreter_python": "/usr/libexec/platform-python" }, "changed": false, "ping": "pong" } ceph5 | SUCCESS => { "ansible_facts": { "discovered_interpreter_python": "/usr/libexec/platform-python" }, "changed": false, "ping": "pong" } ceph1 | SUCCESS => { "ansible_facts": { "discovered_interpreter_python": "/usr/libexec/platform-python" }, "changed": false, "ping": "pong" } ceph7 | SUCCESS => { "ansible_facts": { "discovered_interpreter_python": "/usr/libexec/platform-python" }, "changed": false, "ping": "pong" }
-
Navigate to the
/usr/share/cephadm-ansible
directory. Run ansible-playbook with relative file paths.
$ ansible-playbook -i /usr/share/cephadm-ansible/inventory /usr/share/cephadm-ansible/cephadm-preflight.yml --extra-vars "ceph_origin=rhcs"
The preflight playbook Ansible playbook configures the RHCS
dnf
repository and prepares the storage cluster for bootstrapping. It also installs podman, lvm2, chronyd, and cephadm. The default location forcephadm-ansible
andcephadm-preflight.yml
is/usr/share/cephadm-ansible
. For additional information, see Running the preflight playbook
3.5.2. Cluster bootstrapping and service deployment with cephadm utility
The cephadm utility installs and starts a single Ceph Monitor daemon and a Ceph Manager daemon for a new Red Hat Ceph Storage cluster on the local node where the cephadm bootstrap command is run.
In this guide we are going to bootstrap the cluster and deploy all the needed Red Hat Ceph Storage services in one step using a cluster specification yaml file.
If you find issues during the deployment, it may be easier to troubleshoot the errors by dividing the deployment into two steps:
- Bootstrap
- Service deployment
For additional information on the bootstrapping process, see Bootstrapping a new storage cluster.
Procedure
Create json file to authenticate against the container registry using a json file as follows:
$ cat <<EOF > /root/registry.json { "url":"registry.redhat.io", "username":"User", "password":"Pass" } EOF
Create a
cluster-spec.yaml
that adds the nodes to the Red Hat Ceph Storage cluster and also sets specific labels for where the services should run following table 3.1.cat <<EOF > /root/cluster-spec.yaml service_type: host addr: 10.0.40.78 ## <XXX.XXX.XXX.XXX> hostname: ceph1 ## <ceph-hostname-1> location: root: default datacenter: DC1 labels: - osd - mon - mgr --- service_type: host addr: 10.0.40.35 hostname: ceph2 location: datacenter: DC1 labels: - osd - mon --- service_type: host addr: 10.0.40.24 hostname: ceph3 location: datacenter: DC1 labels: - osd - mds - rgw --- service_type: host addr: 10.0.40.185 hostname: ceph4 location: root: default datacenter: DC2 labels: - osd - mon - mgr --- service_type: host addr: 10.0.40.88 hostname: ceph5 location: datacenter: DC2 labels: - osd - mon --- service_type: host addr: 10.0.40.66 hostname: ceph6 location: datacenter: DC2 labels: - osd - mds - rgw --- service_type: host addr: 10.0.40.221 hostname: ceph7 labels: - mon --- service_type: mon placement: label: "mon" --- service_type: mds service_id: cephfs placement: label: "mds" --- service_type: mgr service_name: mgr placement: label: "mgr" --- service_type: osd service_id: all-available-devices service_name: osd.all-available-devices placement: label: "osd" spec: data_devices: all: true --- service_type: rgw service_id: objectgw service_name: rgw.objectgw placement: count: 2 label: "rgw" spec: rgw_frontend_port: 8080 EOF
Retrieve the IP for the NIC with the Red Hat Ceph Storage public network configured from the bootstrap node. After substituting
10.0.40.0
with the subnet that you have defined in your ceph public network, execute the following command.$ ip a | grep 10.0.40
Example output:
10.0.40.78
Run the
cephadm
bootstrap command as the root user on the node that will be the initial Monitor node in the cluster. TheIP_ADDRESS
option is the node’s IP address that you are using to run thecephadm bootstrap
command.NoteIf you have configured a different user instead of
root
for passwordless SSH access, then use the--ssh-user=
flag with thecepadm bootstrap
command.If you are using non default/id_rsa ssh key names, then use
--ssh-private-key
and--ssh-public-key
options withcephadm
command.$ cephadm bootstrap --ssh-user=deployment-user --mon-ip 10.0.40.78 --apply-spec /root/cluster-spec.yaml --registry-json /root/registry.json
ImportantIf the local node uses fully-qualified domain names (FQDN), then add the
--allow-fqdn-hostname
option tocephadm bootstrap
on the command line.Once the bootstrap finishes, you will see the following output from the previous cephadm bootstrap command:
You can access the Ceph CLI with: sudo /usr/sbin/cephadm shell --fsid dd77f050-9afe-11ec-a56c-029f8148ea14 -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring Consider enabling telemetry to help improve Ceph: ceph telemetry on For more information see: https://docs.ceph.com/docs/pacific/mgr/telemetry/
Verify the status of Red Hat Ceph Storage cluster deployment using the Ceph CLI client from ceph1:
$ ceph -s
Example output:
cluster: id: 3a801754-e01f-11ec-b7ab-005056838602 health: HEALTH_OK services: mon: 5 daemons, quorum ceph1,ceph2,ceph4,ceph5,ceph7 (age 4m) mgr: ceph1.khuuot(active, since 5m), standbys: ceph4.zotfsp osd: 12 osds: 12 up (since 3m), 12 in (since 4m) rgw: 2 daemons active (2 hosts, 1 zones) data: pools: 5 pools, 107 pgs objects: 191 objects, 5.3 KiB usage: 105 MiB used, 600 GiB / 600 GiB avail 105 active+clean
NoteIt may take several minutes for all the services to start.
It is normal to get a global recovery event while you do not have any OSDs configured.
You can use
ceph orch ps
andceph orch ls
to further check the status of the services.Verify if all the nodes are part of the
cephadm
cluster.$ ceph orch host ls
Example output:
HOST ADDR LABELS STATUS ceph1 10.0.40.78 _admin osd mon mgr ceph2 10.0.40.35 osd mon ceph3 10.0.40.24 osd mds rgw ceph4 10.0.40.185 osd mon mgr ceph5 10.0.40.88 osd mon ceph6 10.0.40.66 osd mds rgw ceph7 10.0.40.221 mon
NoteYou can run Ceph commands directly from the host because
ceph1
was configured in thecephadm-ansible
inventory as part of the [admin] group. The Ceph admin keys were copied to the host during thecephadm bootstrap
process.Check the current placement of the Ceph monitor services on the datacenters.
$ ceph orch ps | grep mon | awk '{print $1 " " $2}'
Example output:
mon.ceph1 ceph1 mon.ceph2 ceph2 mon.ceph4 ceph4 mon.ceph5 ceph5 mon.ceph7 ceph7
Check the current placement of the Ceph manager services on the datacenters.
$ ceph orch ps | grep mgr | awk '{print $1 " " $2}'
Example output:
mgr.ceph2.ycgwyz ceph2 mgr.ceph5.kremtt ceph5
Check the ceph osd crush map layout to ensure that each host has one OSD configured and its status is
UP
. Also, double-check that each node is under the right datacenter bucket as specified in table 3.1$ ceph osd tree
Example output:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.87900 root default -16 0.43950 datacenter DC1 -11 0.14650 host ceph1 2 ssd 0.14650 osd.2 up 1.00000 1.00000 -3 0.14650 host ceph2 3 ssd 0.14650 osd.3 up 1.00000 1.00000 -13 0.14650 host ceph3 4 ssd 0.14650 osd.4 up 1.00000 1.00000 -17 0.43950 datacenter DC2 -5 0.14650 host ceph4 0 ssd 0.14650 osd.0 up 1.00000 1.00000 -9 0.14650 host ceph5 1 ssd 0.14650 osd.1 up 1.00000 1.00000 -7 0.14650 host ceph6 5 ssd 0.14650 osd.5 up 1.00000 1.00000
Create and enable a new RDB block pool.
$ ceph osd pool create 32 32 $ ceph osd pool application enable rbdpool rbd
NoteThe number 32 at the end of the command is the number of PGs assigned to this pool. The number of PGs can vary depending on several factors like the number of OSDs in the cluster, expected % used of the pool, etc. You can use the following calculator to determine the number of PGs needed: Ceph Placement Groups (PGs) per Pool Calculator.
Verify that the RBD pool has been created.
$ ceph osd lspools | grep rbdpool
Example output:
3 rbdpool
Verify that MDS services are active and have located one service on each datacenter.
$ ceph orch ps | grep mds
Example output:
mds.cephfs.ceph3.cjpbqo ceph3 running (17m) 117s ago 17m 16.1M - 16.2.9 mds.cephfs.ceph6.lqmgqt ceph6 running (17m) 117s ago 17m 16.1M - 16.2.9
Create the CephFS volume.
$ ceph fs volume create cephfs
NoteThe
ceph fs volume create
command also creates the needed data and meta CephFS pools. For more information, see Configuring and Mounting Ceph File Systems.Check the
Ceph
status to verify how the MDS daemons have been deployed. Ensure that the state is active whereceph6
is the primary MDS for this filesystem andceph3
is the secondary MDS.$ ceph fs status
Example output:
cephfs - 0 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active cephfs.ceph6.ggjywj Reqs: 0 /s 10 13 12 0 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 96.0k 284G cephfs.cephfs.data data 0 284G STANDBY MDS cephfs.ceph3.ogcqkl
Verify that RGW services are active.
$ ceph orch ps | grep rgw
Example output:
rgw.objectgw.ceph3.kkmxgb ceph3 *:8080 running (7m) 3m ago 7m 52.7M - 16.2.9 rgw.objectgw.ceph6.xmnpah ceph6 *:8080 running (7m) 3m ago 7m 53.3M - 16.2.9
3.5.3. Configuring Red Hat Ceph Storage stretch mode
Once the Red Hat Ceph Storage cluster is fully deployed using cephadm
, use the following procedure to configure the stretch cluster mode. The new stretch mode is designed to handle the 2-site case.
Procedure
Check the current election strategy being used by the monitors with the ceph mon dump command. By default in a ceph cluster, the connectivity is set to classic.
ceph mon dump | grep election_strategy
Example output:
dumped monmap epoch 9 election_strategy: 1
Change the monitor election to connectivity.
ceph mon set election_strategy connectivity
Run the previous ceph mon dump command again to verify the election_strategy value.
$ ceph mon dump | grep election_strategy
Example output:
dumped monmap epoch 10 election_strategy: 3
To know more about the different election strategies, see Configuring monitor election strategy.
Set the location for all our Ceph monitors:
ceph mon set_location ceph1 datacenter=DC1 ceph mon set_location ceph2 datacenter=DC1 ceph mon set_location ceph4 datacenter=DC2 ceph mon set_location ceph5 datacenter=DC2 ceph mon set_location ceph7 datacenter=DC3
Verify that each monitor has its appropriate location.
$ ceph mon dump
Example output:
epoch 17 fsid dd77f050-9afe-11ec-a56c-029f8148ea14 last_changed 2022-03-04T07:17:26.913330+0000 created 2022-03-03T14:33:22.957190+0000 min_mon_release 16 (pacific) election_strategy: 3 0: [v2:10.0.143.78:3300/0,v1:10.0.143.78:6789/0] mon.ceph1; crush_location {datacenter=DC1} 1: [v2:10.0.155.185:3300/0,v1:10.0.155.185:6789/0] mon.ceph4; crush_location {datacenter=DC2} 2: [v2:10.0.139.88:3300/0,v1:10.0.139.88:6789/0] mon.ceph5; crush_location {datacenter=DC2} 3: [v2:10.0.150.221:3300/0,v1:10.0.150.221:6789/0] mon.ceph7; crush_location {datacenter=DC3} 4: [v2:10.0.155.35:3300/0,v1:10.0.155.35:6789/0] mon.ceph2; crush_location {datacenter=DC1}
Create a CRUSH rule that makes use of this OSD crush topology by installing the
ceph-base
RPM package in order to use thecrushtool
command:$ dnf -y install ceph-base
To know more about CRUSH ruleset, see Ceph CRUSH ruleset.
Get the compiled CRUSH map from the cluster:
$ ceph osd getcrushmap > /etc/ceph/crushmap.bin
Decompile the CRUSH map and convert it to a text file in order to be able to edit it:
$ crushtool -d /etc/ceph/crushmap.bin -o /etc/ceph/crushmap.txt
Add the following rule to the CRUSH map by editing the text file
/etc/ceph/crushmap.txt
at the end of the file.$ vim /etc/ceph/crushmap.txt
rule stretch_rule { id 1 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type datacenter step chooseleaf firstn 2 type host step emit } # end crush map
This example is applicable for active applications in both OpenShift Container Platform clusters.
NoteThe rule
id
has to be unique. In the example, we only have one more crush rule with id 0 hence we are using id 1. If your deployment has more rules created, then use the next free id.The CRUSH rule declared contains the following information:
Rule name
- Description: A unique whole name for identifying the rule.
-
Value:
stretch_rule
id
- Description: A unique whole number for identifying the rule.
-
Value:
1
type
- Description: Describes a rule for either a storage drive replicated or erasure-coded.
-
Value:
replicated
min_size
- Description: If a pool makes fewer replicas than this number, CRUSH will not select this rule.
- Value: 1
max_size
- Description: If a pool makes more replicas than this number, CRUSH will not select this rule.
- Value: 10
step take default
-
Description: Takes the root bucket called
default
, and begins iterating down the tree.
-
Description: Takes the root bucket called
step choose firstn 0 type datacenter
- Description: Selects the datacenter bucket, and goes into its subtrees.
step chooseleaf firstn 2 type host
- Description: Selects the number of buckets of the given type. In this case, it is two different hosts located in the datacenter it entered at the previous level.
step emit
- Description: Outputs the current value and empties the stack. Typically used at the end of a rule, but may also be used to pick from different trees in the same rule.
Compile the new CRUSH map from the file
/etc/ceph/crushmap.txt
and convert it to a binary file called/etc/ceph/crushmap2.bin
:$ crushtool -c /etc/ceph/crushmap.txt -o /etc/ceph/crushmap2.bin
Inject the new crushmap we created back into the cluster:
$ ceph osd setcrushmap -i /etc/ceph/crushmap2.bin
Example output:
17
NoteThe number 17 is a counter and it will increase (18,19, and so on) depending on the changes you make to the crush map.
Verify that the stretched rule created is now available for use.
ceph osd crush rule ls
Example output:
replicated_rule stretch_rule
Enable the stretch cluster mode.
$ ceph mon enable_stretch_mode ceph7 stretch_rule datacenter
In this example,
ceph7
is the arbiter node,stretch_rule
is the crush rule we created in the previous step anddatacenter
is the dividing bucket.Verify all our pools are using the
stretch_rule
CRUSH rule we have created in our Ceph cluster:$ for pool in $(rados lspools);do echo -n "Pool: ${pool}; ";ceph osd pool get ${pool} crush_rule;done
Example output:
Pool: device_health_metrics; crush_rule: stretch_rule Pool: cephfs.cephfs.meta; crush_rule: stretch_rule Pool: cephfs.cephfs.data; crush_rule: stretch_rule Pool: .rgw.root; crush_rule: stretch_rule Pool: default.rgw.log; crush_rule: stretch_rule Pool: default.rgw.control; crush_rule: stretch_rule Pool: default.rgw.meta; crush_rule: stretch_rule Pool: rbdpool; crush_rule: stretch_rule
This indicates that a working Red Hat Ceph Storage stretched cluster with arbiter mode is now available.
3.6. Installing OpenShift Data Foundation on managed clusters
To configure storage replication between the two OpenShift Container Platform clusters, OpenShift Data Foundation operator must be installed first on each managed cluster.
Prerequisites
- Ensure that you have met the hardware requirements for OpenShift Data Foundation external deployments. For a detailed description of the hardware requirements, see External mode requirements.
Procedure
- Install and configure the latest OpenShift Data Foundation cluster on each of the managed clusters.
After installing the operator, create a StorageSystem using the option Full deployment type and
Connect with external storage platform
where your Backing storage type isRed Hat Ceph Storage
.For detailed instructions, refer to Deploying OpenShift Data Foundation in external mode.
Use the following flags with the
ceph-external-cluster-details-exporter.py
script.At a minimum, you must use the following three flags with the
ceph-external-cluster-details-exporter.py script
:- --rbd-data-pool-name
-
With the name of the RBD pool that was created during RHCS deployment for OpenShift Container Platform. For example, the pool can be called
rbdpool
. - --rgw-endpoint
-
Provide the endpoint in the format
<ip_address>:<port>
. It is the RGW IP of the RGW daemon running on the same site as the OpenShift Container Platform cluster that you are configuring. - --run-as-user
- With a different client name for each site.
The following flags are
optional
if default values were used during the RHCS deployment:- --cephfs-filesystem-name
-
With the name of the CephFS filesystem we created during RHCS deployment for OpenShift Container Platform, the default filesystem name is
cephfs
. - --cephfs-data-pool-name
-
With the name of the CephFS data pool we created during RHCS deployment for OpenShift Container Platform, the default pool is called
cephfs.data
. - --cephfs-metadata-pool-name
-
With the name of the CephFS metadata pool we created during RHCS deployment for OpenShift Container Platform, the default pool is called
cephfs.meta
.
Run the following command on the bootstrap node
ceph1
, to get the IP for the RGW endpoints in datacenter1 and datacenter2:ceph orch ps | grep rgw.objectgw
Example output:
rgw.objectgw.ceph3.mecpzm ceph3 *:8080 running (5d) 31s ago 7w 204M - 16.2.7-112.el8cp rgw.objectgw.ceph6.mecpzm ceph6 *:8080 running (5d) 31s ago 7w 204M - 16.2.7-112.el8cp
host ceph3.example.com host ceph6.example.com
Example output:
ceph3.example.com has address 10.0.40.24 ceph6.example.com has address 10.0.40.66
Run the
ceph-external-cluster-details-exporter.py
with the parameters that are configured for the first OpenShift Container Platform managed clustercluster1
on bootstrapped nodeceph1
.python3 ceph-external-cluster-details-exporter.py --rbd-data-pool-name rbdpool --cephfs-filesystem-name cephfs --cephfs-data-pool-name cephfs.cephfs.data --cephfs-metadata-pool-name cephfs.cephfs.meta --<rgw-endpoint> XXX.XXX.XXX.XXX:8080 --run-as-user client.odf.cluster1 > ocp-cluster1.json
NoteModify the <rgw-endpoint> XXX.XXX.XXX.XXX according to your environment.
Run the
ceph-external-cluster-details-exporter.py
with the parameters that are configured for the first OpenShift Container Platform managed clustercluster2
on bootstrapped nodeceph1
.python3 ceph-external-cluster-details-exporter.py --rbd-data-pool-name rbdpool --cephfs-filesystem-name cephfs --cephfs-data-pool-name cephfs.cephfs.data --cephfs-metadata-pool-name cephfs.cephfs.meta --rgw-endpoint XXX.XXX.XXX.XXX:8080 --run-as-user client.odf.cluster2 > ocp-cluster2.json
NoteModify the <rgw-endpoint> XXX.XXX.XXX.XXX according to your environment.
-
Save the two files generated in the bootstrap cluster (ceph1)
ocp-cluster1.json
andocp-cluster2.json
to your local machine. -
Use the contents of file
ocp-cluster1.json
on the OpenShift Container Platform console oncluster1
where external OpenShift Data Foundation is being deployed. -
Use the contents of file
ocp-cluster2.json
on the OpenShift Container Platform console oncluster2
where external OpenShift Data Foundation is being deployed.
-
Save the two files generated in the bootstrap cluster (ceph1)
- Review the settings and then select Create StorageSystem.
Validate the successful deployment of OpenShift Data Foundation on each managed cluster with the following command:
$ oc get storagecluster -n openshift-storage ocs-external-storagecluster -o jsonpath='{.status.phase}{"\n"}'
For the Multicloud Gateway (MCG):
$ oc get noobaa -n openshift-storage noobaa -o jsonpath='{.status.phase}{"\n"}'
Wait for the status result to be Ready for both queries on the Primary managed cluster and the Secondary managed cluster.
-
On the OpenShift Web Console, navigate to Installed Operators
OpenShift Data Foundation Storage System ocs-external-storagecluster-storagesystem
Resources. Verify that the Status of StorageCluster
isReady
and has a green tick mark next to it. Enable read affinity for RBD and CephFS volumes to be served from the nearest datacenter.
On the Primary managed cluster, label all the nodes.
$ oc label nodes --all metro-dr.openshift-storage.topology.io/datacenter=DC1
Execute the following commands to enable read affinity:
$ oc patch storageclusters.ocs.openshift.io -n openshift-storage ocs-external-storagecluster -p '{"spec":{"csi":{"readAffinity":{"enabled":true,"crushLocationLabels":["metro-dr.openshift-storage.topology.io/datacenter"]}}}}' --type=merge
$ oc delete po -n openshift-storage -l 'app in (csi-cephfsplugin,csi-rbdplugin)'
On the Secondary managed cluster, label all the nodes:
$ oc label nodes --all metro-dr.openshift-storage.topology.io/datacenter=DC2
Execute the following commands to enable read affinity:
$ oc patch storageclusters.ocs.openshift.io -n openshift-storage ocs-external-storagecluster -p '{"spec":{"csi":{"readAffinity":{"enabled":true,"crushLocationLabels":["metro-dr.openshift-storage.topology.io/datacenter"]}}}}' --type=merge
$ oc delete po -n openshift-storage -l 'app in (csi-cephfsplugin,csi-rbdplugin)'
3.7. Installing OpenShift Data Foundation Multicluster Orchestrator operator
OpenShift Data Foundation Multicluster Orchestrator is a controller that is installed from OpenShift Container Platform’s OperatorHub on the Hub cluster.
Procedure
- On the Hub cluster, navigate to OperatorHub and use the keyword filter to search for ODF Multicluster Orchestrator.
- Click ODF Multicluster Orchestrator tile.
Keep all default settings and click Install.
Ensure that the operator resources are installed in
openshift-operators
project and available to all namespaces.NoteThe
ODF Multicluster Orchestrator
also installs the Openshift DR Hub Operator on the RHACM hub cluster as a dependency.Verify that the operator Pods are in a
Running
state. TheOpenShift DR Hub operator
is also installed at the same time inopenshift-operators
namespace.$ oc get pods -n openshift-operators
Example output:
NAME READY STATUS RESTARTS AGE odf-multicluster-console-6845b795b9-blxrn 1/1 Running 0 4d20h odfmo-controller-manager-f9d9dfb59-jbrsd 1/1 Running 0 4d20h ramen-hub-operator-6fb887f885-fss4w 2/2 Running 0 4d20h
3.8. Configuring SSL access across clusters
Configure network (SSL) access between the primary and secondary clusters so that metadata can be stored on the alternate cluster in a Multicloud Gateway (MCG) object bucket
using a secure transport protocol and in the Hub cluster for verifying access to the object buckets.
If all of your OpenShift clusters are deployed using a signed and valid set of certificates for your environment then this section can be skipped.
Procedure
Extract the ingress certificate for the Primary managed cluster and save the output to
primary.crt
.$ oc get cm default-ingress-cert -n openshift-config-managed -o jsonpath="{['data']['ca-bundle\.crt']}" > primary.crt
Extract the ingress certificate for the Secondary managed cluster and save the output to
secondary.crt
.$ oc get cm default-ingress-cert -n openshift-config-managed -o jsonpath="{['data']['ca-bundle\.crt']}" > secondary.crt
Create a new ConfigMap file to hold the remote cluster’s certificate bundle with filename
cm-clusters-crt.yaml
.NoteThere could be more or less than three certificates for each cluster as shown in this example file. Also, ensure that the certificate contents are correctly indented after you copy and paste from the
primary.crt
andsecondary.crt
files that were created before.apiVersion: v1 data: ca-bundle.crt: | -----BEGIN CERTIFICATE----- <copy contents of cert1 from primary.crt here> -----END CERTIFICATE----- -----BEGIN CERTIFICATE----- <copy contents of cert2 from primary.crt here> -----END CERTIFICATE----- -----BEGIN CERTIFICATE----- <copy contents of cert3 primary.crt here> -----END CERTIFICATE---- -----BEGIN CERTIFICATE----- <copy contents of cert1 from secondary.crt here> -----END CERTIFICATE----- -----BEGIN CERTIFICATE----- <copy contents of cert2 from secondary.crt here> -----END CERTIFICATE----- -----BEGIN CERTIFICATE----- <copy contents of cert3 from secondary.crt here> -----END CERTIFICATE----- kind: ConfigMap metadata: name: user-ca-bundle namespace: openshift-config
Create the ConfigMap on the Primary managed cluster, Secondary managed cluster, and the Hub cluster.
$ oc create -f cm-clusters-crt.yaml
Example output:
configmap/user-ca-bundle created
Patch default proxy resource on the Primary managed cluster, Secondary managed cluster, and the Hub cluster.
$ oc patch proxy cluster --type=merge --patch='{"spec":{"trustedCA":{"name":"user-ca-bundle"}}}'
Example output:
proxy.config.openshift.io/cluster patched
3.9. Creating Disaster Recovery Policy on Hub cluster
Openshift Disaster Recovery Policy (DRPolicy) resource specifies OpenShift Container Platform clusters participating in the disaster recovery solution and the desired replication interval. DRPolicy is a cluster scoped resource that users can apply to applications that require Disaster Recovery solution.
The ODF MultiCluster Orchestrator Operator facilitates the creation of each DRPolicy and the corresponding DRClusters through the Multicluster Web console.
Prerequisites
- Ensure that there is a minimum set of two managed clusters.
Procedure
-
On the OpenShift console, navigate to All Clusters
Data Services Disaster recovery. - On the Overview tab, click Create a disaster recovery policy or you can navigate to Policies tab and click Create DRPolicy.
-
Enter Policy name. Ensure that each DRPolicy has a unique name (for example:
ocp4perf1-ocp4perf2
). - Select two clusters from the list of managed clusters to which this new policy will be associated with.
-
Replication policy is automatically set to
sync
based on the OpenShift clusters selected. - Click Create.
Verify that the DRPolicy is created successfully. Run this command on the Hub cluster for each of the DRPolicy resources created, where <drpolicy_name> is replaced with your unique name.
$ oc get drpolicy <drpolicy_name> -o jsonpath='{.status.conditions[].reason}{"\n"}'
Example output:
Succeeded
When a DRPolicy is created, along with it, two DRCluster resources are also created. It could take up to 10 minutes for all three resources to be validated and for the status to show as
Succeeded
.NoteEditing of
SchedulingInterval
,ReplicationClassSelector
,VolumeSnapshotClassSelector
andDRClusters
field values are not supported in the DRPolicy.Verify the object bucket access from the Hub cluster to both the Primary managed cluster and the Secondary managed cluster.
Get the names of the DRClusters on the Hub cluster.
$ oc get drclusters
Example output:
NAME AGE ocp4perf1 4m42s ocp4perf2 4m42s
Check S3 access to each bucket created on each managed cluster. Use the DRCluster validation command, where <drcluster_name> is replaced with your unique name.
NoteEditing of
Region
andS3ProfileName
field values are non supported in DRClusters.$ oc get drcluster <drcluster_name> -o jsonpath='{.status.conditions[2].reason}{"\n"}'
Example output:
Succeeded
NoteMake sure to run commands for both DRClusters on the Hub cluster.
Verify that the OpenShift DR Cluster operator installation was successful on the Primary managed cluster and the Secondary managed cluster.
$ oc get csv,pod -n openshift-dr-system
Example output:
NAME DISPLAY VERSION REPLACES PHASE clusterserviceversion.operators.coreos.com/odr-cluster-operator.v4.15.0 Openshift DR Cluster Operator 4.15.0 Succeeded clusterserviceversion.operators.coreos.com/volsync-product.v0.8.0 VolSync 0.8.0 Succeeded NAME READY STATUS RESTARTS AGE pod/ramen-dr-cluster-operator-6467cf5d4c-cc8kz 2/2 Running 0 3d12h
You can also verify that
OpenShift DR Cluster Operator
is installed successfully on the OperatorHub of each managed cluster.Verify that the secret is propagated correctly on the Primary managed cluster and the Secondary managed cluster.
oc get secrets -n openshift-dr-system | grep Opaque
Match the output with the s3SecretRef from the Hub cluster:
oc get cm -n openshift-operators ramen-hub-operator-config -oyaml
3.10. Configure DRClusters for fencing automation
This configuration is required for enabling fencing prior to application failover. In order to prevent writes to the persistent volume from the cluster which is hit by a disaster, OpenShift DR instructs Red Hat Ceph Storage (RHCS) to fence the nodes of the cluster from the RHCS external storage. This section guides you on how to add the IPs or the IP Ranges for the nodes of the DRCluster.
3.10.1. Add node IP addresses to DRClusters
Find the IP addresses for all of the OpenShift nodes in the managed clusters by running this command in the Primary managed cluster and the Secondary managed cluster.
$ oc get nodes -o jsonpath='{range .items[*]}{.status.addresses[?(@.type=="ExternalIP")].address}{"\n"}{end}'
Example output:
10.70.56.118 10.70.56.193 10.70.56.154 10.70.56.242 10.70.56.136 10.70.56.99
Once you have the
IP addresses
then theDRCluster
resources can be modified for each managed cluster.Find the DRCluster names on the Hub Cluster.
$ oc get drcluster
Example output:
NAME AGE ocp4perf1 5m35s ocp4perf2 5m35s
Edit each DRCluster to add your unique IP addresses after replacing
<drcluster_name>
with your unique name.$ oc edit drcluster <drcluster_name>
apiVersion: ramendr.openshift.io/v1alpha1 kind: DRCluster metadata: [...] spec: s3ProfileName: s3profile-<drcluster_name>-ocs-external-storagecluster ## Add this section cidrs: - <IP_Address1>/32 - <IP_Address2>/32 - <IP_Address3>/32 - <IP_Address4>/32 - <IP_Address5>/32 - <IP_Address6>/32 [...]
Example output:
drcluster.ramendr.openshift.io/ocp4perf1 edited
There could be more than six IP addresses.
Modify this DRCluster configuration also for IP addresses
on the Secondary managed clusters in the peer DRCluster resource (e.g., ocp4perf2).
3.10.2. Add fencing annotations to DRClusters
Add the following annotations to all the DRCluster resources. These annotations include details needed for the NetworkFence resource created later in these instructions (prior to testing application failover).
Replace <drcluster_name> with your unique name.
$ oc edit drcluster <drcluster_name>
apiVersion: ramendr.openshift.io/v1alpha1 kind: DRCluster metadata: ## Add this section annotations: drcluster.ramendr.openshift.io/storage-clusterid: openshift-storage drcluster.ramendr.openshift.io/storage-driver: openshift-storage.rbd.csi.ceph.com drcluster.ramendr.openshift.io/storage-secret-name: rook-csi-rbd-provisioner drcluster.ramendr.openshift.io/storage-secret-namespace: openshift-storage [...]
Example output:
drcluster.ramendr.openshift.io/ocp4perf1 edited
Make sure to add these annotations for both DRCluster resources (for example: ocp4perf1
and ocp4perf2
).
3.11. Create sample application for testing disaster recovery solution
OpenShift Data Foundation disaster recovery (DR) solution supports disaster recovery for Subscription-based and ApplicationSet-based applications that are managed by RHACM. For more details, see Subscriptions and ApplicationSet documentation.
The following sections detail how to create an application and apply a DRPolicy to an application.
Subscription-based applications
OpenShift users that do not have cluster-admin permissions, see the knowledge article on how to assign necessary permissions to an application user for executing disaster recovery actions.
ApplicationSet-based applications
OpenShift users that do not have cluster-admin permissions cannot create ApplicationSet-based applications.
3.11.1. Subscription-based applications
3.11.1.1. Creating a sample Subscription-based application
In order to test failover
from the Primary managed cluster to the Secondary managed cluster and relocate
, we need a sample application.
Prerequisites
- When creating an application for general consumption, ensure that the application is deployed to ONLY one cluster.
-
Use the sample application called
busybox
as an example. - Ensure all external routes of the application are configured using either Global Traffic Manager (GTM) or Global Server Load Balancing (GLSB) service for traffic redirection when the application fails over or is relocated.
As a best practice, group Red Hat Advanced Cluster Management (RHACM) subscriptions that belong together, refer to a single Placement Rule to DR protect them as a group. Further create them as a single application for a logical grouping of the subscriptions for future DR actions like failover and relocate.
NoteIf unrelated subscriptions refer to the same Placement Rule for placement actions, they are also DR protected as the DR workflow controls all subscriptions that references the Placement Rule.
Procedure
- On the Hub cluster, navigate to Applications and click Create application.
- Select type as Subscription.
-
Enter your application Name (for example,
busybox
) and Namespace (for example,busybox-sample
). -
In the Repository location for resources section, select Repository type
Git
. Enter the Git repository URL for the sample application, the github Branch and Path where the resources
busybox
Pod and PVC will be created.Use the sample application repository as
https://github.com/red-hat-storage/ocm-ramen-samples
where the Branch isrelease-4.16
and Path isbusybox-odr-metro
.Scroll down in the form until you see Deploy application resources on clusters with all specified labels.
- Select the global Cluster sets or the one that includes the correct managed clusters for your environment.
- Add a label <name> with its value set to the managed cluster name.
Click Create which is at the top right hand corner.
On the follow-on screen go to the
Topology
tab. You should see that there are all Green checkmarks on the application topology.NoteTo get more information, click on any of the topology elements and a window will appear on the right of the topology view.
Validating the sample application deployment.
Now that the
busybox
application has been deployed to your preferred Cluster, the deployment can be validated.Log in to your managed cluster where
busybox
was deployed by RHACM.$ oc get pods,pvc -n busybox-sample
Example output:
NAME READY STATUS RESTARTS AGE pod/busybox-67bf494b9-zl5tr 1/1 Running 0 77s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/busybox-pvc Bound pvc-c732e5fe-daaf-4c4d-99dd-462e04c18412 5Gi RWO ocs-storagecluster-ceph-rbd 77s
3.11.1.2. Apply Data policy to sample application
Prerequisites
- Ensure that both managed clusters referenced in the Data policy are reachable. If not, the application will not be protected for disaster recovery until both clusters are online.
Procedure
-
On the Hub cluster, navigate to All Clusters
Applications. - Click the Actions menu at the end of application to view the list of available actions.
-
Click Manage data policy
Assign data policy. - Select Policy and click Next.
Select an Application resource and then use PVC label selector to select
PVC label
for the selected application resource.NoteYou can select more than one PVC label for the selected application resources. You can also use the Add application resource option to add multiple resources.
- After adding all the application resources, click Next.
-
Review the
Policy configuration details
and click Assign. The newly assigned Data policy is displayed on theManage data policy
modal list view. Verify that you can view the assigned policy details on the Applications page.
- On the Applications page, navigate to the Data policy column and click the policy link to expand the view.
- Verify that you can see the number of policies assigned along with failover and relocate status.
- Click View more details to view the status of ongoing activities with the policy in use with the application.
-
After you apply DRPolicy to the applications, confirm whether the
ClusterDataProtected
is set toTrue
in the drpc yaml output.
3.11.2. ApplicationSet-based applications
3.11.2.1. Creating ApplicationSet-based applications
Prerequisite
- Ensure that the Red Hat OpenShift GitOps operator is installed on all three clusters: Hub cluster, Primary managed cluster and Secondary managed cluster. For instructions, see Installing Red Hat OpenShift GitOps Operator in web console.
On the Hub cluster, ensure that both Primary and Secondary managed clusters are registered to GitOps. For registration instructions, see Registering managed clusters to GitOps. Then check if the Placement used by
GitOpsCluster
resource to register both managed clusters, has the tolerations to deal with cluster unavailability. You can verify if the following tolerations are added to the Placement using the commandoc get placement <placement-name> -n openshift-gitops -o yaml
.tolerations: - key: cluster.open-cluster-management.io/unreachable operator: Exists - key: cluster.open-cluster-management.io/unavailable operator: Exists
In case the tolerations are not added, see Configuring application placement tolerations for Red Hat Advanced Cluster Management and OpenShift GitOps.
-
Ensure that you have created the
ClusterRoleBinding
yaml on both the Primary and Secondary managed clusters. For instruction, see the Prerequisites chapter in RHACM documentation.
Procedure
-
On the Hub cluster, navigate to All Clusters
Applications and click Create application. - Choose the application type as Argo CD ApplicationSet - Pull model.
- In the General step, enter your Application set name.
-
Select Argo server
openshift-gitops
and Requeue time as180
seconds. - Click Next.
-
In the Repository location for resources section, select Repository type
Git
. Enter the Git repository URL for the sample application, the github Branch and Path where the resources busybox Pod and PVC will be created.
- Use the sample application repository as https://github.com/red-hat-storage/ocm-ramen-samples
-
Select Revision as
release-4.16
-
Choose Path as
busybox-odr-metro
.
- Enter Remote namespace value. (example, busybox-sample) and click Next.
Choose the Sync policy settings as per your requirement or go with the default selections, and then click Next.
You can choose one or more options.
- In Label expressions, add a label <name> with its value set to the managed cluster name.
- Click Next.
- Review the setting details and click Submit.
3.11.2.2. Apply Data policy to sample ApplicationSet-based application
Prerequisites
- Ensure that both managed clusters referenced in the Data policy are reachable. If not, the application will not be protected for disaster recovery until both clusters are online.
Procedure
-
On the Hub cluster, navigate to All Clusters
Applications. - Click the Actions menu at the end of application to view the list of available actions.
-
Click Manage data policy
Assign data policy. - Select Policy and click Next.
Select an Application resource and then use PVC label selector to select
PVC label
for the selected application resource.NoteYou can select more than one PVC label for the selected application resources.
- After adding all the application resources, click Next.
-
Review the
Policy configuration details
and click Assign. The newly assigned Data policy is displayed on theManage data policy
modal list view. Verify that you can view the assigned policy details on the Applications page.
- On the Applications page, navigate to the Data policy column and click the policy link to expand the view.
- Verify that you can see the number of policies assigned along with failover and relocate status.
-
After you apply DRPolicy to the applications, confirm whether the
ClusterDataProtected
is set toTrue
in the drpc yaml output.
3.11.3. Deleting sample application
This section provides instructions for deleting the sample application busybox
using the RHACM console.
When deleting a DR protected application, access to both clusters that belong to the DRPolicy is required. This is to ensure that all protected API resources and resources in the respective S3 stores are cleaned up as part of removing the DR protection. If access to one of the clusters is not healthy, deleting the DRPlacementControl
resource for the application, on the hub, would remain in the Deleting state.
Prerequisites
- These instructions to delete the sample application should not be executed until the failover and relocate testing is completed and the application is ready to be removed from RHACM and the managed clusters.
Procedure
- On the RHACM console, navigate to Applications.
-
Search for the sample application to be deleted (for example,
busybox
). - Click the Action Menu (⋮) next to the application you want to delete.
Click Delete application.
When the Delete application is selected a new screen will appear asking if the application related resources should also be deleted.
- Select Remove application related resources checkbox to delete the Subscription and PlacementRule.
- Click Delete. This will delete the busybox application on the Primary managed cluster (or whatever cluster the application was running on).
In addition to the resources deleted using the RHACM console, delete the
DRPlacementControl
if it is not auto-deleted after deleting thebusybox
application.Log in to the OpenShift Web console for the Hub cluster and navigate to Installed Operators for the project
busybox-sample
.For ApplicationSet applications, select the project as
openshift-gitops
.- Click OpenShift DR Hub Operator and then click the DRPlacementControl tab.
-
Click the Action Menu (⋮) next to the
busybox
application DRPlacementControl that you want to delete. - Click Delete DRPlacementControl.
- Click Delete.
This process can be used to delete any application with a DRPlacementControl
resource.
3.12. Subscription-based application failover between managed clusters
Perform a failover when a managed cluster becomes unavailable, due to any reason. This failover method is application-based.
Prerequisites
- If your setup has active and passive RHACM hub clusters, see Hub recovery using Red Hat Advanced Cluster Management.
When the primary cluster is in a state other than
Ready
, check the actual status of the cluster as it might take some time to update.-
Navigate to the RHACM console
Infrastructure Clusters Cluster list tab. Check the status of both the managed clusters individually before performing failover operation.
However, failover operation can still be performed when the cluster you are failing over to is in a Ready state.
-
Navigate to the RHACM console
Procedure
Enable fencing on the Hub cluster.
Open CLI terminal and edit the DRCluster resource, where <drcluster_name> is your unique name.
CautionOnce the managed cluster is fenced, all communication from applications to the OpenShift Data Foundation external storage cluster will fail and some Pods will be in an unhealthy state (for example:
CreateContainerError
,CrashLoopBackOff
) on the cluster that is now fenced.$ oc edit drcluster <drcluster_name>
apiVersion: ramendr.openshift.io/v1alpha1 kind: DRCluster metadata: [...] spec: ## Add this line clusterFence: Fenced cidrs: [...] [...]
Example output:
drcluster.ramendr.openshift.io/ocp4perf1 edited
Verify the fencing status on the Hub cluster for the Primary managed cluster, replacing <drcluster_name> is your unique identifier.
$ oc get drcluster.ramendr.openshift.io <drcluster_name> -o jsonpath='{.status.phase}{"\n"}'
Example output:
Fenced
Login to your Ceph cluster and verify that the IPs that belong to the OpenShift Container Platform cluster nodes are now in the blocklist.
$ ceph osd blocklist ls
Example output
cidr:10.1.161.1:0/32 2028-10-30T22:30:03.585634+0000 cidr:10.1.161.14:0/32 2028-10-30T22:30:02.483561+0000 cidr:10.1.161.51:0/32 2028-10-30T22:30:01.272267+0000 cidr:10.1.161.63:0/32 2028-10-30T22:30:05.099655+0000 cidr:10.1.161.129:0/32 2028-10-30T22:29:58.335390+0000 cidr:10.1.161.130:0/32 2028-10-30T22:29:59.861518+0000
- On the Hub cluster, navigate to Applications.
- Click the Actions menu at the end of application row to view the list of available actions.
- Click Failover application.
- After the Failover application modal is shown, select policy and target cluster to which the associated application will failover in case of a disaster.
Click the Select subscription group dropdown to verify the default selection or modify this setting.
By default, the subscription group that replicates for the application resources is selected.
Check the status of the Failover readiness.
-
If the status is
Ready
with a green tick, it indicates that the target cluster is ready for failover to start. Proceed to step 7. -
If the status is
Unknown
orNot ready
, then wait until the status changes toReady
.
-
If the status is
- Click Initiate. The busybox application is now failing over to the Secondary-managed cluster.
- Close the modal window and track the status using the Data policy column on the Applications page.
Verify that the activity status shows as FailedOver for the application.
-
Navigate to the Applications
Overview tab. - In the Data policy column, click the policy link for the application you applied the policy to.
- On the Data policy popover, click the View more details link.
-
Navigate to the Applications
3.13. ApplicationSet-based application failover between managed clusters
Perform a failover when a managed cluster becomes unavailable, due to any reason. This failover method is application-based.
Prerequisites
- If your setup has active and passive RHACM hub clusters, see Hub recovery using Red Hat Advanced Cluster Management.
When the primary cluster is in a state other than
Ready
, check the actual status of the cluster as it might take some time to update.-
Navigate to the RHACM console
Infrastructure Clusters Cluster list tab. Check the status of both the managed clusters individually before performing failover operation.
However, failover operation can still be performed when the cluster you are failing over to is in a Ready state.
-
Navigate to the RHACM console
Procedure
Enable fencing on the Hub cluster.
Open CLI terminal and edit the DRCluster resource, where <drcluster_name> is your unique name.
CautionOnce the managed cluster is fenced, all communication from applications to the OpenShift Data Foundation external storage cluster will fail and some Pods will be in an unhealthy state (for example:
CreateContainerError
,CrashLoopBackOff
) on the cluster that is now fenced.$ oc edit drcluster <drcluster_name>
apiVersion: ramendr.openshift.io/v1alpha1 kind: DRCluster metadata: [...] spec: ## Add this line clusterFence: Fenced cidrs: [...] [...]
Example output:
drcluster.ramendr.openshift.io/ocp4perf1 edited
Verify the fencing status on the Hub cluster for the Primary managed cluster, replacing <drcluster_name> is your unique identifier.
$ oc get drcluster.ramendr.openshift.io <drcluster_name> -o jsonpath='{.status.phase}{"\n"}'
Example output:
Fenced
Login to your Ceph cluster and verify that the IPs that belong to the OpenShift Container Platform cluster nodes are now in the blocklist.
$ ceph osd blocklist ls
Example output
cidr:10.1.161.1:0/32 2028-10-30T22:30:03.585634+0000 cidr:10.1.161.14:0/32 2028-10-30T22:30:02.483561+0000 cidr:10.1.161.51:0/32 2028-10-30T22:30:01.272267+0000 cidr:10.1.161.63:0/32 2028-10-30T22:30:05.099655+0000 cidr:10.1.161.129:0/32 2028-10-30T22:29:58.335390+0000 cidr:10.1.161.130:0/32 2028-10-30T22:29:59.861518+0000
- On the Hub cluster, navigate to Applications.
- Click the Actions menu at the end of application row to view the list of available actions.
- Click Failover application.
- When the Failover application modal is shown, verify the details presented are correct and check the status of the Failover readiness. If the status is Ready with a green tick, it indicates that the target cluster is ready for failover to start.
- Click Initiate. The busybox resources are now created on the target cluster.
- Close the modal window and track the status using the Data policy column on the Applications page.
Verify that the activity status shows as FailedOver for the application.
-
Navigate to the Applications
Overview tab. - In the Data policy column, click the policy link for the application you applied the policy to.
- On the Data policy popover, verify that you can see one or more policy names and the ongoing activities associated with the policy in use with the application.
-
Navigate to the Applications
3.14. Relocating Subscription-based application between managed clusters
Relocate an application to its preferred location when all managed clusters are available.
Prerequisite
- If your setup has active and passive RHACM hub clusters, see Hub recovery using Red Hat Advanced Cluster Management.
When the primary cluster is in a state other than Ready, check the actual status of the cluster as it might take some time to update. Relocate can only be performed when both primary and preferred clusters are up and running.
-
Navigate to RHACM console
Infrastructure Clusters Cluster list tab. - Check the status of both the managed clusters individually before performing relocate operation.
-
Navigate to RHACM console
- Verify that applications were cleaned up from the cluster before unfencing it.
Procedure
Disable fencing on the Hub cluster.
Edit the DRCluster resource for this cluster, replacing <drcluster_name> with a unique name.
$ oc edit drcluster <drcluster_name>
apiVersion: ramendr.openshift.io/v1alpha1 kind: DRCluster metadata: [...] spec: cidrs: [...] ## Modify this line clusterFence: Unfenced [...] [...]
Example output:
drcluster.ramendr.openshift.io/ocp4perf1 edited
Gracefully reboot OpenShift Container Platform nodes that were
Fenced
. A reboot is required to resume the I/O operations after unfencing to avoid any further recovery orchestration failures. Reboot all nodes of the cluster by following the steps in the procedure, Rebooting a node gracefully.NoteMake sure that all the nodes are initially cordoned and drained before you reboot and perform uncordon operations on the nodes.
After all OpenShift nodes are rebooted and are in a
Ready
status, verify that all Pods are in a healthy state by running this command on the Primary managed cluster (or whatever cluster has been Unfenced).oc get pods -A | egrep -v 'Running|Completed'
Example output:
NAMESPACE NAME READY STATUS RESTARTS AGE
The output for this query should be zero Pods before proceeding to the next step.
ImportantIf there are Pods still in an unhealthy status because of severed storage communication, troubleshoot and resolve before continuing. Because the storage cluster is external to OpenShift, it also has to be properly recovered after a site outage for OpenShift applications to be healthy.
Alternatively, you can use the OpenShift Web Console dashboards and Overview tab to assess the health of applications and the external ODF storage cluster. The detailed OpenShift Data Foundation dashboard is found by navigating to Storage
Data Foundation. Verify that the
Unfenced
cluster is in a healthy state. Validate the fencing status in the Hub cluster for the Primary-managed cluster, replacing <drcluster_name> with a unique name.$ oc get drcluster.ramendr.openshift.io <drcluster_name> -o jsonpath='{.status.phase}{"\n"}'
Example output:
Unfenced
Login to your Ceph cluster and verify that the IPs that belong to the OpenShift Container Platform cluster nodes are NOT in the blocklist.
$ ceph osd blocklist ls
Ensure that you do not see the IPs added during fencing.
- On the Hub cluster, navigate to Applications.
- Click the Actions menu at the end of application row to view the list of available actions.
- Click Relocate application.
- When the Relocate application modal is shown, select policy and target cluster to which the associated application will relocate to in case of a disaster.
- By default, the subscription group that will deploy the application resources is selected. Click the Select subscription group dropdown to verify the default selection or modify this setting.
Check the status of the Relocation readiness.
-
If the status is
Ready
with a green tick, it indicates that the target cluster is ready for relocation to start. Proceed to step 7. -
If the status is
Unknown
orNot ready
, then wait until the status changes toReady
.
-
If the status is
- Click Initiate. The busybox resources are now created on the target cluster.
- Close the modal window and track the status using the Data policy column on the Applications page.
Verify that the activity status shows as Relocated for the application.
-
Navigate to the Applications
Overview tab. - In the Data policy column, click the policy link for the application you applied the policy to.
- On the Data policy popover, click the View more details link.
-
Navigate to the Applications
3.15. Relocating an ApplicationSet-based application between managed clusters
Relocate an application to its preferred location when all managed clusters are available.
Prerequisite
- If your setup has active and passive RHACM hub clusters, see Hub recovery using Red Hat Advanced Cluster Management.
When the primary cluster is in a state other than Ready, check the actual status of the cluster as it might take some time to update. Relocate can only be performed when both primary and preferred clusters are up and running.
-
Navigate to RHACM console
Infrastructure Clusters Cluster list tab. - Check the status of both the managed clusters individually before performing relocate operation.
-
Navigate to RHACM console
- Verify that applications were cleaned up from the cluster before unfencing it.
Procedure
Disable fencing on the Hub cluster.
Edit the DRCluster resource for this cluster, replacing <drcluster_name> with a unique name.
$ oc edit drcluster <drcluster_name>
apiVersion: ramendr.openshift.io/v1alpha1 kind: DRCluster metadata: [...] spec: cidrs: [...] ## Modify this line clusterFence: Unfenced [...] [...]
Example output:
drcluster.ramendr.openshift.io/ocp4perf1 edited
Gracefully reboot OpenShift Container Platform nodes that were
Fenced
. A reboot is required to resume the I/O operations after unfencing to avoid any further recovery orchestration failures. Reboot all nodes of the cluster by following the steps in the procedure, Rebooting a node gracefully.NoteMake sure that all the nodes are initially cordoned and drained before you reboot and perform uncordon operations on the nodes.
After all OpenShift nodes are rebooted and are in a
Ready
status, verify that all Pods are in a healthy state by running this command on the Primary managed cluster (or whatever cluster has been Unfenced).oc get pods -A | egrep -v 'Running|Completed'
Example output:
NAMESPACE NAME READY STATUS RESTARTS AGE
The output for this query should be zero Pods before proceeding to the next step.
ImportantIf there are Pods still in an unhealthy status because of severed storage communication, troubleshoot and resolve before continuing. Because the storage cluster is external to OpenShift, it also has to be properly recovered after a site outage for OpenShift applications to be healthy.
Alternatively, you can use the OpenShift Web Console dashboards and Overview tab to assess the health of applications and the external ODF storage cluster. The detailed OpenShift Data Foundation dashboard is found by navigating to Storage
Data Foundation. Verify that the
Unfenced
cluster is in a healthy state. Validate the fencing status in the Hub cluster for the Primary-managed cluster, replacing <drcluster_name> with a unique name.$ oc get drcluster.ramendr.openshift.io <drcluster_name> -o jsonpath='{.status.phase}{"\n"}'
Example output:
Unfenced
Login to your Ceph cluster and verify that the IPs that belong to the OpenShift Container Platform cluster nodes are NOT in the blocklist.
$ ceph osd blocklist ls
Ensure that you do not see the IPs added during fencing.
- On the Hub cluster, navigate to Applications.
- Click the Actions menu at the end of application row to view the list of available actions.
- Click Relocate application.
- When the Relocate application modal is shown, select policy and target cluster to which the associated application will relocate to in case of a disaster.
- Click Initiate. The busybox resources are now created on the target cluster.
- Close the modal window and track the status using the Data policy column on the Applications page.
Verify that the activity status shows as Relocated for the application.
-
Navigate to the Applications
Overview tab. - In the Data policy column, click the policy link for the application you applied the policy to.
- On the Data policy popover, verify that you can see one or more policy names and the relocation status associated with the policy in use with the application.
-
Navigate to the Applications
3.16. Disaster recovery protection for discovered applications
Red Hat OpenShift Data Foundation now provides disaster recovery (DR) protection and support for workloads that are deployed in one of the managed clusters directly without using Red Hat Advanced Cluster Management (RHACM). These workloads are called discovered applications.
The workloads that are deployed using RHACM are now called managed applications. When a workload is deployed directly on one of the managed clusters without using RHACM, then those workloads are called discovered applications. Though these workload details can be seen on the RHACM console, the application lifecycle (create, delete, edit) is not managed by RHACM.
3.16.1. Prerequisites for disaster recovery protection of discovered applications
This section provides instructions to guide you through the prerequisites for protecting discovered applications. This includes tasks such as assigning a data policy and initiating DR actions such as failover and relocate.
- Ensure that all the DR configurations have been installed on the Primary managed cluster and the Secondary managed cluster.
Install the OADP 1.4 operator.
NoteAny version before OADP 1.4 will not work for protecting discovered applications.
-
On the Primary and Secondary managed cluster, navigate to OperatorHub and use the keyword filter to search for
OADP
. - Click the OADP tile.
-
Keep all default settings and click Install. Ensure that the operator resources are installed in the
openshift-adp
project.
NoteIf OADP 1.4 is installed after DR configuration has been completed then the
ramen-dr-cluster-operator
pods on the Primary managed cluster and the Secondary managed cluster in namespaceopenshift-dr-system
must be restarted (deleted and recreated).-
On the Primary and Secondary managed cluster, navigate to OperatorHub and use the keyword filter to search for
[Optional] Add CACertificates to
ramen-hub-operator-config
ConfigMap.Configure network (SSL) access between the primary and secondary clusters so that metadata can be stored on the alternate cluster in a Multicloud Gateway (MCG) object bucket using a secure transport protocol and in the Hub cluster for verifying access to the object buckets.
NoteIf all of your OpenShift clusters are deployed using a signed and valid set of certificates for your environment then this section can be skipped.
If you are using self-signed certificates, then you have already created a ConfigMap named
user-ca-bundle
in theopenshift-config
namespace and added this ConfigMap to the default Proxy cluster resource.Find the encoded value for the CACertificates.
$ oc get configmap user-ca-bundle -n openshift-config -o jsonpath="{['data']['ca-bundle\.crt']}" |base64 -w 0
Add this base64 encoded value to the configmap
ramen-hub-operator-config
on the Hub cluster. Example below shows where to add CACertificates.$ oc edit configmap ramen-hub-operator-config -n openshift-operators
[...] ramenOpsNamespace: openshift-dr-ops s3StoreProfiles: - s3Bucket: odrbucket-36bceb61c09c s3CompatibleEndpoint: https://s3-openshift-storage.apps.hyper3.vmw.ibmfusion.eu s3ProfileName: s3profile-hyper3-ocs-storagecluster s3Region: noobaa s3SecretRef: name: 60f2ea6069e168346d5ad0e0b5faa59bb74946f caCertificates: {input base64 encoded value here} - s3Bucket: odrbucket-36bceb61c09c s3CompatibleEndpoint: https://s3-openshift-storage.apps.hyper4.vmw.ibmfusion.eu s3ProfileName: s3profile-hyper4-ocs-storagecluster s3Region: noobaa s3SecretRef: name: cc237eba032ad5c422fb939684eb633822d7900 caCertificates: {input base64 encoded value here}
Verify that there are DR secrets created in the OADP operator default namespace
openshift-adp
on the Primary managed cluster and the Secondary managed cluster. The DR secrets that were created when the first DRPolicy was created, will be similar to the secrets below. The DR secret name is preceded with the letterv
.$ oc get secrets -n openshift-adp NAME TYPE DATA AGE v60f2ea6069e168346d5ad0e0b5faa59bb74946f Opaque 1 3d20h vcc237eba032ad5c422fb939684eb633822d7900 Opaque 1 3d20h [...]
NoteThere will be one DR created secret for each managed cluster in the
openshift-adp
namespace.Verify if the Data Protection Application (DPA) is already installed on each managed cluster in the OADP namespace
openshift-adp
. If not already created then follow the next step to create this resource.Create the DPA by copying the following YAML definition content to
dpa.yaml
.apiVersion: oadp.openshift.io/v1alpha1 kind: DataProtectionApplication metadata: labels: app.kubernetes.io/component: velero name: velero namespace: openshift-adp spec: backupImages: false configuration: nodeAgent: enable: false uploaderType: restic velero: defaultPlugins: - openshift - aws noDefaultBackupLocation: true
Create the DPA resource.
$ oc create -f dpa.yaml -n openshift-adp
dataprotectionapplication.oadp.openshift.io/velero created
Verify that the OADP resources are created and are in
Running
state.$ oc get pods,dpa -n openshift-adp NAME READY STATUS RESTARTS AGE pod/openshift-adp-controller-manager-7b64b74fcd-msjbs 1/1 Running 0 5m30s pod/velero-694b5b8f5c-b4kwg 1/1 Running 0 3m31s NAME AGE dataprotectionapplication.oadp.openshift.io/velero 3m31s
3.16.2. Creating a sample discovered application
In order to test failover
from the Primary managed cluster to the Secondary managed cluster and relocate
for discovered applications, you need a sample application that is installed without using the RHACM create application capability.
Procedure
Log in to the Primary managed cluster and clone the sample application repository.
$ git clone https://github.com/red-hat-storage/ocm-ramen-samples.git
Verify that you are on the
main
branch.$ cd ~/ocm-ramen-samples $ git branch * main
The correct directory should be used when creating the sample application based on your scenario, metro or regional.
NoteOnly applications using CephRBD or block volumes are supported for discovered applications.
$ ls workloads/deployment | egrep -v 'cephfs|k8s|base' odr-metro-rbd odr-regional-rbd
Create a project named
busybox-discovered
on both the Primary and Secondary managed clusters.$ oc new-project busybox-discovered
Create the
busybox
application on the Primary managed cluster. This sample application example is for Metro-DR using a block (Ceph RBD) volume.$ oc apply -k workloads/deployment/odr-metro-rbd -n busybox-discovered persistentvolumeclaim/busybox-pvc created deployment.apps/busybox created
NoteOpenShift Data Foundation Disaster Recovery solution now extends protection to discovered applications that span across multiple namespaces.
Verify that busybox is running in the correct project on the Primary managed cluster.
$ oc get pods,pvc,deployment -n busybox-discovered
NAME READY STATUS RESTARTS AGE pod/busybox-796fccbb95-qmxjf 1/1 Running 0 18s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE persistentvolumeclaim/busybox-pvc Bound pvc-b20e4129-902d-47c7-b962-040ad64130c4 1Gi RWO ocs-storagecluster-ceph-rbd <unset> 18s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/busybox 1/1 1 1 18
3.16.3. Enrolling a sample discovered application for disaster recovery protection
This section guides you on how to apply an existing DR Policy to a discovered application from the Protected applications tab.
Prerequisites
- Ensure that Disaster Recovery has been configured and that at least one DR Policy has been created.
Procedure
-
On RHACM console, navigate to Disaster recovery
Protected applications tab. - Click Enroll application to start configuring existing applications for DR protection.
- Select ACM discovered applications.
-
In the Namespace page, choose the DR cluster which is the name of the Primary managed cluster where
busybox
is installed. Select namespace where the application is installed. For example,
busybox-discovered
.NoteIf you have workload spread across multiple namespaces then you can select all of those namespaces to DR protect.
-
Choose a unique Name, for example
busybox-rbd
, for the discovered application and click Next. - In the Configuration page, the Resource label is used to protect your resources where you can set which resources will be included in the kubernetes-object backup and what volume’s persistent data will be replicated. Resource label is selected by default.
-
Provide Label expressions and PVC label selector. Choose the label
appname=busybox
for both the kubernetes-objects and for the PVC(s). - Click Next.
In the Replication page, select an existing DR Policy and the kubernetes-objects backup interval.
NoteIt is recommended to choose the same duration for the PVC data replication and kubernetes-object backup interval (i.e., 5 minutes).
- Click Next.
Review the configuration and click Save.
Use the Back button to go back to the screen to correct any issues.
Verify that the Application volumes (PVCs) and the Kubernetes-objects backup have a Healthy status before proceeding to DR Failover and Relocate testing. You can view the status of your Discovered applications on the Protected applications tab.
To see the status of the DRPC, run the following command on the Hub cluster:
$ oc get drpc {drpc_name} -o wide -n openshift-dr-ops
The discovered applications store resources such as DRPlacementControl (DRPC) and Placement on the Hub cluster in a new namespace called
openshift-dr-ops
. The DRPC name can be identified by the unique Name configured in prior steps (i.e.,busybox-rbd
).To see the status of the VolumeReplicationGroup (VRG) for discovered applications, run the following command on the managed cluster where the busybox application was manually installed.
$ oc get vrg {vrg_name} -n openshift-dr-ops
The VRG resource is stored in the namespace
openshift-dr-ops
after a DR Policy is assigned to the discovered application. The VRG name can be identified by the unique Name configured in prior steps (i.e.,busybox-rbd
).
3.16.4. Discovered application failover and relocate
A protected Discovered application can Failover or Relocate to its peer cluster similar to managed applications. However, there are some additional steps for discovered applications since RHACM does not manage the lifecycle of the application as it does for Managed applications.
This section guides you through the Failover and Relocate process for a protected discovered application.
Never initiate a Failover or Relocate of an application when one or both resource types are in a Warning or Critical status.
3.16.4.1. Failover disaster recovery protected discovered application
This section guides you on how to failover a discovered application which is disaster recovery protected.
Prerequisites
-
Ensure that the application namespace is created in both managed clusters (for example,
busybox-discovered
).
Procedure
Enable fencing on the Hub cluster.
Open CLI terminal and edit the DRCluster resource, where <drcluster_name> is your unique name.
CautionOnce the managed cluster is fenced, all communication from applications to the OpenShift Data Foundation external storage cluster will fail and some Pods will be in an unhealthy state (for example:
CreateContainerError
,CrashLoopBackOff
) on the cluster that is now fenced.$ oc edit drcluster <drcluster_name>
apiVersion: ramendr.openshift.io/v1alpha1 kind: DRCluster metadata: [...] spec: ## Add this line clusterFence: Fenced cidrs: [...] [...]
Example output:
drcluster.ramendr.openshift.io/ocp4perf1 edited
Verify the fencing status on the Hub cluster for the Primary managed cluster, replacing <drcluster_name> is your unique identifier.
$ oc get drcluster.ramendr.openshift.io <drcluster_name> -o jsonpath='{.status.phase}{"\n"}'
Example output:
Fenced
Login to your Ceph cluster and verify that the IPs that belong to the OpenShift Container Platform cluster nodes are now in the blocklist.
$ ceph osd blocklist ls
Example output
cidr:10.1.161.1:0/32 2028-10-30T22:30:03.585634+0000 cidr:10.1.161.14:0/32 2028-10-30T22:30:02.483561+0000 cidr:10.1.161.51:0/32 2028-10-30T22:30:01.272267+0000 cidr:10.1.161.63:0/32 2028-10-30T22:30:05.099655+0000 cidr:10.1.161.129:0/32 2028-10-30T22:29:58.335390+0000 cidr:10.1.161.130:0/32 2028-10-30T22:29:59.861518+0000
-
In the RHACM console, navigate to Disaster Recovery
Protected applications tab. - At the end of the application row, click on the Actions menu and choose to initiate Failover.
- In the Failover application modal window, review the status of the application and the target cluster.
-
Click Initiate. Wait for the
Failover
process to complete. Verify that the busybox application is running on the Secondary managed cluster.
$ oc get pods,pvc -n busybox-discovered NAME READY STATUS RESTARTS AGE pod/busybox-796fccbb95-qmxjf 1/1 Running 0 2m46s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE persistentvolumeclaim/busybox-pvc Bound pvc-b20e4129-902d-47c7-b962-040ad64130c4 1Gi RWO ocs-storagecluster-ceph-rbd <unset> 2m57s
Check the progression status of Failover until the result is
WaitOnUserToCleanup
. The DRPC name can be identified by the unique Name configured in prior steps (for example,busybox-rbd
).$ oc get drpc {drpc_name} -n openshift-dr-ops -o jsonpath='{.status.progression}{"\n"}' WaitOnUserToCleanUp
Remove the busybox application from the Primary managed cluster to complete the
Failover
process.- Navigate to the Protected applications tab. You will see a message to remove the application.
Navigate to the cloned repository for
busybox
and run the following commands on the Primary managed cluster where youfailed
over from. Use the same directory that was used to create the application (for example,odr-metro-rbd
).$ cd ~/ocm-ramen-samples/ $ git branch * main $ oc delete -k workloads/deployment/odr-metro-rbd -n busybox-discovered persistentvolumeclaim "busybox-pvc" deleted deployment.apps "busybox" deleted
- After deleting the application, navigate to the Protected applications tab and verify that the busybox resources are both in Healthy status.
3.16.4.2. Relocate disaster recovery protected discovered application
This section guides you on how to relocate a discovered application which is disaster recovery protected.
Procedure
Disable fencing on the Hub cluster.
Edit the DRCluster resource for this cluster, replacing <drcluster_name> with a unique name.
$ oc edit drcluster <drcluster_name>
apiVersion: ramendr.openshift.io/v1alpha1 kind: DRCluster metadata: [...] spec: cidrs: [...] ## Modify this line clusterFence: Unfenced [...] [...]
Example output:
drcluster.ramendr.openshift.io/ocp4perf1 edited
Gracefully reboot OpenShift Container Platform nodes that were
Fenced
. A reboot is required to resume the I/O operations after unfencing to avoid any further recovery orchestration failures. Reboot all nodes of the cluster by following the steps in the procedure, Rebooting a node gracefully.NoteMake sure that all the nodes are initially cordoned and drained before you reboot and perform uncordon operations on the nodes.
After all OpenShift nodes are rebooted and are in a
Ready
status, verify that all Pods are in a healthy state by running this command on the Primary managed cluster (or whatever cluster has been Unfenced).oc get pods -A | egrep -v 'Running|Completed'
Example output:
NAMESPACE NAME READY STATUS RESTARTS AGE
The output for this query should be zero Pods before proceeding to the next step.
ImportantIf there are Pods still in an unhealthy status because of severed storage communication, troubleshoot and resolve before continuing. Because the storage cluster is external to OpenShift, it also has to be properly recovered after a site outage for OpenShift applications to be healthy.
Alternatively, you can use the OpenShift Web Console dashboards and Overview tab to assess the health of applications and the external ODF storage cluster. The detailed OpenShift Data Foundation dashboard is found by navigating to Storage
Data Foundation. Verify that the
Unfenced
cluster is in a healthy state. Validate the fencing status in the Hub cluster for the Primary-managed cluster, replacing <drcluster_name> with a unique name.$ oc get drcluster.ramendr.openshift.io <drcluster_name> -o jsonpath='{.status.phase}{"\n"}'
Example output:
Unfenced
Login to your Ceph cluster and verify that the IPs that belong to the OpenShift Container Platform cluster nodes are NOT in the blocklist.
$ ceph osd blocklist ls
Ensure that you do not see the IPs added during fencing.
-
In the RHACM console, navigate to Disaster Recovery
Protected applications tab. - At the end of the application row, click on the Actions menu and choose to initiate Relocate.
- In the Relocate application modal window, review the status of the application and the target cluster.
- Click Initiate.
Check the progression status of Relocate until the result is
WaitOnUserToCleanup
. The DRPC name can be identified by the unique Name configured in prior steps (for example,busybox-rbd
).$ oc get drpc {drpc_name} -n openshift-dr-ops -o jsonpath='{.status.progression}{"\n"}' WaitOnUserToCleanUp
Remove the busybox application from the Secondary managed cluster before Relocate to the Primary managed cluster is completed.
Navigate to the cloned repository for
busybox
and run the following commands on the Secondary managed cluster where yourelocated
from. Use the same directory that was used to create the application (for example,odr-metro-rbd
).$ cd ~/ocm-ramen-samples/ $ git branch * main $ oc delete -k workloads/deployment/odr-metro-rbd -n busybox-discovered persistentvolumeclaim "busybox-pvc" deleted deployment.apps "busybox" deleted
- After deleting the application, navigate to the Protected applications tab and verify that the busybox resources are both in Healthy status.
Verify that the
busybox
application is running on the Primary managed cluster.$ oc get pods,pvc -n busybox-discovered NAME READY STATUS RESTARTS AGE pod/busybox-796fccbb95-qmxjf 1/1 Running 0 2m46s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE persistentvolumeclaim/busybox-pvc Bound pvc-b20e4129-902d-47c7-b962-040ad64130c4 1Gi RWO ocs-storagecluster-ceph-rbd <unset> 2m57s
3.16.5. Disable disaster recovery for protected applications
This section guides you to disable disaster recovery resources when you want to delete the protected applications or when the application no longer needs to be protected.
Procedure
- Login to the Hub cluster.
List the
DRPlacementControl
(DRPC) resources. Each DRPC resource was created when the application was assigned a DR policy.$ oc get drpc -n openshift-dr-ops
Find the DRPC that has a name that includes the unique identifier that you chose when assigning a DR policy (for example,
busybox-rbd
) and delete the DRPC.$ oc delete {drpc_name} -n openshift-dr-ops
List the Placement resources. Each Placement resource was created when the application was assigned a DR policy.
$ oc get placements -n openshift-dr-ops
Find the Placement that has a name that includes the unique identifier that you chose when assigning a DR policy (for example,
busybox-rbd-placement-1
) and delete the Placement.$ oc delete placements {placement_name} -n openshift-dr-ops
3.17. Recovering to a replacement cluster with Metro-DR
When there is a failure with the primary cluster, you get the options to either repair, wait for the recovery of the existing cluster, or replace the cluster entirely if the cluster is irredeemable. This solution guides you when replacing a failed primary cluster with a new cluster and enables failback (relocate) to this new cluster.
In these instructions, we are assuming that a RHACM managed cluster must be replaced after the applications have been installed and protected. For purposes of this section, the RHACM managed cluster is the replacement cluster, while the cluster that is not replaced is the surviving cluster and the new cluster is the recovery cluster.
Replacement cluster recovery for Discovered applications is currently not supported. Only Managed applications are supported.
Prerequisite
- Ensure that the Metro-DR environment has been configured with applications installed using Red Hat Advance Cluster Management (RHACM).
- Ensure that the applications are assigned a Data policy which protects them against cluster failure.
Procedure
Perform the following steps on the Hub cluster:
Fence the replacement cluster by using the CLI terminal to edit the DRCluster resource, where <drcluster_name> is the replacement cluster name.
oc edit drcluster <drcluster_name>
apiVersion: ramendr.openshift.io/v1alpha1 kind: DRCluster metadata: [...] spec: ## Add or modify this line clusterFence: Fenced cidrs: [...] [...]
- Using the RHACM console, navigate to Applications and failover all protected applications from the failed cluster to the surviving cluster.
Verify and ensure that all protected applications are now running on the surviving cluster.
NoteThe PROGRESSION state for each application DRPlacementControl will show as
Cleaning Up
. This is expected if the replacement cluster is offline or down.
Unfence the replacement cluster.
Using the CLI terminal, edit the DRCluster resource, where <drcluster_name> is the replacement cluster name.
$ oc edit drcluster <drcluster_name>
apiVersion: ramendr.openshift.io/v1alpha1 kind: DRCluster metadata: [...] spec: ## Modify this line clusterFence: Unfenced cidrs: [...] [...]
Delete the DRCluster for the replacement cluster.
$ oc delete drcluster <drcluster_name> --wait=false
NoteUse --wait=false since the DRCluster will not be deleted until a later step.
Disable disaster recovery on the Hub cluster for each protected application on the surviving cluster.
For each application, edit the Placement and ensure that the surviving cluster is selected.
NoteFor Subscription-based applications the associated Placement can be found in the same namespace on the hub cluster similar to the managed clusters. For ApplicationSets-based applications the associated Placement can be found in the
openshift-gitops
namespace on the hub cluster.$ oc edit placement <placement_name> -n <namespace>
apiVersion: cluster.open-cluster-management.io/v1beta1 kind: Placement metadata: annotations: cluster.open-cluster-management.io/experimental-scheduling-disable: "true" [...] spec: clusterSets: - submariner predicates: - requiredClusterSelector: claimSelector: {} labelSelector: matchExpressions: - key: name operator: In values: - cluster1 <-- Modify to be surviving cluster name [...]
Verify that the
s3Profile
is removed for the replacement cluster by running the following command on the surviving cluster for each protected application’s VolumeReplicationGroup.$ oc get vrg -n <application_namespace> -o jsonpath='{.items[0].spec.s3Profiles}' | jq
After the protected application Placement resources are all configured to use the surviving cluster and replacement cluster s3Profile(s) removed from protected applications, all
DRPlacementControl
resources must be deleted from the Hub cluster.$ oc delete drpc <drpc_name> -n <namespace>
NoteFor Subscription-based applications the associated DRPlacementControl can be found in the same namespace as the managed clusters on the hub cluster. For ApplicationSets-based applications the associated DRPlacementControl can be found in the
openshift-gitops
namespace on the hub cluster.Verify that all DRPlacementControl resources are deleted before proceeding to the next step. This command is a query across all namespaces. There should be no resources found.
$ oc get drpc -A
The last step is to edit each applications Placement and remove the annotation
cluster.open-cluster-management.io/experimental-scheduling-disable: "true"
.$ oc edit placement <placement_name> -n <namespace>
apiVersion: cluster.open-cluster-management.io/v1beta1 kind: Placement metadata: annotations: ## Remove this annotation cluster.open-cluster-management.io/experimental-scheduling-disable: "true" [...]
- Repeat the process detailed in the last step and the sub-steps for every protected application on the surviving cluster. Disabling DR for protected applications is now completed.
On the Hub cluster, run the following script to remove all disaster recovery configurations from the surviving cluster and the hub cluster.
#!/bin/bash secrets=$(oc get secrets -n openshift-operators | grep Opaque | cut -d" " -f1) echo $secrets for secret in $secrets do oc patch -n openshift-operators secret/$secret -p '{"metadata":{"finalizers":null}}' --type=merge done mirrorpeers=$(oc get mirrorpeer -o name) echo $mirrorpeers for mp in $mirrorpeers do oc patch $mp -p '{"metadata":{"finalizers":null}}' --type=merge oc delete $mp done drpolicies=$(oc get drpolicy -o name) echo $drpolicies for drp in $drpolicies do oc patch $drp -p '{"metadata":{"finalizers":null}}' --type=merge oc delete $drp done drclusters=$(oc get drcluster -o name) echo $drclusters for drp in $drclusters do oc patch $drp -p '{"metadata":{"finalizers":null}}' --type=merge oc delete $drp done oc delete project openshift-operators managedclusters=$(oc get managedclusters -o name | cut -d"/" -f2) echo $managedclusters for mc in $managedclusters do secrets=$(oc get secrets -n $mc | grep multicluster.odf.openshift.io/secret-type | cut -d" " -f1) echo $secrets for secret in $secrets do set -x oc patch -n $mc secret/$secret -p '{"metadata":{"finalizers":null}}' --type=merge oc delete -n $mc secret/$secret done done oc delete clusterrolebinding spoke-clusterrole-bindings
NoteThis script used the command
oc delete project openshift-operators
to remove the Disaster Recovery (DR) operators in this namespace on the hub cluster. If there are other non-DR operators in this namespace, you must install them again from OperatorHub.After the namespace
openshift-operators
is automatically created again, add the monitoring label back for collecting the disaster recovery metrics.$ oc label namespace openshift-operators openshift.io/cluster-monitoring='true'
On the surviving cluster, ensure that the object bucket created during the DR installation is deleted. Delete the object bucket if it was not removed by script. The name of the object bucket used for DR starts with
odrbucket
.$ oc get obc -n openshift-storage
On the RHACM console, navigate to Infrastructure
Clusters view. - Detach the replacement cluster.
- Create a new OpenShift cluster (recovery cluster) and import the new cluster into the RHACM console. For instructions, see Creating a cluster and Importing a target managed cluster to the hub cluster.
Install OpenShift Data Foundation operator on the recovery cluster and connect it to the same external Ceph storage system as the surviving cluster. For detailed instructions, refer to Deploying OpenShift Data Foundation in external mode.
NoteEnsure that the OpenShift Data Foundation version is 4.15 (or greater) and the same version of OpenShift Data Foundation is on the surviving cluster.
- On the hub cluster, install the ODF Multicluster Orchestrator operator from OperatorHub. For instructions, see chapter on Installing OpenShift Data Foundation Multicluster Orchestrator operator.
Using the RHACM console, navigate to Data Services
Data policies. - Select Create DRPolicy and name your policy.
- Select the recovery cluster and the surviving cluster.
- Create the policy. For instructions see chapter on Creating Disaster Recovery Policy on Hub cluster.
Proceed to the next step only after the status of DRPolicy changes to
Validated
.- Apply the DRPolicy to the applications on the surviving cluster that were originally protected before the replacement cluster failed.
- Relocate the newly protected applications on the surviving cluster back to the new recovery (primary) cluster. Using the RHACM console, navigate to the Applications menu to perform the relocation.
3.18. Hub recovery using Red Hat Advanced Cluster Management [Technology preview]
When your setup has active and passive Red Hat Advanced Cluster Management for Kubernetes (RHACM) hub clusters, and in case where the active hub is down, you can use the passive hub to failover or relocate the disaster recovery protected workloads.
Hub recovery for Metro-DR is a Technology Preview feature and is subject to Technology Preview support limitations. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information, see Technology Preview Features Support Scope.
3.18.1. Configuring passive hub cluster
To perform hub recovery in case the active hub is down or unreachable, follow the procedure in this section to configure the passive hub cluster and then failover or relocate the disaster recovery protected workloads.
Procedure
Ensure that RHACM operator and
MultiClusterHub
is installed on the passive hub cluster. See RHACM installation guide for instructions.After the operator is successfully installed, a popover with a message that the Web console update is available appears on the user interface. Click Refresh web console from this popover for the console changes to reflect.
- Before hub recovery, configure backup and restore. See Backup and restore topics of RHACM Business continuity guide.
- Install the multicluster orchestrator (MCO) operator along with Red Hat OpenShift GitOps operator on the passive RHACM hub prior to the restore. For instructions to restore your RHACM hub, see Installing OpenShift Data Foundation Multicluster Orchestrator operator.
-
Ensure that
.spec.cleanupBeforeRestore
is set toNone
for theRestore.cluster.open-cluster-management.io
resource. For details, see Restoring passive resources while checking for backups chapter of RHACM documentation. - If SSL access across clusters was configured manually during setup, then re-configure SSL access across clusters. For instructions, see Configuring SSL access across clusters chapter.
On the passive hub, add label to
openshift-operators
namespace to enable basic monitoring ofVolumeSyncronizationDelay
alert using this command. For alert details, see Disaster recovery alerts chapter.$ oc label namespace openshift-operators openshift.io/cluster-monitoring='true'
3.18.2. Switching to passive hub cluster
Use this procedure when active hub is down or unreachable.
Procedure
Restore the backups on the passive hub cluster. For information, see Restoring a hub cluster from backup.
ImportantRecovering a failed hub to its passive instance will only restore applications and their DR protected state to its last scheduled backup. Any application that was DR protected after the last scheduled backup would need to be protected again on the new hub.
During the restore procedure, to avoid eviction of resources when ManifestWorks are not regenerated correctly, you can enlarge the AppliedManifestWork eviction grace period.
Verify that the restore is complete.
$ oc -n <restore-namespace> wait restore <restore-name> --for=jsonpath='{.status.phase}'=Finished --timeout=120s
After the restore is completed, on the hub cluster, check for existing global
KlusterletConfig
.-
If global KlusterletConfig exists then edit and set the value for
appliedManifestWorkEvictionGracePeriod
parameter to a larger value. For example, 24 hours or more. If global KlusterletConfig does not exist, then create the Klusterletconfig using the following yaml:
apiVersion: config.open-cluster-management.io/v1alpha1 kind: KlusterletConfig metadata: name: global spec: appliedManifestWorkEvictionGracePeriod: "24h"
The configuration will be propagated to all the managed clusters automatically.
-
If global KlusterletConfig exists then edit and set the value for
- Verify that the Primary and Seconday managed clusters are successfully imported into the RHACM console and they are accessible. If any of the managed clusters are down or unreachable then they will not be successfully imported.
- Wait until DRPolicy validation succeeds.
Verify that the DRPolicy is created successfully. Run this command on the Hub cluster for each of the DRPolicy resources created, where <drpolicy_name> is replaced with a unique name.
$ oc get drpolicy <drpolicy_name> -o jsonpath='{.status.conditions[].reason}{"\n"}'
Example output:
Succeeded
- Refresh the RHACM console to make the DR monitoring dashboard tab accessible if it was enabled on the Active hub cluster.
-
Once all components are recovered, edit the global KlusterletConfig on the new hub and remove the parameter
appliedManifestWorkEvictionGracePeriod
and its value. - If only the active hub cluster is down, restore the hub by performing hub recovery, and restoring the backups on the passive hub. If the managed clusters are still accessible, no further action is required.
- If the primary managed cluster is down, along with the active hub cluster, you need to fail over the workloads from the primary managed cluster to the secondary managed cluster. For failover instructions, based on your workload type, see Subscription-based applications or ApplicationSet-based applications.
Verify that the failover is successful. If the Primary managed cluster is also down, then the PROGRESSION status for the workload would be in
Cleaning Up
phase until the down Primary managed cluster is back online and successfully imported into the RHACM console.On the passive hub cluster, run the following command to check the PROGRESSION status.
$ oc get drpc -o wide -A