Configuring OpenShift Data Foundation Disaster Recovery for OpenShift Workloads
The OpenShift Data Foundation Disaster Recovery capabilities for Metropolitan and Regional regions is now General Available which also includes Disaster Recovery with stretch cluster.
Abstract
Providing feedback on Red Hat documentation Copy linkLink copied to clipboard!
We appreciate your input on our documentation. Do let us know how we can make it better.
To give feedback, create a Jira ticket:
- Log in to the Jira.
- Click Create in the top navigation bar
- Enter a descriptive title in the Summary field.
- Enter your suggestion for improvement in the Description field. Include links to the relevant parts of the documentation.
- Select Documentation in the Components field.
- Click Create at the bottom of the dialogue.
Chapter 1. Introduction to OpenShift Data Foundation Disaster Recovery Copy linkLink copied to clipboard!
Disaster recovery (DR) is the ability to recover and continue business critical applications from natural or human created disasters. It is a component of the overall business continuance strategy of any major organization as designed to preserve the continuity of business operations during major adverse events.
The OpenShift Data Foundation DR capability enables DR across multiple Red Hat OpenShift Container Platform clusters, and is categorized as follows:
Metro-DR
Metro-DR ensures business continuity during the unavailability of a data center with no data loss. In the public cloud these would be similar to protecting from an Availability Zone failure.
Regional-DR
Regional-DR ensures business continuity during the unavailability of a geographical region, accepting some loss of data in a predictable amount. In the public cloud this would be similar to protecting from a region failure.
Disaster Recovery with stretch cluster
Stretch cluster solution ensures business continuity with no-data loss disaster recovery protection with OpenShift Data Foundation based synchronous replication in a single OpenShift cluster, stretched across two data centers with low latency and one arbiter node.
Zone failure in Metro-DR and region failure in Regional-DR is usually expressed using the terms, Recovery Point Objective (RPO) and Recovery Time Objective (RTO).
- RPO is a measure of how frequently you take backups or snapshots of persistent data. In practice, the RPO indicates the amount of data that will be lost or need to be reentered after an outage.
- RTO is the amount of downtime a business can tolerate. The RTO answers the question, “How long can it take for our system to recover after we are notified of a business disruption?”
The intent of this guide is to detail the Disaster Recovery steps and commands necessary to be able to failover an application from one OpenShift Container Platform cluster to another and then relocate the same application to the original primary cluster.
Chapter 2. Disaster recovery subscription requirement Copy linkLink copied to clipboard!
Disaster Recovery features supported by Red Hat OpenShift Data Foundation require all of the following prerequisites to successfully implement a disaster recovery solution:
- A valid Red Hat OpenShift Data Foundation Advanced entitlement
- A valid Red Hat Advanced Cluster Management for Kubernetes subscription
Any Red Hat OpenShift Data Foundation Cluster containing PVs participating in active replication either as a source or destination requires OpenShift Data Foundation Advanced entitlement. This subscription should be active on both source and destination clusters.
To know how subscriptions for OpenShift Data Foundation work, see knowledgebase article on OpenShift Data Foundation subscriptions.
OpenShift Data Foundation deployed with Multus networking is not supported for Regional Disaster Recovery (Regional-DR) setups.
Chapter 3. Metro-DR solution for OpenShift Data Foundation Copy linkLink copied to clipboard!
This section of the guide provides details of the Metro Disaster Recovery (Metro-DR) steps and commands necessary to be able to failover an application from one OpenShift Container Platform cluster to another and then failback the same application to the original primary cluster. In this case the OpenShift Container Platform clusters will be created or imported using Red Hat Advanced Cluster Management (RHACM) and have distance limitations between the OpenShift Container Platform clusters of less than 10ms RTT latency.
The persistent storage for applications is provided by an external Red Hat Ceph Storage (RHCS) cluster stretched between the two locations with the OpenShift Container Platform instances connected to this storage cluster. An arbiter node with a storage monitor service is required at a third location (different location than where OpenShift Container Platform instances are deployed) to establish quorum for the RHCS cluster in the case of a site outage. This third location can be in the range of ~100ms RTT from the storage cluster connected to the OpenShift Container Platform instances.
This is a general overview of the Metro DR steps required to configure and execute OpenShift Disaster Recovery (ODR) capabilities using OpenShift Data Foundation and RHACM across two distinct OpenShift Container Platform clusters separated by distance. In addition to these two clusters called managed clusters, a third OpenShift Container Platform cluster is required that will be the Red Hat Advanced Cluster Management (RHACM) hub cluster.
You can now easily set up Metropolitan disaster recovery solutions for workloads based on OpenShift virtualization technology using OpenShift Data Foundation. For more information, see the knowledgebase article.
3.1. Components of Metro-DR solution Copy linkLink copied to clipboard!
Metro-DR is composed of Red Hat Advanced Cluster Management for Kubernetes, Red Hat Ceph Storage and OpenShift Data Foundation components to provide application and data mobility across OpenShift Container Platform clusters.
Red Hat Advanced Cluster Management for Kubernetes
Red Hat Advanced Cluster Management (RHACM) provides the ability to manage multiple clusters and application lifecycles. Hence, it serves as a control plane in a multi-cluster environment.
RHACM is split into two parts:
- RHACM Hub: components that run on the multi-cluster control plane.
- Managed clusters: components that run on the clusters that are managed.
For more information about this product, see RHACM documentation and the RHACM “Manage Applications” documentation.
Red Hat Ceph Storage
Red Hat Ceph Storage is a massively scalable, open, software-defined storage platform that combines the most stable version of the Ceph storage system with a Ceph management platform, deployment utilities, and support services. It significantly lowers the cost of storing enterprise data and helps organizations manage exponential data growth. The software is a robust and modern petabyte-scale storage platform for public or private cloud deployments.
For more product information, see Red Hat Ceph Storage.
OpenShift Data Foundation
OpenShift Data Foundation provides the ability to provision and manage storage for stateful applications in an OpenShift Container Platform cluster. It is backed by Ceph as the storage provider, whose lifecycle is managed by Rook in the OpenShift Data Foundation component stack and Ceph-CSI provides the provisioning and management of Persistent Volumes for stateful applications.
OpenShift DR
OpenShift DR is a disaster recovery orchestrator for stateful applications across a set of peer OpenShift clusters which are deployed and managed using RHACM and provides cloud-native interfaces to orchestrate the life-cycle of an application’s state on Persistent Volumes. These include:
- Protecting an application and its state relationship across OpenShift clusters
- Failing over an application and its state to a peer cluster
- Relocate an application and its state to the previously deployed cluster
OpenShift DR is split into three components:
- ODF Multicluster Orchestrator: Installed on the multi-cluster control plane (RHACM Hub), it orchestrates configuration and peering of OpenShift Data Foundation clusters for Metro and Regional DR relationships.
- OpenShift DR Hub Operator: Automatically installed as part of ODF Multicluster Orchestrator installation on the hub cluster to orchestrate failover or relocation of DR enabled applications.
- OpenShift DR Cluster Operator: Automatically installed on each managed cluster that is part of a Metro and Regional DR relationship to manage the lifecycle of all PVCs of an application.
3.2. Unsupported features for Metro-DR Copy linkLink copied to clipboard!
The following features related to external mode are not supported when using Metro-DR:
- StorageClasses using non-default RADOS namespace
- User created StorageClasses, even when using default RADOS namespace
- Multiple StorageClasses
3.3. Metro-DR deployment workflow Copy linkLink copied to clipboard!
This section provides an overview of the steps required to configure and deploy Metro-DR capabilities using the latest versions of Red Hat OpenShift Data Foundation, Red Hat Ceph Storage (RHCS) and Red Hat Advanced Cluster Management for Kubernetes (RHACM) version 2.10 or later, across two distinct OpenShift Container Platform clusters. In addition to two managed clusters, a third OpenShift Container Platform cluster will be required to deploy the Advanced Cluster Management.
To configure your infrastructure, perform the below steps in the order given:
- Ensure requirements across the Hub, Primary and Secondary Openshift Container Platform clusters that are part of the DR solution are met. See Requirements for enabling Metro-DR.
- Ensure you meet the requirements for deploying Red Hat Ceph Storage stretch cluster with arbiter. See Requirements for deploying Red Hat Ceph Storage.
- Deploy and configure Red Hat Ceph Storage stretch mode. For instructions on enabling Ceph cluster on two different data centers using stretched mode functionality, see Deploying Red Hat Ceph Storage.
- Install OpenShift Data Foundation operator and create a storage system on Primary and Secondary managed clusters. See Installing OpenShift Data Foundation on managed clusters.
- Install the ODF Multicluster Orchestrator on the Hub cluster. See Installing ODF Multicluster Orchestrator on Hub cluster.
- Configure SSL access between the Hub, Primary and Secondary clusters. See Configuring SSL access across clusters.
Create a DRPolicy resource for use with applications requiring DR protection across the Primary and Secondary clusters. See Creating Disaster Recovery Policy on Hub cluster.
NoteThe Metro-DR solution can only have one DRpolicy.
Testing your disaster recovery solution with:
Subscription-based application:
- Create sample applications. See Creating sample application.
- Test failover and relocate operations using the sample application between managed clusters. See Subscription-based application failover and relocating subscription-based application.
ApplicationSet-based application:
- Create sample applications. See Creating ApplicationSet-based applications.
- Test failover and relocate operations using the sample application between managed clusters. See ApplicationSet-based application failover and relocating ApplicationSet-based application.
Discovered applications
- Ensure all requirements mentioned in Prerequisites is addressed. See Prerequisites for disaster recovery protection of discovered applications
- Create a sample discovered application. See Creating a sample discovered application
- Enroll the discovered application. See Enrolling a sample discovered application for disaster recovery protection
- Test failover and relocate. See Discovered application failover and relocate
3.4. Requirements for enabling Metro-DR Copy linkLink copied to clipboard!
The prerequisites to installing a disaster recovery solution supported by Red Hat OpenShift Data Foundation are as follows:
You must have the following OpenShift clusters that have network reachability between them:
- Hub cluster where Red Hat Advanced Cluster Management (RHACM) for Kubernetes operator are installed.
- Primary managed cluster where OpenShift Data Foundation is running.
- Secondary managed cluster where OpenShift Data Foundation is running.
NoteFor configuring hub recovery setup, you need a 4th cluster which acts as the passive hub. The primary managed cluster (Site-1) can be co-situated with the active RHACM hub cluster while the passive hub cluster is situated along with the secondary managed cluster (Site-2). Alternatively, the active RHACM hub cluster can be placed in a neutral site (Site-3) that is not impacted by the failures of either of the primary managed cluster at Site-1 or the secondary cluster at Site-2. In this situation, if a passive hub cluster is used it can be placed with the secondary cluster at Site-2. For more information, see Configuring passive hub cluster for hub recovery.
Hub recovery is a Technology Preview feature and is subject to Technology Preview support limitations.
Ensure that RHACM operator and MultiClusterHub is installed on the Hub cluster. See RHACM installation guide for instructions.
After the operator is successfully installed, the web console automatically reloads to apply the changes. During this process, a temporary error message might appear on the page and this is expected and disappears after the refresh completes.
Ensure that application traffic routing and redirection are configured appropriately.
On the Hub cluster
- Navigate to All Clusters → Infrastructure → Clusters.
- Import or create the Primary managed cluster and the Secondary managed cluster using the RHACM console.
- Choose the appropriate options for your environment.
After the managed clusters are successfully created or imported, you can see the list of clusters that were imported or created on the console. For instructions, see Creating a cluster and Importing a target managed cluster to the hub cluster.
The Openshift Container Platform managed clusters and the Red Hat Ceph Storage (RHCS) nodes have distance limitations. The network latency between the sites must be below 10 milliseconds round-trip time (RTT).
3.5. Requirements for deploying Red Hat Ceph Storage stretch cluster with arbiter Copy linkLink copied to clipboard!
Red Hat Ceph Storage is an open-source enterprise platform that provides unified software-defined storage on standard, economical servers and disks. With block, object, and file storage combined into one platform, Red Hat Ceph Storage efficiently and automatically manages all your data, so you can focus on the applications and workloads that use it.
This section provides a basic overview of the Red Hat Ceph Storage deployment. For more complex deployment, refer to the official documentation guide for Red Hat Ceph Storage 7.
Only Flash media is supported since it runs with min_size=1 when degraded. Use stretch mode only with all-flash OSDs. Using all-flash OSDs minimizes the time needed to recover once connectivity is restored, thus minimizing the potential for data loss.
Erasure coded pools cannot be used with stretch mode.
3.5.1. Hardware requirements Copy linkLink copied to clipboard!
For information on minimum hardware requirements for deploying Red Hat Ceph Storage, see Minimum hardware recommendations for containerized Ceph.
| Node name | Datacenter | Ceph components |
|---|---|---|
| ceph1 | DC1 | OSD+MON+MGR |
| ceph2 | DC1 | OSD+MON |
| ceph3 | DC1 | OSD+MDS+RGW |
| ceph4 | DC2 | OSD+MON+MGR |
| ceph5 | DC2 | OSD+MON |
| ceph6 | DC2 | OSD+MDS+RGW |
| ceph7 | DC3 | MON |
3.5.2. Software requirements Copy linkLink copied to clipboard!
Use the latest software version of Red Hat Ceph Storage 8.
For more information on the supported Operating System versions for Red Hat Ceph Storage, see knowledgebase article on Red Hat Ceph Storage: Supported configurations.
3.5.3. Network configuration requirements Copy linkLink copied to clipboard!
The recommended Red Hat Ceph Storage configuration is as follows:
- You must have two separate networks, one public network and one private network.
You must have three different datacenters that support VLANS and subnets for Cephs private and public network for all datacenters.
NoteYou can use different subnets for each of the datacenters.
- The latencies between the two datacenters running the Red Hat Ceph Storage Object Storage Devices (OSDs) cannot exceed 10 ms RTT. For the arbiter datacenter, this was tested with values as high up to 100 ms RTT to the other two OSD datacenters.
Here is an example of a basic network configuration that we have used in this guide:
- DC1: Ceph public/private network: 10.0.40.0/24
- DC2: Ceph public/private network: 10.0.40.0/24
- DC3: Ceph public/private network: 10.0.40.0/24
For more information on the required network environment, see Ceph network configuration.
3.6. Deploying Red Hat Ceph Storage Copy linkLink copied to clipboard!
3.6.1. Node pre-deployment steps Copy linkLink copied to clipboard!
Before installing the Red Hat Ceph Storage Ceph cluster, perform the following steps to fulfill all the requirements needed.
Register all the nodes to the Red Hat Network or Red Hat Satellite and subscribe to a valid pool:
subscription-manager register subscription-manager subscribe --pool=8a8XXXXXX9e0
subscription-manager register subscription-manager subscribe --pool=8a8XXXXXX9e0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Enable access for all the nodes in the Ceph cluster for the following repositories:
-
rhel9-for-x86_64-baseos-rpms rhel9-for-x86_64-appstream-rpmssubscription-manager repos --disable="*" --enable="rhel9-for-x86_64-baseos-rpms" --enable="rhel9-for-x86_64-appstream-rpms"
subscription-manager repos --disable="*" --enable="rhel9-for-x86_64-baseos-rpms" --enable="rhel9-for-x86_64-appstream-rpms"Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
Update the operating system RPMs to the latest version and reboot if needed:
dnf update -y reboot
dnf update -y rebootCopy to Clipboard Copied! Toggle word wrap Toggle overflow Select a node from the cluster to be your bootstrap node.
ceph1is our bootstrap node in this example going forward.Only on the bootstrap node
ceph1, enable theansible-2.9-for-rhel-9-x86_64-rpmsandrhceph-6-tools-for-rhel-9-x86_64-rpmsrepositories:subscription-manager repos --enable="ansible-2.9-for-rhel-9-x86_64-rpms" --enable="rhceph-6-tools-for-rhel-9-x86_64-rpms"
subscription-manager repos --enable="ansible-2.9-for-rhel-9-x86_64-rpms" --enable="rhceph-6-tools-for-rhel-9-x86_64-rpms"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Configure the
hostnameusing the bare/short hostname in all the hosts.hostnamectl set-hostname <short_name>
hostnamectl set-hostname <short_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the hostname configuration for deploying Red Hat Ceph Storage with cephadm.
hostname
$ hostnameCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
ceph1
ceph1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Modify /etc/hosts file and add the fqdn entry to the 127.0.0.1 IP by setting the DOMAIN variable with our DNS domain name.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the long hostname with the
fqdnusing thehostname -foption.hostname -f
$ hostname -fCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
ceph1.example.domain.com
ceph1.example.domain.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteTo know more about why these changes are required, see Fully Qualified Domain Names vs Bare Host Names.
Run the following steps on the bootstrap node. In our example, the bootstrap node is
ceph1.Install the
cephadm-ansibleRPM package:sudo dnf install -y cephadm-ansible
$ sudo dnf install -y cephadm-ansibleCopy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantTo run the ansible playbooks, you must have
sshpasswordless access to all the nodes that are configured to the Red Hat Ceph Storage cluster. Ensure that the configured user (for example,deployment-user) has root privileges to invoke thesudocommand without needing a password.To use a custom key, configure the selected user (for example,
deployment-user) ssh config file to specify the id/key that will be used for connecting to the nodes via ssh:cat <<EOF > ~/.ssh/config Host ceph* User deployment-user IdentityFile ~/.ssh/ceph.pem EOF
cat <<EOF > ~/.ssh/config Host ceph* User deployment-user IdentityFile ~/.ssh/ceph.pem EOFCopy to Clipboard Copied! Toggle word wrap Toggle overflow Build the ansible inventory
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteHere, the Hosts (
Ceph1andCeph4) belonging to two different data centers are configured as part of the [admin] group on the inventory file and are tagged as_adminbycephadm. Each of these admin nodes receive the admin ceph keyring during the bootstrap process so that when one data center is down, we can check using the other available admin node.Verify that
ansiblecan access all nodes using the ping module before running the pre-flight playbook.ansible -i /usr/share/cephadm-ansible/inventory -m ping all -b
$ ansible -i /usr/share/cephadm-ansible/inventory -m ping all -bCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Navigate to the
/usr/share/cephadm-ansibledirectory. Run ansible-playbook with relative file paths.
ansible-playbook -i /usr/share/cephadm-ansible/inventory /usr/share/cephadm-ansible/cephadm-preflight.yml --extra-vars "ceph_origin=rhcs"
$ ansible-playbook -i /usr/share/cephadm-ansible/inventory /usr/share/cephadm-ansible/cephadm-preflight.yml --extra-vars "ceph_origin=rhcs"Copy to Clipboard Copied! Toggle word wrap Toggle overflow The preflight playbook Ansible playbook configures the RHCS
dnfrepository and prepares the storage cluster for bootstrapping. It also installs podman, lvm2, chronyd, and cephadm. The default location forcephadm-ansibleandcephadm-preflight.ymlis/usr/share/cephadm-ansible. For additional information, see Running the preflight playbook
3.6.2. Cluster bootstrapping and service deployment with cephadm utility Copy linkLink copied to clipboard!
The cephadm utility installs and starts a single Ceph Monitor daemon and a Ceph Manager daemon for a new Red Hat Ceph Storage cluster on the local node where the cephadm bootstrap command is run.
In this guide we are going to bootstrap the cluster and deploy all the needed Red Hat Ceph Storage services in one step using a cluster specification yaml file.
If you find issues during the deployment, it may be easier to troubleshoot the errors by dividing the deployment into two steps:
- Bootstrap
- Service deployment
For additional information on the bootstrapping process, see Bootstrapping a new storage cluster.
Procedure
Create json file to authenticate against the container registry using a json file as follows:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
cluster-spec.yamlthat adds the nodes to the Red Hat Ceph Storage cluster and also sets specific labels for where the services should run following table 3.1.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Retrieve the IP for the NIC with the Red Hat Ceph Storage public network configured from the bootstrap node. After substituting
10.0.40.0with the subnet that you have defined in your ceph public network, execute the following command.ip a | grep 10.0.40
$ ip a | grep 10.0.40Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
10.0.40.78
10.0.40.78Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the
cephadmbootstrap command as the root user on the node that will be the initial Monitor node in the cluster. TheIP_ADDRESSoption is the node’s IP address that you are using to run thecephadm bootstrapcommand.NoteIf you have configured a different user instead of
rootfor passwordless SSH access, then use the--ssh-user=flag with thecepadm bootstrapcommand.If you are using non default/id_rsa ssh key names, then use
--ssh-private-keyand--ssh-public-keyoptions withcephadmcommand.cephadm bootstrap --ssh-user=deployment-user --mon-ip 10.0.40.78 --apply-spec /root/cluster-spec.yaml --registry-json /root/registry.json
$ cephadm bootstrap --ssh-user=deployment-user --mon-ip 10.0.40.78 --apply-spec /root/cluster-spec.yaml --registry-json /root/registry.jsonCopy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the local node uses fully-qualified domain names (FQDN), then add the
--allow-fqdn-hostnameoption tocephadm bootstrapon the command line.Once the bootstrap finishes, you will see the following output from the previous cephadm bootstrap command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the status of Red Hat Ceph Storage cluster deployment using the Ceph CLI client from ceph1:
ceph -s
$ ceph -sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIt may take several minutes for all the services to start.
It is normal to get a global recovery event while you do not have any OSDs configured.
You can use
ceph orch psandceph orch lsto further check the status of the services.Verify if all the nodes are part of the
cephadmcluster.ceph orch host ls
$ ceph orch host lsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteYou can run Ceph commands directly from the host because
ceph1was configured in thecephadm-ansibleinventory as part of the [admin] group. The Ceph admin keys were copied to the host during thecephadm bootstrapprocess.Check the current placement of the Ceph monitor services on the datacenters.
ceph orch ps | grep mon | awk '{print $1 " " $2}'$ ceph orch ps | grep mon | awk '{print $1 " " $2}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
mon.ceph1 ceph1 mon.ceph2 ceph2 mon.ceph4 ceph4 mon.ceph5 ceph5 mon.ceph7 ceph7
mon.ceph1 ceph1 mon.ceph2 ceph2 mon.ceph4 ceph4 mon.ceph5 ceph5 mon.ceph7 ceph7Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the current placement of the Ceph manager services on the datacenters.
ceph orch ps | grep mgr | awk '{print $1 " " $2}'$ ceph orch ps | grep mgr | awk '{print $1 " " $2}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
mgr.ceph2.ycgwyz ceph2 mgr.ceph5.kremtt ceph5
mgr.ceph2.ycgwyz ceph2 mgr.ceph5.kremtt ceph5Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the ceph osd crush map layout to ensure that each host has one OSD configured and its status is
UP. Also, double-check that each node is under the right datacenter bucket as specified in table 3.1ceph osd tree
$ ceph osd treeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create and enable a new RDB block pool.
ceph osd pool create 32 32 ceph osd pool application enable rbdpool rbd
$ ceph osd pool create 32 32 $ ceph osd pool application enable rbdpool rbdCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe number 32 at the end of the command is the number of PGs assigned to this pool. The number of PGs can vary depending on several factors like the number of OSDs in the cluster, expected % used of the pool, etc. You can use the following calculator to determine the number of PGs needed: Ceph Placement Groups (PGs) per Pool Calculator.
Verify that the RBD pool has been created.
ceph osd lspools | grep rbdpool
$ ceph osd lspools | grep rbdpoolCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
3 rbdpool
3 rbdpoolCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that MDS services are active and have located one service on each datacenter.
ceph orch ps | grep mds
$ ceph orch ps | grep mdsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
mds.cephfs.ceph3.cjpbqo ceph3 running (17m) 117s ago 17m 16.1M - 16.2.9 mds.cephfs.ceph6.lqmgqt ceph6 running (17m) 117s ago 17m 16.1M - 16.2.9
mds.cephfs.ceph3.cjpbqo ceph3 running (17m) 117s ago 17m 16.1M - 16.2.9 mds.cephfs.ceph6.lqmgqt ceph6 running (17m) 117s ago 17m 16.1M - 16.2.9Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the CephFS volume.
ceph fs volume create cephfs
$ ceph fs volume create cephfsCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe
ceph fs volume createcommand also creates the needed data and meta CephFS pools. For more information, see Configuring and Mounting Ceph File Systems.Check the
Cephstatus to verify how the MDS daemons have been deployed. Ensure that the state is active whereceph6is the primary MDS for this filesystem andceph3is the secondary MDS.ceph fs status
$ ceph fs statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that RGW services are active.
ceph orch ps | grep rgw
$ ceph orch ps | grep rgwCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rgw.objectgw.ceph3.kkmxgb ceph3 *:8080 running (7m) 3m ago 7m 52.7M - 16.2.9 rgw.objectgw.ceph6.xmnpah ceph6 *:8080 running (7m) 3m ago 7m 53.3M - 16.2.9
rgw.objectgw.ceph3.kkmxgb ceph3 *:8080 running (7m) 3m ago 7m 52.7M - 16.2.9 rgw.objectgw.ceph6.xmnpah ceph6 *:8080 running (7m) 3m ago 7m 53.3M - 16.2.9Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.6.3. Configuring Red Hat Ceph Storage stretch mode Copy linkLink copied to clipboard!
Once the Red Hat Ceph Storage cluster is fully deployed using cephadm, use the following procedure to configure the stretch cluster mode. The new stretch mode is designed to handle the 2-site case.
Procedure
Check the current election strategy being used by the monitors with the ceph mon dump command. By default in a ceph cluster, the connectivity is set to classic.
ceph mon dump | grep election_strategy
ceph mon dump | grep election_strategyCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
dumped monmap epoch 9 election_strategy: 1
dumped monmap epoch 9 election_strategy: 1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Change the monitor election to connectivity.
ceph mon set election_strategy connectivity
ceph mon set election_strategy connectivityCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run the previous ceph mon dump command again to verify the election_strategy value.
ceph mon dump | grep election_strategy
$ ceph mon dump | grep election_strategyCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
dumped monmap epoch 10 election_strategy: 3
dumped monmap epoch 10 election_strategy: 3Copy to Clipboard Copied! Toggle word wrap Toggle overflow To know more about the different election strategies, see Configuring monitor election strategy.
Set the location for all our Ceph monitors:
ceph mon set_location ceph1 datacenter=DC1 ceph mon set_location ceph2 datacenter=DC1 ceph mon set_location ceph4 datacenter=DC2 ceph mon set_location ceph5 datacenter=DC2 ceph mon set_location ceph7 datacenter=DC3
ceph mon set_location ceph1 datacenter=DC1 ceph mon set_location ceph2 datacenter=DC1 ceph mon set_location ceph4 datacenter=DC2 ceph mon set_location ceph5 datacenter=DC2 ceph mon set_location ceph7 datacenter=DC3Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that each monitor has its appropriate location.
ceph mon dump
$ ceph mon dumpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a CRUSH rule that makes use of this OSD crush topology by installing the
ceph-baseRPM package in order to use thecrushtoolcommand:dnf -y install ceph-base
$ dnf -y install ceph-baseCopy to Clipboard Copied! Toggle word wrap Toggle overflow To know more about CRUSH ruleset, see Ceph CRUSH ruleset.
Get the compiled CRUSH map from the cluster:
ceph osd getcrushmap > /etc/ceph/crushmap.bin
$ ceph osd getcrushmap > /etc/ceph/crushmap.binCopy to Clipboard Copied! Toggle word wrap Toggle overflow Decompile the CRUSH map and convert it to a text file in order to be able to edit it:
crushtool -d /etc/ceph/crushmap.bin -o /etc/ceph/crushmap.txt
$ crushtool -d /etc/ceph/crushmap.bin -o /etc/ceph/crushmap.txtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add the following rule to the CRUSH map by editing the text file
/etc/ceph/crushmap.txtat the end of the file.vim /etc/ceph/crushmap.txt
$ vim /etc/ceph/crushmap.txtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow This example is applicable for active applications in both OpenShift Container Platform clusters.
NoteThe rule
idhas to be unique. In the example, we only have one more crush rule with id 0 hence we are using id 1. If your deployment has more rules created, then use the next free id.The CRUSH rule declared contains the following information:
Rule name- Description: A unique whole name for identifying the rule.
-
Value:
stretch_rule
id- Description: A unique whole number for identifying the rule.
-
Value:
1
type- Description: Describes a rule for either a storage drive replicated or erasure-coded.
-
Value:
replicated
min_size- Description: If a pool makes fewer replicas than this number, CRUSH will not select this rule.
- Value: 1
max_size- Description: If a pool makes more replicas than this number, CRUSH will not select this rule.
- Value: 10
step take default-
Description: Takes the root bucket called
default, and begins iterating down the tree.
-
Description: Takes the root bucket called
step choose firstn 0 type datacenter- Description: Selects the datacenter bucket, and goes into its subtrees.
step chooseleaf firstn 2 type host- Description: Selects the number of buckets of the given type. In this case, it is two different hosts located in the datacenter it entered at the previous level.
step emit- Description: Outputs the current value and empties the stack. Typically used at the end of a rule, but may also be used to pick from different trees in the same rule.
Compile the new CRUSH map from the file
/etc/ceph/crushmap.txtand convert it to a binary file called/etc/ceph/crushmap2.bin:crushtool -c /etc/ceph/crushmap.txt -o /etc/ceph/crushmap2.bin
$ crushtool -c /etc/ceph/crushmap.txt -o /etc/ceph/crushmap2.binCopy to Clipboard Copied! Toggle word wrap Toggle overflow Inject the new crushmap we created back into the cluster:
ceph osd setcrushmap -i /etc/ceph/crushmap2.bin
$ ceph osd setcrushmap -i /etc/ceph/crushmap2.binCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
17
17Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe number 17 is a counter and it will increase (18,19, and so on) depending on the changes you make to the crush map.
Verify that the stretched rule created is now available for use.
ceph osd crush rule ls
ceph osd crush rule lsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
replicated_rule stretch_rule
replicated_rule stretch_ruleCopy to Clipboard Copied! Toggle word wrap Toggle overflow Enable the stretch cluster mode.
ceph mon enable_stretch_mode ceph7 stretch_rule datacenter
$ ceph mon enable_stretch_mode ceph7 stretch_rule datacenterCopy to Clipboard Copied! Toggle word wrap Toggle overflow In this example,
ceph7is the arbiter node,stretch_ruleis the crush rule we created in the previous step anddatacenteris the dividing bucket.Verify all our pools are using the
stretch_ruleCRUSH rule we have created in our Ceph cluster:for pool in $(rados lspools);do echo -n "Pool: ${pool}; ";ceph osd pool get ${pool} crush_rule;done$ for pool in $(rados lspools);do echo -n "Pool: ${pool}; ";ceph osd pool get ${pool} crush_rule;doneCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow This indicates that a working Red Hat Ceph Storage stretched cluster with arbiter mode is now available.
3.7. Installing OpenShift Data Foundation on managed clusters Copy linkLink copied to clipboard!
To configure storage replication between the two OpenShift Container Platform clusters, OpenShift Data Foundation operator must be installed first on each managed cluster.
Prerequisites
- Ensure that you have met the hardware requirements for OpenShift Data Foundation external deployments. For a detailed description of the hardware requirements, see External mode requirements.
Procedure
- Install and configure the latest OpenShift Data Foundation cluster on each of the managed clusters.
After installing the operator, create a StorageSystem using the option Full deployment type and
Connect with external storage platformwhere your Backing storage type isRed Hat Ceph Storage.For detailed instructions, refer to Deploying OpenShift Data Foundation in external mode.
Use the following flags with the
ceph-external-cluster-details-exporter.pyscript.At a minimum, you must use the following three flags with the
ceph-external-cluster-details-exporter.py script:- --rbd-data-pool-name
-
With the name of the RBD pool that was created during RHCS deployment for OpenShift Container Platform. For example, the pool can be called
rbdpool. - --rgw-endpoint
-
Provide the endpoint in the format
<ip_address>:<port>. It is the RGW IP of the RGW daemon running on the same site as the OpenShift Container Platform cluster that you are configuring. - --run-as-user
- With a different client name for each site.
The following flags are
optionalif default values were used during the RHCS deployment:- --cephfs-filesystem-name
-
With the name of the CephFS filesystem we created during RHCS deployment for OpenShift Container Platform, the default filesystem name is
cephfs. - --cephfs-data-pool-name
-
With the name of the CephFS data pool we created during RHCS deployment for OpenShift Container Platform, the default pool is called
cephfs.data. - --cephfs-metadata-pool-name
-
With the name of the CephFS metadata pool we created during RHCS deployment for OpenShift Container Platform, the default pool is called
cephfs.meta.
Run the following command on the bootstrap node
ceph1, to get the IP for the RGW endpoints in datacenter1 and datacenter2:ceph orch ps | grep rgw.objectgw
ceph orch ps | grep rgw.objectgwCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rgw.objectgw.ceph3.mecpzm ceph3 *:8080 running (5d) 31s ago 7w 204M - 16.2.7-112.el8cp rgw.objectgw.ceph6.mecpzm ceph6 *:8080 running (5d) 31s ago 7w 204M - 16.2.7-112.el8cp
rgw.objectgw.ceph3.mecpzm ceph3 *:8080 running (5d) 31s ago 7w 204M - 16.2.7-112.el8cp rgw.objectgw.ceph6.mecpzm ceph6 *:8080 running (5d) 31s ago 7w 204M - 16.2.7-112.el8cpCopy to Clipboard Copied! Toggle word wrap Toggle overflow host ceph3.example.com host ceph6.example.com
host ceph3.example.com host ceph6.example.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
ceph3.example.com has address 10.0.40.24 ceph6.example.com has address 10.0.40.66
ceph3.example.com has address 10.0.40.24 ceph6.example.com has address 10.0.40.66Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the
ceph-external-cluster-details-exporter.pywith the parameters that are configured for the first OpenShift Container Platform managed clustercluster1on bootstrapped nodeceph1.python3 ceph-external-cluster-details-exporter.py --rbd-data-pool-name rbdpool --cephfs-filesystem-name cephfs --cephfs-data-pool-name cephfs.cephfs.data --cephfs-metadata-pool-name cephfs.cephfs.meta --<rgw-endpoint> XXX.XXX.XXX.XXX:8080 --run-as-user client.odf.cluster1 > ocp-cluster1.json
python3 ceph-external-cluster-details-exporter.py --rbd-data-pool-name rbdpool --cephfs-filesystem-name cephfs --cephfs-data-pool-name cephfs.cephfs.data --cephfs-metadata-pool-name cephfs.cephfs.meta --<rgw-endpoint> XXX.XXX.XXX.XXX:8080 --run-as-user client.odf.cluster1 > ocp-cluster1.jsonCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteModify the <rgw-endpoint> XXX.XXX.XXX.XXX according to your environment.
Run the
ceph-external-cluster-details-exporter.pywith the parameters that are configured for the first OpenShift Container Platform managed clustercluster2on bootstrapped nodeceph1.python3 ceph-external-cluster-details-exporter.py --rbd-data-pool-name rbdpool --cephfs-filesystem-name cephfs --cephfs-data-pool-name cephfs.cephfs.data --cephfs-metadata-pool-name cephfs.cephfs.meta --rgw-endpoint XXX.XXX.XXX.XXX:8080 --run-as-user client.odf.cluster2 > ocp-cluster2.json
python3 ceph-external-cluster-details-exporter.py --rbd-data-pool-name rbdpool --cephfs-filesystem-name cephfs --cephfs-data-pool-name cephfs.cephfs.data --cephfs-metadata-pool-name cephfs.cephfs.meta --rgw-endpoint XXX.XXX.XXX.XXX:8080 --run-as-user client.odf.cluster2 > ocp-cluster2.jsonCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteModify the <rgw-endpoint> XXX.XXX.XXX.XXX according to your environment.
-
Save the two files generated in the bootstrap cluster (ceph1)
ocp-cluster1.jsonandocp-cluster2.jsonto your local machine. -
Use the contents of file
ocp-cluster1.jsonon the OpenShift Container Platform console oncluster1where external OpenShift Data Foundation is being deployed. -
Use the contents of file
ocp-cluster2.jsonon the OpenShift Container Platform console oncluster2where external OpenShift Data Foundation is being deployed.
-
Save the two files generated in the bootstrap cluster (ceph1)
- Review the settings and then select Create StorageSystem.
Validate the successful deployment of OpenShift Data Foundation on each managed cluster with the following command:
oc get storagecluster -n openshift-storage ocs-external-storagecluster -o jsonpath='{.status.phase}{"\n"}'$ oc get storagecluster -n openshift-storage ocs-external-storagecluster -o jsonpath='{.status.phase}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow For the Multicloud Gateway (MCG):
oc get noobaa -n openshift-storage noobaa -o jsonpath='{.status.phase}{"\n"}'$ oc get noobaa -n openshift-storage noobaa -o jsonpath='{.status.phase}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Wait for the status result to be Ready for both queries on the Primary managed cluster and the Secondary managed cluster.
Verify the storage cluster is healthy.
- On the OpenShift Web Console, navigate to Storage → Data Foundation → Storage System.
- In the Status card of the Overview tab, click Storage System and then click the storage system link from the pop up that appears.
- In the Status card of the Block and File tab, verify that the Storage Cluster has a green tick.
Enable read affinity for RBD and CephFS volumes to be served from the nearest datacenter.
On the Primary managed cluster, label all the nodes.
oc label nodes --all metro-dr.openshift-storage.topology.io/datacenter=DC1
$ oc label nodes --all metro-dr.openshift-storage.topology.io/datacenter=DC1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Execute the following commands to enable read affinity:
oc patch storageclusters.ocs.openshift.io -n openshift-storage ocs-external-storagecluster -p '{"spec":{"csi":{"readAffinity":{"enabled":true,"crushLocationLabels":["metro-dr.openshift-storage.topology.io/datacenter"]}}}}' --type=merge$ oc patch storageclusters.ocs.openshift.io -n openshift-storage ocs-external-storagecluster -p '{"spec":{"csi":{"readAffinity":{"enabled":true,"crushLocationLabels":["metro-dr.openshift-storage.topology.io/datacenter"]}}}}' --type=mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete po -n openshift-storage -l 'app in (openshift-storage.cephfs.csi.ceph.com-ctrlplugin,openshift-storage.rbd.csi.ceph.com-ctrlplugin)'
$ oc delete po -n openshift-storage -l 'app in (openshift-storage.cephfs.csi.ceph.com-ctrlplugin,openshift-storage.rbd.csi.ceph.com-ctrlplugin)'Copy to Clipboard Copied! Toggle word wrap Toggle overflow On the Secondary managed cluster, label all the nodes:
oc label nodes --all metro-dr.openshift-storage.topology.io/datacenter=DC2
$ oc label nodes --all metro-dr.openshift-storage.topology.io/datacenter=DC2Copy to Clipboard Copied! Toggle word wrap Toggle overflow Execute the following commands to enable read affinity:
oc patch storageclusters.ocs.openshift.io -n openshift-storage ocs-external-storagecluster -p '{"spec":{"csi":{"readAffinity":{"enabled":true,"crushLocationLabels":["metro-dr.openshift-storage.topology.io/datacenter"]}}}}' --type=merge$ oc patch storageclusters.ocs.openshift.io -n openshift-storage ocs-external-storagecluster -p '{"spec":{"csi":{"readAffinity":{"enabled":true,"crushLocationLabels":["metro-dr.openshift-storage.topology.io/datacenter"]}}}}' --type=mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete po -n openshift-storage -l 'app in (openshift-storage.cephfs.csi.ceph.com-ctrlplugin,openshift-storage.rbd.csi.ceph.com-ctrlplugin)'
$ oc delete po -n openshift-storage -l 'app in (openshift-storage.cephfs.csi.ceph.com-ctrlplugin,openshift-storage.rbd.csi.ceph.com-ctrlplugin)'Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.8. Installing OpenShift Data Foundation Multicluster Orchestrator operator Copy linkLink copied to clipboard!
OpenShift Data Foundation Multicluster Orchestrator is a controller that is installed from OpenShift Container Platform’s OperatorHub on the Hub cluster.
Procedure
- On the Hub cluster, navigate to OperatorHub and use the keyword filter to search for ODF Multicluster Orchestrator.
- Click ODF Multicluster Orchestrator tile.
Keep all default settings and click Install.
Ensure that the operator resources are installed in
openshift-operatorsproject and available to all namespaces.NoteThe
ODF Multicluster Orchestratoralso installs the Openshift DR Hub Operator on the RHACM hub cluster as a dependency.Verify that the operator Pods are in a
Runningstate. TheOpenShift DR Hub operatoris also installed at the same time inopenshift-operatorsnamespace.oc get pods -n openshift-operators
$ oc get pods -n openshift-operatorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME READY STATUS RESTARTS AGE odf-multicluster-console-6845b795b9-blxrn 1/1 Running 0 4d20h odfmo-controller-manager-f9d9dfb59-jbrsd 1/1 Running 0 4d20h ramen-hub-operator-6fb887f885-fss4w 2/2 Running 0 4d20h
NAME READY STATUS RESTARTS AGE odf-multicluster-console-6845b795b9-blxrn 1/1 Running 0 4d20h odfmo-controller-manager-f9d9dfb59-jbrsd 1/1 Running 0 4d20h ramen-hub-operator-6fb887f885-fss4w 2/2 Running 0 4d20hCopy to Clipboard Copied! Toggle word wrap Toggle overflow
3.9. Configuring SSL access across clusters Copy linkLink copied to clipboard!
Configure network (SSL) access between the primary and secondary clusters so that metadata can be stored on the alternate cluster in a Multicloud Gateway (MCG) object bucket using a secure transport protocol and in the Hub cluster for verifying access to the object buckets.
If all of your OpenShift clusters are deployed using a signed and valid set of certificates for your environment then this section can be skipped.
Procedure
Extract the ingress certificate for the Primary managed cluster and save the output to
primary.crt.oc get cm default-ingress-cert -n openshift-config-managed -o jsonpath="{['data']['ca-bundle\.crt']}" > primary.crt$ oc get cm default-ingress-cert -n openshift-config-managed -o jsonpath="{['data']['ca-bundle\.crt']}" > primary.crtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Extract the ingress certificate for the Secondary managed cluster and save the output to
secondary.crt.oc get cm default-ingress-cert -n openshift-config-managed -o jsonpath="{['data']['ca-bundle\.crt']}" > secondary.crt$ oc get cm default-ingress-cert -n openshift-config-managed -o jsonpath="{['data']['ca-bundle\.crt']}" > secondary.crtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a new ConfigMap file to hold the remote cluster’s certificate bundle with filename
cm-clusters-crt.yaml.NoteThere could be more or less than three certificates for each cluster as shown in this example file. Also, ensure that the certificate contents are correctly indented after you copy and paste from the
primary.crtandsecondary.crtfiles that were created before.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the ConfigMap on the Primary managed cluster, Secondary managed cluster, and the Hub cluster.
oc create -f cm-clusters-crt.yaml
$ oc create -f cm-clusters-crt.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
configmap/user-ca-bundle created
configmap/user-ca-bundle createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow Patch default proxy resource on the Primary managed cluster, Secondary managed cluster, and the Hub cluster.
oc patch proxy cluster --type=merge --patch='{"spec":{"trustedCA":{"name":"user-ca-bundle"}}}'$ oc patch proxy cluster --type=merge --patch='{"spec":{"trustedCA":{"name":"user-ca-bundle"}}}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
proxy.config.openshift.io/cluster patched
proxy.config.openshift.io/cluster patchedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
3.10. Creating Disaster Recovery Policy on Hub cluster Copy linkLink copied to clipboard!
Openshift Disaster Recovery Policy (DRPolicy) resource specifies OpenShift Container Platform clusters participating in the disaster recovery solution and the desired replication interval. DRPolicy is a cluster scoped resource that users can apply to applications that require Disaster Recovery solution.
The ODF MultiCluster Orchestrator Operator facilitates the creation of each DRPolicy and the corresponding DRClusters through the Multicluster Web console.
Prerequisites
- Ensure that there is a minimum set of two managed clusters.
Procedure
- On the OpenShift console, navigate to All Clusters → Data Services → Disaster recovery.
- On the Overview tab, click Create a disaster recovery policy or you can navigate to Policies tab and click Create DRPolicy.
-
Enter Policy name. Ensure that each DRPolicy has a unique name (for example:
ocp4perf1-ocp4perf2). - Select two clusters from the list of managed clusters to which this new policy will be associated with.
-
Replication policy is automatically set to
syncbased on the OpenShift clusters selected. - Click Create.
Verify that the DRPolicy is created successfully. Run this command on the Hub cluster for each of the DRPolicy resources created, where <drpolicy_name> is replaced with your unique name.
oc get drpolicy <drpolicy_name> -o jsonpath='{.status.conditions[].reason}{"\n"}'$ oc get drpolicy <drpolicy_name> -o jsonpath='{.status.conditions[].reason}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Succeeded
SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow When a DRPolicy is created, along with it, two DRCluster resources are also created. It could take up to 10 minutes for all three resources to be validated and for the status to show as
Succeeded.NoteEditing of
SchedulingInterval,ReplicationClassSelector,VolumeSnapshotClassSelectorandDRClustersfield values are not supported in the DRPolicy.Verify the object bucket access from the Hub cluster to both the Primary managed cluster and the Secondary managed cluster.
Get the names of the DRClusters on the Hub cluster.
oc get drclusters
$ oc get drclustersCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME AGE ocp4perf1 4m42s ocp4perf2 4m42s
NAME AGE ocp4perf1 4m42s ocp4perf2 4m42sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check S3 access to each bucket created on each managed cluster. Use the DRCluster validation command, where <drcluster_name> is replaced with your unique name.
NoteEditing of
RegionandS3ProfileNamefield values are non supported in DRClusters.oc get drcluster <drcluster_name> -o jsonpath='{.status.conditions[2].reason}{"\n"}'$ oc get drcluster <drcluster_name> -o jsonpath='{.status.conditions[2].reason}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Succeeded
SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteMake sure to run commands for both DRClusters on the Hub cluster.
Verify that the OpenShift DR Cluster operator installation was successful on the Primary managed cluster and the Secondary managed cluster.
oc get csv,pod -n openshift-dr-system | egrep 'odr|ramen'
$ oc get csv,pod -n openshift-dr-system | egrep 'odr|ramen'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
clusterserviceversion.operators.coreos.com/odr-cluster-operator.v4.20.0-rhodf Openshift DR Cluster Operator 4.20.0-rhodf odr-cluster-operator.v4.20.0-rhodf Succeeded pod/ramen-dr-cluster-operator-77bf74849c-rfn85 2/2 Running 0 7h20m
clusterserviceversion.operators.coreos.com/odr-cluster-operator.v4.20.0-rhodf Openshift DR Cluster Operator 4.20.0-rhodf odr-cluster-operator.v4.20.0-rhodf Succeeded pod/ramen-dr-cluster-operator-77bf74849c-rfn85 2/2 Running 0 7h20mCopy to Clipboard Copied! Toggle word wrap Toggle overflow You can also verify that
OpenShift DR Cluster Operatoris installed successfully on the OperatorHub of each managed cluster.Verify that the secret is propagated correctly on the Primary managed cluster and the Secondary managed cluster.
oc get secrets -n openshift-dr-system | grep Opaque
oc get secrets -n openshift-dr-system | grep OpaqueCopy to Clipboard Copied! Toggle word wrap Toggle overflow Match the output with the s3SecretRef from the Hub cluster:
oc get cm -n openshift-operators ramen-hub-operator-config -oyaml
oc get cm -n openshift-operators ramen-hub-operator-config -oyamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
3.11. Configure DRClusters for fencing automation Copy linkLink copied to clipboard!
This configuration is required for enabling fencing prior to application failover. In order to prevent writes to the persistent volume from the cluster which is hit by a disaster, OpenShift DR instructs Red Hat Ceph Storage (RHCS) to fence the nodes of the cluster from the RHCS external storage. This section guides you on how to add the IPs or the IP Ranges for the nodes of the DRCluster.
3.11.1. Add node IP addresses to DRClusters Copy linkLink copied to clipboard!
Find the IP addresses for all of the OpenShift nodes in the managed clusters by running this command in the Primary managed cluster and the Secondary managed cluster.
oc get nodes -o jsonpath='{range .items[*]}{.status.addresses[?(@.type=="ExternalIP")].address}{"\n"}{end}'$ oc get nodes -o jsonpath='{range .items[*]}{.status.addresses[?(@.type=="ExternalIP")].address}{"\n"}{end}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Once you have the
IP addressesthen theDRClusterresources can be modified for each managed cluster.Find the DRCluster names on the Hub Cluster.
oc get drcluster
$ oc get drclusterCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME AGE ocp4perf1 5m35s ocp4perf2 5m35s
NAME AGE ocp4perf1 5m35s ocp4perf2 5m35sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit each DRCluster to add your unique IP addresses after replacing
<drcluster_name>with your unique name.oc edit drcluster <drcluster_name>
$ oc edit drcluster <drcluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
drcluster.ramendr.openshift.io/ocp4perf1 edited
drcluster.ramendr.openshift.io/ocp4perf1 editedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
There could be more than six IP addresses.
Modify this DRCluster configuration also for IP addresses on the Secondary managed clusters in the peer DRCluster resource (e.g., ocp4perf2).
3.11.2. Add fencing annotations to DRClusters Copy linkLink copied to clipboard!
Add the following annotations to all the DRCluster resources. These annotations include details needed for the NetworkFence resource created later in these instructions (prior to testing application failover).
Replace <drcluster_name> with your unique name.
oc edit drcluster <drcluster_name>
$ oc edit drcluster <drcluster_name>
Example output:
drcluster.ramendr.openshift.io/ocp4perf1 edited
drcluster.ramendr.openshift.io/ocp4perf1 edited
Make sure to add these annotations for both DRCluster resources (for example: ocp4perf1 and ocp4perf2).
3.12. Create sample application for testing disaster recovery solution Copy linkLink copied to clipboard!
OpenShift Data Foundation disaster recovery (DR) solution supports disaster recovery for ApplicationSet-based applications that are managed by RHACM. For more details, see ApplicationSet documentation.
The following section details how to create an application and apply a DRPolicy to an application.
ApplicationSet-based applications
OpenShift users that do not have cluster-admin permissions cannot create ApplicationSet-based applications.
Discovered applications using RBD block volumes and Ceph FileSystem are not supported when using Metro-DR.
3.12.1. ApplicationSet-based applications Copy linkLink copied to clipboard!
3.12.1.1. Creating ApplicationSet-based applications Copy linkLink copied to clipboard!
Prerequisite
- Ensure that the Red Hat OpenShift GitOps operator is installed on all three clusters: Hub cluster, Primary managed cluster and Secondary managed cluster. For instructions, see Installing Red Hat OpenShift GitOps Operator in web console.
On the Hub cluster, ensure that both Primary and Secondary managed clusters are registered to GitOps. For registration instructions, see Registering managed clusters to GitOps. Then check if the Placement used by
GitOpsClusterresource to register both managed clusters, has the tolerations to deal with cluster unavailability. You can verify if the following tolerations are added to the Placement using the commandoc get placement <placement-name> -n openshift-gitops -o yaml.tolerations: - key: cluster.open-cluster-management.io/unreachable operator: Exists - key: cluster.open-cluster-management.io/unavailable operator: Existstolerations: - key: cluster.open-cluster-management.io/unreachable operator: Exists - key: cluster.open-cluster-management.io/unavailable operator: ExistsCopy to Clipboard Copied! Toggle word wrap Toggle overflow In case the tolerations are not added, see Configuring application placement tolerations for Red Hat Advanced Cluster Management and OpenShift GitOps.
-
Ensure that you have created the
ClusterRoleBindingyaml on both the Primary and Secondary managed clusters. For instruction, see the Prerequisites chapter in RHACM documentation.
Procedure
- On the Hub cluster, navigate to All Clusters → Applications and click Create application.
- Choose the application type as Argo CD ApplicationSet - Pull model.
- In the General step, enter your Application set name.
-
Select Argo server
openshift-gitopsand Requeue time as180seconds. - Click Next.
-
In the Repository location for resources section, select Repository type
Git. Enter the Git repository URL for the sample application, the github Branch and Path where the resources busybox Pod and PVC will be created.
- Use the sample application repository as https://github.com/red-hat-storage/ocm-ramen-samples
-
Select Revision as
main -
Choose Path as
workloads/deployment/odr-metro-rbd.
- Enter Remote namespace value. (example, busybox-sample) and click Next.
Choose the Sync policy settings as per your requirement or go with the default selections, and then click Next.
You can choose one or more options.
- In Label expressions, add a label <name> with its value set to the managed cluster name.
- Click Next.
- Review the setting details and click Submit.
3.12.1.2. Apply Data policy to sample ApplicationSet-based application Copy linkLink copied to clipboard!
Prerequisites
- Ensure that both managed clusters referenced in the Data policy are reachable. If not, the application will not be protected for disaster recovery until both clusters are online.
Procedure
- On the Hub cluster, navigate to All Clusters → Applications.
- Click the Actions menu at the end of application to view the list of available actions.
- Click Manage disaster recovery.
- Click Enroll application.
- Select Policy name and click Next.
Select an Application resource and then use PVC label selector to select
PVC labelfor the selected application resource.NoteYou can select more than one PVC label for the selected application resources.
- Click Next.
- In the Enroll managed application modal, review the policy configuration details and click Assign. The newly assigned Data policy details are displayed on the Manage disaster recovery modal.
Verify that you can view the assigned policy details on the Applications page.
On the Applications page, navigate to the DR Status column and view the status. The status will either be healthy or critical. You can click the status to view the last synced time for application volumes and the DR policy assigned.
NoteIt may take a few minutes after disaster recovery is assigned for the DR Status to move from critical to healthy.
Failover and relocate status can be viewed in the DR Status column when initiated. These statuses can be clicked to view more details about the action.
DR Status shows either healthy, warning, or critical when failover or relocate is not taking place.
-
After you apply DRPolicy to the applications, confirm whether the
ClusterDataProtectedis set toTruein the drpc yaml output.
3.12.2. Deleting sample application Copy linkLink copied to clipboard!
This section provides instructions for deleting the sample application busybox using the RHACM console.
When deleting a DR protected application, access to both clusters that belong to the DRPolicy is required. This is to ensure that all protected API resources and resources in the respective S3 stores are cleaned up as part of removing the DR protection. If access to one of the clusters is not healthy, deleting the DRPlacementControl resource for the application, on the hub, would remain in the Deleting state.
Prerequisites
- These instructions to delete the sample application should not be executed until the failover and relocate testing is completed and the application is ready to be removed from RHACM and the managed clusters.
Procedure
- On the RHACM console, navigate to Applications.
-
Search for the sample application to be deleted (for example,
busybox). - Click the Action Menu (⋮) next to the application you want to delete.
Click Delete application.
When the Delete application is selected a new screen will appear asking if the application related resources should also be deleted.
- Select Remove application related resources checkbox to delete the Subscription and PlacementRule.
- Click Delete. This will delete the busybox application on the Primary managed cluster (or whatever cluster the application was running on).
In addition to the resources deleted using the RHACM console, delete the
DRPlacementControlif it is not auto-deleted after deleting thebusyboxapplication.Log in to the OpenShift Web console for the Hub cluster and navigate to Installed Operators for the project
busybox-sample.For ApplicationSet applications, select the project as
openshift-gitops.- Click OpenShift DR Hub Operator and then click the DRPlacementControl tab.
-
Click the Action Menu (⋮) next to the
busyboxapplication DRPlacementControl that you want to delete. - Click Delete DRPlacementControl.
- Click Delete.
This process can be used to delete any application with a DRPlacementControl resource.
3.13. Subscription-based application failover between managed clusters Copy linkLink copied to clipboard!
Perform a failover when a managed cluster becomes unavailable, due to any reason. This failover method is application-based.
Prerequisites
- If your setup has active and passive RHACM hub clusters, see Hub recovery using Red Hat Advanced Cluster Management.
When the primary cluster is in a state other than
Ready, check the actual status of the cluster as it might take some time to update.- Navigate to the RHACM console → Infrastructure → Clusters → Cluster list tab.
Check the status of both the managed clusters individually before performing failover operation.
However, failover operation can still be performed when the cluster you are failing over to is in a Ready state.
Procedure
Enable fencing on the Hub cluster.
Open CLI terminal and edit the DRCluster resource, where <drcluster_name> is your unique name.
ImportantOnce the managed cluster is fenced, all communication from applications to the OpenShift Data Foundation external storage cluster will fail and some Pods will be in an unhealthy state (for example:
CreateContainerError,CrashLoopBackOff) on the cluster that is now fenced.oc edit drcluster <drcluster_name>
$ oc edit drcluster <drcluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
drcluster.ramendr.openshift.io/ocp4perf1 edited
drcluster.ramendr.openshift.io/ocp4perf1 editedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the fencing status on the Hub cluster for the Primary managed cluster, replacing <drcluster_name> is your unique identifier.
oc get drcluster.ramendr.openshift.io <drcluster_name> -o jsonpath='{.status.phase}{"\n"}'$ oc get drcluster.ramendr.openshift.io <drcluster_name> -o jsonpath='{.status.phase}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Fenced
FencedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Login to your Ceph cluster and verify that the IPs that belong to the OpenShift Container Platform cluster nodes are now in the blocklist.
ceph osd blocklist ls
$ ceph osd blocklist lsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- On the Hub cluster, navigate to Applications.
- Click the Actions menu at the end of application row to view the list of available actions.
- Click Failover application.
- After the Failover application modal is shown, select policy and target cluster to which the associated application will failover in case of a disaster.
Click the Select subscription group dropdown to verify the default selection or modify this setting.
By default, the subscription group that replicates for the application resources is selected.
Check the status of the Failover readiness.
-
If the status is
Readywith a green tick, it indicates that the target cluster is ready for failover to start. Proceed to step 7. -
If the status is
UnknownorNot ready, then wait until the status changes toReady.
-
If the status is
- Click Initiate. The busybox application is now failing over to the Secondary-managed cluster.
- Close the modal window and track the status using the Data policy column on the Applications page.
Verify that the activity status shows as FailedOver for the application.
- Navigate to the Applications → Overview tab.
- In the Data policy column, click the policy link for the application you applied the policy to.
- On the Data policy popover, click the View more details link.
3.14. ApplicationSet-based application failover between managed clusters Copy linkLink copied to clipboard!
Perform a failover when a managed cluster becomes unavailable, due to any reason. This failover method is application-based.
Prerequisites
- If your setup has active and passive RHACM hub clusters, see Hub recovery using Red Hat Advanced Cluster Management.
When the primary cluster is in a state other than
Ready, check the actual status of the cluster as it might take some time to update.- Navigate to the RHACM console → Infrastructure → Clusters → Cluster list tab.
Check the status of both the managed clusters individually before performing failover operation.
However, failover operation can still be performed when the cluster you are failing over to is in a Ready state.
Procedure
Enable fencing on the Hub cluster.
Open CLI terminal and edit the DRCluster resource, where <drcluster_name> is your unique name.
ImportantOnce the managed cluster is fenced, all communication from applications to the OpenShift Data Foundation external storage cluster will fail and some Pods will be in an unhealthy state (for example:
CreateContainerError,CrashLoopBackOff) on the cluster that is now fenced.oc edit drcluster <drcluster_name>
$ oc edit drcluster <drcluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
drcluster.ramendr.openshift.io/ocp4perf1 edited
drcluster.ramendr.openshift.io/ocp4perf1 editedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the fencing status on the Hub cluster for the Primary managed cluster, replacing <drcluster_name> is your unique identifier.
oc get drcluster.ramendr.openshift.io <drcluster_name> -o jsonpath='{.status.phase}{"\n"}'$ oc get drcluster.ramendr.openshift.io <drcluster_name> -o jsonpath='{.status.phase}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Fenced
FencedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Login to your Ceph cluster and verify that the IPs that belong to the OpenShift Container Platform cluster nodes are now in the blocklist.
ceph osd blocklist ls
$ ceph osd blocklist lsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- On the Hub cluster, navigate to Applications.
- Click the Actions menu at the end of application row to view the list of available actions.
- Click Failover application.
- When the Failover application modal is shown, verify the details presented are correct and check the status of the Failover readiness. If the status is Ready with a green tick, it indicates that the target cluster is ready for failover to start.
- Click Initiate. The busybox resources are now created on the target cluster.
- Close the modal window and track the status using the Data policy column on the Applications page.
Verify that the activity status shows as FailedOver for the application.
- Navigate to the Applications → Overview tab.
- In the Data policy column, click the policy link for the application you applied the policy to.
- On the Data policy popover, verify that you can see one or more policy names and the ongoing activities associated with the policy in use with the application.
3.15. Relocating Subscription-based application between managed clusters Copy linkLink copied to clipboard!
Relocate an application to its preferred location when all managed clusters are available.
Prerequisite
- If your setup has active and passive RHACM hub clusters, see Hub recovery using Red Hat Advanced Cluster Management.
When the primary cluster is in a state other than Ready, check the actual status of the cluster as it might take some time to update. Relocate can only be performed when both primary and preferred clusters are up and running.
- Navigate to RHACM console → Infrastructure → Clusters → Cluster list tab.
- Check the status of both the managed clusters individually before performing relocate operation.
Verify that applications were cleaned up from the cluster before unfencing it:
- Ensure that either there are no pods, or that no pods are using PVCs.
- PVCs remain in terminating state.
Procedure
Disable fencing on the Hub cluster.
Edit the DRCluster resource for this cluster, replacing <drcluster_name> with a unique name.
oc edit drcluster <drcluster_name>
$ oc edit drcluster <drcluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
drcluster.ramendr.openshift.io/ocp4perf1 edited
drcluster.ramendr.openshift.io/ocp4perf1 editedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Gracefully reboot OpenShift Container Platform nodes that were
Fenced. A reboot is required to resume the I/O operations after unfencing to avoid any further recovery orchestration failures. Reboot all nodes of the cluster by following the steps in the procedure, Rebooting a node gracefully.NoteMake sure that all the nodes are initially cordoned and drained before you reboot and perform uncordon operations on the nodes.
After all OpenShift nodes are rebooted and are in a
Readystatus, verify that all Pods are in a healthy state by running this command on the Primary managed cluster (or whatever cluster has been Unfenced).oc get pods -A | egrep -v 'Running|Completed'
oc get pods -A | egrep -v 'Running|Completed'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAMESPACE NAME READY STATUS RESTARTS AGE
NAMESPACE NAME READY STATUS RESTARTS AGECopy to Clipboard Copied! Toggle word wrap Toggle overflow The output for this query should be zero Pods before proceeding to the next step.
ImportantIf there are Pods still in an unhealthy status because of severed storage communication, troubleshoot and resolve before continuing. Because the storage cluster is external to OpenShift, it also has to be properly recovered after a site outage for OpenShift applications to be healthy.
Alternatively, you can use the OpenShift Web Console dashboards and Overview tab to assess the health of applications and the external ODF storage cluster. The detailed OpenShift Data Foundation dashboard is found by navigating to Storage → Data Foundation.
Verify that the
Unfencedcluster is in a healthy state. Validate the fencing status in the Hub cluster for the Primary-managed cluster, replacing <drcluster_name> with a unique name.oc get drcluster.ramendr.openshift.io <drcluster_name> -o jsonpath='{.status.phase}{"\n"}'$ oc get drcluster.ramendr.openshift.io <drcluster_name> -o jsonpath='{.status.phase}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Unfenced
UnfencedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Login to your Ceph cluster and verify that the IPs that belong to the OpenShift Container Platform cluster nodes are NOT in the blocklist.
ceph osd blocklist ls
$ ceph osd blocklist lsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that you do not see the IPs added during fencing.
- On the Hub cluster, navigate to Applications.
- Click the Actions menu at the end of application row to view the list of available actions.
- Click Relocate application.
- When the Relocate application modal is shown, select policy and target cluster to which the associated application will relocate to in case of a disaster.
- By default, the subscription group that will deploy the application resources is selected. Click the Select subscription group dropdown to verify the default selection or modify this setting.
Check the status of the Relocation readiness.
-
If the status is
Readywith a green tick, it indicates that the target cluster is ready for relocation to start. Proceed to step 7. -
If the status is
UnknownorNot ready, then wait until the status changes toReady.
-
If the status is
- Click Initiate. The busybox resources are now created on the target cluster.
- Close the modal window and track the status using the Data policy column on the Applications page.
Verify that the activity status shows as Relocated for the application.
- Navigate to the Applications → Overview tab.
- In the Data policy column, click the policy link for the application you applied the policy to.
- On the Data policy popover, click the View more details link.
3.16. Relocating an ApplicationSet-based application between managed clusters Copy linkLink copied to clipboard!
Relocate an application to its preferred location when all managed clusters are available.
Prerequisite
- If your setup has active and passive RHACM hub clusters, see Hub recovery using Red Hat Advanced Cluster Management.
When the primary cluster is in a state other than Ready, check the actual status of the cluster as it might take some time to update. Relocate can only be performed when both primary and preferred clusters are up and running.
- Navigate to RHACM console → Infrastructure → Clusters → Cluster list tab.
- Check the status of both the managed clusters individually before performing relocate operation.
Verify that applications were cleaned up from the cluster before unfencing it:
- Ensure that either there are no pods, or that no pods are using PVCs.
- PVCs remain in terminating state.
Procedure
Disable fencing on the Hub cluster.
Edit the DRCluster resource for this cluster, replacing <drcluster_name> with a unique name.
oc edit drcluster <drcluster_name>
$ oc edit drcluster <drcluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
drcluster.ramendr.openshift.io/ocp4perf1 edited
drcluster.ramendr.openshift.io/ocp4perf1 editedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Gracefully reboot OpenShift Container Platform nodes that were
Fenced. A reboot is required to resume the I/O operations after unfencing to avoid any further recovery orchestration failures. Reboot all nodes of the cluster by following the steps in the procedure, Rebooting a node gracefully.NoteMake sure that all the nodes are initially cordoned and drained before you reboot and perform uncordon operations on the nodes.
After all OpenShift nodes are rebooted and are in a
Readystatus, verify that all Pods are in a healthy state by running this command on the Primary managed cluster (or whatever cluster has been Unfenced).oc get pods -A | egrep -v 'Running|Completed'
oc get pods -A | egrep -v 'Running|Completed'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAMESPACE NAME READY STATUS RESTARTS AGE
NAMESPACE NAME READY STATUS RESTARTS AGECopy to Clipboard Copied! Toggle word wrap Toggle overflow The output for this query should be zero Pods before proceeding to the next step.
ImportantIf there are Pods still in an unhealthy status because of severed storage communication, troubleshoot and resolve before continuing. Because the storage cluster is external to OpenShift, it also has to be properly recovered after a site outage for OpenShift applications to be healthy.
Alternatively, you can use the OpenShift Web Console dashboards and Overview tab to assess the health of applications and the external ODF storage cluster. The detailed OpenShift Data Foundation dashboard is found by navigating to Storage → Data Foundation.
Verify that the
Unfencedcluster is in a healthy state. Validate the fencing status in the Hub cluster for the Primary-managed cluster, replacing <drcluster_name> with a unique name.oc get drcluster.ramendr.openshift.io <drcluster_name> -o jsonpath='{.status.phase}{"\n"}'$ oc get drcluster.ramendr.openshift.io <drcluster_name> -o jsonpath='{.status.phase}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Unfenced
UnfencedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Login to your Ceph cluster and verify that the IPs that belong to the OpenShift Container Platform cluster nodes are NOT in the blocklist.
ceph osd blocklist ls
$ ceph osd blocklist lsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that you do not see the IPs added during fencing.
- On the Hub cluster, navigate to Applications.
- Click the Actions menu at the end of application row to view the list of available actions.
- Click Relocate application.
- When the Relocate application modal is shown, select policy and target cluster to which the associated application will relocate to in case of a disaster.
- Click Initiate. The busybox resources are now created on the target cluster.
- Close the modal window and track the status using the Data policy column on the Applications page.
Verify that the activity status shows as Relocated for the application.
- Navigate to the Applications → Overview tab.
- In the Data policy column, click the policy link for the application you applied the policy to.
- On the Data policy popover, verify that you can see one or more policy names and the relocation status associated with the policy in use with the application.
Disaster recovery protection for discovered applications is a Technology Preview feature and is subject to Technology Preview support limitations. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information, see Technology Preview Features Support Scope.
3.17. Hub recovery using Red Hat Advanced Cluster Management [Technology preview] Copy linkLink copied to clipboard!
When your setup has active and passive Red Hat Advanced Cluster Management for Kubernetes (RHACM) hub clusters, and in case where the active hub is down, you can use the passive hub to failover or relocate the disaster recovery protected workloads.
Hub recovery for Metro-DR is a Technology Preview feature and is subject to Technology Preview support limitations. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information, see Technology Preview Features Support Scope.
3.17.1. Configuring passive hub cluster Copy linkLink copied to clipboard!
To perform hub recovery in case the active hub is down or unreachable, follow the procedure in this section to configure the passive hub cluster and then failover or relocate the disaster recovery protected workloads.
Procedure
Ensure that RHACM operator and
MultiClusterHubis installed on the passive hub cluster. See RHACM installation guide for instructions.After the operator is successfully installed, the web console automatically reloads to apply the changes. During this process, a temporary error message might appear on the page and this is expected and disappears after the refresh completes.
- Before hub recovery, configure backup and restore. See Backup and restore topics of RHACM Business continuity guide.
- Install the multicluster orchestrator (MCO) operator along with Red Hat OpenShift GitOps operator on the passive RHACM hub prior to the restore. For instructions to restore your RHACM hub, see Installing OpenShift Data Foundation Multicluster Orchestrator operator.
-
Ensure that
.spec.cleanupBeforeRestoreis set toNonefor theRestore.cluster.open-cluster-management.ioresource. For details, see Restoring passive resources while checking for backups chapter of RHACM documentation. - If SSL access across clusters was configured manually during setup, then re-configure SSL access across clusters. For instructions, see Configuring SSL access across clusters chapter.
On the passive hub, add label to
openshift-operatorsnamespace to enable basic monitoring ofVolumeSyncronizationDelayalert using this command. For alert details, see Disaster recovery alerts chapter.oc label namespace openshift-operators openshift.io/cluster-monitoring='true'
$ oc label namespace openshift-operators openshift.io/cluster-monitoring='true'Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.17.2. Switching to passive hub cluster Copy linkLink copied to clipboard!
Use this procedure when active hub is down or unreachable.
Procedure
During the restore procedure, to avoid eviction of resources when ManifestWorks are not regenerated correctly, you can enlarge the AppliedManifestWork eviction grace period. On the passive hub cluster, check for existing global
KlusterletConfig.-
If global KlusterletConfig exists then edit and set the value for
appliedManifestWorkEvictionGracePeriodparameter to a larger value. For example, 24 hours or more. If global KlusterletConfig does not exist, then create the Klusterletconfig using the following yaml:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The configuration will be propagated to all the managed clusters automatically.
-
If global KlusterletConfig exists then edit and set the value for
Restore the backups on the passive hub cluster. For information, see Restoring a hub cluster from backup.
ImportantRecovering a failed hub to its passive instance will only restore applications and their DR protected state to its last scheduled backup. Any application that was DR protected after the last scheduled backup would need to be protected again on the new hub.
Verify that the restore is complete.
oc -n <restore-namespace> wait restore <restore-name> --for=jsonpath='{.status.phase}'=Finished --timeout=120s$ oc -n <restore-namespace> wait restore <restore-name> --for=jsonpath='{.status.phase}'=Finished --timeout=120sCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Verify that the Primary and Secondary managed clusters are successfully imported into the RHACM console and they are accessible. If any of the managed clusters are down or unreachable then they will not be successfully imported.
- Wait until DRPolicy validation succeeds.
Verify that the DRPolicy is created successfully. Run this command on the Hub cluster for each of the DRPolicy resources created, where <drpolicy_name> is replaced with a unique name.
oc get drpolicy <drpolicy_name> -o jsonpath='{.status.conditions[].reason}{"\n"}'$ oc get drpolicy <drpolicy_name> -o jsonpath='{.status.conditions[].reason}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Succeeded
SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Refresh the RHACM console to make the DR monitoring dashboard tab accessible if it was enabled on the Active hub cluster.
Verify the DRPC output using the following command on the new hub cluster:
oc get drpc -A -o wide
$ oc get drpc -A -o wideCopy to Clipboard Copied! Toggle word wrap Toggle overflow If
PROGRESSIONshows a status ofPAUSED, administrative intervention is required to unpause it.PROGRESSIONentersPAUSEDstate under the following conditions:- Cluster Query Failure: None of the clusters were successfully queried during the DRPC reconciliation. This situation can occur during hub recovery.
- Action Mismatch: The DRPC action differs from the queried VRG action.
Cluster Mismatch: The DRPC action and the VRG action are the same, but the Primary VRG is found in a different cluster than the one expected by the DRPC.
ImportantIf you cannot diagnose and resolve the cause of the pause, contact Red Hat Customer Support.
If
PROGRESSIONis in eitherCompletedorCleaning up, it is safe to proceed.
-
Edit the global KlusterletConfig on the new hub and remove the parameter
appliedManifestWorkEvictionGracePeriodand its value. Depending on whether the active hub cluster, or both the active hub cluster along with the primary managed cluster had been down, follow the next steps based on your scenario:
- If only the active hub cluster had been down, and if the managed clusters are still accessible, no further action is required.
If the primary managed cluster had been down, along with the active hub cluster, you need to fail over the workloads from the primary managed cluster to the secondary managed cluster.
For failover instructions, based on your workload type, see Subscription-based applications or ApplicationSet-based applications.
Verify that the failover is successful. If the Primary managed cluster is also down, then the PROGRESSION status for the workload would be in
Cleaning Upphase until the down Primary managed cluster is back online and successfully imported into the RHACM console.On the passive hub cluster, run the following command to check the PROGRESSION status.
oc get drpc -o wide -A
$ oc get drpc -o wide -ACopy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 4. Regional-DR solution for OpenShift Data Foundation Copy linkLink copied to clipboard!
This section of the guide provides details of the Regional Disaster Recovery (Regional-DR) steps and commands necessary to be able to failover an application from one OpenShift Container Platform cluster to another and then failback the same application to the original primary cluster. In this case the OpenShift Container Platform clusters will be created or imported using Red Hat Advanced Cluster Management (RHACM).
Though Regional-DR solution is built on Asynchronous data replication and hence could have a potential data loss but provides the protection against a broad set of failures.
This is a general overview of the Regional-DR steps required to configure and execute OpenShift Disaster Recovery (ODR) capabilities using OpenShift Data Foundation and RHACM across two distinct OpenShift Container Platform clusters separated by distance. In addition to these two clusters called managed clusters, a third OpenShift Container Platform cluster is required that will be the Red Hat Advanced Cluster Management (RHACM) hub cluster.
Regional‑DR is currently supported only on VMware and bare metal environments.
You can now easily set up Regional disaster recovery solutions for workloads based on OpenShift virtualization technology using OpenShift Data Foundation. For more information, see the knowledgebase article.
4.1. Components of Regional-DR solution Copy linkLink copied to clipboard!
Regional-DR is composed of Red Hat Advanced Cluster Management for Kubernetes and OpenShift Data Foundation components to provide application and data mobility across Red Hat OpenShift Container Platform clusters.
Red Hat Advanced Cluster Management for Kubernetes
Red Hat Advanced Cluster Management (RHACM) provides the ability to manage multiple clusters and application lifecycles. Hence, it serves as a control plane in a multi-cluster environment.
RHACM is split into two parts:
- RHACM Hub: components that run on the multi-cluster control plane.
- Managed clusters: components that run on the clusters that are managed.
For more information about this product, see RHACM documentation and the RHACM “Manage Applications” documentation.
OpenShift Data Foundation
OpenShift Data Foundation provides the ability to provision and manage storage for stateful applications in an OpenShift Container Platform cluster.
OpenShift Data Foundation is backed by Ceph as the storage provider, whose lifecycle is managed by Rook in the OpenShift Data Foundation component stack. Ceph-CSI provides the provisioning and management of Persistent Volumes for stateful applications.
OpenShift Data Foundation stack is now enhanced with the following abilities for disaster recovery:
- Enable RBD block pools for mirroring across OpenShift Data Foundation instances (clusters)
- Ability to mirror specific images within an RBD block pool
- Provides csi-addons to manage per Persistent Volume Claim (PVC) mirroring
OpenShift DR
OpenShift DR is a set of orchestrators to configure and manage stateful applications across a set of peer OpenShift clusters which are managed using RHACM and provides cloud-native interfaces to orchestrate the life-cycle of an application’s state on Persistent Volumes. These include:
- Protecting an application and its state relationship across OpenShift clusters
- Failing over an application and its state to a peer cluster
- Relocate an application and its state to the previously deployed cluster
OpenShift DR is split into three components:
- ODF Multicluster Orchestrator: Installed on the multi-cluster control plane (RHACM Hub), it orchestrates configuration and peering of OpenShift Data Foundation clusters for Metro and Regional DR relationships
- OpenShift DR Hub Operator: Automatically installed as part of ODF Multicluster Orchestrator installation on the hub cluster to orchestrate failover or relocation of DR enabled applications.
- OpenShift DR Cluster Operator: Automatically installed on each managed cluster that is part of a Metro and Regional DR relationship to manage the lifecycle of all PVCs of an application.
4.2. Regional-DR deployment workflow Copy linkLink copied to clipboard!
This section provides an overview of the steps required to configure and deploy Regional-DR capabilities using the latest version of Red Hat OpenShift Data Foundation across two distinct OpenShift Container Platform clusters. In addition to two managed clusters, a third OpenShift Container Platform cluster will be required to deploy the Red Hat Advanced Cluster Management (RHACM).
To configure your infrastructure, perform the below steps in the order given:
- Ensure requirements across the three: Hub, Primary and Secondary Openshift Container Platform clusters that are part of the DR solution are met. See Requirements for enabling Regional-DR.
- Install OpenShift Data Foundation operator and create a storage system on Primary and Secondary managed clusters. See Creating OpenShift Data Foundation cluster on managed clusters.
- Install the ODF Multicluster Orchestrator on the Hub cluster. See Installing ODF Multicluster Orchestrator on Hub cluster.
- Configure SSL access between the Hub, Primary and Secondary clusters. See Configuring SSL access across clusters.
Create a DRPolicy resource for use with applications requiring DR protection across the Primary and Secondary clusters. See Creating Disaster Recovery Policy on Hub cluster.
NoteThere can be more than a single policy.
Testing your disaster recovery solution with:
ApplicationSet-based application:
- Create sample applications. See Creating ApplicationSet-based applications.
- Test failover and relocate operations using the sample application between managed clusters. See ApplicationSet-based application failover and relocating ApplicationSet-based application.
Discovered applications
- Ensure all requirements mentioned in Prerequisites is addressed. See Prerequisites for disaster recovery protection of discovered applications
- Create a sample discovered application. See Creating a sample discovered application
- Enroll the discovered application. See Enrolling a sample discovered application for disaster recovery protection
- Test failover and relocate. See Discovered application failover and relocate
4.3. Requirements for enabling Regional-DR Copy linkLink copied to clipboard!
The prerequisites to installing a disaster recovery solution supported by Red Hat OpenShift Data Foundation are as follows:
You must have three OpenShift clusters that have network reachability between them:
- Hub cluster where Red Hat Advanced Cluster Management (RHACM) for Kubernetes operator is installed.
- Primary managed cluster where OpenShift Data Foundation is running.
- Secondary managed cluster where OpenShift Data Foundation is running.
NoteFor configuring hub recovery setup, you need a 4th cluster which acts as the passive hub. The primary managed cluster (Site-1) can be co-situated with the active RHACM hub cluster while the passive hub cluster is situated along with the secondary managed cluster (Site-2). Alternatively, the active RHACM hub cluster can be placed in a neutral site (Site-3) that is not impacted by the failures of either of the primary managed cluster at Site-1 or the secondary cluster at Site-2. In this situation, if a passive hub cluster is used it can be placed with the secondary cluster at Site-2. For more information, see Configuring passive hub cluster for hub recovery.
Ensure that RHACM operator and MultiClusterHub is installed on the Hub cluster. See RHACM installation guide for instructions and Submariner networking requirements table for network prerequisites.
After the operator is successfully installed, the web console automatically reloads to apply the changes. During this process, a temporary error message might appear on the page and this is expected and disappears after the refresh completes.
Ensure that application traffic routing and redirection are configured appropriately.
On the Hub cluster
- Navigate to All Clusters → Infrastructure → Clusters.
- Import or create the Primary managed cluster and the Secondary managed cluster using the RHACM console.
- Choose the appropriate options for your environment.
For instructions, see Creating a cluster and Importing a target managed cluster to the hub cluster.
Connect the private OpenShift cluster and service networks using the RHACM Submariner add-ons. Verify that the two clusters have non-overlapping service and cluster private networks. Otherwise, ensure that the Globalnet is enabled during the Submariner add-ons installation.
Run the following command for each of the managed clusters to determine if Globalnet needs to be enabled. The following example is for non-overlapping cluster and service networks so Globalnet would not be enabled.
oc get networks.config.openshift.io cluster -o json | jq .spec
$ oc get networks.config.openshift.io cluster -o json | jq .specCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output for Primary cluster:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output for Secondary cluster:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Additionally, if Submariner and OpenShift Data Foundation are already installed on the managed clusters, use the OpenShift Data Foundation command line interface (CLI) tool to get additional information about the clusters. This information can determine the need for enabling Globalnet during submariner installation based on the clusters service and private networks.
Download the the OpenShift Data Foundation CLI tool from the customer portal.
Run the following command on one of the two managed clusters, where
PeerManagedClusterName(ClusterIDis the name of the peer OpenShift Data Foundation cluster:odf get dr-prereq <PeerManagedClusterName(ClusterID)>
$ odf get dr-prereq <PeerManagedClusterName(ClusterID)>Copy to Clipboard Copied! Toggle word wrap Toggle overflow If submariner is not installed with Globalnet on clusters with overlapping services, the following output shows:
Info: Submariner is installed. Info: Globalnet is required. Info: Globalnet is not enabled.
Info: Submariner is installed. Info: Globalnet is required. Info: Globalnet is not enabled.Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThis requires Submariner to be uninstalled and then reinstalled with Globalnet enabled.
If submariner is installed with Globalnet on clusters with overlapping services, the following output shows:
Info: Submariner is installed. Info: Globalnet is required. Info: Globalnet is enabled.
Info: Submariner is installed. Info: Globalnet is required. Info: Globalnet is enabled.Copy to Clipboard Copied! Toggle word wrap Toggle overflow If submariner is installed without Globalnet on clusters with non-overlapping services, the following output shows:
Info: Submariner is installed. Info: Globalnet is not required. Info: Globalnet is not enabled.
Info: Submariner is installed. Info: Globalnet is not required. Info: Globalnet is not enabled.Copy to Clipboard Copied! Toggle word wrap Toggle overflow If submariner is installed with Globalnet on clusters with non-overlapping services, the following output shows:
Info: Submariner is installed. Info: Globalnet is not required. Info: Globalnet is enabled.
Info: Submariner is installed. Info: Globalnet is not required. Info: Globalnet is enabled.Copy to Clipboard Copied! Toggle word wrap Toggle overflow For more information, see Submariner documentation.
4.4. Creating an OpenShift Data Foundation cluster on managed clusters Copy linkLink copied to clipboard!
In order to configure storage replication between the two OpenShift Container Platform clusters, create an OpenShift Data Foundation storage system after you install the OpenShift Data Foundation operator.
Refer to OpenShift Data Foundation deployment guides and instructions that are specific to your infrastructure (VMware, Bare metal).
Procedure
Install and configure the latest OpenShift Data Foundation cluster on each of the managed clusters.
For information about the OpenShift Data Foundation deployment, refer to your infrastructure specific deployment guides (VMware, Bare metal).
Validate the successful deployment of OpenShift Data Foundation on each managed cluster with the following command:
oc get storagecluster -n openshift-storage ocs-storagecluster -o jsonpath='{.status.phase}{"\n"}'$ oc get storagecluster -n openshift-storage ocs-storagecluster -o jsonpath='{.status.phase}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow For the Multicloud Gateway (MCG):
oc get noobaa -n openshift-storage noobaa -o jsonpath='{.status.phase}{"\n"}'$ oc get noobaa -n openshift-storage noobaa -o jsonpath='{.status.phase}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the status result is
Readyfor both queries on the Primary managed cluster and the Secondary managed cluster, then continue with the next step.Verify the status of the storage cluster:
- On the OpenShift Web Console, navigate to Storage → Data Foundation → Storage System.
- In the Status card of the Overview tab, click Storage System and then click the storage system link from the pop up that appears.
- In the Status card of the Block and File tab, verify that the Storage Cluster has a green tick.
[Optional] If Globalnet was enabled when Submariner was installed, then edit the
StorageClusterafter the OpenShift Data Foundation install finishes.For Globalnet networks, manually edit the
StorageClusteryaml to add the clusterID and set enabled totrue. Replace <clustername> with your RHACM imported or newly created managed cluster name. Edit theStorageClusteron both the Primary managed cluster and the Secondary managed cluster.WarningDo not make this change in the
StorageClusterunless you enabled Globalnet when Submariner was installed.oc edit storagecluster -o yaml -n openshift-storage
$ oc edit storagecluster -o yaml -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow spec: network: multiClusterService: clusterID: <clustername> enabled: truespec: network: multiClusterService: clusterID: <clustername> enabled: trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf
multiClusterServiceis enabled, it can not be disabled or undone as it failsover the MONs and restarts the OSDs with GlobalNet IP addresses which can not be changed once assigned.After the above changes are made,
- Wait for the OSD pods to restart and OSD services to be created.
- Wait for all MONS to failover.
Ensure that the MONS and OSD services are exported.
oc get serviceexport -n openshift-storage
$ oc get serviceexport -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Ensure that cluster is in a
Readystate and cluster health has a green tick indicatingHealth ok. Verify using step 3.
4.5. Installing OpenShift Data Foundation Multicluster Orchestrator operator Copy linkLink copied to clipboard!
OpenShift Data Foundation Multicluster Orchestrator is a controller that is installed from OpenShift Container Platform’s OperatorHub on the Hub cluster.
Procedure
- On the Hub cluster, navigate to OperatorHub and use the keyword filter to search for ODF Multicluster Orchestrator.
- Click ODF Multicluster Orchestrator tile.
Keep all default settings and click Install.
Ensure that the operator resources are installed in
openshift-operatorsproject and available to all namespaces.NoteThe
ODF Multicluster Orchestratoralso installs the Openshift DR Hub Operator on the RHACM hub cluster as a dependency.Verify that the operator Pods are in a
Runningstate. TheOpenShift DR Hub operatoris also installed at the same time inopenshift-operatorsnamespace.oc get pods -n openshift-operators
$ oc get pods -n openshift-operatorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME READY STATUS RESTARTS AGE odf-multicluster-console-6845b795b9-blxrn 1/1 Running 0 4d20h odfmo-controller-manager-f9d9dfb59-jbrsd 1/1 Running 0 4d20h ramen-hub-operator-6fb887f885-fss4w 2/2 Running 0 4d20h
NAME READY STATUS RESTARTS AGE odf-multicluster-console-6845b795b9-blxrn 1/1 Running 0 4d20h odfmo-controller-manager-f9d9dfb59-jbrsd 1/1 Running 0 4d20h ramen-hub-operator-6fb887f885-fss4w 2/2 Running 0 4d20hCopy to Clipboard Copied! Toggle word wrap Toggle overflow
4.5.1. [Optional] Connecting storage clusters with the ocs-provider-server ServiceExport Copy linkLink copied to clipboard!
[Optional] If Globalnet was enabled when Submariner was installed, use this section to connect the StorageClusters with the ocs-provider-server ServiceExport.
Procedure
Create a
ServiceExportcalledocs-provider-server. Copy the following yaml toocs-provider-server.yamland create the resource on both the Primary managed cluster and the Secondary managed cluster.apiVersion: multicluster.x-k8s.io/v1alpha1 kind: ServiceExport metadata: name: ocs-provider-server namespace: openshift-storage
apiVersion: multicluster.x-k8s.io/v1alpha1 kind: ServiceExport metadata: name: ocs-provider-server namespace: openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add an annotation to the
StorageCluster. To add the annotation, run the following command on both the Primary managed cluster and the Secondary managed cluster.NoteThe
<managedcluster_name>refers to the ManagedCluster name in RHACM for the Primary managed cluster or the Secondary managed cluster name.oc annotate storagecluster ocs-storagecluster -n openshift-storage ocs.openshift.io/api-server-exported-address=<managedcluser_name>.ocs-provider-server.openshift-storage.svc.clusterset.local:50051
$ oc annotate storagecluster ocs-storagecluster -n openshift-storage ocs.openshift.io/api-server-exported-address=<managedcluser_name>.ocs-provider-server.openshift-storage.svc.clusterset.local:50051Copy to Clipboard Copied! Toggle word wrap Toggle overflow
4.6. Configuring SSL access across clusters Copy linkLink copied to clipboard!
Configure network (SSL) access between the primary and secondary clusters so that metadata can be stored on the alternate cluster in a Multicloud Gateway (MCG) object bucket using a secure transport protocol and in the Hub cluster for verifying access to the object buckets.
If all of your OpenShift clusters are deployed using a signed and valid set of certificates for your environment then this section can be skipped.
Procedure
Extract the ingress certificate for the Primary managed cluster and save the output to
primary.crt.oc get cm default-ingress-cert -n openshift-config-managed -o jsonpath="{['data']['ca-bundle\.crt']}" > primary.crt$ oc get cm default-ingress-cert -n openshift-config-managed -o jsonpath="{['data']['ca-bundle\.crt']}" > primary.crtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Extract the ingress certificate for the Secondary managed cluster and save the output to
secondary.crt.oc get cm default-ingress-cert -n openshift-config-managed -o jsonpath="{['data']['ca-bundle\.crt']}" > secondary.crt$ oc get cm default-ingress-cert -n openshift-config-managed -o jsonpath="{['data']['ca-bundle\.crt']}" > secondary.crtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a new ConfigMap file to hold the remote cluster’s certificate bundle with filename
cm-clusters-crt.yaml.NoteThere could be more or less than three certificates for each cluster as shown in this example file. Also, ensure that the certificate contents are correctly indented after you copy and paste from the
primary.crtandsecondary.crtfiles that were created before.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the ConfigMap on the Primary managed cluster, Secondary managed cluster, and the Hub cluster.
oc create -f cm-clusters-crt.yaml
$ oc create -f cm-clusters-crt.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
configmap/user-ca-bundle created
configmap/user-ca-bundle createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow Patch default proxy resource on the Primary managed cluster, Secondary managed cluster, and the Hub cluster.
oc patch proxy cluster --type=merge --patch='{"spec":{"trustedCA":{"name":"user-ca-bundle"}}}'$ oc patch proxy cluster --type=merge --patch='{"spec":{"trustedCA":{"name":"user-ca-bundle"}}}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
proxy.config.openshift.io/cluster patched
proxy.config.openshift.io/cluster patchedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
4.7. Creating Disaster Recovery Policy on Hub cluster Copy linkLink copied to clipboard!
Openshift Disaster Recovery Policy (DRPolicy) resource specifies OpenShift Container Platform clusters participating in the disaster recovery solution and the desired replication interval. DRPolicy is a cluster scoped resource that users can apply to applications that require Disaster Recovery solution.
The ODF MultiCluster Orchestrator Operator facilitates the creation of each DRPolicy and the corresponding DRClusters through the Multicluster Web console.
Prerequisites
- Ensure that there is a minimum set of two managed clusters.
Procedure
- On the OpenShift console, navigate to All Clusters → Data Services → Disaster recovery.
- On the Overview tab, click Create a disaster recovery policy or you can navigate to Policies tab and click Create DRPolicy.
-
Enter Policy name. Ensure that each DRPolicy has a unique name (for example:
ocp4bos1-ocp4bos2-5m). - Select two clusters from the list of managed clusters to which this new policy will be associated with.
-
Replication policy is automatically set to
Asynchronousbased on the OpenShift clusters selected and a Sync schedule option will become available. Set Sync schedule.
ImportantFor every desired replication interval a new DRPolicy must be created with a unique name (such as:
ocp4bos1-ocp4bos2-10m). The same clusters can be selected but the Sync schedule can be configured with a different replication interval in minutes/hours/days. The minimum is one minute.Optional: Expand Advanced settings and check the Enable disaster recovery support for restored and cloned PersistentVolumeClaims (For Data Foundation only) checkbox. See Disaster recovery protection for cloned and restored RBD volumes for more details.
NoteThis option should only be used with discovered applications.
- Click Create.
Verify that the DRPolicy is created successfully. Run this command on the Hub cluster for each of the DRPolicy resources created, where <drpolicy_name> is replaced with your unique name.
oc get drpolicy <drpolicy_name> -o jsonpath='{.status.conditions[].reason}{"\n"}'$ oc get drpolicy <drpolicy_name> -o jsonpath='{.status.conditions[].reason}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Succeeded
SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow When a DRPolicy is created, along with it, two DRCluster resources are also created. It could take up to 10 minutes for all three resources to be validated and for the status to show as
Succeeded.NoteEditing of
SchedulingInterval,ReplicationClassSelector,VolumeSnapshotClassSelectorandDRClustersfield values are not supported in the DRPolicy.Verify the object bucket access from the Hub cluster to both the Primary managed cluster and the Secondary managed cluster.
Get the names of the DRClusters on the Hub cluster.
oc get drclusters
$ oc get drclustersCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME AGE ocp4bos1 4m42s ocp4bos2 4m42s
NAME AGE ocp4bos1 4m42s ocp4bos2 4m42sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check S3 access to each bucket created on each managed cluster. Use the DRCluster validation command, where <drcluster_name> is replaced with your unique name.
NoteEditing of
RegionandS3ProfileNamefield values are non supported in DRClusters.oc get drcluster <drcluster_name> -o jsonpath='{.status.conditions[2].reason}{"\n"}'$ oc get drcluster <drcluster_name> -o jsonpath='{.status.conditions[2].reason}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Succeeded
SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteMake sure to run commands for both DRClusters on the Hub cluster.
Verify that the OpenShift DR Cluster operator installation was successful on the Primary managed cluster and the Secondary managed cluster.
oc get csv,pod -n openshift-dr-system | egrep 'odr|ramen'
$ oc get csv,pod -n openshift-dr-system | egrep 'odr|ramen'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
clusterserviceversion.operators.coreos.com/odr-cluster-operator.v4.20.0-rhodf Openshift DR Cluster Operator 4.20.0-rhodf odr-cluster-operator.v4.20.0-rhodf Succeeded pod/ramen-dr-cluster-operator-77bf74849c-rfn85 2/2 Running 0 7h20m
clusterserviceversion.operators.coreos.com/odr-cluster-operator.v4.20.0-rhodf Openshift DR Cluster Operator 4.20.0-rhodf odr-cluster-operator.v4.20.0-rhodf Succeeded pod/ramen-dr-cluster-operator-77bf74849c-rfn85 2/2 Running 0 7h20mCopy to Clipboard Copied! Toggle word wrap Toggle overflow You can also verify that
OpenShift DR Cluster Operatoris installed successfully on the OperatorHub of each managed cluster.Verify that the
StorageClusterPeeris in aPeeredstate on the Primary managed cluster and the Secondary managed cluster:Note<managedcluser_name>refers to the ManagedCluster name in RHACM. On the Primary managed cluster use the Secondary managed cluster name. On the Secondary managed cluster use the Primary managed cluster name.oc get storageclusterpeer <managedcluser_name>-peer -n openshift-storage -oyaml | yq '.status.state
$ oc get storageclusterpeer <managedcluser_name>-peer -n openshift-storage -oyaml | yq '.status.stateCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Peered
PeeredCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteWhen the initial DRPolicy is created, the VolSync operator is installed automatically in the
volsync-systemproject on each managed cluster. VolSync is used to set up volume replication between two clusters to protect CephFS-based PVCs. The replication feature is enabled by default.Verify that the status of the OpenShift Data Foundation mirroring
daemonhealth on the Primary managed cluster and the Secondary managed cluster.oc get cephblockpoolradosnamespaces ocs-storagecluster-cephblockpool-builtin-implicit -n openshift-storage -o jsonpath='{.status.mirroringStatus.summary}{"\n"}'$ oc get cephblockpoolradosnamespaces ocs-storagecluster-cephblockpool-builtin-implicit -n openshift-storage -o jsonpath='{.status.mirroringStatus.summary}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
{"daemon_health":"OK","group_health":"OK","group_states":{},"health":"OK","image_health":"OK","image_states":{},"states":{}}{"daemon_health":"OK","group_health":"OK","group_states":{},"health":"OK","image_health":"OK","image_states":{},"states":{}}Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIt could take up to 10 minutes for the
daemon_healthandhealthto go from Warning to OK. If the status does not become OK eventually, then use the RHACM console to verify that the Submariner connection between managed clusters is still in a healthy state. Do not proceed until all values are OK.
4.8. Disaster recovery protection for cloned and restored RBD volumes Copy linkLink copied to clipboard!
Cloned and restored RBD volumes are copy-on-write clones of the corresponding datasources. Such RBD volumes are still linked to the parent volume to optimize space consumption and Ceph cluster performance. In order to be able to mirror RBD images backing such volumes, they need to be separated from the parent RBD image.
Selecting the advanced option in Step 7 of Creating Disaster Recovery Policy on Hub cluster instructs CephCSI driver to perform necessary operations on the cloned or restored RBD volumes to prepare them for Disaster Recovery protection. The necessary operation has the following consequences:
- Space consumption may increase up to the total capacity of all cloned and restored RBD volumes being protected. It is recommended to make sure there is enough space available in the cluster to avoid the cluster getting full.
- Ceph cluster resource usage increases.
- The time taken for the operation to complete depends on the number of the RBD volumes, size of each volume, Ceph cluster resource availability, and client IO going on the RBD volume. Therefore, the amount of time required for completion of the operation cannot be determined.
4.9. Create sample application for testing disaster recovery solution Copy linkLink copied to clipboard!
OpenShift Data Foundation disaster recovery (DR) solution supports disaster recovery for ApplicationSet-based applications that are managed by RHACM. For more details, see ApplicationSet documentation.
The following section details how to create an application and apply a DRPolicy to an application.
ApplicationSet-based applications
OpenShift users that do not have cluster-admin permissions cannot create ApplicationSet-based applications.
Regional-DR now supports leveraging CephRBD volumes using non-default replica-2 storage classes that are managed by OpenShift Data Foundation.
4.9.1. ApplicationSet-based applications Copy linkLink copied to clipboard!
4.9.1.1. Creating ApplicationSet-based applications Copy linkLink copied to clipboard!
Prerequisite
- Ensure that the Red Hat OpenShift GitOps operator is installed on all three clusters: Hub cluster, Primary managed cluster and Secondary managed cluster. For instructions, see Installing Red Hat OpenShift GitOps Operator in web console.
On the Hub cluster, ensure that both Primary and Secondary managed clusters are registered to GitOps. For registration instructions, see Registering managed clusters to GitOps. Then check if the Placement used by
GitOpsClusterresource to register both managed clusters, has the tolerations to deal with cluster unavailability. You can verify if the following tolerations are added to the Placement using the commandoc get placement <placement-name> -n openshift-gitops -o yaml.tolerations: - key: cluster.open-cluster-management.io/unreachable operator: Exists - key: cluster.open-cluster-management.io/unavailable operator: Existstolerations: - key: cluster.open-cluster-management.io/unreachable operator: Exists - key: cluster.open-cluster-management.io/unavailable operator: ExistsCopy to Clipboard Copied! Toggle word wrap Toggle overflow In case the tolerations are not added, see Configuring application placement tolerations for Red Hat Advanced Cluster Management and OpenShift GitOps.
-
Ensure that you have created the
ClusterRoleBindingyaml on both the Primary and Secondary managed clusters. For instruction, see the Prerequisites chapter in RHACM documentation.
Procedure
- On the Hub cluster, navigate to All Clusters → Applications and click Create application.
- Choose the application type as Argo CD ApplicationSet - Pull model.
- In the General step, enter your Application set name.
-
Select Argo server
openshift-gitopsand Requeue time as180seconds. - Click Next.
-
In the Repository location for resources section, select Repository type
Git. Enter the Git repository URL for the sample application, the github Branch and Path where the resources busybox Pod and PVC will be created.
- Use the sample application repository as https://github.com/red-hat-storage/ocm-ramen-samples
-
Select Revision as
main Choose one of the following Path:
-
workloads/deployment/odr-regional-rbdto use RBD Regional-DR. -
workloads/deployment/odr-regional-cephfsto use CephFS Regional-DR.
-
- Enter Remote namespace value. (example, busybox-sample) and click Next.
Choose the Sync policy settings as per your requirement or go with the default selections, and then click Next.
You can choose one or more options.
- In Label expressions, add a label <name> with its value set to the managed cluster name.
- Click Next.
- Review the setting details and click Submit.
4.9.1.2. [Optional] Resolving consistency group issues after upgrading from 4.19 to 4.20 Copy linkLink copied to clipboard!
If your application was created on ODF 4.19 with consistency groups enabled, upgrading to ODF 4.20 will trigger the alert UnsupportedConsistencyGroupingEnabled. Click each alert of this type and locate the impacted application name in the Description section.
This alert appears on the Hub cluster if you have enabled monitoring for disaster recovery.
This alert indicates that the existing consistency group configuration is not supported in 4.20 and requires remediation before normal operations can continue. Follow the steps in this section if this is the case for your application.
Disable disaster recovery after the upgrade:
When DR is disabled for discovered apps,
VolumeGroupReplicationand VR resources created using the older naming format may not be fully removed, leaving staleVolumeGroupReplication`s in the environment. Check for any stale `VolumeGroupReplicationresources.On the managed clusters, run the following:
oc get volumegroupreplication -A
$ oc get volumegroupreplication -ACopy to Clipboard Copied! Toggle word wrap Toggle overflow If any stale
VolumeGroupReplicationresources remain after DR has been successfully disabled for all CG‑enabled applications, each resource must be manually deleted using the following command:oc delete volumegroupreplication <volumegroupreplication_name> -n <application-namespace>
$ oc delete volumegroupreplication <volumegroupreplication_name> -n <application-namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the
VolumeGroupReplicationClassfrom each managed cluster.Find the name for the
VolumeGroupReplicationClassclass on each managed cluster:oc get volumegroupreplicationclass
$ oc get volumegroupreplicationclassCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
VolumeGroupReplicationClasson each managed cluster:oc delete volumegroupreplicationclass <volumegroupreplicationclass_name>
$ oc delete volumegroupreplicationclass <volumegroupreplicationclass_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
Re-enable DR after confirming stale
VolumeGroupReplicationresources and anyVolumeGroupReplicationClass(s)have been removed .
4.9.1.3. Apply Data policy to sample ApplicationSet-based application Copy linkLink copied to clipboard!
Prerequisites
- Ensure that both managed clusters referenced in the Data policy are reachable. If not, the application will not be protected for disaster recovery until both clusters are online.
- If you had enabled consistency groups in OpenShift Data Foundation 4.19, follow the instructions in Resolving consistency group issues after upgrading from 4.19 to 4.20.
Procedure
- On the Hub cluster, navigate to All Clusters → Applications.
- Click the Actions menu at the end of application to view the list of available actions.
- Click Manage disaster recovery.
- Click Enroll application.
- Select Policy name and click Next.
Select an Application resource and then use PVC label selector to select
PVC labelfor the selected application resource.NoteYou can select more than one PVC label for the selected application resources.
- Click Next.
- In the Enroll managed application modal, review the policy configuration details and click Assign. The newly assigned Data policy details are displayed on the Manage disaster recovery modal.
Verify that you can view the assigned policy details on the Applications page.
On the Applications page, navigate to the DR Status column and view the status. The status will either be healthy or critical. You can click the status to view the last synced time for application volumes and the DR policy assigned.
NoteIt may take a few minutes after disaster recovery is assigned for the DR Status to move from critical to healthy.
Failover and relocate status can be viewed in the DR Status column when initiated. These statuses can be clicked to view more details about the action.
DR Status shows either healthy, warning, or critical when failover or relocate is not taking place.
Optional: For Rados block device (RBD), verify
volumereplicationandvolumereplicationgroupon the primary cluster.oc get volumereplication -A
$ oc get volumereplication -ACopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAMESPACE NAME AGE VOLUMEREPLICATIONCLASS SOURCEKIND SOURCENAME DESIREDSTATE CURRENTSTATE busybox-sample vr-bc4a166e-13bc-45cd-99f0-8bd516f7ffcc 112s rbd-volumereplicationclass-1625360775 VolumeGroupReplication vgr-07a0050bee1f25720f5cae8d5daa6230-busybox-placement-drpc primary Primary
NAMESPACE NAME AGE VOLUMEREPLICATIONCLASS SOURCEKIND SOURCENAME DESIREDSTATE CURRENTSTATE busybox-sample vr-bc4a166e-13bc-45cd-99f0-8bd516f7ffcc 112s rbd-volumereplicationclass-1625360775 VolumeGroupReplication vgr-07a0050bee1f25720f5cae8d5daa6230-busybox-placement-drpc primary PrimaryCopy to Clipboard Copied! Toggle word wrap Toggle overflow oc get volumereplicationgroup -A
$ oc get volumereplicationgroup -ACopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAMESPACE NAME DESIREDSTATE CURRENTSTATE busybox-sample busybox-placement-drpc primary Primary
NAMESPACE NAME DESIREDSTATE CURRENTSTATE busybox-sample busybox-placement-drpc primary PrimaryCopy to Clipboard Copied! Toggle word wrap Toggle overflow
4.9.2. Deleting sample application Copy linkLink copied to clipboard!
This section provides instructions for deleting the sample application busybox using the RHACM console.
When deleting a DR protected application, access to both clusters that belong to the DRPolicy is required. This is to ensure that all protected API resources and resources in the respective S3 stores are cleaned up as part of removing the DR protection. If access to one of the clusters is not healthy, deleting the DRPlacementControl resource for the application, on the hub, would remain in the Deleting state.
Prerequisites
- These instructions to delete the sample application should not be executed until the failover and relocate testing is completed and the application is ready to be removed from RHACM and the managed clusters.
Procedure
- On the RHACM console, navigate to Applications.
-
Search for the sample application to be deleted (for example,
busybox). - Click the Action Menu (⋮) next to the application you want to delete.
Click Delete application.
When the Delete application is selected a new screen will appear asking if the application related resources should also be deleted.
- Select Remove application related resources checkbox to delete the Subscription and PlacementRule.
- Click Delete. This will delete the busybox application on the Primary managed cluster (or whatever cluster the application was running on).
In addition to the resources deleted using the RHACM console, delete the
DRPlacementControlif it is not auto-deleted after deleting thebusyboxapplication.Log in to the OpenShift Web console for the Hub cluster and navigate to Installed Operators for the project
busybox-sample.For ApplicationSet applications, select the project as
openshift-gitops.- Click OpenShift DR Hub Operator and then click the DRPlacementControl tab.
-
Click the Action Menu (⋮) next to the
busyboxapplication DRPlacementControl that you want to delete. - Click Delete DRPlacementControl.
- Click Delete.
This process can be used to delete any application with a DRPlacementControl resource.
4.10. Subscription-based application failover between managed clusters Copy linkLink copied to clipboard!
Failover is a process that transitions an application from a primary cluster to a secondary cluster in the event of a primary cluster failure. While failover provides the ability for the application to run on the secondary cluster with minimal interruption, making an uninformed failover decision can have adverse consequences, such as complete data loss in the event of unnoticed replication failure from primary to secondary cluster. If a significant amount of time has gone by since the last successful replication, it’s best to wait until the failed primary is recovered.
LastGroupSyncTime is a critical metric that reflects the time since the last successful replication occurred for all PVCs associated with an application. In essence, it measures the synchronization health between the primary and secondary clusters. So, prior to initiating a failover from one cluster to another, check for this metric and only initiate the failover if the LastGroupSyncTime is within a reasonable time in the past.
During the course of failover the Ceph-RBD mirror deployment on the failover cluster is scaled down to ensure a clean failover for volumes that are backed by Ceph-RBD as the storage provisioner.
Prerequisites
- If your setup has active and passive RHACM hub clusters, see Hub recovery using Red Hat Advanced Cluster Management.
When the primary cluster is in a state other than
Ready, check the actual status of the cluster as it might take some time to update.- Navigate to the RHACM console → Infrastructure → Clusters → Cluster list tab.
Check the status of both the managed clusters individually before performing failover operation.
However, failover operation can still be performed when the cluster you are failing over to is in a Ready state.
Run the following command on the Hub Cluster to check if
lastGroupSyncTimeis within an acceptable data loss window, when compared to current time.oc get drpc -o yaml -A | grep lastGroupSyncTime
$ oc get drpc -o yaml -A | grep lastGroupSyncTimeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
[...] lastGroupSyncTime: "2023-07-10T12:40:10Z"
[...] lastGroupSyncTime: "2023-07-10T12:40:10Z"Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Procedure
- On the Hub cluster, navigate to Applications.
- Click the Actions menu at the end of application row to view the list of available actions.
- Click Failover application.
- After the Failover application modal is shown, select policy and target cluster to which the associated application will failover in case of a disaster.
Click the Select subscription group dropdown to verify the default selection or modify this setting.
By default, the subscription group that replicates for the application resources is selected.
Check the status of the Failover readiness.
-
If the status is
Readywith a green tick, it indicates that the target cluster is ready for failover to start. Proceed to step 7. If the status is
UnknownorNot ready, then wait until the status changes toReady.ImportantIf there are data inconsistencies caused by synchronization delays, a warning message appears stating Inconsistent data on target cluster. This alerts to the possibility of data loss if the failover is initiated. The message is no longer displayed when data synchronization is complete.
-
If the status is
- Click Initiate. The busybox application is now failing over to the Secondary-managed cluster.
- Close the modal window and track the status using the Data policy column on the Applications page.
Verify that the activity status shows as FailedOver for the application.
- Navigate to the Applications → Overview tab.
- In the Data policy column, click the policy link for the application you applied the policy to.
- On the Data policy popover, click the View more details link.
- Verify that you can see one or more policy names and the ongoing activities (Last sync time and Activity status) associated with the policy in use with the application.
4.11. ApplicationSet-based application failover between managed clusters Copy linkLink copied to clipboard!
Failover is a process that transitions an application from a primary cluster to a secondary cluster in the event of a primary cluster failure. While failover provides the ability for the application to run on the secondary cluster with minimal interruption, making an uninformed failover decision can have adverse consequences, such as complete data loss in the event of unnoticed replication failure from primary to secondary cluster. If a significant amount of time has gone by since the last successful replication, it’s best to wait until the failed primary is recovered.
LastGroupSyncTime is a critical metric that reflects the time since the last successful replication occurred for all PVCs associated with an application. In essence, it measures the synchronization health between the primary and secondary clusters. So, prior to initiating a failover from one cluster to another, check for this metric and only initiate the failover if the LastGroupSyncTime is within a reasonable time in the past.
During the course of failover the Ceph-RBD mirror deployment on the failover cluster is scaled down to ensure a clean failover for volumes that are backed by Ceph-RBD as the storage provisioner.
Prerequisites
- If your setup has active and passive RHACM hub clusters, see Hub recovery using Red Hat Advanced Cluster Management .
When the primary cluster is in a state other than
Ready, check the actual status of the cluster as it might take some time to update.- Navigate to the RHACM console → Infrastructure → Clusters → Cluster list tab.
Check the status of both the managed clusters individually before performing failover operation.
However, failover operation can still be performed when the cluster you are failing over to is in a Ready state.
Run the following command on the Hub Cluster to check if
lastGroupSyncTimeis within an acceptable data loss window, when compared to current time.oc get drpc -o yaml -A | grep lastGroupSyncTime
$ oc get drpc -o yaml -A | grep lastGroupSyncTimeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
[...] lastGroupSyncTime: "2023-07-10T12:40:10Z"
[...] lastGroupSyncTime: "2023-07-10T12:40:10Z"Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Procedure
- On the Hub cluster, navigate to Applications.
- Click the Actions menu at the end of application row to view the list of available actions.
- Click Failover application.
When the Failover application modal is shown, verify the details presented are correct and check the status of the Failover readiness. If the status is Ready with a green tick, it indicates that the target cluster is ready for failover to start.
ImportantIf there are data inconsistencies caused by synchronization delays, a warning message appears stating Inconsistent data on target cluster. This alerts to the possibility of data loss if the failover is initiated. The message is no longer displayed when data synchronization is complete.
- Click Initiate. The busybox resources are now created on the target cluster.
- Close the modal window and track the status using the Data policy column on the Applications page.
Verify that the activity status shows as FailedOver for the application.
- Navigate to the Applications → Overview tab.
- In the Data policy column, click the policy link for the application you applied the policy to.
- On the Data policy popover, verify that you can see one or more policy names and the ongoing activities associated with the policy in use with the application.
4.12. Relocating Subscription-based application between managed clusters Copy linkLink copied to clipboard!
Relocate an application to its preferred location when all managed clusters are available.
Prerequisite
- If your setup has active and passive RHACM hub clusters, see Hub recovery using Red Hat Advanced Cluster Management.
When the primary cluster is in a state other than Ready, check the actual status of the cluster as it might take some time to update. Relocate can only be performed when both primary and preferred clusters are up and running.
- Navigate to RHACM console → Infrastructure → Clusters → Cluster list tab.
- Check the status of both the managed clusters individually before performing relocate operation.
Perform relocate when
lastGroupSyncTimeis within the replication interval (for example, 5 minutes) when compared to current time. This is recommended to minimize the Recovery Time Objective (RTO) for any single application.Run this command on the Hub Cluster:
oc get drpc -o yaml -A | grep lastGroupSyncTime
$ oc get drpc -o yaml -A | grep lastGroupSyncTimeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
[...] lastGroupSyncTime: "2023-07-10T12:40:10Z"
[...] lastGroupSyncTime: "2023-07-10T12:40:10Z"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Compare the output time (UTC) to current time to validate that all
lastGroupSyncTimevalues are within their application replication interval. If not, wait to Relocate until this is true for alllastGroupSyncTimevalues.
Procedure
- On the Hub cluster, navigate to Applications.
- Click the Actions menu at the end of application row to view the list of available actions.
- Click Relocate application.
- When the Relocate application modal is shown, select policy and target cluster to which the associated application will relocate to in case of a disaster.
- By default, the subscription group that will deploy the application resources is selected. Click the Select subscription group dropdown to verify the default selection or modify this setting.
Check the status of the Relocation readiness.
-
If the status is
Readywith a green tick, it indicates that the target cluster is ready for relocation to start. Proceed to step 7. If the status is
UnknownorNot ready, then wait until the status changes toReady.ImportantIf there are data inconsistencies caused by synchronization delays, a warning message appears stating Inconsistent data on target cluster. This alerts to the possibility of data loss if the relocate is initiated. The message is no longer displayed when data synchronization is complete.
-
If the status is
- Click Initiate. The busybox resources are now created on the target cluster.
- Close the modal window and track the status using the Data policy column on the Applications page.
Verify that the activity status shows as Relocated for the application.
- Navigate to the Applications → Overview tab.
- In the Data policy column, click the policy link for the application you applied the policy to.
- On the Data policy popover, click the View more details link.
- Verify that you can see one or more policy names and the ongoing activities (Last sync time and Activity status) associated with the policy in use with the application.
4.13. Relocating an ApplicationSet-based application between managed clusters Copy linkLink copied to clipboard!
Relocate an application to its preferred location when all managed clusters are available.
Prerequisite
- If your setup has active and passive RHACM hub clusters, see Hub recovery using Red Hat Advanced Cluster Management.
When the primary cluster is in a state other than Ready, check the actual status of the cluster as it might take some time to update. Relocate can only be performed when both primary and preferred clusters are up and running.
- Navigate to RHACM console → Infrastructure → Clusters → Cluster list tab.
- Check the status of both the managed clusters individually before performing relocate operation.
Perform relocate when
lastGroupSyncTimeis within the replication interval (for example, 5 minutes) when compared to current time. This is recommended to minimize the Recovery Time Objective (RTO) for any single application.Run this command on the Hub Cluster:
oc get drpc -o yaml -A | grep lastGroupSyncTime
$ oc get drpc -o yaml -A | grep lastGroupSyncTimeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
[...] lastGroupSyncTime: "2023-07-10T12:40:10Z"
[...] lastGroupSyncTime: "2023-07-10T12:40:10Z"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Compare the output time (UTC) to current time to validate that all
lastGroupSyncTimevalues are within their application replication interval. If not, wait to Relocate until this is true for alllastGroupSyncTimevalues.
Procedure
- On the Hub cluster, navigate to Applications.
- Click the Actions menu at the end of application row to view the list of available actions.
- Click Relocate application.
When the Relocate application modal is shown, select policy and target cluster to which the associated application will relocate to in case of a disaster.
ImportantIf there are data inconsistencies caused by synchronization delays, a warning message appears stating Inconsistent data on target cluster. This alerts to the possibility of data loss if the relocate is initiated. The message is no longer displayed when data synchronization is complete.
- Click Initiate. The busybox resources are now created on the target cluster.
- Close the modal window and track the status using the Data policy column on the Applications page.
Verify that the activity status shows as Relocated for the application.
- Navigate to the Applications → Overview tab.
- In the Data policy column, click the policy link for the application you applied the policy to.
- On the Data policy popover, verify that you can see one or more policy names and the relocation status associated with the policy in use with the application.
4.14. Disaster recovery protection for discovered applications Copy linkLink copied to clipboard!
Red Hat OpenShift Data Foundation now provides disaster recovery (DR) protection and support for workloads that are deployed in one of the managed clusters directly without using Red Hat Advanced Cluster Management (RHACM). These workloads are called discovered applications.
Workloads deployed using RHACM are termed managed applications, while those deployed directly on one of the managed clusters without using RHACM are called discovered applications. Although RHACM displays the details of both types of workloads, it does not manage the lifecycle (create, delete, edit) of discovered applications.
4.14.1. Prerequisites for disaster recovery protection of discovered applications Copy linkLink copied to clipboard!
This section provides instructions to guide you through the prerequisites for protecting discovered applications. This includes tasks such as assigning a data policy and initiating DR actions such as failover and relocate.
- Ensure that all the DR configurations have been installed on the Primary managed cluster and the Secondary managed cluster.
Install the OADP 1.4 operator (or greater).
NoteAny version before OADP 1.4 will not work for protecting discovered applications.
-
On the Primary and Secondary managed cluster, navigate to OperatorHub and use the keyword filter to search for
OADP. - Click the OADP tile.
-
Keep all default settings and click Install. Ensure that the operator resources are installed in the
openshift-adpproject.
NoteIf OADP 1.4 is installed after DR configuration has been completed then the
ramen-dr-cluster-operatorpods on the Primary managed cluster and the Secondary managed cluster in namespaceopenshift-dr-systemmust be restarted (deleted and recreated).-
On the Primary and Secondary managed cluster, navigate to OperatorHub and use the keyword filter to search for
[Optional] Add caCertificates to
ramen-hub-operator-configConfigMap.Configure network (SSL) access between the primary and secondary clusters so that metadata can be stored on the alternate cluster in a Multicloud Gateway (MCG) object bucket using a secure transport protocol and in the Hub cluster for verifying access to the object buckets.
NoteIf all of your OpenShift clusters are deployed using a signed and valid set of certificates for your environment then this section can be skipped.
If you are using self-signed certificates, then you have already created a ConfigMap named
user-ca-bundlein theopenshift-confignamespace and added this ConfigMap to the default Proxy cluster resource. This means you need to add thecaCertificatesparameter to the configmapramen-hub-operator-configwith the encoded value.Find the encoded value for the caCertificates.
oc get configmap user-ca-bundle -n openshift-config -o jsonpath="{['data']['ca-bundle\.crt']}" |base64 -w 0$ oc get configmap user-ca-bundle -n openshift-config -o jsonpath="{['data']['ca-bundle\.crt']}" |base64 -w 0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add this base64 encoded value to the configmap
ramen-hub-operator-configon the Hub cluster. Example below shows where to add caCertificates.oc edit configmap ramen-hub-operator-config -n openshift-operators
$ oc edit configmap ramen-hub-operator-config -n openshift-operatorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verify that there are DR secrets created in the OADP operator default namespace
openshift-adpon the Primary managed cluster and the Secondary managed cluster. The DR secrets that were created when the first DRPolicy was created, will be similar to the secrets below. The DR secret name is preceded with the letterv.oc get secrets -n openshift-adp NAME TYPE DATA AGE v60f2ea6069e168346d5ad0e0b5faa59bb74946f Opaque 1 3d20h vcc237eba032ad5c422fb939684eb633822d7900 Opaque 1 3d20h [...]
$ oc get secrets -n openshift-adp NAME TYPE DATA AGE v60f2ea6069e168346d5ad0e0b5faa59bb74946f Opaque 1 3d20h vcc237eba032ad5c422fb939684eb633822d7900 Opaque 1 3d20h [...]Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThere will be one DR created secret for each managed cluster in the
openshift-adpnamespace.Verify if the Data Protection Application (DPA) is already installed on each managed cluster in the OADP namespace
openshift-adp. If not already created then follow the next step to create this resource.Create the DPA by copying the following YAML definition content to
dpa.yaml.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the DPA resource.
oc create -f dpa.yaml -n openshift-adp
$ oc create -f dpa.yaml -n openshift-adpCopy to Clipboard Copied! Toggle word wrap Toggle overflow dataprotectionapplication.oadp.openshift.io/velero created
dataprotectionapplication.oadp.openshift.io/velero createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the OADP resources are created and are in
Runningstate.Copy to Clipboard Copied! Toggle word wrap Toggle overflow
4.14.2. Creating a sample discovered application Copy linkLink copied to clipboard!
In order to test failover from the Primary managed cluster to the Secondary managed cluster and relocate for discovered applications, you need a sample application that is installed without using the RHACM create application capability.
Regional-DR now supports leveraging CephRBD volumes using non-default replica-2 storage classes that are managed by OpenShift Data Foundation.
Procedure
Log in to the Primary managed cluster and clone the sample application repository.
git clone https://github.com/red-hat-storage/ocm-ramen-samples.git
$ git clone https://github.com/red-hat-storage/ocm-ramen-samples.gitCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that you are on the
mainbranch.cd ~/ocm-ramen-samples git branch * main
$ cd ~/ocm-ramen-samples $ git branch * mainCopy to Clipboard Copied! Toggle word wrap Toggle overflow The correct directory should be used when creating the sample application based on your scenario, metro or regional. To find your directory:
ls workloads/deployment | egrep -v 'k8s|base' odr-metro-rbd odr-regional-rbd odr-regional-cephfs
$ ls workloads/deployment | egrep -v 'k8s|base' odr-metro-rbd odr-regional-rbd odr-regional-cephfsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a project named
busybox-discoveredon both the Primary and Secondary managed clusters.oc new-project busybox-discovered
$ oc new-project busybox-discoveredCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
busyboxapplication on the Primary managed cluster.This sample application example is for Regional-DR using a Ceph RBD volume. CephFS volumes can be used as well.
oc apply -k workloads/deployment/odr-regional-rbd -n busybox-discovered persistentvolumeclaim/busybox-pvc created deployment.apps/busybox created
$ oc apply -k workloads/deployment/odr-regional-rbd -n busybox-discovered persistentvolumeclaim/busybox-pvc created deployment.apps/busybox createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteOpenShift Data Foundation Disaster Recovery solution now extends protection to discovered applications that span across multiple namespaces.
Verify that busybox is running in the correct project on the Primary managed cluster.
oc get pods,pvc,deployment -n busybox-discovered
$ oc get pods,pvc,deployment -n busybox-discoveredCopy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow
4.14.3. [Optional] Resolving consistency group issues after upgrading from 4.19 to 4.20 Copy linkLink copied to clipboard!
If your application was created on ODF 4.19 with consistency groups enabled, upgrading to ODF 4.20 will trigger the alert UnsupportedConsistencyGroupingEnabled. Click each alert of this type and locate the impacted application name in the Description section.
This alert appears on the Hub cluster if you have enabled monitoring for disaster recovery.
This alert indicates that the existing consistency group configuration is not supported in 4.20 and requires remediation before normal operations can continue. Follow the steps in this section if this is the case for your application.
Disable disaster recovery after the upgrade:
When DR is disabled for discovered apps,
VolumeGroupReplicationand VR resources created using the older naming format may not be fully removed, leaving staleVolumeGroupReplication`s in the environment. Check for any stale `VolumeGroupReplicationresources.On the managed clusters, run the following:
oc get volumegroupreplication -A
$ oc get volumegroupreplication -ACopy to Clipboard Copied! Toggle word wrap Toggle overflow If any stale
VolumeGroupReplicationresources remain after DR has been successfully disabled for all CG‑enabled applications, each resource must be manually deleted using the following command:oc delete volumegroupreplication <volumegroupreplication_name> -n <application-namespace>
$ oc delete volumegroupreplication <volumegroupreplication_name> -n <application-namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the
VolumeGroupReplicationClassfrom each managed cluster.Find the name for the
VolumeGroupReplicationClassclass on each managed cluster:oc get volumegroupreplicationclass
$ oc get volumegroupreplicationclassCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
VolumeGroupReplicationClasson each managed cluster:oc delete volumegroupreplicationclass <volumegroupreplicationclass_name>
$ oc delete volumegroupreplicationclass <volumegroupreplicationclass_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
Re-enable DR after confirming stale
VolumeGroupReplicationresources and anyVolumeGroupReplicationClass(s)have been removed .
4.14.4. Enrolling a sample discovered application for disaster recovery protection Copy linkLink copied to clipboard!
This section guides you on how to apply an existing DR Policy to a discovered application from the Protected applications tab.
Prerequisites
- Ensure that Disaster Recovery has been configured and that at least one DR Policy has been created.
- If you had enabled consistency groups in OpenShift Data Foundation 4.19, follow the instructions in Resolving consistency group issues after upgrading from 4.19 to 4.20.
Procedure
- On RHACM console, navigate to Disaster recovery → Protected applications tab.
- Click Enroll application to start configuring existing applications for DR protection.
- Select ACM discovered applications.
-
In the Namespace page, choose the DR cluster which is the name of the Primary managed cluster where
busyboxis installed. Select namespace where the application is installed. For example,
busybox-discovered.NoteIf you have workload spread across multiple namespaces then you can select all of those namespaces to DR protect.
-
Choose a unique Name, for example
busybox-rbd, for the discovered application and click Next. - In the Configuration page, select either Resource label or Recipe.
-
Resource label is used to protect your resources where you can set which resources will be included in the kubernetes-object backup and what volume’s persistent data will be replicated. If you selected Resource label, provide label expressions and PVC label selector. Choose the label
appname=busyboxfor both the kubernetes-objects and for the PVC(s). If you selected Recipe, then from the Recipe list select the name of the recipe.
ImportantThe recipe resource must be created in the application namespace on both managed clusters before enrolling an application for disaster recovery.
- [Optional]: If you are using a recipe and want to add the recipe’s parameters, under Recipe parameters, add the Key and Value (optional), then click Add parameter.
- Click Next.
In the Replication page, select an existing DR Policy and the kubernetes-objects backup interval.
NoteIt is recommended to choose the same duration for the PVC data replication and kubernetes-object backup interval (i.e., 5 minutes).
- Click Next.
Review the configuration and click Save.
Use the Back button to go back to the screen to correct any issues.
Verify that the Application volumes (PVCs) and the Kubernetes-objects backup have a Healthy status before proceeding to DR Failover and Relocate testing. You can view the status of your Discovered applications on the Protected applications tab.
To see the status of the DRPC, run the following command on the Hub cluster:
oc get drpc {drpc_name} -o wide -n openshift-dr-ops$ oc get drpc {drpc_name} -o wide -n openshift-dr-opsCopy to Clipboard Copied! Toggle word wrap Toggle overflow The discovered applications store resources such as DRPlacementControl (DRPC) and Placement on the Hub cluster in a new namespace called
openshift-dr-ops. The DRPC name can be identified by the unique Name configured in prior steps (i.e.,busybox-rbd).To see the status of the VolumeReplicationGroup (VRG) for discovered applications, run the following command on the managed cluster where the busybox application was manually installed.
oc get vrg {vrg_name} -n openshift-dr-ops$ oc get vrg {vrg_name} -n openshift-dr-opsCopy to Clipboard Copied! Toggle word wrap Toggle overflow The VRG resource is stored in the namespace
openshift-dr-opsafter a DR Policy is assigned to the discovered application. The VRG name can be identified by the unique Name configured in prior steps (i.e.,busybox-rbd).
4.14.5. Discovered application failover and relocate Copy linkLink copied to clipboard!
A protected Discovered application can Failover or Relocate to its peer cluster similar to managed applications. However, there are some additional steps for discovered applications since RHACM does not manage the lifecycle of the application as it does for Managed applications.
This section guides you through the Failover and Relocate process for a protected discovered application.
Never initiate a Failover or Relocate of an application when one or both resource types are in a Warning or Critical status.
4.14.5.1. Failover disaster recovery protected discovered application Copy linkLink copied to clipboard!
This section guides you on how to failover a discovered application which is disaster recovery protected.
Prerequisites
-
Ensure that the application namespace is created in both managed clusters (for example,
busybox-discovered).
Procedure
- In the RHACM console, navigate to Disaster Recovery → Protected applications tab.
- At the end of the application row, click on the Actions menu and choose to initiate Failover.
In the Failover application modal window, review the status of the application and the target cluster.
ImportantIf there are data inconsistencies caused by synchronization delays, a warning message appears stating Inconsistent data on target cluster. This alerts to the possibility of data loss if the failover is initiated. The message is no longer displayed when data synchronization is complete.
-
Click Initiate. Wait for the
Failoverprocess to complete. Verify that the busybox application is running on the Secondary managed cluster.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the progression status of Failover until the result is
WaitOnUserToCleanup. The DRPC name can be identified by the unique Name configured in prior steps (for example,busybox-rbd).oc get drpc {drpc_name} -n openshift-dr-ops -o jsonpath='{.status.progression}{"\n"}' WaitOnUserToCleanUp$ oc get drpc {drpc_name} -n openshift-dr-ops -o jsonpath='{.status.progression}{"\n"}' WaitOnUserToCleanUpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the busybox application from the Primary managed cluster to complete the
Failoverprocess.- Navigate to the Protected applications tab. You will see a message to remove the application.
Navigate to the cloned repository for
busyboxand run the following commands on the Primary managed cluster where youfailedover from. Use the same directory that was used to create the application (for example,odr-regional-rbd).Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- After deleting the application, navigate to the Protected applications tab and verify that the busybox resources are both in Healthy status.
4.14.5.2. Relocate disaster recovery protected discovered application Copy linkLink copied to clipboard!
This section guides you on how to relocate a discovered application which is disaster recovery protected.
Procedure
- In the RHACM console, navigate to Disaster Recovery → Protected applications tab.
- At the end of the application row, click on the Actions menu and choose to initiate Relocate.
In the Relocate application modal window, review the status of the application and the target cluster.
ImportantIf there are data inconsistencies caused by synchronization delays, a warning message appears stating Inconsistent data on target cluster. This alerts to the possibility of data loss if the relocate is initiated. The message is no longer displayed when data synchronization is complete.
- Click Initiate.
Check the progression status of Relocate until the result is
WaitOnUserToCleanup. The DRPC name can be identified by the unique Name configured in prior steps (for example,busybox-rbd).oc get drpc {drpc_name} -n openshift-dr-ops -o jsonpath='{.status.progression}{"\n"}' WaitOnUserToCleanUp$ oc get drpc {drpc_name} -n openshift-dr-ops -o jsonpath='{.status.progression}{"\n"}' WaitOnUserToCleanUpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the busybox application from the Secondary managed cluster before Relocate to the Primary managed cluster is completed.
Navigate to the cloned repository for
busyboxand run the following commands on the Secondary managed cluster where yourelocatedfrom. Use the same directory that was used to create the application (for example,odr-regional-rbd).Copy to Clipboard Copied! Toggle word wrap Toggle overflow - After deleting the application, navigate to the Protected applications tab and verify that the busybox resources are both in Healthy status.
Verify that the
busyboxapplication is running on the Primary managed cluster.Copy to Clipboard Copied! Toggle word wrap Toggle overflow
4.14.6. Disable disaster recovery for protected applications Copy linkLink copied to clipboard!
This section guides you to disable disaster recovery resources when you want to delete the protected applications or when the application no longer needs to be protected.
Procedure
- Login to the Hub cluster.
List the
DRPlacementControl(DRPC) resources. Each DRPC resource was created when the application was assigned a DR policy.oc get drpc -n openshift-dr-ops
$ oc get drpc -n openshift-dr-opsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Find the DRPC that has a name that includes the unique identifier that you chose when assigning a DR policy (for example,
busybox-rbd) and delete the DRPC.oc delete {drpc_name} -n openshift-dr-ops$ oc delete {drpc_name} -n openshift-dr-opsCopy to Clipboard Copied! Toggle word wrap Toggle overflow List the Placement resources. Each Placement resource was created when the application was assigned a DR policy.
oc get placements -n openshift-dr-ops
$ oc get placements -n openshift-dr-opsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Find the Placement that has a name that includes the unique identifier that you chose when assigning a DR policy (for example,
busybox-rbd-placement-1) and delete the Placement.oc delete placements {placement_name} -n openshift-dr-ops$ oc delete placements {placement_name} -n openshift-dr-opsCopy to Clipboard Copied! Toggle word wrap Toggle overflow
4.15. Recovering to a replacement cluster with Regional-DR Copy linkLink copied to clipboard!
When there is a failure with the primary cluster, you get the options to either repair, wait for the recovery of the existing cluster, or replace the cluster entirely if the cluster is irredeemable. This solution guides you when replacing a failed primary cluster with a new cluster and enables failback (relocate) to this new cluster.
In these instructions, we are assuming that a RHACM managed cluster must be replaced after the applications have been installed and protected. For purposes of this section, the RHACM managed cluster is the replacement cluster, while the cluster that is not replaced is the surviving cluster and the new cluster is the recovery cluster.
Prerequisite
- Ensure that the Regional-DR environment has been configured with applications installed using Red Hat Advance Cluster Management (RHACM).
- Ensure that the applications are assigned a Data policy which protects them against cluster failure.
Procedure
On the surviving cluster, backup the
clientProfileMapping:oc get clientprofilemapping ocs-storagecluster -n openshift-storage -o yaml > clientProfileMappingBeforeRecovery.yaml
$ oc get clientprofilemapping ocs-storagecluster -n openshift-storage -o yaml > clientProfileMappingBeforeRecovery.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Failover all protected applications on the failed replacement cluster to the surviving cluster on the RHACM console.
For managed applications:
On the Hub cluster, navigate to Applications. Click the Actions (⋮) and select Failover application.
For discovered applications:
Navigate to All Clusters → Data Services → Disaster recovery → Protected applications tab. Click the Actions (⋮) and select Failover.
Validate that all protected applications are running on the surviving cluster before moving to the next step.
NoteThe PROGRESSION state for each managed application
DRPlacementControlshows asCleaning Up. For discovered applications the PROGRESSION will beWaitOnUserToCleanUp.This is to be expected if the replacement cluster is offline or down.From the Hub cluster, delete the DRCluster for the replacement cluster.
oc delete drcluster <drcluster_name> --wait=false
$ oc delete drcluster <drcluster_name> --wait=falseCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteUse
--wait=falsesince the DRCluster will not be deleted until a later step.Remove disaster recovery for each protected application on the surviving cluster. Perform all the sub-steps on the hub cluster.
For each managed application, edit the Placement and ensure that the surviving cluster is selected:
oc edit placement <placement_name> -n <namespace>
$ oc edit placement <placement_name> -n <namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteFor Subscription-based applications the associated Placement can be found in the same namespace on the hub cluster similar to the managed clusters. For ApplicationSets-based applications the associated Placement can be found in the
openshift-gitopsnamespace on the hub cluster.For each managed application, edit the PlacementDecision and ensure that the replacement cluster is deleted.
oc edit placementdecision <placementdecision_name> -n <namespace> --subresource=status
$ oc edit placementdecision <placementdecision_name> -n <namespace> --subresource=statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteFor Subscription-based applications the associated PlacementDecision can be found in the same namespace on the hub cluster similar to the managed clusters. For ApplicationSets-based applications the associated PlacementDecision can be found in the
openshift-gitopsnamespace on the hub clusterVerify that the
s3Profileis removed for the replacement cluster by running the following command on the surviving cluster for each protected application’s VolumeReplicationGroup (VRG).NoteFor managed applications, the associated VRG is found in the application namespace. For Discovered applications the associated VRG is in the
openshift-dr-opsnamespace.oc get vrg -n <vrg_namespace> -o jsonpath='{.items[0].spec.s3Profiles}' | jq$ oc get vrg -n <vrg_namespace> -o jsonpath='{.items[0].spec.s3Profiles}' | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove disaster recovery from the applications on the RHACM console:
For managed applications:
- On the Hub cluster, navigate to All Clusters → Applications.
- In the Overview tab, at the end of the protected application row from the action menu (⋮), select Manage disaster recovery.
- Click Remove disaster recovery.
- Click Confirm remove.
For discovered applications:
- Navigate to All Clusters → Data Services → Disaster recovery → Protected applications tab.
- At the end of the application row, click on the action menu (⋮) and choose Remove disaster recovery.
- Click Remove in the next prompt.
- Repeat the process detailed in the last step and the sub-steps for every protected application on the surviving cluster. Removing disaster recovery for protected applications is now completed.
Remove all disaster recovery configurations for the surviving cluster and the replacement cluster:
From the hub cluster, delete DRPolices associated with the surviving cluster and replacement cluster slated for DR removal:
oc delete drpolicy {drpolicy_name1 drpolicy_name2 ...}$ oc delete drpolicy {drpolicy_name1 drpolicy_name2 ...}Copy to Clipboard Copied! Toggle word wrap Toggle overflow From the hub cluster, delete the mirrorpeer associated with the surviving cluster and replacement cluster slated for DR removal:
oc delete mirrorpeer {mirrorpeer_name}$ oc delete mirrorpeer {mirrorpeer_name}Copy to Clipboard Copied! Toggle word wrap Toggle overflow From the hub cluster, patch the mirorrpeer to remove the finalizer:
oc patch mirrorpeer {mirrorpeer_name} -p '{"metadata":{"finalizers":null}}' --type=merge$ oc patch mirrorpeer {mirrorpeer_name} -p '{"metadata":{"finalizers":null}}' --type=mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow From the surviving cluster, disable mirroring on the
cephblockpoolradosnamespace:oc patch cephblockpoolradosnamespace <cephblockpoolradosnamespace_name> --type merge -p '{"spec":{"mirroring": null}}' -n openshift-storage$ oc patch cephblockpoolradosnamespace <cephblockpoolradosnamespace_name> --type merge -p '{"spec":{"mirroring": null}}' -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow From the surviving cluster, disable mirroring on the
cephblockpool:oc patch cephblockpool <cephblockpool_name> --type json -p '[{"op": "replace", "path": "/spec/mirroring", "value": {}}]' -n openshift-storage$ oc patch cephblockpool <cephblockpool_name> --type json -p '[{"op": "replace", "path": "/spec/mirroring", "value": {}}]' -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Uninstall Submariner for only the replacement cluster (failed cluster) using the RHACM console.
Navigate to Infrastructure → Clusters → Clustersets → Submariner add-ons view and uninstall Submariner for only the replacement cluster.
NoteThe uninstall process of Submariner for the replacement cluster (failed cluster) will stay GREEN and not complete until the replacement cluster has been detached from the RHACM console.
- Navigate back to Clusters view and detach replacement cluster.
- Create new OpenShift cluster (recovery cluster) and import into Infrastructure → Clusters view.
-
Add the new recovery cluster to the
Clustersetused by Submariner. Install Submariner add-ons only for the new recovery cluster.
NoteIf GlobalNet is used for the surviving cluster make sure to enable GlobalNet for the recovery cluster as well.
Install OpenShift Data Foundation on the recovery cluster. The OpenShift Data Foundation version should be OpenShift Data Foundation 4.17 (or greater) and the same version of ODF as on the surviving cluster.
For managed applications using ApplicationSets, the OpenShift GitOps operator must be installed. For instructions, see Installing Red Hat OpenShift GitOps Operator in web console.
For discovered applications, the OADP operator and the Data Protection Application (DPA) must be installed. For instructions, see Prerequisites for disaster recovery protection of discovered applications.
NoteMake sure to follow the optional instructions in the documentation to modify the OpenShift Data Foundation storage cluster on the recovery cluster if GlobalNet has been enabled when installing Submariner.
Using the RHACM console, navigate to Data Services → Disaster recovery → Policies tab.
- Select Create DRPolicy and name your policy.
- Select the recovery cluster and the surviving cluster.
- Create the policy. For instructions see chapter on Creating Disaster Recovery Policy on Hub cluster.
Proceed to the next step only after the status of DRPolicy changes to
Validated.Apply the DRPolicy to the applications on the surviving cluster that were originally protected before the replacement cluster failed.
NoteFor discovered applications, make sure to create application namespace on the recovery cluster before assigning the policy.
Ensure that there is a
poolIDmapping between the blockpools of the replacement cluster and surviving cluster, as well as between the failed cluster and recovery cluster.Open the
clientProfileMappingBeforeRecovery.yamlcreated in step 1 for editing:vi clientProfileMappingBeforeRecovery.yaml
$ vi clientProfileMappingBeforeRecovery.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow In the
clientProfileMappingBeforeRecovery.yamldelete the lines and replace the values notes as follows:Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Save the
clientProfileMappingBeforeRecovery.yamlfile toclientProfileMappingBeforeRecovery-surviving.yaml. Create the
clientProfileMappingBeforeRecovery-surviving.yamlresource on the surviving cluster:oc create -f clientProfileMappingBeforeRecovery-surviving.yaml -n openshift-storage
$ oc create -f clientProfileMappingBeforeRecovery-surviving.yaml -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow On the recovery cluster, get the
blockpoolIdMappingbetween the recovery cluster and the surviving cluster:oc get clientprofilemapping ocs-storagecluster -n openshift-storage -o yaml | yq '.spec.mappings.[].blockPoolIdMapping'
$ oc get clientprofilemapping ocs-storagecluster -n openshift-storage -o yaml | yq '.spec.mappings.[].blockPoolIdMapping'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output for two
cephblockpools.The default is 1cephblockpool. YourblockPoolIdsmay be different:- - "3" <-- Note this value for a future step - "3" - - "13" <-- Note this value for a future step - "14"
- - "3" <-- Note this value for a future step - "3" - - "13" <-- Note this value for a future step - "14"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy the contents of the
clientProfileMappingBeforeRecovery-surviving.yamlinto a new yaml:cp clientProfileMappingBeforeRecovery-surviving.yaml clientProfileMappingAfterRecovery.yaml
$ cp clientProfileMappingBeforeRecovery-surviving.yaml clientProfileMappingAfterRecovery.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the
clientProfileMappingAfterRecovery.yamlwith the correctblockPoolIds:vi clientProfileMappingAfterRecovery.yaml
$ vi clientProfileMappingAfterRecovery.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow The following as an example yaml with the required changes. Replace the
blockPoolIdswith the values you noted earlier:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
clientProfileMappingAfterRecovery.yamlresource on the recovery cluster:oc create -f clientProfileMappingAfterRecovery.yaml -n openshift-storage
$ oc create -f clientProfileMappingAfterRecovery.yaml -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Relocate the newly protected applications on the surviving cluster back to the new recovery cluster on the RHCAM console.
For managed applications:
- On the Hub cluster, navigate to All Clusters → Applications.
- In the Overview tab, at the end of the application row from the action menu (⋮), select Manage diaster recovery.
- Click Relocate.
For discovered applications:
- Navigate to All Clusters → Data Services → Disaster recovery → Protected applications tab.
- At the end of the application row, click on the action menu (⋮) and choose Relocate.
4.16. Viewing Recovery Point Objective values for disaster recovery enabled applications Copy linkLink copied to clipboard!
Recovery Point Objective (RPO) value is the most recent sync time of persistent data from the cluster where the application is currently active to its peer. This sync time helps determine duration of data lost during failover.
This RPO value is applicable only for Regional-DR during failover. Relocation ensures there is no data loss during the operation, as all peer clusters are available.
You can view the Recovery Point Objective (RPO) value of all the protected volumes for their workload on the Hub cluster.
Procedure
- On the Hub cluster, navigate to Applications → Overview tab.
In the Data policy column, click the policy link for the application you applied the policy to.
A Data Policies modal page appears with the number of disaster recovery policies applied to each application along with failover and relocation status.
On the Data Policies modal page, click the View more details link.
A detailed Data Policies modal page is displayed that shows the policy names and the ongoing activities (Last sync, Activity status) associated with the policy that is applied to the application.
The Last sync time reported in the modal page, represents the most recent sync time of all volumes that are DR protected for the application.
4.17. Hub recovery using Red Hat Advanced Cluster Management Copy linkLink copied to clipboard!
When your setup has active and passive Red Hat Advanced Cluster Management for Kubernetes (RHACM) hub clusters, and in case where the active hub is down, you can use the passive hub to failover or relocate the disaster recovery protected workloads.
4.17.1. Configuring passive hub cluster Copy linkLink copied to clipboard!
To perform hub recovery in case the active hub is down or unreachable, follow the procedure in this section to configure the passive hub cluster and then failover or relocate the disaster recovery protected workloads.
Procedure
Ensure that RHACM operator and
MultiClusterHubis installed on the passive hub cluster. See RHACM installation guide for instructions.After the operator is successfully installed, the web console automatically reloads to apply the changes. During this process, a temporary error message might appear on the page and this is expected and disappears after the refresh completes.
- Before hub recovery, configure backup and restore. See Backup and restore topics of RHACM Business continuity guide.
- Install the multicluster orchestrator (MCO) operator along with Red Hat OpenShift GitOps operator on the passive RHACM hub prior to the restore. For instructions to restore your RHACM hub, see Installing OpenShift Data Foundation Multicluster Orchestrator operator.
-
Ensure that
.spec.cleanupBeforeRestoreis set toNonefor theRestore.cluster.open-cluster-management.ioresource. For details, see Restoring passive resources while checking for backups chapter of RHACM documentation. - If SSL access across clusters was configured manually during setup, then re-configure SSL access across clusters. For instructions, see Configuring SSL access across clusters chapter.
On the passive hub, add a label to
openshift-operatorsnamespace to enable basic monitoring ofVolumeSyncronizationDelayalert using this command. For alert details, see Disaster recovery alerts chapter.oc label namespace openshift-operators openshift.io/cluster-monitoring='true'
$ oc label namespace openshift-operators openshift.io/cluster-monitoring='true'Copy to Clipboard Copied! Toggle word wrap Toggle overflow
4.17.2. Switching to passive hub cluster Copy linkLink copied to clipboard!
Use this procedure when the active hub is down or unreachable.
Procedure
During the restore procedure, to avoid eviction of resources when ManifestWorks are not regenerated correctly, you can enlarge the AppliedManifestWork eviction grace period. On the passive hub cluster, check for existing global
KlusterletConfig.-
If global KlusterletConfig exists then edit and set the value for
appliedManifestWorkEvictionGracePeriodparameter to a larger value. For example, 24 hours or more. If global KlusterletConfig does not exist, then create the Klusterletconfig using the following yaml:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The configuration will be propagated to all the managed clusters automatically.
-
If global KlusterletConfig exists then edit and set the value for
Restore the backups on the passive hub cluster. For information, see Restoring a hub cluster from backup.
ImportantRecovering a failed hub to its passive instance will only restore applications and their DR protected state to its last scheduled backup. Any application that was DR protected after the last scheduled backup would need to be protected again on the new hub.
Verify that the restore is complete.
oc -n <restore-namespace> wait restore <restore-name> --for=jsonpath='{.status.phase}'=Finished --timeout=120s$ oc -n <restore-namespace> wait restore <restore-name> --for=jsonpath='{.status.phase}'=Finished --timeout=120sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the Primary and Secondary managed clusters are successfully imported into the RHACM console and they are accessible. If any of the managed clusters are down or unreachable then they will not be successfully imported.
Wait until DRPolicy validation succeeds before performing any DR operation.
If the previously down managed cluster comes back online in the future after the hub recovery process, ensure that it is successfully imported and tied to the new hub. If the cluster doesn’t import and tie to the new hub, delete and recreate the
restore-acmresource:oc get restore restore-acm -o yaml -n open-cluster-management-backup
$ oc get restore restore-acm -o yaml -n open-cluster-management-backupCopy to Clipboard Copied! Toggle word wrap Toggle overflow Then verify that the cluster is properly connected to the new hub.
NoteSubmariner is automatically installed once the managed clusters are imported on the passive hub.
Verify that the DRPolicy is created successfully. Run this command on the Hub cluster for each of the DRPolicy resources created, where <drpolicy_name> is replaced with a unique name.
oc get drpolicy <drpolicy_name> -o jsonpath='{.status.conditions[].reason}{"\n"}'$ oc get drpolicy <drpolicy_name> -o jsonpath='{.status.conditions[].reason}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Succeeded
SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Refresh the RHACM console to make the DR monitoring dashboard tab accessible if it was enabled on the Active hub cluster.
Verify the DRPC output using the following command on the new hub cluster:
oc get drpc -A -o wide
$ oc get drpc -A -o wideCopy to Clipboard Copied! Toggle word wrap Toggle overflow If
PROGRESSIONshows a status ofPAUSED, administrative intervention is required to unpause it.PROGRESSIONentersPAUSEDstate under the following conditions:- Cluster Query Failure: None of the clusters were successfully queried during the DRPC reconciliation. This situation can occur during hub recovery.
- Action Mismatch: The DRPC action differs from the queried VRG action.
Cluster Mismatch: The DRPC action and the VRG action are the same, but the Primary VRG is found in a different cluster than the one expected by the DRPC.
ImportantIf you cannot diagnose and resolve the cause of the pause, contact Red Hat Customer Support.
If
PROGRESSIONis in eitherCompletedorCleaning up, it is safe to proceed.
-
Edit the global KlusterletConfig on the new hub and remove the parameter
appliedManifestWorkEvictionGracePeriodand its value. Depending on whether the active hub cluster, or both the active hub cluster along with the primary managed cluster had been down, follow the next steps based on your scenario:
- If only the active hub cluster had been down, and if the managed clusters are still accessible, no further action is required.
If the primary managed cluster had been down, along with the active hub cluster, you need to fail over the workloads from the primary managed cluster to the secondary managed cluster.
For failover instructions, based on your workload type, see Subscription-based applications or ApplicationSet-based applications.
Verify that the failover is successful. If the Primary managed cluster is also down, then the PROGRESSION status for the workload would be in the
Cleaning Upphase until the down Primary managed cluster is back online and successfully imported into the RHACM console.On the passive hub cluster, run the following command to check the PROGRESSION status.
oc get drpc -o wide -A
$ oc get drpc -o wide -ACopy to Clipboard Copied! Toggle word wrap Toggle overflow
4.18. Enabling granular disaster recovery for individual or groups of virtual machines in a namespace [Technology preview] Copy linkLink copied to clipboard!
Granular disaster recovery for individual or groups of virtual machines in a namespace is a Technology Preview feature and is subject to Technology Preview support limitations. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information, see Technology Preview Features Support Scope.
You can enable granular disaster recovery (DR) for individual virtual machines (VMs) or VM groups within the same namespace, allowing independent failover and relocation actions.
Key highlights
- Granular VM Control: Each discovered and ACM-managed VM can have its own DR policy, enabling independent DR operations without impacting other VMs in the namespace.
- Namespace Partitioning: Supports multiple DRPCs within a namespace, avoiding the limitations of namespace-level DR protection.
Scenarios Supported:
- Independent DR for discovered and ACM-managed VMs.
- Namespace-level DR remains for non-VM applications.
Upgrade Considerations:
- Existing DRPCs are treated as namespace-level.
- Users must disable DR before switching to VM-specific protection.
Manage disaster recovery of virtual machines from Virtual Machines page
NoteExisting UI for namespace-level DR remains unchanged.
During failover of a CNV discovered application, if the primary managed cluster is down and other CNV applications are running in different namespaces on either of the managed clusters, there have been cases where the VM fails to start on the failover cluster. This issue is due to MAC address conflicts.
To prevent this issue, it is recommended to configure disjoint MAC address ranges for each cluster from the Day 1. For example:
oc patch configmap kubemacpool-mac-range-config -n openshift-cnv --type merge -p '{"data":{"RANGE_START":"02:00:00:00:00:00","RANGE_END":"02:00:FF:ff:FF:FF"}}'
$ oc patch configmap kubemacpool-mac-range-config -n openshift-cnv --type merge -p '{"data":{"RANGE_START":"02:00:00:00:00:00","RANGE_END":"02:00:FF:ff:FF:FF"}}'
oc patch configmap kubemacpool-mac-range-config -n openshift-cnv --type merge -p '{"data":{"RANGE_START":"02:01:00:00:00:00","RANGE_END":"02:01:FF:FF:FF:FF"}}'
$ oc patch configmap kubemacpool-mac-range-config -n openshift-cnv --type merge -p '{"data":{"RANGE_START":"02:01:00:00:00:00","RANGE_END":"02:01:FF:FF:FF:FF"}}'
Then restart the kubemacpool-mac-controller-manager:
oc get pods -n openshift-cnv|grep kubemacpool-mac-controller-manager kubemacpool-mac-controller-manager-544ff96b4f-dgb7x 2/2 Running 3 30h
$ oc get pods -n openshift-cnv|grep kubemacpool-mac-controller-manager
kubemacpool-mac-controller-manager-544ff96b4f-dgb7x 2/2 Running 3 30h
Chapter 5. Disaster recovery with stretch cluster for OpenShift Data Foundation Copy linkLink copied to clipboard!
Red Hat OpenShift Data Foundation deployment can be stretched between two different geographical locations to provide the storage infrastructure with disaster recovery capabilities. When faced with a disaster, such as one of the two locations is partially or totally not available, OpenShift Data Foundation deployed on the OpenShift Container Platform deployment must be able to survive. This solution is available only for metropolitan spanned data centers with specific latency requirements between the servers of the infrastructure.
The stretch cluster solution is designed for deployments where latencies do not exceed 10 ms maximum round-trip time (RTT) between the zones containing data volumes. For Arbiter nodes follow the latency requirements specified for etcd, see Guidance for Red Hat OpenShift Container Platform Clusters - Deployments Spanning Multiple Sites(Data Centers/Regions). Contact Red Hat Customer Support if you are planning to deploy with higher latencies.
The following diagram shows the simplest deployment for a stretched cluster:
OpenShift nodes and OpenShift Data Foundation daemons
In the diagram the OpenShift Data Foundation monitor pod deployed in the Arbiter zone has a built-in tolerance for the master nodes. The diagram shows the master nodes in each Data Zone which are required for a highly available OpenShift Container Platform control plane. Also, it is important that the OpenShift Container Platform nodes in one of the zones have network connectivity with the OpenShift Container Platform nodes in the other two zones.
You can now easily set up disaster recovery with stretch cluster for workloads based on OpenShift virtualization technology using OpenShift Data Foundation. For more information, see OpenShift Virtualization in OpenShift Container Platform guide.
5.1. Requirements for enabling stretch cluster Copy linkLink copied to clipboard!
- Ensure you have addressed OpenShift Container Platform requirements for deployments spanning multiple sites. For more information, see knowledgebase article on cluster deployments spanning multiple sites.
- Ensure that you have at least three OpenShift Container Platform master nodes in three different zones. One master node in each of the three zones.
- Ensure that you have at least four OpenShift Container Platform worker nodes evenly distributed across the two Data Zones.
- For stretch clusters on bare metall, use the SSD drive as the root drive for OpenShift Container Platform master nodes.
- Ensure that each node is pre-labeled with its zone label. For more information, see the Applying topology zone labels to OpenShift Container Platform node section.
- The stretch cluster solution is designed for deployments where latencies do not exceed 10 ms between zones. Contact Red Hat Customer Support if you are planning to deploy with higher latencies.
Flexible scaling and Arbiter both cannot be enabled at the same time as they have conflicting scaling logic. With Flexible scaling, you can add one node at a time to your OpenShift Data Foundation cluster. Whereas in an Arbiter cluster, you need to add at least one node in each of the two data zones.
5.2. Applying topology zone labels to OpenShift Container Platform nodes Copy linkLink copied to clipboard!
During a site outage, the zone that has the arbiter function makes use of the arbiter label. These labels are arbitrary and must be unique for the three locations.
For example, you can label the nodes as follows:
topology.kubernetes.io/zone=arbiter for Master0 topology.kubernetes.io/zone=datacenter1 for Master1, Worker1, Worker2 topology.kubernetes.io/zone=datacenter2 for Master2, Worker3, Worker4
topology.kubernetes.io/zone=arbiter for Master0
topology.kubernetes.io/zone=datacenter1 for Master1, Worker1, Worker2
topology.kubernetes.io/zone=datacenter2 for Master2, Worker3, Worker4
To apply the labels to the node:
oc label node <NODENAME> topology.kubernetes.io/zone=<LABEL>
$ oc label node <NODENAME> topology.kubernetes.io/zone=<LABEL>Copy to Clipboard Copied! Toggle word wrap Toggle overflow <NODENAME>- Is the name of the node
<LABEL>- Is the topology zone label
To validate the labels using the example labels for the three zones:
oc get nodes -l topology.kubernetes.io/zone=<LABEL> -o name
$ oc get nodes -l topology.kubernetes.io/zone=<LABEL> -o nameCopy to Clipboard Copied! Toggle word wrap Toggle overflow <LABEL>Is the topology zone label
Alternatively, you can run a single command to see all the nodes with its zone.
oc get nodes -L topology.kubernetes.io/zone
$ oc get nodes -L topology.kubernetes.io/zoneCopy to Clipboard Copied! Toggle word wrap Toggle overflow
The stretch cluster topology zone labels are now applied to the appropriate OpenShift Container Platform nodes to define the three locations.
5.3. Installing Local Storage Operator Copy linkLink copied to clipboard!
Install the Local Storage Operator from the Software Catalog before creating Red Hat OpenShift Data Foundation clusters on local storage devices.
Procedure
- Log in to the OpenShift Web Console.
- Click Ecosystem → Software Catalog.
-
Type
local storagein the Filter by keyword box to find the Local Storage Operator from the list of operators, and click on it. Set the following options on the Install Operator page:
-
Update channel as
stable. - Installation mode as A specific namespace on the cluster.
- Installed Namespace as Operator recommended namespace openshift-local-storage.
- Update approval as Automatic.
-
Update channel as
- Click Install.
Verification steps
- Verify that the Local Storage Operator shows a green tick indicating successful installation.
5.4. Installing Red Hat OpenShift Data Foundation Operator Copy linkLink copied to clipboard!
You can install Red Hat OpenShift Data Foundation Operator using the Red Hat OpenShift Container Platform Software Catalog.
Prerequisites
- Access to an OpenShift Container Platform cluster using an account with cluster-admin and Operator installation permissions.
- You must have at least four worker nodes evenly distributed across two data centers in the Red Hat OpenShift Container Platform cluster.
- For additional resource requirements, see Planning your deployment.
When you need to override the cluster-wide default node selector for OpenShift Data Foundation, you can use the following command in command-line interface to specify a blank node selector for the
openshift-storagenamespace (create openshift-storage namespace in this case):oc annotate namespace openshift-storage openshift.io/node-selector=
$ oc annotate namespace openshift-storage openshift.io/node-selector=Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Taint a node as
infrato ensure only Red Hat OpenShift Data Foundation resources are scheduled on that node. This helps you save on subscription costs. For more information, see How to use dedicated worker nodes for Red Hat OpenShift Data Foundation chapter in the Managing and Allocating Storage Resources guide.
Procedure
- Log in to the OpenShift Web Console.
- Click Ecosystem → Software Catalog.
-
Scroll or type
OpenShift Data Foundationinto the Filter by keyword box to search for the OpenShift Data Foundation Operator. - Click Install.
Set the following options on the Install Operator page:
- Update Channel as stable-4.20.
- Installation Mode as A specific namespace on the cluster.
-
Installed Namespace as Operator recommended namespace openshift-storage. If Namespace
openshift-storagedoes not exist, it is created during the operator installation. Select Approval Strategy as Automatic or Manual.
If you select Automatic updates, then the Operator Lifecycle Manager (OLM) automatically upgrades the running instance of your Operator without any intervention.
If you selected Manual updates, then the OLM creates an update request. As a cluster administrator, you must then manually approve that update request to update the Operator to a newer version.
- Ensure that the Enable option is selected for the Console plugin.
- Click Install.
Verification steps
- After the operator is successfully installed, the web console automatically reloads to apply the changes. During this process, a temporary error message might appear on the page and this is expected and disappears after the refresh completes.
In the Web Console:
- Navigate to Installed Operators and verify that the OpenShift Data Foundation Operator shows a green tick indicating successful installation.
- Navigate to Storage and verify if the Data Foundation dashboard is available.
Next steps
5.5. Creating OpenShift Data Foundation cluster Copy linkLink copied to clipboard!
Prerequisites
- Ensure that you have met all the requirements in Requirements for enabling stretch cluster section.
Procedure
- In the OpenShift Web Console, click Storage → Data Foundation → Storage Systems → Create StorageSystem.
- In the Backing storage page, select the Create a new StorageClass using the local storage devices option.
Click Next.
ImportantYou are prompted to install the Local Storage Operator if it is not already installed. Click Install, and follow the procedure as described in Installing Local Storage Operator.
In the Create local volume set page, provide the following information:
Enter a name for the LocalVolumeSet and the StorageClass.
By default, the local volume set name appears for the storage class name. You can change the name.
Choose one of the following:
Disks on all nodes
Uses the available disks that match the selected filters on all the nodes.
Disks on selected nodes
Uses the available disks that match the selected filters only on selected nodes.
ImportantIf the nodes selected do not match the OpenShift Data Foundation cluster requirement of an aggregated 30 CPUs and 72 GiB of RAM, a minimal cluster is deployed.
For minimum starting node requirements, see the Resource requirements section in the Planning guide.
-
Select
SSDorNVMeto build a supported configuration. You can selectHDDsfor unsupported test installations. Expand the Advanced section and set the following options:
Volume Mode
Block is selected by default.
Device Type
Select one or more device types from the dropdown list.
Disk Size
Set a minimum size of 100GB for the device and maximum available size of the device that needs to be included.
Maximum Disks Limit
This indicates the maximum number of PVs that can be created on a node. If this field is left empty, then PVs are created for all the available disks on the matching nodes.
Click Next.
A pop-up to confirm the creation of LocalVolumeSet is displayed.
- Click Yes to continue.
In the Capacity and nodes page, configure the following:
Available raw capacity is populated with the capacity value based on all the attached disks associated with the storage class. This takes some time to show up.
The Selected nodes list shows the nodes based on the storage class.
Select Enable arbiter checkbox if you want to use the stretch clusters. This option is available only when all the prerequisites for arbiter are fulfilled and the selected nodes are populated. For more information, see Arbiter stretch cluster requirements in Requirements for enabling stretch cluster.
Select the arbiter zone from the dropdown list.
Choose a performance profile for Configure performance.
You can also configure the performance profile after the deployment using the Configure performance option from the options menu of the StorageSystems tab.
ImportantBefore selecting a resource profile, make sure to check the current availability of resources within the cluster. Opting for a higher resource profile in a cluster with insufficient resources might lead to installation failures. For more information about resource requirements, see Resource requirement for performance profiles.
- Click Next.
Optional: In the Security and network page, configure the following based on your requirement:
- To enable encryption, select Enable data encryption for block and file storage.
Select one of the following Encryption level:
- Cluster-wide encryption to encrypt the entire cluster (block and file).
- StorageClass encryption to create encrypted persistent volume (block only) using encryption enabled storage class.
Optional: Select the Connect to an external key management service checkbox. This is optional for cluster-wide encryption.
- From the Key Management Service Provider drop-down list, either select Vault or Thales CipherTrust Manager (using KMIP). If you selected Vault, go to the next step. If you selected Thales CipherTrust Manager (using KMIP), go to step iii.
Select an Authentication Method.
- Using Token authentication method
- Enter a unique Connection Name, host Address of the Vault server ('https://<hostname or ip>'), Port number and Token.
Expand Advanced Settings to enter additional settings and certificate details based on your
Vaultconfiguration:- Enter the Key Value secret path in the Backend Path that is dedicated and unique to OpenShift Data Foundation.
- Optional: Enter TLS Server Name and Vault Enterprise Namespace.
- Upload the respective PEM encoded certificate file to provide the CA Certificate, Client Certificate and Client Private Key .
- Click Save and skip to step iv.
- Using Kubernetes authentication method
- Enter a unique Vault Connection Name, host Address of the Vault server ('https://<hostname or ip>'), Port number and Role name.
Expand Advanced Settings to enter additional settings and certificate details based on your
Vaultconfiguration:- Enter the Key Value secret path in the Backend Path that is dedicated and unique to OpenShift Data Foundation.
- Optional: Enter TLS Server Name and Authentication Path if applicable.
- Upload the respective PEM encoded certificate file to provide the CA Certificate, Client Certificate and Client Private Key .
- Click Save and skip to step iv.
To use Thales CipherTrust Manager (using KMIP) as the KMS provider, follow the steps below:
- Enter a unique Connection Name for the Key Management service within the project.
In the Address and Port sections, enter the IP of Thales CipherTrust Manager and the port where the KMIP interface is enabled. For example:
- Address: 123.34.3.2
- Port: 5696
- Upload the Client Certificate, CA certificate, and Client Private Key.
- If StorageClass encryption is enabled, enter the Unique Identifier to be used for encryption and decryption generated above.
-
The TLS Server field is optional and used when there is no DNS entry for the KMIP endpoint. For example,
kmip_all_<port>.ciphertrustmanager.local.
Network is set to Default (OVN) if you are using a single network.
You can switch to Custom (Multus) if you are using multiple network interfaces and then choose any one of the following:
- Select a Public Network Interface from the dropdown.
- Select a Cluster Network Interface from the dropdown.
NoteIf you are using only one additional network interface, select the single
NetworkAttachementDefinition, that is,ocs-public-clusterfor the Public Network Interface, and leave the Cluster Network Interface blank.- Click Next.
In the Review and create page, review the configuration details.
To modify any configuration settings, click Back to go back to the previous configuration page.
- Click Create StorageSystem.
Verification steps
To verify the final Status of the installed storage cluster:
- On the OpenShift Web Console, navigate to Storage → Data Foundation → Storage System.
- In the Status card of the Overview tab, click Storage System and then click the storage system link from the pop up that appears.
- In the Status card of the Block and File tab, verify that the Storage Cluster has a green tick.
For arbiter mode of deployment:
- In the OpenShift Web Console, navigate to Storage → Data Foundation → View Storage in the Storage cluster card.
- From the actions menu, select Edit Storage Cluster to view the YAML.
Search for the
arbiterkey in thespecsection and ensureenableis set totrue.Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- To verify that all the components for OpenShift Data Foundation are successfully installed, see Verifying your OpenShift Data Foundation installation.
5.5.1. Creating NetworkFenceClass Copy linkLink copied to clipboard!
NetworkFenceClass is required to to prevent corruption of volume contents due to non-graceful node shutdowns.
Procedure
Create the
NetworkFenceClassfrom the following YAML and verify that thecsiaddonsnodeobjects are populated to the IP address and Ceph cluster ID.NoteYou need to create one or more
NetworkFenceClassdepending on the amount of storage clusters connected to the OpenShift Container Platform cluster.Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
provisioner: specifies the name of storage provisioner. -
parameters: specifies storage provider specific parameters. -
csiaddons.openshift.io/networkfence-secret-name: specifies the name of the secret required for network fencing operation. This can be fetched fromparameters.csi.storage.k8s.io/node-stage-secret-namein theocs-storagecluster-ceph-rbd storageclass. -
csiaddons.openshift.io/networkfence-secret-namespace: specifies the namespace in which the secret is located. This can be fetched fromparameters.csi.storage.k8s.io/node-stage-secret-namespaceinocs-storagecluster-ceph-rbd storageclass.
-
-
Run the following command to return the
csiaddonsnodeobjects. The node objects belongs to the daemonset pod (RBD) that will have the IP address that needs to be fenced.
oc get csiaddonsnode -n openshift-storage
$ oc get csiaddonsnode -n openshift-storage
5.6. Verifying OpenShift Data Foundation deployment Copy linkLink copied to clipboard!
To verify that OpenShift Data Foundation is deployed correctly:
5.6.1. Verifying the state of the pods Copy linkLink copied to clipboard!
Procedure
- Click Workloads → Pods from the OpenShift Web Console.
Select
openshift-storagefrom the Project drop-down list.NoteIf the Show default projects option is disabled, use the toggle button to list all the default projects.
For more information about the expected number of pods for each component and how it varies depending on the number of nodes, see Table 5.1, “Pods corresponding to OpenShift Data Foundation cluster”.
Click the Running and Completed tabs to verify that the following pods are in
RunningandCompletedstate:NoteThe available pods depend on the cluster configuration. When the cluster is deployed as a standalone Multicloud Object Gateway, the
rook-ceph-operator-* pods are not available. Similarly, when the cluster is deployed without the Multicloud Object Gateway,noobaa-* pods are not available.Expand Table 5.1. Pods corresponding to OpenShift Data Foundation cluster Component Corresponding pods OpenShift Data Foundation Operator
-
ocs-operator-* (1 pod on any worker node) -
ocs-metrics-exporter-* (1 pod on any worker node) -
odf-operator-controller-manager-* (1 pod on any worker node) -
odf-console-* (1 pod on any worker node) -
csi-addons-controller-manager-* (1 pod on any worker node)
Rook-ceph Operator
rook-ceph-operator-*(1 pod on any worker node)
Multicloud Object Gateway
-
noobaa-operator-* (1 pod on any worker node) -
noobaa-core-* (1 pod on any storage node) -
noobaa-db-pg-cluster-1and noobaa-db-pg-cluster-2 (2 instances of MCG DB pod on any storage node) -
noobaa-endpoint-* (1 pod on any storage node) -
cnpg-controller-manager-* (1 pod on any storage node)
MON
rook-ceph-mon-*(5 pods are distributed across 3 zones, 2 per data-center zones and 1 in arbiter zone)
MGR
rook-ceph-mgr-*(2 pods on any storage node)
MDS
rook-ceph-mds-ocs-storagecluster-cephfilesystem-*(2 pods are distributed across 2 data-center zones)
RGW
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-*(2 pods are distributed across 2 data-center zones)
CSI
cephfs-
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-* (2 pods distributed across storage nodes) -
openshift-storage.cephfs.csi.ceph.com-nodeplugin-* (1 pod on each storage node)
-
nfs-
openshift-storage.nfs.csi.ceph.com-ctrlplugin-* (2 pods distributed across storage nodes) -
openshift-storage.nfs.csi.ceph.com-nodeplugin-* (1 pod on each storage node)
-
rbd-
openshift-storage.rbd.csi.ceph.com-ctrlplugin-* (2 pods distributed across storage nodes) -
openshift-storage.rbd.csi.ceph.com-nodeplugin-* (1 pod on each storage node)
-
rook-ceph-crashcollector
rook-ceph-crashcollector-*(1 pod on each storage node and 1 pod in arbiter zone)
OSD
-
rook-ceph-osd-* (1 pod for each device) -
rook-ceph-osd-prepare-* (1 pod for each device)
-
5.6.2. Verifying the OpenShift Data Foundation cluster is healthy Copy linkLink copied to clipboard!
Procedure
- In the OpenShift Web Console, click Storage → Data Foundation.
- In the Status card of the Overview tab, click Storage System and then click the storage system link from the pop up that appears.
- In the Status card of the Block and File tab, verify that the Storage Cluster has a green tick.
- In the Details card, verify that the cluster information is displayed.
For more information on the health of the OpenShift Data Foundation cluster using the Block and File dashboard, see Monitoring OpenShift Data Foundation.
5.6.3. Verifying the Multicloud Object Gateway is healthy Copy linkLink copied to clipboard!
Procedure
- In the OpenShift Web Console, click Storage → Data Foundation.
In the Status card of the Overview tab, click Storage System and then click the storage system link from the pop up that appears.
- In the Status card of the Object tab, verify that both Object Service and Data Resiliency have a green tick.
- In the Details card, verify that the MCG information is displayed.
For more information on the health of the OpenShift Data Foundation cluster using the object service dashboard, see Monitoring OpenShift Data Foundation.
To avoid data loss, it is recommended to take a backup of NooBaa DB PVC regularly. If NooBaa DB fails and cannot be recovered, then you can revert to the latest backed-up version. For instructions on backing up your NooBaa DB, follow the steps in the knowledgebase article, Perform a One-Time Backup of the Database for the Multicloud Object Gateway.
5.6.4. Verifying that the specific storage classes exist Copy linkLink copied to clipboard!
Procedure
- Click Storage → Storage Classes from the left pane of the OpenShift Web Console.
Verify that the following storage classes are created with the OpenShift Data Foundation cluster creation:
-
ocs-storagecluster-ceph-rbd -
ocs-storagecluster-cephfs -
openshift-storage.noobaa.io -
ocs-storagecluster-ceph-rgw
-
5.7. Install Zone Aware Sample Application Copy linkLink copied to clipboard!
Deploy a zone aware sample application to validate whether an OpenShift Data Foundation, stretch cluster setup is configured correctly.
With latency between the data zones, you can expect to see performance degradation compared to an OpenShift cluster with low latency between nodes and zones (for example, all nodes in the same location). The rate of or amount of performance degradation depends on the latency between the zones and on the application behavior using the storage (such as heavy write traffic). Ensure that you test the critical applications with stretch cluster configuration to ensure sufficient application performance for the required service levels.
A ReadWriteMany (RWX) Persistent Volume Claim (PVC) is created using the ocs-storagecluster-cephfs storage class. Multiple pods use the newly created RWX PVC at the same time. The application used is called File Uploader.
Demonstration on how an application is spread across topology zones so that it is still available in the event of a site outage:
This demonstration is possible since this application shares the same RWX volume for storing files. It works for persistent data access as well because Red Hat OpenShift Data Foundation is configured as a stretched cluster with zone awareness and high availability.
Create a new project.
oc new-project my-shared-storage
$ oc new-project my-shared-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Deploy the example PHP application called file-uploader.
oc new-app openshift/php:latest~https://github.com/mashetty330/openshift-php-upload-demo --name=file-uploader
$ oc new-app openshift/php:latest~https://github.com/mashetty330/openshift-php-upload-demo --name=file-uploaderCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example Output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow View the build log and wait until the application is deployed.
oc logs -f bc/file-uploader -n my-shared-storage
$ oc logs -f bc/file-uploader -n my-shared-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example Output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The command prompt returns out of the tail mode after you see
Push successful.NoteThe new-app command deploys the application directly from the git repository and does not use the OpenShift template, hence the OpenShift route resource is not created by default. You need to create the route manually.
5.7.1. Scaling the application after installation Copy linkLink copied to clipboard!
Procedure
Scale the application to four replicas and expose its services to make the application zone aware and available.
oc expose svc/file-uploader -n my-shared-storage
$ oc expose svc/file-uploader -n my-shared-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale --replicas=4 deploy/file-uploader -n my-shared-storage
$ oc scale --replicas=4 deploy/file-uploader -n my-shared-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -o wide -n my-shared-storage
$ oc get pods -o wide -n my-shared-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow You should have four file-uploader pods in a few minutes. Repeat the above command until there are 4 file-uploader pods in the
Runningstatus.Create a PVC and attach it into an application.
oc set volume deploy/file-uploader --add --name=my-shared-storage \ -t pvc --claim-mode=ReadWriteMany --claim-size=10Gi \ --claim-name=my-shared-storage --claim-class=ocs-storagecluster-cephfs \ --mount-path=/opt/app-root/src/uploaded \ -n my-shared-storage
$ oc set volume deploy/file-uploader --add --name=my-shared-storage \ -t pvc --claim-mode=ReadWriteMany --claim-size=10Gi \ --claim-name=my-shared-storage --claim-class=ocs-storagecluster-cephfs \ --mount-path=/opt/app-root/src/uploaded \ -n my-shared-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow This command:
- Creates a PVC.
- Updates the application deployment to include a volume definition.
- Updates the application deployment to attach a volume mount into the specified mount-path.
- Creates a new deployment with the four application pods.
Check the result of adding the volume.
oc get pvc -n my-shared-storage
$ oc get pvc -n my-shared-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example Output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE my-shared-storage Bound pvc-5402cc8a-e874-4d7e-af76-1eb05bd2e7c7 10Gi RWX ocs-storagecluster-cephfs 52s
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE my-shared-storage Bound pvc-5402cc8a-e874-4d7e-af76-1eb05bd2e7c7 10Gi RWX ocs-storagecluster-cephfs 52sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Notice the
ACCESS MODEis set to RWX.All the four
file-uploaderpods are using the same RWX volume. Without this access mode, OpenShift does not attempt to attach multiple pods to the same Persistent Volume (PV) reliably. If you attempt to scale up the deployments that are using ReadWriteOnce (RWO) PV, the pods may get colocated on the same node.
5.7.2. Modify Deployment to be Zone Aware Copy linkLink copied to clipboard!
Currently, the file-uploader Deployment is not zone aware and can schedule all the pods in the same zone. In this case, if there is a site outage then the application is unavailable. For more information, see Controlling pod placement by using pod topology spread constraints.
Add the pod placement rule in the application deployment configuration to make the application zone aware.
Run the following command, and review the output:
oc get deployment file-uploader -o yaml -n my-shared-storage | less
$ oc get deployment file-uploader -o yaml -n my-shared-storage | lessCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example Output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the deployment to use the topology zone labels.
oc edit deployment file-uploader -n my-shared-storage
$ oc edit deployment file-uploader -n my-shared-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add add the following new lines between the
StartandEnd(shown in the output in the previous step):Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
deployment.apps/file-uploader edited
deployment.apps/file-uploader editedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Scale down the deployment to zero pods and then back to four pods. This is needed because the deployment changed in terms of pod placement.
- Scaling down to zero pods
oc scale deployment file-uploader --replicas=0 -n my-shared-storage
$ oc scale deployment file-uploader --replicas=0 -n my-shared-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
deployment.apps/file-uploader scaled
deployment.apps/file-uploader scaledCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Scaling up to four pods
oc scale deployment file-uploader --replicas=4 -n my-shared-storage
$ oc scale deployment file-uploader --replicas=4 -n my-shared-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
deployment.apps/file-uploader scaled
deployment.apps/file-uploader scaledCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verify that the four pods are spread across the four nodes in datacenter1 and datacenter2 zones.
oc get pods -o wide -n my-shared-storage | egrep '^file-uploader'| grep -v build | awk '{print $7}' | sort | uniq -c$ oc get pods -o wide -n my-shared-storage | egrep '^file-uploader'| grep -v build | awk '{print $7}' | sort | uniq -cCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
1 perf1-mz8bt-worker-d2hdm 1 perf1-mz8bt-worker-k68rv 1 perf1-mz8bt-worker-ntkp8 1 perf1-mz8bt-worker-qpwsr
1 perf1-mz8bt-worker-d2hdm 1 perf1-mz8bt-worker-k68rv 1 perf1-mz8bt-worker-ntkp8 1 perf1-mz8bt-worker-qpwsrCopy to Clipboard Copied! Toggle word wrap Toggle overflow Search for the zone labels used.
oc get nodes -L topology.kubernetes.io/zone | grep datacenter | grep -v master
$ oc get nodes -L topology.kubernetes.io/zone | grep datacenter | grep -v masterCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
perf1-mz8bt-worker-d2hdm Ready worker 35d v1.20.0+5fbfd19 datacenter1 perf1-mz8bt-worker-k68rv Ready worker 35d v1.20.0+5fbfd19 datacenter1 perf1-mz8bt-worker-ntkp8 Ready worker 35d v1.20.0+5fbfd19 datacenter2 perf1-mz8bt-worker-qpwsr Ready worker 35d v1.20.0+5fbfd19 datacenter2
perf1-mz8bt-worker-d2hdm Ready worker 35d v1.20.0+5fbfd19 datacenter1 perf1-mz8bt-worker-k68rv Ready worker 35d v1.20.0+5fbfd19 datacenter1 perf1-mz8bt-worker-ntkp8 Ready worker 35d v1.20.0+5fbfd19 datacenter2 perf1-mz8bt-worker-qpwsr Ready worker 35d v1.20.0+5fbfd19 datacenter2Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use the file-uploader web application using your browser to upload new files.
Find the route that is created.
oc get route file-uploader -n my-shared-storage -o jsonpath --template="http://{.spec.host}{'\n'}"$ oc get route file-uploader -n my-shared-storage -o jsonpath --template="http://{.spec.host}{'\n'}"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example Output:
http://file-uploader-my-shared-storage.apps.cluster-ocs4-abdf.ocs4-abdf.sandbox744.opentlc.com
http://file-uploader-my-shared-storage.apps.cluster-ocs4-abdf.ocs4-abdf.sandbox744.opentlc.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow Point your browser to the web application using the route in the previous step.
The web application lists all the uploaded files and offers the ability to upload new ones as well as you download the existing data. Right now, there is nothing.
Select an arbitrary file from your local machine and upload it to the application.
- Click Choose file to select an arbitrary file.
Click Upload.
Figure 5.1. A simple PHP-based file upload tool
- Click List uploaded files to see the list of all currently uploaded files.
The OpenShift Container Platform image registry, ingress routing, and monitoring services are not zone aware.
5.8. Recovering OpenShift Data Foundation stretch cluster Copy linkLink copied to clipboard!
Given that the stretch cluster disaster recovery solution is to provide resiliency in the face of a complete or partial site outage, it is important to understand the different methods of recovery for applications and their storage.
How the application is architected determines how soon it becomes available again on the active zone.
There are different methods of recovery for applications and their storage depending on the site outage. The recovery time depends on the application architecture. The different methods of recovery are as follows:
5.8.1. Understanding zone failure Copy linkLink copied to clipboard!
For the purpose of this section, zone failure is considered as a failure where all OpenShift Container Platform, master and worker nodes in a zone are no longer communicating with the resources in the second data zone (for example, powered down nodes). If communication between the data zones is still partially working (intermittently up or down), the cluster, storage, and network admins should disconnect the communication path between the data zones for recovery to succeed.
When you install the sample application, power off the OpenShift Container Platform nodes (at least the nodes with OpenShift Data Foundation devices) to test the failure of a data zone in order to validate that your file-uploader application is available, and you can upload new files.
To prevent corruption of volume contents during a non-graceful node shutdown, follow the steps in Fencing and tainting nodes to prevent volume corruption during non-graceful node shutdown.
5.8.2. Recovering zone-aware HA applications with RWX storage Copy linkLink copied to clipboard!
Applications that are deployed with topologyKey: topology.kubernetes.io/zone have one or more replicas scheduled in each data zone, and are using shared storage, that is, ReadWriteMany (RWX) CephFS volume, terminate themselves in the failed zone after few minutes and new pods are rolled in and stuck in pending state until the zones are recovered.
An example of this type of application is detailed in the Install Zone Aware Sample Application section.
During zone recovery if application pods go into CrashLoopBackOff (CLBO) state with permission denied error while mounting the CephFS volume, then restart the nodes where the pods are scheduled. Wait for some time and then check if the pods are running again.
To prevent corruption of volume contents during a non-graceful node shutdown, follow the steps in Fencing and tainting nodes to prevent volume corruption during non-graceful node shutdown.
5.8.3. Recovering HA applications with RWX storage Copy linkLink copied to clipboard!
Applications that are using topologyKey: kubernetes.io/hostname or no topology configuration have no protection against all of the application replicas being in the same zone.
This can happen even with podAntiAffinity and topologyKey: kubernetes.io/hostname in the Pod spec because this anti-affinity rule is host-based and not zone-based.
If this happens and all replicas are located in the zone that fails, the application using ReadWriteMany (RWX) storage takes 6-8 minutes to recover on the active zone. This pause is for the OpenShift Container Platform nodes in the failed zone to become NotReady (60 seconds) and then for the default pod eviction timeout to expire (300 seconds).
In order to prevent corruption of volume contents during a non-graceful node shutdown, follow the steps in Fencing and tainting nodes to prevent volume corruption during non-graceful node shutdown.
5.8.4. Recovering applications with RWO storage Copy linkLink copied to clipboard!
Applications that use ReadWriteOnce (RWO) storage have a known behavior described in this Kubernetes issue. Because of this issue, if there is a data zone failure, any application pods in that zone mounting RWO volumes (for example, cephrbd based volumes) are stuck with Terminating status after 6-8 minutes and are not re-created on the active zone without manual intervention.
To prevent corruption of volume contents during a non-graceful node shutdown, follow the steps in Fencing and tainting nodes to prevent volume corruption during non-graceful node shutdown.
5.8.5. Recovering StatefulSet pods Copy linkLink copied to clipboard!
Pods that are part of a StatefulSet have a similar issue as pods mounting ReadWriteOnce (RWO) volumes. More information is referenced in the Kubernetes resource StatefulSet considerations.
To prevent corruption of volume contents during a non-graceful node shutdown, follow the steps in Fencing and tainting nodes to prevent volume corruption during non-graceful node shutdown.
5.8.6. Fencing and tainting nodes to prevent volume corruption during non-graceful node shutdown Copy linkLink copied to clipboard!
The following procedure needs to be executed during a non-graceful node shutdown to prevent corruption of volume contents.
Prerequisites
-
Ensure that the
NetworkFenceClassis created by following the steps in the Create `NetworkFenceClass` section of Creating OpenShift Data Foundation cluster.
Procedure
Create
NetworkFenceCR to fence the IP of the storage nodes that are down.Get the NodeName that needs to be fenced:
oc get node
$ oc get nodeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get the
csiaddonsnodeobject corresponding to the node found in the previous step:oc get csiaddonsnode -n openshift-storage |grep -i <nodename> |grep -i daemonset
$ oc get csiaddonsnode -n openshift-storage |grep -i <nodename> |grep -i daemonsetCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get the IP address from the
csiaddonsnodeobject found in the previous step, and create theNetworkFenceCR and ensure that it is in theFencedstate:oc get csiaddonsnode <csi-addons-name> -o jsonpath='{.status.networkFenceClientStatus}' [{"ClientDetails":[{"cidrs":["10.244.0.1/32"],"id":"a815fe8e-eabd-4e87-a6e8-78cebfb67d08"}],"networkFenceClassName":"networkfenceclass-sample"}]$ oc get csiaddonsnode <csi-addons-name> -o jsonpath='{.status.networkFenceClientStatus}' [{"ClientDetails":[{"cidrs":["10.244.0.1/32"],"id":"a815fe8e-eabd-4e87-a6e8-78cebfb67d08"}],"networkFenceClassName":"networkfenceclass-sample"}]Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
<csi-addons-name>with thecsiaddonsnodeobject found in the earlier step.Create the
NetworkFenceCR for above CIDR with the OpenShift node name for easy identification:Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
parameters.clusterID: specifies the cluster id and can be fetched fromparameters.clusterIDinocs-storagecluster-ceph-rbdstorageclass. -
secret.name: specifies the name of the secret required for network fencing operation. This can be fetched fromparameters.csi.storage.k8s.io/provisioner-secret-namein theocs-storagecluster-ceph-rbdstorageclass. -
secret.namespace: specifies the namespace in which the secret is located. This can be fetched fromparameters.csi.storage.k8s.io/provisioner-secret-namespace`in `ocs-storagecluster-ceph-rbd storageclass.
-
Verify that network fencing has been done successfully:
oc get networkfence
$ oc get networkfenceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME DRIVER CIDRS FENCESTATE AGE RESULT <openshift-node-name> openshift-storage.rbd.csi.ceph.com ["10.244.0.1/32"] Fenced 42h Succeeded
NAME DRIVER CIDRS FENCESTATE AGE RESULT <openshift-node-name> openshift-storage.rbd.csi.ceph.com ["10.244.0.1/32"] Fenced 42h SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Add taint for the nodes that are down:
oc adm taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
$ oc adm taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecuteCopy to Clipboard Copied! Toggle word wrap Toggle overflow Monitor application relocation to surviving nodes.
Perform the above steps for all the nodes that are down.
Recovery after the node is back
After the nodes are back online, you need to unfence the node and remove the taints on the node to schedule the pods on these nodes.
Change the state of
NetworkFencetoUnFencedin the CR and wait for it to become unfencedSet the
NetworkFencetoUnFenced:oc patch networkfence <network-fence-object-name> -p '{"spec":{"fenceState":"Unfenced"}}' --type=merge$ oc patch networkfence <network-fence-object-name> -p '{"spec":{"fenceState":"Unfenced"}}' --type=mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the Network UnFencing is done successfully:
oc get networkfence
$ oc get networkfenceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME DRIVER CIDRS FENCESTATE AGE RESULT <openshift-node-name> openshift-storage.rbd.csi.ceph.com ["10.244.0.1/32"] unfenced 42h Succeeded
NAME DRIVER CIDRS FENCESTATE AGE RESULT <openshift-node-name> openshift-storage.rbd.csi.ceph.com ["10.244.0.1/32"] unfenced 42h SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- Reboot the node.
Untaint the nodes that were previously tainted:
oc adm taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute-
$ oc adm taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute-Copy to Clipboard Copied! Toggle word wrap Toggle overflow Repeat these steps for all nodes.
Chapter 6. Disabling disaster recovery for a disaster recovery enabled application Copy linkLink copied to clipboard!
This section guides you to disable disaster recovery (DR) for an application deployed using Red Hat Advanced Cluster Management (RHACM).
6.1. Disabling DR managed applications Copy linkLink copied to clipboard!
For each application, edit the associated
Placementon the Hub cluster, and modify the managed cluster where the application is currently running:oc edit placement <placement_name> -n <namespace>
$ oc edit placement <placement_name> -n <namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow - On the Hub cluster, navigate to All Clusters → Applications.
- In the Overview tab, at the end of the protected application row from the action menu, select Manage disaster recovery.
- Click Remove disaster recovery.
Click Confirm remove.
WarningYour application will lose disaster recovery protection, preventing volume synchronization (replication) between clusters.
The application continues to be visible in the Applications Overview menu but the Data policy is removed.
6.2. Disabling DR discovered applications Copy linkLink copied to clipboard!
- In the RHACM console, navigate to All Clusters → Data Services → Protected applications tab.
- At the end of the application row, click on the Actions menu and choose Remove disaster recovery.
Click Remove in the next prompt.
WarningYour application will lose disaster recovery protection, preventing volume synchronization (replication) between clusters.
Due to a known issue, CephFS PVCs can become orphaned after disabling DR protection for a discovered workload on the cluster where the application in not running. To clean up the orphaned CephFS PVCs, manually delete them using the following command:
oc delete pvc <pvc-name> -n <application-namespace>
$ oc delete pvc <pvc-name> -n <application-namespace>
The application is no longer in the Protected applications tab once the DR is removed.
Chapter 7. Uninstalling disaster recovery from hub and managed clusters Copy linkLink copied to clipboard!
7.1. Uninstalling disaster recovery from hub and managed clusters Copy linkLink copied to clipboard!
There are two levels of uninstallation for both Regional-DR and Metro-DR:
Partial Uninstallation This involves removing all DR resources and configurations from two or more peer managed clusters defined in a DRPolicy.
Complete Uninstallation This is a full uninstall of the DR setup and includes the following steps:
- Remove all DRPolicies and mirrorpeers from all peer clusters.
- Remove the DR operators and associated resources from the hub cluster.
7.1.1. Before uninstalling disaster recovery Copy linkLink copied to clipboard!
Before uninstalling DR, you must first remove disaster recovery protection from all protected applications, both managed and discovered. Follow these steps to remove disaster recovery protection.
Both managed clusters must be online and healthy before DR Uninstall is attempted.
Find all DRPolicies associated with the peer clusters you want to remove DR from using one of the following options:
Using the RHACM console
- On the hub cluster, navigate to All Clusters → Data Services → Disaster Recovery → Policies.
- Identify any DRPolicies associated with the peer clusters you want to remove DR from.
Using the CLI
- Get the names of the peer clusters from the RHACM console. On the hub cluster, navigate to All Clusters → Infrastructure → Clusters.
Run the following commands on the hub cluster, replacing the cluster names with the peer clusters found in the previous step:
CLUSTER_A="{cluster1_name}"$ CLUSTER_A="{cluster1_name}"Copy to Clipboard Copied! Toggle word wrap Toggle overflow CLUSTER_B="{cluster2_name}"$ CLUSTER_B="{cluster2_name}"Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get drpolicies -o json | jq -r --arg c1 "$CLUSTER_A" --arg c2 "$CLUSTER_B" '.items[] |select((.spec.drClusters | sort) == ([$c1, $c2] | sort)) .metadata.name'
$ oc get drpolicies -o json | jq -r --arg c1 "$CLUSTER_A" --arg c2 "$CLUSTER_B" '.items[] |select((.spec.drClusters | sort) == ([$c1, $c2] | sort)) .metadata.name'Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Identify any DRPolicies associated with the peer clusters you want to remove DR from.
Find any existing DRPlacementControl (DRPC) resources associated with your peer managed clusters. Run the following on the hub cluster to for each DRPolicy identified in step 1:
DRPOLICY_NAME="{drpolicy_name}"$ DRPOLICY_NAME="{drpolicy_name}"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
drpolicy_namewith the DRPolicy.oc get drplacementcontrols -A -o json | jq -r --arg drpc "$DRPOLICY_NAME" \ '.items[] | select(.spec.drPolicyRef.name == $drpc) | "\(.metadata.namespace)/\(.metadata.name)"'
$ oc get drplacementcontrols -A -o json | jq -r --arg drpc "$DRPOLICY_NAME" \ '.items[] | select(.spec.drPolicyRef.name == $drpc) | "\(.metadata.namespace)/\(.metadata.name)"'Copy to Clipboard Copied! Toggle word wrap Toggle overflow If any DRPCs were found, remove DR protection from the associated applications.
Remove DR protection from managed applications:
- On the Hub cluster, navigate to All Clusters → Applications.
- In the Overview tab, at the end of the protected application row from the action menu (⋮), select Manage disaster recovery.
- Click Remove disaster recovery.
- Click Confirm remove.
Remove DR protection from discovered applications:
- In the RHACM console, navigate to All Clusters → Data Services → Protected applications tab.
- At the end of the application row, click on the action menu (⋮) and choose Remove disaster recovery.
- Click Remove in the next prompt.
Run the following on the hub cluster to find the associated mirrorpeer:
export DR_POLICY={drpolicy_name}$ export DR_POLICY={drpolicy_name}Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
drpolicy_namewith one of the DRPolicy names found in step 1.DR_CLUSTERS=$(oc get drpolicy $DR_POLICY -o json | jq -r '.spec.drClusters | sort | join(",")')$ DR_CLUSTERS=$(oc get drpolicy $DR_POLICY -o json | jq -r '.spec.drClusters | sort | join(",")')Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get mirrorpeers -A -o json | jq -r --arg dr_clusters "$DR_CLUSTERS" '.items[] | {name: .metadata.name,clusters: (.spec.items | map(.clusterName) | sort | join(",")) } |select(.clusters == $dr_clusters) | .name'$ oc get mirrorpeers -A -o json | jq -r --arg dr_clusters "$DR_CLUSTERS" '.items[] | {name: .metadata.name,clusters: (.spec.items | map(.clusterName) | sort | join(",")) } |select(.clusters == $dr_clusters) | .name'Copy to Clipboard Copied! Toggle word wrap Toggle overflow
7.1.2. Uninstalling disaster recovery for peer managed clusters Copy linkLink copied to clipboard!
Follow these steps to uninstall DR resources for the peer managed clusters.
You will need the cluster names, DRPolicies, and mirrorpeer identified in the previous section.
Procedure
From the hub cluster, delete DRPolices associated with the peer managed clusters slated for DR removal.
There will be one or more DRPolicies found in the prior section that need to be deleted.
oc delete drpolicy {drpolicy_name1 drpolicy_name2 ...}$ oc delete drpolicy {drpolicy_name1 drpolicy_name2 ...}Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
drpolicy_name1etc. with the DRPolicy names found in the previous section.From the hub cluster, delete the mirrorpeer associated with the peer managed clusters slated for DR removal.
There will be one mirrorpeer per DRPolicy found in the prior section that needs to be deleted.
oc delete mirrorpeer {mirrorpeer_name}$ oc delete mirrorpeer {mirrorpeer_name}Copy to Clipboard Copied! Toggle word wrap Toggle overflow
If you are only removing DR for specific peer clusters, do not proceed to uninstall DR operators or other shared resources. This concludes the partial DR uninstall process.
7.1.3. Uninstalling disaster recovery operators Copy linkLink copied to clipboard!
If your goal is to completely remove DR from your OpenShift environment, make sure you have followed all the previous sections to uninstall DR from every peer managed cluster.
Procedure
On the hub cluster, delete the
subscriptions.operatorsandclusterserviceversion(csv) for the OpenShift Data Foundation Multicluster Orchestrator and OpenShift DR Hub Operator:oc delete subscriptions.operators.coreos.com -l operators.coreos.com/odr-hub-operator.openshift-operators -n openshift-operators oc delete subscriptions.operators.coreos.com -l operators.coreos.com/odf-multicluster-orchestrator.openshift-operators -n openshift-operators oc delete csv -l operators.coreos.com/odr-hub-operator.openshift-operators -n openshift-operators oc delete csv -l operators.coreos.com/odf-multicluster-orchestrator.openshift-operators -n openshift-operators
$ oc delete subscriptions.operators.coreos.com -l operators.coreos.com/odr-hub-operator.openshift-operators -n openshift-operators $ oc delete subscriptions.operators.coreos.com -l operators.coreos.com/odf-multicluster-orchestrator.openshift-operators -n openshift-operators $ oc delete csv -l operators.coreos.com/odr-hub-operator.openshift-operators -n openshift-operators $ oc delete csv -l operators.coreos.com/odf-multicluster-orchestrator.openshift-operators -n openshift-operatorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Alternatively, you can delete the ODF Multicluster Orchestrator and DR Hub Operator from the RHACM console:
- Go to Ecosystem → Installed Operators.
- Locate the DR Hub Operator.
- Click the action menu (⋮) and select Uninstall Operator.
- Repeat the same for the Multicluster Orchestrator.
From the hub cluster, delete the
managedclusterviewfor each peer cluster:oc delete managedclusterview -l multicluster.odf.openshift.io/created-by=odf-multicluster-managedcluster-controller -A
$ oc delete managedclusterview -l multicluster.odf.openshift.io/created-by=odf-multicluster-managedcluster-controller -ACopy to Clipboard Copied! Toggle word wrap Toggle overflow From the hub cluster, delete the dynamic console plugin for the Multicluster Orchestrator operator:
oc delete consoleplugins.console.openshift.io odf-multicluster-console
$ oc delete consoleplugins.console.openshift.io odf-multicluster-consoleCopy to Clipboard Copied! Toggle word wrap Toggle overflow From the hub cluster, delete the configmap
odf-client-info:oc delete configmap odf-client-info -n openshift-operators
$ oc delete configmap odf-client-info -n openshift-operatorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow From the hub cluster, delete the
odf-multicluster-consoleservice:oc delete service odf-multicluster-console -n openshift-operators
$ oc delete service odf-multicluster-console -n openshift-operatorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow
At this point, all DR-related operator components and configurations should be fully removed from the hub cluster and all managed clusters.
Chapter 8. Monitoring disaster recovery health Copy linkLink copied to clipboard!
8.1. Enable monitoring for disaster recovery Copy linkLink copied to clipboard!
Use this procedure to enable basic monitoring for your disaster recovery setup.
Procedure
- On the Hub cluster, open a terminal window
Add the following label to
openshift-operatornamespace.oc label namespace openshift-operators openshift.io/cluster-monitoring='true'
$ oc label namespace openshift-operators openshift.io/cluster-monitoring='true'Copy to Clipboard Copied! Toggle word wrap Toggle overflow
You must always add this label for Regional-DR solution.
8.2. Enabling disaster recovery dashboard on Hub cluster Copy linkLink copied to clipboard!
This section guides you to enable the disaster recovery dashboard for advanced monitoring on the Hub cluster.
For Regional-DR, the dashboard shows monitoring status cards for operator health, cluster health, metrics, alerts and application count.
For Metro-DR, you can configure the dashboard to only monitor the ramen setup health and application count.
Prerequisites
Ensure that you have already installed the following
- OpenShift Container Platform version 4.17 and have administrator privileges.
- ODF Multicluster Orchestrator with the console plugin enabled.
- Red Hat Advanced Cluster Management for Kubernetes 2.11 (RHACM) from Software Catalog. For instructions on how to install, see Installing RHACM.
- Ensure you have enabled observability on RHACM. See Enabling observability guidelines.
Procedure
- On the Hub cluster, open a terminal window and perform the next steps.
Create the configmap file named
observability-metrics-custom-allowlist.yaml.You can use the following YAML to list the disaster recovery metrics on Hub cluster. For details, see Adding custom metrics. To know more about ramen metrics, see Disaster recovery metrics.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In the
open-cluster-management-observabilitynamespace, run the following command:oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml
$ oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow After
observability-metrics-custom-allowlistyaml is created, RHACM starts collecting the listed OpenShift Data Foundation metrics from all the managed clusters.To exclude a specific managed cluster from collecting the observability data, add the following cluster label to the
clusters: observability: disabled.
8.3. Viewing health status of disaster recovery replication relationships Copy linkLink copied to clipboard!
Prerequisites
Ensure that you have enabled the disaster recovery dashboard for monitoring. For instructions, see chapter Enabling disaster recovery dashboard on Hub cluster.
Procedure
- On the Hub cluster, ensure All Clusters option is selected.
- Refresh the console to make the DR monitoring dashboard tab accessible.
- Navigate to Data Services and click Data policies.
- On the Overview tab, you can view the health status of the operators, clusters and applications. Green tick indicates that the operators are running and available..
- Click the Disaster recovery tab to view a list of DR policy details and connected applications.
8.4. Viewing health status of ApplicationSet-based applications Copy linkLink copied to clipboard!
You can view the health status of your ApplicationSet-based applications on the user interface. This health status can alert you to replication delays and make for easier troubleshooting.
Procedure
- On the Hub cluster, navigate to All Clusters → Applications.
In the DR Status column of the application, view the status:
- Healthy The last group sync time is less than 2X that of the sync interval.
- Warning The last group sync time is greater than 2X and less than 3X of the sync interval.
- Critical The last group sync time is greater than 3X of the sync interval.
You can also view the application status for virtual machine (VM) based workloads:
- On the Hub cluster, navigate to All Clusters → Infrastructure → Virtual Machines → Status column.
8.5. Disaster recovery metrics Copy linkLink copied to clipboard!
These are the ramen metrics that are scrapped by prometheus.
- ramen_last_sync_timestamp_seconds
- ramen_policy_schedule_interval_seconds
- ramen_last_sync_duration_seconds
- ramen_last_sync_data_bytes
- ramen_workload_protection_status
Run these metrics from the Hub cluster where Red Hat Advanced Cluster Management for Kubernetes (RHACM operator) is installed.
8.5.1. Last synchronization timestamp in seconds Copy linkLink copied to clipboard!
This is the time in seconds which gives the time of the most recent successful synchronization of all PVCs per application.
- Metric name
-
ramen_last_sync_timestamp_seconds - Metrics type
- Gauge
- Labels
-
ObjType: Type of the object, here its DRPC -
ObjName: Name of the object, here it is DRPC-Name -
ObjNamespace: DRPC namespace -
Policyname: Name of the DRPolicy -
SchedulingInterval: Scheduling interval value from DRPolicy
-
- Metric value
-
Value is set as Unix seconds which is obtained from
lastGroupSyncTimefrom DRPC status.
8.5.2. Policy schedule interval in seconds Copy linkLink copied to clipboard!
This gives the scheduling interval in seconds from DRPolicy.
- Metric name
-
ramen_policy_schedule_interval_seconds - Metrics type
- Gauge
- Labels
-
Policyname: Name of the DRPolicy
-
- Metric value
- This is set to a scheduling interval in seconds which is taken from DRPolicy.
8.5.3. Last synchronization duration in seconds Copy linkLink copied to clipboard!
This represents the longest time taken to sync from the most recent successful synchronization of all PVCs per application.
- Metric name
-
ramen_last_sync_duration_seconds - Metrics type
- Gauge
- Labels
-
obj_type: Type of the object, here it is DRPC -
obj_name: Name of the object, here it is DRPC-Name -
obj_namespace: DRPC namespace -
scheduling_interval: Scheduling interval value from DRPolicy
-
- Metric value
-
The value is taken from
lastGroupSyncDurationfrom DRPC status.
8.5.4. Total bytes transferred from most recent synchronization Copy linkLink copied to clipboard!
This value represents the total bytes transferred from the most recent successful synchronization of all PVCs per application.
- Metric name
-
ramen_last_sync_data_bytes - Metrics type
- Gauge
- Labels
-
obj_type: Type of the object, here it is DRPC -
obj_name: Name of the object, here it is DRPC-Name -
obj_namespace: DRPC namespace -
scheduling_interval: Scheduling interval value from DRPolicy
-
- Metric value
-
The value is taken from
lastGroupSyncBytesfrom DRPC status.
8.5.5. Workload protection status Copy linkLink copied to clipboard!
This value provides the application protection status per application that is DR protected.
- Metric name
-
ramen_workload_protection_status - Metrics type
- Gauge
- Labels
-
ObjType: Type of the object, here its DRPC -
ObjName: Name of the object, here it is DRPC-Name -
ObjNamespace: DRPC namespace
-
- Metric value
- The value is either a "1" or a "0", where "1" indicates application DR protection is healthy and a "0" indicates application protection degraged and potentially unprotected.
8.6. Disaster recovery alerts Copy linkLink copied to clipboard!
This section provides a list of all supported alerts associated with Red Hat OpenShift Data Foundation within a disaster recovery environment.
Recording rules
Record:
ramen_sync_duration_seconds- Expression
sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))
sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Purpose
- The time interval between the volume group’s last sync time and the time now in seconds.
Record:
ramen_rpo_difference- Expression
ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Purpose
- The difference between the expected sync delay and the actual sync delay taken by the volume replication group.
Record:
count_persistentvolumeclaim_total- Expression
count(kube_persistentvolumeclaim_info)
count(kube_persistentvolumeclaim_info)Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Purpose
- Sum of all PVC from the managed cluster.
Alerts
Alert:
VolumeSynchronizationDelay- Impact
- Critical
- Purpose
- Actual sync delay taken by the volume replication group is thrice the expected sync delay.
- YAML
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Alert:
VolumeSynchronizationDelay- Impact
- Warning
- Purpose
- Actual sync delay taken by the volume replication group is twice the expected sync delay.
- YAML
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Alert:
WorkloadUnprotected- Impact
- Critical
- Purpose
- Application protection status is degraded for more than 10 minutes
- YAML
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 9. Troubleshooting disaster recovery Copy linkLink copied to clipboard!
9.1. Troubleshooting Metro-DR Copy linkLink copied to clipboard!
9.1.1. A statefulset application stuck after failover Copy linkLink copied to clipboard!
- Problem
While relocating to a preferred cluster, DRPlacementControl is stuck reporting PROGRESSION as "MovingToSecondary".
Previously, before Kubernetes v1.23, the Kubernetes control plane never cleaned up the PVCs created for StatefulSets. This activity was left to the cluster administrator or a software operator managing the StatefulSets. Due to this, the PVCs of the StatefulSets were left untouched when their Pods were deleted. This prevents Ramen from relocating an application to its preferred cluster.
- Resolution
If the workload uses StatefulSets, and relocation is stuck with PROGRESSION as "MovingToSecondary", then run:
oc get pvc -n <namespace>
$ oc get pvc -n <namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow For each bounded PVC for that namespace that belongs to the StatefulSet, run
oc delete pvc <pvcname> -n namespace
$ oc delete pvc <pvcname> -n namespaceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Once all PVCs are deleted, Volume Replication Group (VRG) transitions to secondary, and then gets deleted.
Run the following command
oc get drpc -n <namespace> -o wide
$ oc get drpc -n <namespace> -o wideCopy to Clipboard Copied! Toggle word wrap Toggle overflow After a few seconds to a few minutes, the PROGRESSION reports "Completed" and relocation is complete.
- Result
- The workload is relocated to the preferred cluster
BZ reference: [2118270]
9.1.2. DR policies protect all applications in the same namespace Copy linkLink copied to clipboard!
- Problem
-
While only a single application is selected to be used by a DR policy, all applications in the same namespace will be protected. This results in PVCs, that match the
DRPlacementControlspec.pvcSelectoracross multiple workloads or if the selector is missing across all workloads, replication management to potentially manage each PVC multiple times and cause data corruption or invalid operations based on individualDRPlacementControlactions. - Resolution
-
Label PVCs that belong to a workload uniquely, and use the selected label as the DRPlacementControl
spec.pvcSelectorto disambiguate which DRPlacementControl protects and manages which subset of PVCs within a namespace. It is not possible to specify thespec.pvcSelectorfield for the DRPlacementControl using the user interface, hence the DRPlacementControl for such applications must be deleted and created using the command line.
BZ reference: [2128860]
9.1.3. Relocate or failback might be stuck in Initiating state Copy linkLink copied to clipboard!
- Problem
-
When a primary cluster is down and comes back online while the secondary goes down,
relocateorfailbackmight be stuck in theInitiatingstate. - Resolution
To avoid this situation, cut off all access from the old active hub to the managed clusters.
Alternatively, you can scale down the ApplicationSet controller on the old active hub cluster either before moving workloads or when they are in the clean-up phase.
On the old active hub, scale down the two deployments using the following commands:
oc scale deploy -n openshift-gitops-operator openshift-gitops-operator-controller-manager --replicas=0 oc scale statefulset -n openshift-gitops openshift-gitops-application-controller --replicas=0
$ oc scale deploy -n openshift-gitops-operator openshift-gitops-operator-controller-manager --replicas=0 $ oc scale statefulset -n openshift-gitops openshift-gitops-application-controller --replicas=0Copy to Clipboard Copied! Toggle word wrap Toggle overflow
BZ reference: [2243804]
9.2. Troubleshooting Regional-DR Copy linkLink copied to clipboard!
9.2.1. rbd-mirror daemon health is in warning state Copy linkLink copied to clipboard!
- Problem
There appears to be numerous cases where WARNING gets reported if mirror service
::get_mirror_service_statuscallsCephmonitor to get service status forrbd-mirror.Following a network disconnection,
rbd-mirrordaemon health is in thewarningstate while the connectivity between both the managed clusters is fine.- Resolution
Run the following command in the toolbox and look for
leader:falserbd mirror pool status --verbose ocs-storagecluster-cephblockpool | grep 'leader:'
rbd mirror pool status --verbose ocs-storagecluster-cephblockpool | grep 'leader:'Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you see the following in the output:
leader: falseIt indicates that there is a daemon startup issue and the most likely root cause could be due to problems reliably connecting to the secondary cluster.
Workaround: Move the
rbd-mirrorpod to a different node by simply deleting the pod and verify that it has been rescheduled on another node.leader: trueor no output
BZ reference: [2118627]
9.2.2. volsync-rsync-src pod is in error state as it is unable to resolve the destination hostname Copy linkLink copied to clipboard!
- Problem
VolSyncsource pod is unable to resolve the hostname of the VolSync destination pod. The log of the VolSync Pod consistently shows an error message over an extended period of time similar to the following log snippet.oc logs -n busybox-workloads-3-2 volsync-rsync-src-dd-io-pvc-1-p25rz
$ oc logs -n busybox-workloads-3-2 volsync-rsync-src-dd-io-pvc-1-p25rzCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
VolSync rsync container version: ACM-0.6.0-ce9a280 Syncing data to volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local:22 ... ssh: Could not resolve hostname volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local: Name or service not known
VolSync rsync container version: ACM-0.6.0-ce9a280 Syncing data to volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local:22 ... ssh: Could not resolve hostname volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local: Name or service not knownCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Resolution
Restart
submariner-lighthouse-agenton both nodes.oc delete pod -l app=submariner-lighthouse-agent -n submariner-operator
$ oc delete pod -l app=submariner-lighthouse-agent -n submariner-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow
9.2.3. Cleanup and data sync for ApplicationSet workloads remain stuck after older primary managed cluster is recovered post failover Copy linkLink copied to clipboard!
- Problem
ApplicationSet based workload deployments to managed clusters are not garbage collected in cases when the hub cluster fails. It is recovered to a standby hub cluster, while the workload has been failed over to a surviving managed cluster. The cluster that the workload was failed over from, rejoins the new recovered standby hub.
ApplicationSets that are DR protected, with a regional DRPolicy, hence starts firing the VolumeSynchronizationDelay alert. Further such DR protected workloads cannot be failed over to the peer cluster or relocated to the peer cluster as data is out of sync between the two clusters.
- Resolution
The workaround requires that
openshift-gitopsoperators can own the workload resources that are orphaned on the managed cluster that rejoined the hub post a failover of the workload was performed from the new recovered hub. To achieve this the following steps can be taken:-
Determine the Placement that is in use by the ArgoCD ApplicationSet resource on the hub cluster in the
openshift-gitopsnamespace. Inspect the placement label value for the ApplicationSet in this field:
spec.generators.clusterDecisionResource.labelSelector.matchLabelsThis would be the name of the Placement resource <placement-name>
Ensure that there exists a
PlacemenDecisionfor the ApplicationSet referencedPlacement.oc get placementdecision -n openshift-gitops --selector cluster.open-cluster-management.io/placement=<placement-name>
$ oc get placementdecision -n openshift-gitops --selector cluster.open-cluster-management.io/placement=<placement-name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow This results in a single
PlacementDecisionthat places the workload in the currently desired failover cluster.Create a new
PlacementDecisionfor the ApplicationSet pointing to the cluster where it should be cleaned up.For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Update the newly created
PlacementDecisionwith a status subresource.decision-status.yaml: status: decisions: - clusterName: <managedcluster-name-to-clean-up> # This would be the cluster from where the workload was failed over, NOT the current workload cluster reason: FailoverCleanupdecision-status.yaml: status: decisions: - clusterName: <managedcluster-name-to-clean-up> # This would be the cluster from where the workload was failed over, NOT the current workload cluster reason: FailoverCleanupCopy to Clipboard Copied! Toggle word wrap Toggle overflow oc patch placementdecision -n openshift-gitops <placemen-name>-decision-<n> --patch-file=decision-status.yaml --subresource=status --type=merge
$ oc patch placementdecision -n openshift-gitops <placemen-name>-decision-<n> --patch-file=decision-status.yaml --subresource=status --type=mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Watch and ensure that the Application resource for the ApplicationSet has been placed on the desired cluster
oc get application -n openshift-gitops <applicationset-name>-<managedcluster-name-to-clean-up>
$ oc get application -n openshift-gitops <applicationset-name>-<managedcluster-name-to-clean-up>Copy to Clipboard Copied! Toggle word wrap Toggle overflow In the output, check if the SYNC STATUS shows as
Syncedand the HEALTH STATUS shows asHealthy.Delete the PlacementDecision that was created in step (3), such that ArgoCD can garbage collect the workload resources on the <managedcluster-name-to-clean-up>
oc delete placementdecision -n openshift-gitops <placemen-name>-decision-<n>
$ oc delete placementdecision -n openshift-gitops <placemen-name>-decision-<n>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
ApplicationSets that are DR protected, with a regional DRPolicy, stops firing the
VolumeSynchronizationDelayalert.-
Determine the Placement that is in use by the ArgoCD ApplicationSet resource on the hub cluster in the
BZ reference: [2268594]
9.2.4. Secondary transition failed as PVC is potentially in use by a pod Copy linkLink copied to clipboard!
- Problem
-
If you see the error
Secondary transition failed as PVC is potentially in use by a podafter a failover and the failed cluster has recovered, it usually means application cleanup wasn’t completed on the primary cluster. This step is critical to release the PVC and allow replication to resume. - Resolution
Complete cleanup on the primary cluster. Replication should resume automatically.
If replication doesn’t resume, perform a Relocate action to move the application to the secondary cluster. This ensures recovery using the last synchronized data.
9.3. Troubleshooting 2-site stretch cluster with Arbiter Copy linkLink copied to clipboard!
9.3.1. Recovering workload pods stuck in ContainerCreating state post zone recovery Copy linkLink copied to clipboard!
- Problem
After performing complete zone failure and recovery, the workload pods are sometimes stuck in
ContainerCreatingstate with the any of the below errors:- MountDevice failed to create newCsiDriverClient: driver name openshift-storage.rbd.csi.ceph.com not found in the list of registered CSI drivers
- MountDevice failed for volume <volume_name> : rpc error: code = Aborted desc = an operation with the given Volume ID <volume_id> already exists
- MountVolume.SetUp failed for volume <volume_name> : rpc error: code = Internal desc = staging path <path> for volume <volume_id> is not a mountpoint
- Resolution
If the workload pods are stuck with any of the above mentioned errors, perform the following workarounds:
For ceph-fs workload stuck in
ContainerCreating:- Restart the nodes where the stuck pods are scheduled
- Delete these stuck pods
- Verify that the new pods are running
For ceph-rbd workload stuck in
ContainerCreatingthat do not self recover after sometime- Restart csi-rbd plugin pods in the nodes where the stuck pods are scheduled
- Verify that the new pods are running