Chapter 14. Backing up and restoring Data Grid clusters
Data Grid Operator lets you back up and restore Data Grid cluster state for disaster recovery and to migrate Data Grid resources between clusters.
14.1. Backup and Restore CRs Copy linkLink copied to clipboard!
Backup and Restore CRs save in-memory data at runtime so you can easily recreate Data Grid clusters.
Applying a Backup or Restore CR creates a new pod that joins the Data Grid cluster as a zero-capacity member, which means it does not require cluster rebalancing or state transfer to join.
For backup operations, the pod iterates over cache entries and other resources and creates an archive, a .zip file, in the /opt/infinispan/backups directory on the persistent volume (PV).
Performing backups does not significantly impact performance because the other pods in the Data Grid cluster only need to respond to the backup pod as it iterates over cache entries.
For restore operations, the pod retrieves Data Grid resources from the archive on the PV and applies them to the Data Grid cluster.
When either the backup or restore operation completes, the pod leaves the cluster and is terminated.
Reconciliation
Data Grid Operator does not reconcile Backup and Restore CRs which mean that backup and restore operations are "one-time" events.
Modifying an existing Backup or Restore CR instance does not perform an operation or have any effect. If you want to update .spec fields, you must create a new instance of the Backup or Restore CR.
14.2. Backing up Data Grid clusters Copy linkLink copied to clipboard!
Create a backup file that stores Data Grid cluster state to a persistent volume.
Prerequisites
-
Create an
InfinispanCR withspec.service.type: DataGrid. Ensure there are no active client connections to the Data Grid cluster.
Data Grid backups do not provide snapshot isolation and data modifications are not written to the archive after the cache is backed up.
To archive the exact state of the cluster, you should always disconnect any clients before you back it up.
Procedure
-
Name the
BackupCR with themetadata.namefield. -
Specify the Data Grid cluster to backup with the
spec.clusterfield. Configure the persistent volume claim (PVC) that adds the backup archive to the persistent volume (PV) with the
spec.volume.storageandspec.volume.storage.storageClassNamefields.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optionally include
spec.resourcesfields to specify which Data Grid resources you want to back up.If you do not include any
spec.resourcesfields, theBackupCR creates an archive that contains all Data Grid resources. If you do specifyspec.resourcesfields, theBackupCR creates an archive that contains those resources only.Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can also use the
*wildcard character as in the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply your
BackupCR.oc apply -f my-backup.yaml
$ oc apply -f my-backup.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check that the
status.phasefield has a status ofSucceededin theBackupCR and that Data Grid logs have the following message:ISPN005044: Backup file created 'my-backup.zip'
ISPN005044: Backup file created 'my-backup.zip'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following command to check that the backup is successfully created:
oc describe Backup my-backup
$ oc describe Backup my-backupCopy to Clipboard Copied! Toggle word wrap Toggle overflow
14.3. Restoring Data Grid clusters Copy linkLink copied to clipboard!
Restore Data Grid cluster state from a backup archive.
Prerequisites
-
Create a
BackupCR on a source cluster. Create a target Data Grid cluster of Data Grid service pods.
NoteIf you restore an existing cache, the operation overwrites the data in the cache but not the cache configuration.
For example, you back up a distributed cache named
mycacheon the source cluster. You then restoremycacheon a target cluster where it already exists as a replicated cache. In this case, the data from the source cluster is restored andmycachecontinues to have a replicated configuration on the target cluster.Ensure there are no active client connections to the target Data Grid cluster you want to restore.
Cache entries that you restore from a backup can overwrite more recent cache entries.
For example, a client performs acache.put(k=2)operation and you then restore a backup that containsk=1.
Procedure
-
Name the
RestoreCR with themetadata.namefield. -
Specify a
BackupCR to use with thespec.backupfield. Specify the Data Grid cluster to restore with the
spec.clusterfield.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optionally add the
spec.resourcesfield to restore specific resources only.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply your
RestoreCR.oc apply -f my-restore.yaml
$ oc apply -f my-restore.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check that the
status.phasefield has a status ofSucceededin theRestoreCR and that Data Grid logs have the following message:ISPN005045: Restore 'my-backup' complete
ISPN005045: Restore 'my-backup' completeCopy to Clipboard Copied! Toggle word wrap Toggle overflow
You should then open the Data Grid Console or establish a CLI connection to verify data and Data Grid resources are restored as expected.
14.4. Backup and restore status Copy linkLink copied to clipboard!
Backup and Restore CRs include a status.phase field that provides the status for each phase of the operation.
| Status | Description |
|---|---|
|
| The system has accepted the request and the controller is preparing the underlying resources to create the pod. |
|
| The controller has prepared all underlying resources successfully. |
|
| The pod is created and the operation is in progress on the Data Grid cluster. |
|
| The operation has completed successfully on the Data Grid cluster and the pod is terminated. |
|
| The operation did not successfully complete and the pod is terminated. |
|
| The controller cannot obtain the status of the pod or determine the state of the operation. This condition typically indicates a temporary communication error with the pod. |
14.4.1. Handling failed backup and restore operations Copy linkLink copied to clipboard!
If the status.phase field of the Backup or Restore CR is Failed, you should examine pod logs to determine the root cause before you attempt the operation again.
Procedure
Examine the logs for the pod that performed the failed operation.
Pods are terminated but remain available until you delete the
BackuporRestoreCR.oc logs <backup|restore_pod_name>
$ oc logs <backup|restore_pod_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Resolve any error conditions or other causes of failure as indicated by the pod logs.
-
Create a new instance of the
BackuporRestoreCR and attempt the operation again.