2.10.3. Recovering Failed Node Hosts
Important
This section presumes you have backed up the
/var/lib/openshift
directory. See Section 2.10.2, “Backing Up Node Host Files” for more information.
A failed node host can be recovered if the
/var/lib/openshift
gear directory had fault tolerance and can be restored. SELinux contexts must be preserved with the gear directory in order for recovery to succeed. Note this scenario rarely occurs, especially when node hosts are virtual machines in a fault-tolerant infrastructure rather than physical machines. Note that scaled applications cannot be recovered onto a node host with a different IP address than the original node host.
Procedure 2.7. To Recover a Failed Node Host:
- Create a node host with the same host name and IP address as the one that failed.
- The host name DNS A record can be adjusted if the IP address must be different. However, note that the application CNAME and database records all point to the host name and cannot be easily changed.
- Ensure the
ruby193-mcollective
service is not running on the new node host:#
service ruby193-mcollective stop
- Copy all the configuration files in the
/etc/openshift
directory from the failed node host to the new node host and ensure that the gear profile is the same.
- Attach and mount the backup to
/var/lib/openshift
, ensuring theusrquota
mount option is used:#
echo "/dev/path/to/backup/partition /var/lib/openshift/ ext4 defaults,usrquota 0 0" >> /etc/fstab
#mount -a
- Reinstate quotas on the
/var/lib/openshift
directory:#
quotacheck -cmug /var/lib/openshift
#restorecon /var/lib/openshift/aquota.user
#quotaon /var/lib/openshift
- Run the
oo-admin-regenerate-gear-metadata
tool, available starting in OpenShift Enterprise 2.1.6, on the new node host to replace and recover the failed gear data. This browses each existing gear on the gear data volume and ensures it has the correct entries in certain files, and if necessary, performs any fixes:#
oo-admin-regenerate-gear-metadata
This script attempts to regenerate gear entries for: * /etc/passwd * /etc/shadow * /etc/group * /etc/cgrules.conf * /etc/cgconfig.conf * /etc/security/limits.d Proceed? [yes/NO]:yes
Theoo-admin-regenerate-gear-metadata
tool will not make any changes unless it notices any missing entries. Note that this tool can be added to a node host deployment script.Alternatively, if you are using OpenShift Enteprise 2.1.5 or earlier, replace the/etc/passwd
file on the new node host with the content from the original, failed node host. If this backup file was lost, see Section 2.10.4, “Recreating /etc/passwd Entries” for instructions on recreating the/etc/passwd
file. - When the
oo-admin-regenerate-gear-metadata
tool completes, it runs theoo-accept-node
command and reports the output:Running oo-accept-node to check node consistency... ... FAIL: user 54fe156faf1c09b9a900006f does not have quotas imposed. This can be addressed by running: oo-devel-node set-quota --with-container-uuid 54fe156faf1c09b9a900006f --blocks 2097152 --inodes 80000
If there are any quota errors, run the suggested quota command, then run theoo-accept-node
command again to ensure the problem has been resolved:#
oo-devel-node set-quota --with-container-uuid 54fe156faf1c09b9a900006f --blocks 2097152 --inodes 80000
#oo-accept-node
- Reboot the new node host to activate all changes, start the gears, and allow MCollective and other services to run.