Chapter 5. Optimizing persistent storage
5.1. Overview
Optimizing storage helps to minimize storage use across all resources. By optimizing storage, administrators help ensure that existing storage resources are working in an efficient manner.
This guide primarily focuses on optimizing persistent storage. Local ephemeral storage for data utilized during the lifetime of pods has fewer options. Ephemeral storage is only available if you enabled the ephemeral storage technology preview in OpenShift Container Platform 3.10. This feature is disabled by default. See configuring for ephemeral storage for more information.
5.2. General storage guidelines
The following table lists the available persistent storage technologies for OpenShift Container Platform.
Storage type | Description | Examples |
---|---|---|
Block |
| converged mode/independent mode GlusterFS [a] iSCSI, Fibre Channel, Ceph RBD, OpenStack Cinder, AWS EBS [a], Dell/EMC Scale.IO, VMware vSphere Volume, GCE Persistent Disk [a], Azure Disk |
File |
| converged mode/independent mode GlusterFS [a], RHEL NFS, NetApp NFS [b] , Azure File, Vendor NFS, Vendor GlusterFS [c], Azure File, AWS EFS |
Object |
| converged mode/independent mode GlusterFS [a], Ceph Object Storage (RADOS Gateway), OpenStack Swift, Aliyun OSS, AWS S3, Google Cloud Storage, Azure Blob Storage, Vendor S3 [c], Vendor Swift [c] |
[a]
converged mode/independent mode GlusterFS, Ceph RBD, OpenStack Cinder, AWS EBS, Azure Disk, GCE persistent disk, and VMware vSphere support dynamic persistent volume (PV) provisioning natively in OpenShift Container Platform.
[b]
NetApp NFS supports dynamic PV provisioning when using the Trident plugin.
[c]
Vendor GlusterFS, Vendor S3, and Vendor Swift supportability and configurability may vary.
|
As of OpenShift Container Platform 3.6.1, converged mode GlusterFS (a hyperconverged or cluster-hosted storage solution) and independent mode GlusterFS (an externally hosted storage solution) provides interfaces for block, file, and object storage for the purpose of the OpenShift Container Platform registry, logging, and metrics.
5.3. Storage recommendations
The following table summarizes the recommended and configurable storage technologies for the given OpenShift Container Platform cluster application.
Storage type | ROX [a] | RWX [b] | Registry | Scaled registry | Metrics | Logging | Apps |
---|---|---|---|---|---|---|---|
Block | Yes [c] | No | Configurable | Not configurable | Recommended | Recommended | Recommended |
File | Yes [c] | Yes | Configurable | Configurable | Configurable [d] | Configurable [e] | Recommended |
Object | Yes | Yes | Recommended | Recommended | Not configurable | Not configurable | Not configurable [f] |
[a]
ReadOnlyMany
[b]
ReadWriteMany
[c]
This does not apply to physical disk, VM physical disk, VMDK, loopback over NFS, AWS EBS, and Azure Disk.
[d]
For metrics, it is an anti-pattern to use any shared storage and a single volume (RWX). By default, metrics deploys with one volume per Cassandra Replica.
[e]
For logging, using any shared storage would be an anti-pattern. One volume per logging-es is required.
[f]
Object storage is not consumed through OpenShift Container Platform’s PVs/persistent volume claims (PVCs). Apps must integrate with the object storage REST API.
|
A scaled registry is an OpenShift Container Platform registry where three or more pod replicas are running.
5.3.1. Specific application storage recommendations
Testing shows issues with using the RHEL NFS server as a storage backend for the container image registry. This includes the OpenShift Container Registry and Quay, Cassandra for metrics storage, and ElasticSearch for logging storage. Therefore, using the RHEL NFS server to back PVs used by core services is not recommended.
Other NFS implementations on the marketplace might not have these issues. Contact the individual NFS implementation vendor for more information on any testing that was possibly completed against these OpenShift core components.
5.3.1.1. Registry
In a non-scaled/high-availability (HA) OpenShift Container Platform registry cluster deployment:
- The preferred storage technology is object storage followed by block storage. The storage technology does not need to support RWX access mode.
- The storage technology must ensure read-after-write consistency. All NAS storage (excluding converged mode/independent mode GlusterFS as it uses an object storage interface) are not recommended for OpenShift Container Platform Registry cluster deployment with production workloads.
-
While
hostPath
volumes are configurable for a non-scaled/HA OpenShift Container Platform Registry, they are not recommended for cluster deployment.
5.3.1.2. Scaled registry
In a scaled/HA OpenShift Container Platform registry cluster deployment:
- The preferred storage technology is object storage. The storage technology must support RWX access mode and must ensure read-after-write consistency.
- File storage and block storage are not recommended for a scaled/HA OpenShift Container Platform registry cluster deployment with production workloads.
- All NAS storage (excluding converged mode/independent mode GlusterFS as it uses an object storage interface) are not recommended for OpenShift Container Platform Registry cluster deployment with production workloads.
5.3.1.3. Metrics
In an OpenShift Container Platform hosted metrics cluster deployment:
- The preferred storage technology is block storage.
- It is not recommended to use NAS storage (excluding converged mode/independent mode GlusterFS as it uses a block storage interface from iSCSI) for a hosted metrics cluster deployment with production workloads.
Testing shows issues with using the NFS server on RHEL as storage backend for the container registry. This includes the Cassandra for metrics storage. Therefore, using NFS to back PVs used by core services is not recommended.
Other NFS implementations on the marketplace might not have these issues. Contact the individual NFS implementation vendor for more information on any testing that was possibly completed against these OpenShift core components.
5.3.1.4. Logging
In an OpenShift Container Platform hosted logging cluster deployment:
- The preferred storage technology is block storage.
- It is not recommended to use NAS storage (excluding converged mode/independent mode GlusterFS as it uses a block storage interface from iSCSI) for a hosted metrics cluster deployment with production workloads.
Testing shows issues with using the NFS server on RHEL as storage backend for the container registry. This includes ElasticSearch for logging storage. Therefore, using NFS to back PVs used by core services is not recommended.
Other NFS implementations on the marketplace might not have these issues. Contact the individual NFS implementation vendor for more information on any testing that was possibly completed against these OpenShift core components.
5.3.1.5. Applications
Application use cases vary from application to application, as described in the following examples:
- Storage technologies that support dynamic PV provisioning have low mount time latencies, and are not tied to nodes to support a healthy cluster.
- Application developers are responsible for knowing and understanding the storage requirements for their application, and how it works with the provided storage to ensure that issues do not occur when an application scales or interacts with the storage layer.
5.3.2. Other specific application storage recommendations
- OpenShift Container Platform Internal etcd: For the best etcd reliability, the lowest consistent latency storage technology is preferable.
- OpenStack Cinder: OpenStack Cinder tends to be adept in ROX access mode use cases.
- Databases: Databases (RDBMSs, NoSQL DBs, etc.) tend to perform best with dedicated block storage.
5.4. Choosing a graph driver
Container runtimes store images and containers in a graph driver (a pluggable storage technology), such as DeviceMapper and OverlayFS. Each has advantages and disadvantages.
For more information about OverlayFS, including supportability and usage caveats, see the Red Hat Enterprise Linux (RHEL) 7 Release Notes for your version.
Name | Description | Benefits | Limitations |
---|---|---|---|
OverlayFS
| Combines a lower (parent) and upper (child) filesystem and a working directory (on the same filesystem as the child). The lower filesystem is the base image, and when you create new containers, a new upper filesystem is created containing the deltas. |
| Not POSIX compliant. |
Device Mapper Thin Provisioning | Uses LVM, Device Mapper, and the dm-thinp kernel module. It differs by removing the loopback device, talking straight to a raw partition (no filesystem). |
|
|
Device Mapper loop-lvm | Uses the Device Mapper thin provisioning module (dm-thin-pool) to implement copy-on-write (CoW) snapshots. For each device mapper graph location, thin pool is created based on two block devices, one for data and one for metadata. By default, these block devices are created automatically by using loopback mounts of automatically created sparse files. | It works out of the box, so it is useful for prototyping and development purposes. |
|
For better performance, Red Hat strongly recommends using the overlayFS storage driver over Device Mapper. However, if you are already using Device Mapper in a production environment, Red Hat strongly recommends using thin provisioning for container images and container root file systems. Otherwise, always use overlayfs2 for Docker engine or overlayFS for CRI-O.
Using a loop device can affect performance issues. While you can still continue to use it, the following warning message is logged:
devmapper: Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man docker` to refer to dm.thinpooldev section.
To ease storage configuration, use the docker-storage-setup
utility, which automates much of the configuration details:
Edit the the /etc/sysconfig/docker-storage-setup file to specify the device driver:
STORAGE_DRIVER=devicemapper
Or
STORAGE_DRIVER=overlay2
NoteIf using CRI-O specify
STORAGE_DRIVER=overlay
to use overlay2.If you had a separate disk drive dedicated to Docker storage (for example, /dev/xvdb), add the following to the /etc/sysconfig/docker-storage-setup file:
DEVS=/dev/xvdb VG=docker_vg
Restart the
docker-storage-setup
service:# systemctl restart docker-storage-setup
After the restart,
docker-storage-setup
sets up a volume group nameddocker_vg
and creates a thin-pool logical volume. Documentation for thin provisioning on RHEL is available in the LVM Administrator Guide. View the newly created volumes with thelsblk
command:# lsblk /dev/xvdb NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvdb 202:16 0 20G 0 disk └─xvdb1 202:17 0 10G 0 part ├─docker_vg-docker--pool_tmeta 253:0 0 12M 0 lvm │ └─docker_vg-docker--pool 253:2 0 6.9G 0 lvm └─docker_vg-docker--pool_tdata 253:1 0 6.9G 0 lvm └─docker_vg-docker--pool 253:2 0 6.9G 0 lvm
NoteThin-provisioned volumes are not mounted and have no file system (individual containers do have an XFS file system), thus they do not show up in
df
output.To verify that Docker is using an LVM thin pool, and to monitor disk space use, run the
docker info
command:# docker info | egrep -i 'storage|pool|space|filesystem' Storage Driver: overlay2 1 Backing Filesystem: extfs
- 1
- The
docker info
output when usingoverlay2
.
# docker info | egrep -i 'storage|pool|space|filesystem' Storage Driver: devicemapper 1 Pool Name: docker_vg-docker--pool 2 Pool Blocksize: 524.3 kB Backing Filesystem: xfs Data Space Used: 62.39 MB Data Space Total: 6.434 GB Data Space Available: 6.372 GB Metadata Space Used: 40.96 kB Metadata Space Total: 16.78 MB Metadata Space Available: 16.74 MB
By default, a thin pool is configured to use 40% of the underlying block device. As you use the storage, LVM automatically extends the thin pool up to 100%. This is why the Data Space Total
value does not match the full size of the underlying LVM device. This auto-extend technique was used to unify the storage approach taken in both Red Hat Enterprise Linux and Red Hat Atomic Host, which only uses a single partition.
In development, Docker in Red Hat distributions defaults to a loopback mounted sparse file. To see if your system is using the loopback mode:
# docker info|grep loop0 Data file: /dev/loop0
Red Hat strongly recommends using the overlay2 storage driver in thin-pool mode for production workloads.
OverlayFS is also supported for container runtimes use cases as of Red Hat Enterprise Linux 7.2, and provides faster start up time and page cache sharing, which can potentially improve density by reducing overall memory utilization.
5.4.1. Benefits of using OverlayFS or DeviceMapper with SELinux
The main advantage of the OverlayFS graph is Linux page cache sharing among containers that share an image on the same node. This attribute of OverlayFS leads to reduced input/output (I/O) during container startup (and, thus, faster container startup time by several hundred milliseconds), as well as reduced memory usage when similar images are running on a node. Both of these results are beneficial in many environments, especially those with the goal of optimizing for density and have high container churn rate (such as a build farm), or those that have significant overlap in image content.
Page cache sharing is not possible with DeviceMapper because thin-provisioned devices are allocated on a per-container basis.
DeviceMapper is the default Docker storage configuration on Red Hat Enterprise Linux. The use of OverlayFS as the container storage technology is under evaluation and moving Red Hat Enterprise Linux to OverlayFS as the default in future releases is under consideration.
5.4.2. Comparing the Overlay and Overlay2 graph drivers
OverlayFS is a type of union file system. It allows you to overlay one file system on top of another. Changes are recorded in the upper file system, while the lower file system remains unmodified. This allows multiple users to share a file-system image, such as a container or a DVD-ROM, where the base image is on read-only media.
OverlayFS layers two directories on a single Linux host and presents them as a single directory. These directories are called layers, and the unification process is referred to as a union mount.
OverlayFS uses one of two graph drivers, overlay or overlay2. As of Red Hat Enterprise Linux 7.2, overlaybecame a supported graph driver. As of Red Hat Enterprise Linux 7.4, overlay2 became supported. SELinux on the docker daemon became supported in Red Hat Enterprise Linux 7.4. See the Red Hat Enterprise Linux release notes for information on using OverlayFS with your version of RHEL, including supportability and usage caveats.
The overlay2 driver natively supports up to 128 lower OverlayFS layers but, the overlay driver works only with a single lower OverlayFS layer. Because of this capability, the overlay2 driver provides better performance for layer-related Docker commands, such as docker build
, and consumes fewer inodes on the backing filesystem.
Because the overlay driver works with a single lower OverlayFS layer, you cannot implement multi-layered images as multiple OverlayFS layers. Instead, each image layer is implemented as its own directory under /var/lib/docker/overlay. Hard links are then used as a space-efficient way to reference data shared with lower layers.
Docker recommends using the overlay2 driver with OverlayFS rather than the overlay driver, because it is more efficient in terms of inode utilization.
You need version 3.10.0-693 or higher of the kernel to use Overlay2 with RHEL or CentOS.