Chapter 4. Configuring Resource Isolation on Hyper-converged Nodes


Whether through the default hyperconverged-ceph.yaml file (in Section 3.1, “Pure HCI”) or the custom OsdCompute role (in Section 3.2, “Mixed HCI”), the director creates hyper-converged nodes by co-locating Ceph OSD and Compute services. However, without any further tuning this co-location also risks resource contention between Ceph and Compute services, as neither are aware of each other’s presence on the same host. Resource contention can result in degradation of service. This, in turn, offsets any benefits provided by hyper-convergence.

To prevent contention, you need to configure resource isolation for both Ceph and Compute services. The following sub-sections describe how to do so.

4.1. Reserve CPU and Memory Resources for Compute

By default, the Compute service parameters do not take into account the co-location of Ceph OSD services on the same node. Hyper-converged nodes need to be tuned in order to address this to maintain stability and maximize the number of possible instances. Use the computations here as a guideline for an optimal baseline, then change the settings to find an acceptable trade-off between determinism and instance-hosting capacity. The examples provided in this document prioritize determinism and uptime.

The following Heat template parameters control how the Compute services consume memory and CPU resources on a node:

reserved_host_memory

This is the amount of memory (in MB) to reserve for the host node. To determine an appropriate value for hyper-converged nodes, assume that each OSD consumes 3GB of memory. Given a node with 256GB memory and 10 OSDs, you can allocate 30 GB of memory for Ceph, leaving 226 GB for Compute. With that much memory a node can host, for example, 113 instances using 2GB of memory each.

However, you still need to consider additional overhead per instance for the hypervisor. Assuming this overhead is 0.5 GB, the same node can only host 90 instances, which accounts for the 226GB divided by 2.5GB. The amount of memory to reserve for the host node (that is, memory the Compute service should not use) is:

(In * Ov) + (Os * RA)

Where:

  • In: number of instances
  • Ov: amount of overhead memory needed per instance
  • Os: number of OSDs on the node
  • RA: amount of RAM that each OSD should have

With 90 instances, this give us (90*0.5) + (10*3) = 75GB. The Compute service expects this value in MB, namely 75000.

The following Python code provides this computation:

left_over_mem = mem - (GB_per_OSD * osds)
number_of_guests = int(left_over_mem /
    (average_guest_size + GB_overhead_per_guest))
nova_reserved_mem_MB = MB_per_GB * (
    (GB_per_OSD * osds) +
    (number_of_guests * GB_overhead_per_guest))
cpu_allocation_ratio

The Compute scheduler uses this when choosing which Compute nodes on which to deploy an instance. By default, this is 16.0 (as in, 16:1). This means if there are 56 cores on a node, the Compute scheduler will schedule enough instances to consume 896 vCPUs on a node before considering the node unable to host any more.

To determine a suitable cpu_allocation_ratio for a hyper-converged node, assume each Ceph OSD uses at least one core (unless the workload is I/O-intensive, and on a node with no SSD). On a node with 56 cores and 10 OSDs, this would leave 46 cores for Compute. If each instance uses 100 per cent of the CPU it receives, then the ratio would simply be the number of instance vCPUs divided by the number of cores; that is, 46 / 56 = 0.8. However, since instances do not normally consume 100 per cent of their allocated CPUs, you can raise the cpu_allocation_ratio by taking the anticipated percentage into account when determining the number of required guest vCPUs.

So, if we can predict that instances will only use 10 per cent (or 0.1) of their vCPU, then the number of vCPUs for instances can be expressed as 46 / 0.1 = 460. When this value is divided by the number of cores (56), the ratio increases to approximately 8.

The following Python code provides this computation:

cores_per_OSD = 1.0
average_guest_util = 0.1 # 10%
nonceph_cores = cores - (cores_per_OSD * osds)
guest_vCPUs = nonceph_cores / average_guest_util
cpu_allocation_ratio = guest_vCPUs / cores
Tip

You can also use the script in Section A.2.1, “Compute CPU and Memory Calculator” to compute baseline values for both reserved_host_memory and cpu_allocation_ratio.

After computing for the values you want to use, include them as defaults for HCI nodes. To do so, create a new environment file named compute.yaml in ~/templates containing your reserved_host_memory and cpu_allocation_ratio values. For pure HCI deployments, it should contain the following:

parameter_defaults:
  NovaComputeExtraConfig:  # 1
    nova::compute::reserved_host_memory: 181000
    nova::cpu_allocation_ratio: 8.2
1
The NovaComputeExtraConfig line applies all its nested parameters to all Compute roles. In a pure HCI deployment, all Compute nodes are also hyper-converged.

For mixed HCI, ~/templates/compute.yaml should contain:

parameter_defaults:
  OsdComputeExtraConfig:  # 1
    nova::compute::reserved_host_memory: 181000
    nova::cpu_allocation_ratio: 8.2
1
The OsdComputeExtraConfig line is a custom resource that applies all nested settings to the custom OsdCompute role, which we defined in Section 3.2, “Mixed HCI”.

4.2. Configure Ceph NUMA Pinning

When applying a hyper-converged role on a host that features NUMA, you can improve determinism by pinning the Ceph OSD services to one of the available NUMA sockets. When you do, pin the Ceph Storage services to the socket with the network IRQ and the storage controller. Doing this helps address the Ceph OSD’s heavy usage of network I/O.

You can orchestrate this through a simple shell script that takes a network interface as an argument and applies the necessary NUMA-related settings to the interface. This network interface will presumably be the one the Ceph OSD uses. Create this script (numa-systemd-osd.sh) in ~/templates.

Important

See Section A.2.2, “Custom Script to Configure NUMA Pinning for Ceph OSD Services” for the contents of numa-systemd-osd.sh, including a more detailed description.

The numa-systemd-osd.sh script will also attempt to install NUMA configuration tools. As such, the overcloud nodes must also be registered with Red Hat, as described in Registering the Nodes (from Red Hat Ceph Storage for the Overcloud).

To run this script on the overcloud, first create a new Heat template named ceph-numa-pinning-template.yaml in ~/templates with the following contents:

heat_template_version: 2014-10-16

parameters:
  servers:
    type: json

resources:
  ExtraConfig:
    type: OS::Heat::SoftwareConfig
    properties:
      group: script
      inputs:
        - name: OSD_NUMA_INTERFACE
      config: {get_file: numa-systemd-osd.sh} # 1

  ExtraDeployments:
    type: OS::Heat::SoftwareDeployments
    properties:
      servers: {get_param: servers}
      config: {get_resource: ExtraConfig}
      input_values:
        OSD_NUMA_INTERFACE: 'em2' # 2
      actions: ['CREATE'] # 3
1
The get_file function calls the ~/templates/numa-systemd-osd.sh. This script should be able to take a network interface as an input (in this case, OSD_NUMA_INTERFACE) and perform the necessary NUMA-related configuration for it. See Section A.2.2, “Custom Script to Configure NUMA Pinning for Ceph OSD Services” for the contents of this script, along with a detailed description of how it works.
Important

On a Pure HCI deployment, you will need to edit the top-level IF statement in the ~/templates/numa-systemd-osd.sh script. See Section A.2.2, “Custom Script to Configure NUMA Pinning for Ceph OSD Services” for details.

2
The OSD_NUMA_INTERFACE variable specifies the network interface that the Ceph OSD services should use (in this example, em2). The ~/templates/numa-systemd-osd.sh script will apply the necessary NUMA settings to this interface.
3
As we only specify CREATE in actions, the script will only run during the initial deployment, and not during an update.

The interface used for OSD_NUMA_INTERFACE can be determined for all deployments by either the StorageNetwork variable, or the StorageMgtmtNetwork variable. Workloads that are read-heavy benefit from using the StorageNetwork interface, while write-heavy workloads benefit from using the StorageMgtmtNetwork one.

If the Ceph OSD service uses a virtual network interface (for example, a bond), use the name of the network devices that make up the bond, not the bond itself. For example, if bond1 uses em2 and em4, then set OSD_NUMA_INTERFACE to either em2 or em4 (not bond1). If the bond combines NICs which are not on the same NUMA node (as confirmed by lstopo-no-graphics), then do not use numa-systemd-osd.sh.

After creating the ceph-numa-pinning-template.yaml template, create an environment file named ceph-numa-pinning.yaml in ~/templates with the following contents:

resource_registry:
  OS::TripleO::NodeExtraConfigPost: /home/stack/templates/ceph-numa-pinning-template.yaml

This environment file will allow you to invoke the ceph-numa-pinning-template.yaml template later on in Chapter 6, Deployment.

4.3. Reduce Ceph Backfill and Recovery Operations

When a Ceph OSD is removed, Ceph uses backfill and recovery operations to rebalance the cluster. Ceph does this to keep multiple copies of data according to the placement group policy. These operations use system resources; as such, if a Ceph cluster is under load its performance will drop as it diverts resources to backfill and recovery.

To mitigate this performance effect during OSD removal, you can reduce the priority of backfill and recovery operations. Keep in mind that the trade off for this is that there are less data replicas for a longer time, which puts the data at a slightly greater risk.

To configure the priority of backfill and recovery operations, add an environment file named ceph-backfill-recovery.yaml to ~/templates containing the following:

parameter_defaults:
  ExtraConfig:
    ceph::profile::params::osd_recovery_op_priority: 3 # 1
    ceph::profile::params::osd_recovery_max_active: 3 # 2
    ceph::profile::params::osd_max_backfills: 1 # 3
1
The osd_recovery_op_priority sets the priority for recovery operations, relative to the OSD client OP priority.
2
The osd_recovery_max_active sets the number of active recovery requests per OSD, at one time. More requests will accelerate recovery, but the requests place an increased load on the cluster. Set this to 1 if you want to reduce latency.
3
The osd_max_backfills sets the maximum number of backfills allowed to or from a single OSD.
Important

The values used in this sample are the current defaults. Unless you are planning to use ceph-backfill-recovery.yaml with different values, you do not need to add it to your deployment.

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.