Chapter 3. Subsystems and Tunable Parameters
Subsystems are kernel modules that are aware of cgroups. Typically, they are resource controllers that allocate varying levels of system resources to different cgroups. However, subsystems could be programmed for any other interaction with the kernel where the need exists to treat different groups of processes differently. The application programming interface (API) to develop new subsystems is documented in
cgroups.txt
in the kernel documentation, installed on your system at /usr/share/doc/kernel-doc-kernel-version/Documentation/cgroups/
(provided by the kernel-doc package). The latest version of the cgroups documentation is also available online at http://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt. Note, however, that the features in the latest documentation might not match those available in the kernel installed on your system.
State objects that contain the subsystem parameters for a cgroup are represented as pseudofiles within the cgroup virtual file system. These pseudofiles can be manipulated by shell commands or their equivalent system calls. For example,
cpuset.cpus
is a pseudofile that specifies which CPUs a cgroup is permitted to access. If /cgroup/cpuset/webserver
is a cgroup for the web server that runs on a system, and the following command is executed,
~]# echo 0,2 > /cgroup/cpuset/webserver/cpuset.cpus
the value
0,2
is written to the cpuset.cpus
pseudofile and therefore limits any tasks whose PIDs are listed in /cgroup/cpuset/webserver/tasks
to use only CPU 0 and CPU 2 on the system.
3.1. blkio
The Block I/O (
blkio
) subsystem controls and monitors access to I/O on block devices by tasks in cgroups. Writing values to some of these pseudofiles limits access or bandwidth, and reading values from some of these pseudofiles provides information on I/O operations.
The
blkio
subsystem offers two policies for controlling access to I/O:
- Proportional weight division — implemented in the Completely Fair Queuing (CFQ) I/O scheduler, this policy allows you to set weights to specific cgroups. This means that each cgroup has a set percentage (depending on the weight of the cgroup) of all I/O operations reserved. For more information, refer to Section 3.1.1, “Proportional Weight Division Tunable Parameters”
- I/O throttling (Upper limit) — this policy is used to set an upper limit for the number of I/O operations performed by a specific device. This means that a device can have a limited rate of read or write operations. For more information, refer to Section 3.1.2, “I/O Throttling Tunable Parameters”
Important
Currently, the Block I/O subsystem does not work for buffered write operations. It is primarily targeted at direct I/O, although it works for buffered read operations.
3.1.1. Proportional Weight Division Tunable Parameters
- blkio.weight
- specifies the relative proportion (weight) of block I/O access available by default to a cgroup, in the range from
100
to1000
. This value is overridden for specific devices by theblkio.weight_device
parameter. For example, to assign a default weight of500
to a cgroup for access to block devices, run:~]#
echo 500 > blkio.weight
- blkio.weight_device
- specifies the relative proportion (weight) of I/O access on specific devices available to a cgroup, in the range from
100
to1000
. The value of this parameter overrides the value of theblkio.weight
parameter for the devices specified. Values take the format of major:minor weight, where major and minor are device types and node numbers specified in Linux Allocated Devices, otherwise known as the Linux Devices List and available from https://www.kernel.org/doc/html/v4.11/admin-guide/devices.html. For example, to assign a weight of500
to a cgroup for access to/dev/sda
, run:~]#
echo 8:0 500 > blkio.weight_device
In the Linux Allocated Devices notation,8:0
represents/dev/sda
.
3.1.2. I/O Throttling Tunable Parameters
- blkio.throttle.read_bps_device
- specifies the upper limit on the number of read operations a device can perform. The rate of the read operations is specified in bytes per second. Entries have three fields: major, minor, and bytes_per_second. Major and minor are device types and node numbers specified in Linux Allocated Devices, and bytes_per_second is the upper limit rate at which read operations can be performed. For example, to allow the
/dev/sda
device to perform read operations at a maximum of 10 MBps, run:~]#
echo "8:0 10485760" > /cgroup/blkio/test/blkio.throttle.read_bps_device
- blkio.throttle.read_iops_device
- specifies the upper limit on the number of read operations a device can perform. The rate of the read operations is specified in operations per second. Entries have three fields: major, minor, and operations_per_second. Major and minor are device types and node numbers specified in Linux Allocated Devices, and operations_per_second is the upper limit rate at which read operations can be performed. For example, to allow the
/dev/sda
device to perform a maximum of 10 read operations per second, run:~]#
echo "8:0 10" > /cgroup/blkio/test/blkio.throttle.read_iops_device
- blkio.throttle.write_bps_device
- specifies the upper limit on the number of write operations a device can perform. The rate of the write operations is specified in bytes per second. Entries have three fields: major, minor, and bytes_per_second. Major and minor are device types and node numbers specified in Linux Allocated Devices, and bytes_per_second is the upper limit rate at which write operations can be performed. For example, to allow the
/dev/sda
device to perform write operations at a maximum of 10 MBps, run:~]#
echo "8:0 10485760" > /cgroup/blkio/test/blkio.throttle.write_bps_device
- blkio.throttle.write_iops_device
- specifies the upper limit on the number of write operations a device can perform. The rate of the write operations is specified in operations per second. Entries have three fields: major, minor, and operations_per_second. Major and minor are device types and node numbers specified in Linux Allocated Devices, and operations_per_second is the upper limit rate at which write operations can be performed. For example, to allow the
/dev/sda
device to perform a maximum of 10 write operations per second, run:~]#
echo "8:0 10" > /cgroup/blkio/test/blkio.throttle.write_iops_device
- blkio.throttle.io_serviced
- reports the number of I/O operations performed on specific devices by a cgroup as seen by the throttling policy. Entries have four fields: major, minor, operation, and number. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (
read
,write
,sync
, orasync
) and number represents the number of operations. - blkio.throttle.io_service_bytes
- reports the number of bytes transferred to or from specific devices by a cgroup. The only difference between
blkio.io_service_bytes
andblkio.throttle.io_service_bytes
is that the former is not updated when the CFQ scheduler is operating on a request queue. Entries have four fields: major, minor, operation, and bytes. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (read
,write
,sync
, orasync
) and bytes is the number of transferred bytes.
3.1.3. blkio Common Tunable Parameters
The following parameters may be used for either of the policies listed in Section 3.1, “blkio”.
- blkio.reset_stats
- resets the statistics recorded in the other pseudofiles. Write an integer to this file to reset the statistics for this cgroup.
- blkio.time
- reports the time that a cgroup had I/O access to specific devices. Entries have three fields: major, minor, and time. Major and minor are device types and node numbers specified in Linux Allocated Devices, and time is the length of time in milliseconds (ms).
- blkio.sectors
- reports the number of sectors transferred to or from specific devices by a cgroup. Entries have three fields: major, minor, and sectors. Major and minor are device types and node numbers specified in Linux Allocated Devices, and sectors is the number of disk sectors.
- blkio.avg_queue_size
- reports the average queue size for I/O operations by a cgroup, over the entire length of time of the group's existence. The queue size is sampled every time a queue for this cgroup receives a timeslice. Note that this report is available only if
CONFIG_DEBUG_BLK_CGROUP=y
is set on the system. - blkio.group_wait_time
- reports the total time (in nanoseconds — ns) a cgroup spent waiting for a timeslice for one of its queues. The report is updated every time a queue for this cgroup gets a timeslice, so if you read this pseudofile while the cgroup is waiting for a timeslice, the report will not contain time spent waiting for the operation currently queued. Note that this report is available only if
CONFIG_DEBUG_BLK_CGROUP=y
is set on the system. - blkio.empty_time
- reports the total time (in nanoseconds — ns) a cgroup spent without any pending requests. The report is updated every time a queue for this cgroup has a pending request, so if you read this pseudofile while the cgroup has no pending requests, the report will not contain time spent in the current empty state. Note that this report is available only if
CONFIG_DEBUG_BLK_CGROUP=y
is set on the system. - blkio.idle_time
- reports the total time (in nanoseconds — ns) the scheduler spent idling for a cgroup in anticipation of a better request than the requests already in other queues or from other groups. The report is updated every time the group is no longer idling, so if you read this pseudofile while the cgroup is idling, the report will not contain time spent in the current idling state. Note that this report is available only if
CONFIG_DEBUG_BLK_CGROUP=y
is set on the system. - blkio.dequeue
- reports the number of times requests for I/O operations by a cgroup were dequeued by specific devices. Entries have three fields: major, minor, and number. Major and minor are device types and node numbers specified in Linux Allocated Devices, and number is the number of times requests by the group were dequeued. Note that this report is available only if
CONFIG_DEBUG_BLK_CGROUP=y
is set on the system. - blkio.io_serviced
- reports the number of I/O operations performed on specific devices by a cgroup as seen by the CFQ scheduler. Entries have four fields: major, minor, operation, and number. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (
read
,write
,sync
, orasync
) and number represents the number of operations. - blkio.io_service_bytes
- reports the number of bytes transferred to or from specific devices by a cgroup as seen by the CFQ scheduler. Entries have four fields: major, minor, operation, and bytes. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (
read
,write
,sync
, orasync
) and bytes is the number of transferred bytes. - blkio.io_service_time
- reports the total time between request dispatch and request completion for I/O operations on specific devices by a cgroup as seen by the CFQ scheduler. Entries have four fields: major, minor, operation, and time. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (
read
,write
,sync
, orasync
) and time is the length of time in nanoseconds (ns). The time is reported in nanoseconds rather than a larger unit so that this report is meaningful even for solid-state devices. - blkio.io_wait_time
- reports the total time I/O operations on specific devices by a cgroup spent waiting for service in the scheduler queues. When you interpret this report, note:
- the time reported can be greater than the total time elapsed, because the time reported is the cumulative total of all I/O operations for the cgroup rather than the time that the cgroup itself spent waiting for I/O operations. To find the time that the group as a whole has spent waiting, use the
blkio.group_wait_time
parameter. - if the device has a
queue_depth
> 1, the time reported only includes the time until the request is dispatched to the device, not any time spent waiting for service while the device reorders requests.
Entries have four fields: major, minor, operation, and time. Major and minor are device types and node numbers specified in Linux Allocated Devices, operation represents the type of operation (read
,write
,sync
, orasync
) and time is the length of time in nanoseconds (ns). The time is reported in nanoseconds rather than a larger unit so that this report is meaningful even for solid-state devices. - blkio.io_merged
- reports the number of BIOS requests merged into requests for I/O operations by a cgroup. Entries have two fields: number and operation. Number is the number of requests, and operation represents the type of operation (
read
,write
,sync
, orasync
). - blkio.io_queued
- reports the number of requests queued for I/O operations by a cgroup. Entries have two fields: number and operation. Number is the number of requests, and operation represents the type of operation (
read
,write
,sync
, orasync
).
3.1.4. Example Usage
Refer to Example 3.1, “blkio proportional weight division” for a simple test of running two
dd
threads in two different cgroups with various blkio.weight
values.
Example 3.1. blkio proportional weight division
- Mount the
blkio
subsystem:~]#
mount -t cgroup -o blkio blkio /cgroup/blkio/
- Create two cgroups for the
blkio
subsystem:~]#
mkdir /cgroup/blkio/test1/
~]#mkdir /cgroup/blkio/test2/
- Set
blkio
weights in the previously created cgroups:~]#
echo 1000 > /cgroup/blkio/test1/blkio.weight
~]#echo 500 > /cgroup/blkio/test2/blkio.weight
- Create two large files:
~]#
dd if=/dev/zero of=file_1 bs=1M count=4000
~]#dd if=/dev/zero of=file_2 bs=1M count=4000
The above commands create two files (file_1
andfile_2
) of size 4 GB. - For each of the test cgroups, execute a
dd
command (which reads the contents of a file and outputs it to the null device) on one of the large files:~]#
cgexec -g blkio:test1 time dd if=file_1 of=/dev/null
~]#cgexec -g blkio:test2 time dd if=file_2 of=/dev/null
Both commands will output their completion time once they have finished. - Simultaneously with the two running
dd
threads, you can monitor the performance in real time by using the iotop utility. To install the iotop utility, execute, as root, theyum install iotop
command. The following is an example of the output as seen in the iotop utility while running the previously starteddd
threads:Total DISK READ: 83.16 M/s | Total DISK WRITE: 0.00 B/s TIME TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 15:18:04 15071 be/4 root 27.64 M/s 0.00 B/s 0.00 % 92.30 % dd if=file_2 of=/dev/null 15:18:04 15069 be/4 root 55.52 M/s 0.00 B/s 0.00 % 88.48 % dd if=file_1 of=/dev/null
In order to get the most accurate result in Example 3.1, “blkio proportional weight division”, prior to the execution of the
dd
commands, flush all file system buffers and free pagecache, dentries and inodes using the following commands:
~]#sync
~]#echo 3 > /proc/sys/vm/drop_caches
Additionally, you can enable group isolation which provides stronger isolation between groups at the expense of throughput. When group isolation is disabled, fairness can be expected only for a sequential workload. By default, group isolation is enabled and fairness can be expected for random I/O workloads as well. To enable group isolation, use the following command:
~]# echo 1 > /sys/block/<disk_device>/queue/iosched/group_isolation
where <disk_device> stands for the name of the desired device, for example
sda
.