3.7. memory
The
memory
subsystem generates automatic reports on memory resources used by the tasks in a cgroup, and sets limits on memory use of those tasks:
Note
By default, the
memory
subsystem uses 40 bytes of memory per physical page on x86_64 systems. These resources are consumed even if memory
is not used in any hierarchy. If you do not plan to use the memory
subsystem, you can disable it to reduce the resource consumption of the kernel.
To permanently disable the
memory
subsystem, open the /boot/grub/grub.conf
configuration file as root
and append the following text to the line that starts with the kernel keyword:
cgroup_disable=memory
For more information on working with
/boot/grub/grub.conf
, see the Configuring the GRUB Boot Loader chapter in the Red Hat Enterprise Linux 6 Deployment Guide.
To temporarily disable the
memory
subsystem for a single session, perform the following steps when starting the system:
- At the GRUB boot screen, press any key to enter the GRUB interactive menu.
- Select Red Hat Enterprise Linux with the version of the kernel that you want to boot and press the a key to modify the kernel parameters.
- Type
cgroup_disable=memory
at the end of the line and press Enter to exit GRUB edit mode.
With
cgroup_disable=memory
enabled, memory
is not visible as an individually mountable subsystem and it is not automatically mounted when mounting all cgroups in a single hierarchy. Please note that memory
is currently the only subsystem that can be effectively disabled with cgroup_disable
to save resources. Using this option with other subsystems only disables their usage, but does not cut their resource consumption. However, other subsystems do not consume as much resources as the memory
subsystem.
The following tunable parameters are available for the
memory
subsystem:
- memory.stat
- reports a wide range of memory statistics, as described in the following table:
Table 3.2. Values reported by memory.stat Statistic Description cache
page cache, including tmpfs
(shmem
), in bytesrss
anonymous and swap cache, not including tmpfs
(shmem
), in bytesmapped_file
size of memory-mapped mapped files, including tmpfs
(shmem
), in bytespgpgin
number of pages paged into memory pgpgout
number of pages paged out of memory swap
swap usage, in bytes active_anon
anonymous and swap cache on active least-recently-used (LRU) list, including tmpfs
(shmem
), in bytesinactive_anon
anonymous and swap cache on inactive LRU list, including tmpfs
(shmem
), in bytesactive_file
file-backed memory on active LRU list, in bytes inactive_file
file-backed memory on inactive LRU list, in bytes unevictable
memory that cannot be reclaimed, in bytes hierarchical_memory_limit
memory limit for the hierarchy that contains the memory
cgroup, in byteshierarchical_memsw_limit
memory plus swap limit for the hierarchy that contains the memory
cgroup, in bytesAdditionally, each of these files other thanhierarchical_memory_limit
andhierarchical_memsw_limit
has a counterpart prefixedtotal_
that reports not only on the cgroup, but on all its children as well. For example,swap
reports the swap usage by a cgroup andtotal_swap
reports the total swap usage by the cgroup and all its child groups.When you interpret the values reported bymemory.stat
, note how the various statistics inter-relate:active_anon
+inactive_anon
= anonymous memory + file cache fortmpfs
+ swap cacheTherefore,active_anon
+inactive_anon
≠rss
, becauserss
does not includetmpfs
.active_file
+inactive_file
= cache - size oftmpfs
- memory.usage_in_bytes
- reports the total current memory usage by processes in the cgroup (in bytes).
- memory.memsw.usage_in_bytes
- reports the sum of current memory usage plus swap space used by processes in the cgroup (in bytes).
- memory.max_usage_in_bytes
- reports the maximum memory used by processes in the cgroup (in bytes).
- memory.memsw.max_usage_in_bytes
- reports the maximum amount of memory and swap space used by processes in the cgroup (in bytes).
- memory.limit_in_bytes
- sets the maximum amount of user memory (including file cache). If no units are specified, the value is interpreted as bytes. However, it is possible to use suffixes to represent larger units —
k
orK
for kilobytes,m
orM
for megabytes, andg
orG
for gigabytes. For example, to set the limit to 1 gigabyte, execute:~]#
echo 1G > /cgroup/memory/lab1/memory.limit_in_bytes
You cannot usememory.limit_in_bytes
to limit the root cgroup; you can only apply values to groups lower in the hierarchy.Write-1
tomemory.limit_in_bytes
to remove any existing limits. - memory.memsw.limit_in_bytes
- sets the maximum amount for the sum of memory and swap usage. If no units are specified, the value is interpreted as bytes. However, it is possible to use suffixes to represent larger units —
k
orK
for kilobytes,m
orM
for megabytes, andg
orG
for gigabytes.You cannot usememory.memsw.limit_in_bytes
to limit the root cgroup; you can only apply values to groups lower in the hierarchy.Write-1
tomemory.memsw.limit_in_bytes
to remove any existing limits.Important
It is important to set thememory.limit_in_bytes
parameter before setting thememory.memsw.limit_in_bytes
parameter: attempting to do so in the reverse order results in an error. This is becausememory.memsw.limit_in_bytes
becomes available only after all memory limitations (previously set inmemory.limit_in_bytes
) are exhausted.Consider the following example: settingmemory.limit_in_bytes = 2G
andmemory.memsw.limit_in_bytes = 4G
for a certain cgroup will allow processes in that cgroup to allocate 2 GB of memory and, once exhausted, allocate another 2 GB of swap only. Thememory.memsw.limit_in_bytes
parameter represents the sum of memory and swap. Processes in a cgroup that does not have thememory.memsw.limit_in_bytes
parameter set can potentially use up all the available swap (after exhausting the set memory limitation) and trigger an Out Of Memory situation caused by the lack of available swap.The order in which thememory.limit_in_bytes
andmemory.memsw.limit_in_bytes
parameters are set in the/etc/cgconfig.conf
file is important as well. The following is a correct example of such a configuration:memory { memory.limit_in_bytes = 1G; memory.memsw.limit_in_bytes = 1G; }
- memory.failcnt
- reports the number of times that the memory limit has reached the value set in
memory.limit_in_bytes
. - memory.memsw.failcnt
- reports the number of times that the memory plus swap space limit has reached the value set in
memory.memsw.limit_in_bytes
. - memory.soft_limit_in_bytes
- enables flexible sharing of memory. Under normal circumstances, control groups are allowed to use as much of the memory as needed, constrained only by their hard limits set with the
memory.limit_in_bytes
parameter. However, when the system detects memory contention or low memory, control groups are forced to restrict their consumption to their soft limits. To set the soft limit for example to 256 MB, execute:~]#
echo 256M > /cgroup/memory/lab1/memory.soft_limit_in_bytes
This parameter accepts the same suffixes asmemory.limit_in_bytes
to represent units. To have any effect, the soft limit must be set below the hard limit. If lowering the memory usage to the soft limit does not solve the contention, cgroups are pushed back as much as possible to make sure that one control group does not starve the others of memory. Note that soft limits take effect over a long period of time, since they involve reclaiming memory for balancing between memory cgroups. - memory.force_empty
- when set to
0
, empties memory of all pages used by tasks in the cgroup. This interface can only be used when the cgroup has no tasks. If memory cannot be freed, it is moved to a parent cgroup if possible. Use thememory.force_empty
parameter before removing a cgroup to avoid moving out-of-use page caches to its parent cgroup. - memory.swappiness
- sets the tendency of the kernel to swap out process memory used by tasks in this cgroup instead of reclaiming pages from the page cache. This is the same tendency, calculated the same way, as set in
/proc/sys/vm/swappiness
for the system as a whole. The default value is60
. Values lower than60
decrease the kernel's tendency to swap out process memory, values greater than60
increase the kernel's tendency to swap out process memory, and values greater than100
permit the kernel to swap out pages that are part of the address space of the processes in this cgroup.Note that a value of0
does not prevent process memory being swapped out; swap out might still happen when there is a shortage of system memory because the global virtual memory management logic does not read the cgroup value. To lock pages completely, usemlock()
instead of cgroups.You cannot change the swappiness of the following groups:- the root cgroup, which uses the swappiness set in
/proc/sys/vm/swappiness
. - a cgroup that has child groups below it.
- memory.move_charge_at_immigrate
- allows moving charges associated with a task along with task migration. Charging is a way of giving a penalty to cgroups which access shared pages too often. These penalties, also called charges, are by default not moved when a task migrates from one cgroup to another. The pages allocated from the original cgroup still remain charged to it; the charge is dropped when the page is freed or reclaimed.With
memory.move_charge_at_immigrate
enabled, the pages associated with a task are taken from the old cgroup and charged to the new cgroup. The following example shows how to enablememory.move_charge_at_immigrate
:~]#
echo 1 > /cgroup/memory/lab1/memory.move_charge_at_immigrate
Charges are moved only when the moved task is a leader of a thread group. If there is not enough memory for the task in the destination cgroup, an attempt to reclaim memory is performed. If the reclaim is not successful, the task migration is aborted.To disablememory.move_charge_at_immigrate
, execute:~]#
echo 0 > /cgroup/memory/lab1/memory.move_charge_at_immigrate
- memory.use_hierarchy
- contains a flag (
0
or1
) that specifies whether memory usage should be accounted for throughout a hierarchy of cgroups. If enabled (1
), the memory subsystem reclaims memory from the children of and process that exceeds its memory limit. By default (0
), the subsystem does not reclaim memory from a task's children. - memory.oom_control
- contains a flag (
0
or1
) that enables or disables the Out of Memory killer for a cgroup. If enabled (0
), tasks that attempt to consume more memory than they are allowed are immediately killed by the OOM killer. The OOM killer is enabled by default in every cgroup using thememory
subsystem; to disable it, write1
to thememory.oom_control
file:~]#
echo 1 > /cgroup/memory/lab1/memory.oom_control
When the OOM killer is disabled, tasks that attempt to use more memory than they are allowed are paused until additional memory is freed.Thememory.oom_control
file also reports the OOM status of the current cgroup under theunder_oom
entry. If the cgroup is out of memory and tasks in it are paused, theunder_oom
entry reports the value1
.Thememory.oom_control
file is capable of reporting an occurrence of an OOM situation using the notification API. For more information, refer to Section 2.13, “Using the Notification API” and Example 3.3, “OOM Control and Notifications”.
3.7.1. Example Usage
Example 3.3. OOM Control and Notifications
The following example demonstrates how the OOM killer takes action when a task in a cgroup attempts to use more memory than allowed, and how a notification handler can report OOM situations:
- Attach the
memory
subsystem to a hierarchy and create a cgroup:~]#
mount -t cgroup -o memory memory /cgroup/memory
~]#mkdir /cgroup/memory/blue
- Set the amount of memory which tasks in the
blue
cgroup can use to 100 MB:~]#
echo 104857600 > memory.limit_in_bytes
- Change into the
blue
directory and make sure the OOM killer is enabled:~]#
cd /cgroup/memory/blue
blue]#cat memory.oom_control
oom_kill_disable 0 under_oom 0 - Move the current shell process into the
tasks
file of theblue
cgroup so that all other processes started in this shell are automatically moved to theblue
cgroup:blue]#
echo $$ > tasks
- Start a test program that attempts to allocate a large amount of memory exceeding the limit you set in step 2. As soon as the
blue
cgroup runs out of free memory, the OOM killer kills the test program and reportsKilled
to the standard output:blue]#
~/mem-hog
KilledThe following is an example of such a test program[5]:#include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #define KB (1024) #define MB (1024 * KB) #define GB (1024 * MB) int main(int argc, char *argv[]) { char *p; again: while ((p = (char *)malloc(GB))) memset(p, 0, GB); while ((p = (char *)malloc(MB))) memset(p, 0, MB); while ((p = (char *)malloc(KB))) memset(p, 0, KB); sleep(1); goto again; return 0; }
- Disable the OOM killer and rerun the test program. This time, the test program remains paused waiting for additional memory to be freed:
blue]#
echo 1 > memory.oom_control
blue]#~/mem-hog
- While the test program is paused, note that the
under_oom
state of the cgroup has changed to indicate that the cgroup is out of available memory:~]#
cat /cgroup/memory/blue/memory.oom_control
oom_kill_disable 1 under_oom 1Reenabling the OOM killer immediately kills the test program. - To receive notifications about every OOM situation, create a program as specified in Section 2.13, “Using the Notification API”. For example[6]:
#include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sys/eventfd.h> #include <errno.h> #include <string.h> #include <stdio.h> #include <stdlib.h> static inline void die(const char *msg) { fprintf(stderr, "error: %s: %s(%d)\n", msg, strerror(errno), errno); exit(EXIT_FAILURE); } static inline void usage(void) { fprintf(stderr, "usage: oom_eventfd_test <cgroup.event_control> <memory.oom_control>\n"); exit(EXIT_FAILURE); } #define BUFSIZE 256 int main(int argc, char *argv[]) { char buf[BUFSIZE]; int efd, cfd, ofd, rb, wb; uint64_t u; if (argc != 3) usage(); if ((efd = eventfd(0, 0)) == -1) die("eventfd"); if ((cfd = open(argv[1], O_WRONLY)) == -1) die("cgroup.event_control"); if ((ofd = open(argv[2], O_RDONLY)) == -1) die("memory.oom_control"); if ((wb = snprintf(buf, BUFSIZE, "%d %d", efd, ofd)) >= BUFSIZE) die("buffer too small"); if (write(cfd, buf, wb) == -1) die("write cgroup.event_control"); if (close(cfd) == -1) die("close cgroup.event_control"); for (;;) { if (read(efd, &u, sizeof(uint64_t)) != sizeof(uint64_t)) die("read eventfd"); printf("mem_cgroup oom event received\n"); } return 0; }
The above program detects OOM situations in a cgroup specified as an argument on the command line and reports them using themem_cgroup oom event received
string to the standard output. - Run the above notification handler program in a separate console, specifying the
blue
cgroup's control files as arguments:~]$
./oom_notification /cgroup/memory/blue/cgroup.event_control /cgroup/memory/blue/memory.oom_control
- In a different console, run the
mem_hog
test program to create an OOM situation to see theoom_notification
program report it in the standard output:blue]#
~/mem-hog