2.9. GFS2 Node Locking
In order to get the best performance from a GFS2 file system, it is important to understand some of the basic theory of its operation. A single node file system is implemented alongside a cache, the purpose of which is to eliminate latency of disk accesses when using frequently requested data. In Linux the page cache (and historically the buffer cache) provide this caching function.
With GFS2, each node has its own page cache which may contain some portion of the on-disk data. GFS2 uses a locking mechanism called glocks (pronounced gee-locks) to maintain the integrity of the cache between nodes. The glock subsystem provides a cache management function which is implemented using the distributed lock manager (DLM) as the underlying communication layer.
The glocks provide protection for the cache on a per-inode basis, so there is one lock per inode which is used for controlling the caching layer. If that glock is granted in shared mode (DLM lock mode: PR) then the data under that glock may be cached upon one or more nodes at the same time, so that all the nodes may have local access to the data.
If the glock is granted in exclusive mode (DLM lock mode: EX) then only a single node may cache the data under that glock. This mode is used by all operations which modify the data (such as the
write
system call).
If another node requests a glock which cannot be granted immediately, then the DLM sends a message to the node or nodes which currently hold the glocks blocking the new request to ask them to drop their locks. Dropping glocks can be (by the standards of most file system operations) a long process. Dropping a shared glock requires only that the cache be invalidated, which is relatively quick and proportional to the amount of cached data.
Dropping an exclusive glock requires a log flush, and writing back any changed data to disk, followed by the invalidation as per the shared glock.
The difference between a single node file system and GFS2, then, is that a single node file system has a single cache and GFS2 has a separate cache on each node. In both cases, latency to access cached data is of a similar order of magnitude, but the latency to access uncached data is much greater in GFS2 if another node has previously cached that same data.
Operations such as
read
(buffered), stat,
and readdir
only require a shared glock. Operations such as write
(buffered), mkdir
, rmdir
, and unlink
require an exclusive glock. Direct I/O read/write operations require a deferred glock if no allocation is taking place, or an exclusive glock if the write requires an allocation (that is, extending the file, or hole filling).
There are two main performance considerations which follow from this. First, read-only operations parallelize extremely well across a cluster, since they can run independently on every node. Second, operations requiring an exclusive glock can reduce performance, if there are multiple nodes contending for access to the same inode(s). Consideration of the working set on each node is thus an important factor in GFS2 file system performance such as when, for example, you perform a file system backup as described in 第 2.6 节 “File System Backups”.
A further consequence of this is that we recommend the use of the
noatime
and nodiratime
mount options with GFS2 whenever possible. This prevents reads from requiring exclusive locks to update the atime
timestamp.
For users who are concerned about the working set or caching efficiency, GFS2 provides tools that allow you to monitor the performance of a GFS2 file system: Performance Co-Pilot, as described in 附录 A, GFS2 Performance Analysis with Performance Co-Pilot, and GFS2 tracepoints, as described in 附录 B, GFS2 Tracepoints and the debugfs glocks File.
注意
Due to the way in which GFS2's caching is implemented the best performance is obtained when either of the following takes place:
- An inode is used in a read-only fashion across all nodes.
- An inode is written or modified from a single node only.
Note that inserting and removing entries from a directory during file creation and deletion counts as writing to the directory inode.
It is possible to break this rule provided that it is broken relatively infrequently. Ignoring this rule too often will result in a severe performance penalty.
If you
mmap
() a file on GFS2 with a read/write mapping, but only read from it, this only counts as a read. On GFS though, it counts as a write, so GFS2 is much more scalable with mmap
() I/O.
If you do not set the
noatime
mount
parameter, then reads will also result in writes to update the file timestamps. We recommend that all GFS2 users should mount with noatime
unless they have a specific requirement for atime
.
2.9.1. Issues with Posix Locking
When using Posix locking, you should take the following into account:
- Use of Flocks will yield faster processing than use of Posix locks.
- Programs using Posix locks in GFS2 should avoid using the
GETLK
function since, in a clustered environment, the process ID may be for a different node in the cluster.
2.9.2. Performance Tuning with GFS2
It is usually possible to alter the way in which a troublesome application stores its data in order to gain a considerable performance advantage.
A typical example of a troublesome application is an email server. These are often laid out with a spool directory containing files for each user (
mbox
), or with a directory for each user containing a file for each message (maildir
). When requests arrive over IMAP, the ideal arrangement is to give each user an affinity to a particular node. That way their requests to view and delete email messages will tend to be served from the cache on that one node. Obviously if that node fails, then the session can be restarted on a different node.
When mail arrives by means of SMTP, then again the individual nodes can be set up so as to pass a certain user's mail to a particular node by default. If the default node is not up, then the message can be saved directly into the user's mail spool by the receiving node. Again this design is intended to keep particular sets of files cached on just one node in the normal case, but to allow direct access in the case of node failure.
This setup allows the best use of GFS2's page cache and also makes failures transparent to the application, whether
imap
or smtp
.
Backup is often another tricky area. Again, if it is possible it is greatly preferable to back up the working set of each node directly from the node which is caching that particular set of inodes. If you have a backup script which runs at a regular point in time, and that seems to coincide with a spike in the response time of an application running on GFS2, then there is a good chance that the cluster may not be making the most efficient use of the page cache.
Obviously, if you are in the (enviable) position of being able to stop the application in order to perform a backup, then this will not be a problem. On the other hand, if a backup is run from just one node, then after it has completed a large portion of the file system will be cached on that node, with a performance penalty for subsequent accesses from other nodes. This can be mitigated to a certain extent by dropping the VFS page cache on the backup node after the backup has completed with following command:
echo -n 3 >/proc/sys/vm/drop_caches
However this is not as good a solution as taking care to ensure the working set on each node is either shared, mostly read-only across the cluster, or accessed largely from a single node.
2.9.3. Troubleshooting GFS2 Performance with the GFS2 Lock Dump
If your cluster performance is suffering because of inefficient use of GFS2 caching, you may see large and increasing I/O wait times. You can make use of GFS2's lock dump information to determine the cause of the problem.
This section provides an overview of the GFS2 lock dump. For a more complete description of the GFS2 lock dump, see 附录 B, GFS2 Tracepoints and the debugfs glocks File.
The GFS2 lock dump information can be gathered from the
debugfs
file which can be found at the following path name, assuming that debugfs
is mounted on /sys/kernel/debug/
:
/sys/kernel/debug/gfs2/fsname/glocks
The content of the file is a series of lines. Each line starting with G: represents one glock, and the following lines, indented by a single space, represent an item of information relating to the glock immediately before them in the file.
The best way to use the
debugfs
file is to use the cat
command to take a copy of the complete content of the file (it might take a long time if you have a large amount of RAM and a lot of cached inodes) while the application is experiencing problems, and then looking through the resulting data at a later date.
注意
It can be useful to make two copies of the
debugfs
file, one a few seconds or even a minute or two after the other. By comparing the holder information in the two traces relating to the same glock number, you can tell whether the workload is making progress (it is just slow) or whether it has become stuck (which is always a bug and should be reported to Red Hat support immediately).
Lines in the
debugfs
file starting with H: (holders) represent lock requests either granted or waiting to be granted. The flags field on the holders line f: shows which: The 'W' flag refers to a waiting request, the 'H' flag refers to a granted request. The glocks which have large numbers of waiting requests are likely to be those which are experiencing particular contention.
表 2.1 “Glock flags” shows the meanings of the different glock flags and 表 2.2 “Glock holder flags” shows the meanings of the different glock holder flags.
Flag | Name | Meaning |
---|---|---|
b | Blocking | Valid when the locked flag is set, and indicates that the operation that has been requested from the DLM may block. This flag is cleared for demotion operations and for "try" locks. The purpose of this flag is to allow gathering of stats of the DLM response time independent from the time taken by other nodes to demote locks. |
d | Pending demote | A deferred (remote) demote request |
D | Demote | A demote request (local or remote) |
f | Log flush | The log needs to be committed before releasing this glock |
F | Frozen | Replies from remote nodes ignored - recovery is in progress. This flag is not related to file system freeze, which uses a different mechanism, but is used only in recovery. |
i | Invalidate in progress | In the process of invalidating pages under this glock |
I | Initial | Set when DLM lock is associated with this glock |
l | Locked | The glock is in the process of changing state |
L | LRU | Set when the glock is on the LRU list |
o | Object | Set when the glock is associated with an object (that is, an inode for type 2 glocks, and a resource group for type 3 glocks) |
p | Demote in progress | The glock is in the process of responding to a demote request |
q | Queued | Set when a holder is queued to a glock, and cleared when the glock is held, but there are no remaining holders. Used as part of the algorithm the calculates the minimum hold time for a glock. |
r | Reply pending | Reply received from remote node is awaiting processing |
y | Dirty | Data needs flushing to disk before releasing this glock |
Flag | Name | Meaning |
---|---|---|
a | Async | Do not wait for glock result (will poll for result later) |
A | Any | Any compatible lock mode is acceptable |
c | No cache | When unlocked, demote DLM lock immediately |
e | No expire | Ignore subsequent lock cancel requests |
E | exact | Must have exact lock mode |
F | First | Set when holder is the first to be granted for this lock |
H | Holder | Indicates that requested lock is granted |
p | Priority | Enqueue holder at the head of the queue |
t | Try | A "try" lock |
T | Try 1CB | A "try" lock that sends a callback |
W | Wait | Set while waiting for request to complete |
Having identified a glock which is causing a problem, the next step is to find out which inode it relates to. The glock number (n: on the G: line) indicates this. It is of the form type/number and if type is 2, then the glock is an inode glock and the number is an inode number. To track down the inode, you can then run
find -inum number
where number is the inode number converted from the hex format in the glocks file into decimal.
警告
If you run the
find
command on a file system when it is experiencing lock contention, you are likely to make the problem worse. It is a good idea to stop the application before running the find
command when you are looking for contended inodes.
表 2.3 “Glock types” shows the meanings of the different glock types.
Type number | Lock type | Use |
---|---|---|
1 | Trans | Transaction lock |
2 | Inode | Inode metadata and data |
3 | Rgrp | Resource group metadata |
4 | Meta | The superblock |
5 | Iopen | Inode last closer detection |
6 | Flock | flock (2) syscall |
8 | Quota | Quota operations |
9 | Journal | Journal mutex |
If the glock that was identified was of a different type, then it is most likely to be of type 3: (resource group). If you see significant numbers of processes waiting for other types of glock under normal loads, report this to Red Hat support.
If you do see a number of waiting requests queued on a resource group lock there may be a number of reasons for this. One is that there are a large number of nodes compared to the number of resource groups in the file system. Another is that the file system may be very nearly full (requiring, on average, longer searches for free blocks). The situation in both cases can be improved by adding more storage and using the
gfs2_grow
command to expand the file system.