Ce contenu n'est pas disponible dans la langue sélectionnée.

11.15. Managing Split-brain


Split-brain is a state of data inconsistency that occurs when different data sources in a cluster having different ideas about what the correct, current state of that data should be. This can happen because of servers in a network design, or a failure condition based on servers not communicating and synchronizing their data to each other.
In Red Hat Gluster Storage, split-brain is a term applicable to Red Hat Gluster Storage volumes in a replicate configuration. A file is said to be in split-brain when the copies of the same file in different bricks that constitute the replica-pair have mismatching data and/or metadata contents such that they are conflicting each other and automatic healing is not possible. In this scenario, you can decide which is the correct file (source) and which is the one that requires healing (sink) by inspecting at the mismatching files from the backend bricks.
The AFR translator in glusterFS makes use of extended attributes to keep track of the operations on a file. These attributes determine which brick is the correct source when a file requires healing. If the files are clean, the extended attributes are all zeroes indicating that no heal is necessary. When a heal is required, they are marked in such a way that there is a distinguishable source and sink and the heal can happen automatically. But, when a split-brain occurs, these extended attributes are marked in such a way that both bricks mark themselves as sources, making automatic healing impossible.
Split-brain occurs when a difference exists between multiple copies of the same file, and Red Hat Gluster Storage is unable to determine which version is correct. Applications are restricted from executing certain operations like read and write on the disputed file when split-brain happens. Attempting to access the files results in the application receiving an input/output error on the disputed file.
The three types of split-brain that occur in Red Hat Gluster Storage are:

11.15.1. Preventing Split-brain

To prevent split-brain in the trusted storage pool, you must configure server-side and client-side quorum.

11.15.1.1. Configuring Server-Side Quorum

The quorum configuration in a trusted storage pool determines the number of server failures that the trusted storage pool can sustain. If an additional failure occurs, the trusted storage pool will become unavailable. If too many server failures occur, or if there is a problem with communication between the trusted storage pool nodes, it is essential that the trusted storage pool be taken offline to prevent data loss.
After configuring the quorum ratio at the trusted storage pool level, you must enable the quorum on a particular volume by setting cluster.server-quorum-type volume option as server. For more information on this volume option, see Section 11.1, “Configuring Volume Options”.
Configuration of the quorum is necessary to prevent network partitions in the trusted storage pool. Network Partition is a scenario where, a small set of nodes might be able to communicate together across a functioning part of a network, but not be able to communicate with a different set of nodes in another part of the network. This can cause undesirable situations, such as split-brain in a distributed system. To prevent a split-brain situation, all the nodes in at least one of the partitions must stop running to avoid inconsistencies.
This quorum is on the server-side, that is, the glusterd service. Whenever the glusterd service on a machine observes that the quorum is not met, it brings down the bricks to prevent data split-brain. When the network connections are brought back up and the quorum is restored, the bricks in the volume are brought back up. When the quorum is not met for a volume, any commands that update the volume configuration or peer addition or detach are not allowed. It is to be noted that both, the glusterd service not running and the network connection between two machines being down are treated equally.
You can configure the quorum percentage ratio for a trusted storage pool. If the percentage ratio of the quorum is not met due to network outages, the bricks of the volume participating in the quorum in those nodes are taken offline. By default, the quorum is met if the percentage of active nodes is more than 50% of the total storage nodes. However, if the quorum ratio is manually configured, then the quorum is met only if the percentage of active storage nodes of the total storage nodes is greater than or equal to the set value.
To configure the quorum ratio, use the following command:
# gluster volume set all cluster.server-quorum-ratio PERCENTAGE
For example, to set the quorum to 51% of the trusted storage pool:
# gluster volume set all cluster.server-quorum-ratio 51%
In this example, the quorum ratio setting of 51% means that more than half of the nodes in the trusted storage pool must be online and have network connectivity between them at any given time. If a network disconnect happens to the storage pool, then the bricks running on those nodes are stopped to prevent further writes.
You must ensure to enable the quorum on a particular volume to participate in the server-side quorum by running the following command:
# gluster volume set VOLNAME cluster.server-quorum-type server

Important

For a two-node trusted storage pool, it is important to set the quorum ratio to be greater than 50% so that two nodes separated from each other do not both believe they have a quorum.
For a replicated volume with two nodes and one brick on each machine, if the server-side quorum is enabled and one of the nodes goes offline, the other node will also be taken offline because of the quorum configuration. As a result, the high availability provided by the replication is ineffective. To prevent this situation, a dummy node can be added to the trusted storage pool which does not contain any bricks. This ensures that even if one of the nodes which contains data goes offline, the other node will remain online. Note that if the dummy node and one of the data nodes goes offline, the brick on other node will be also be taken offline, and will result in data unavailability.

11.15.1.2. Configuring Client-Side Quorum

By default, when replication is configured, clients can modify files as long as at least one brick in the replica group is available. If network partitioning occurs, different clients are only able to connect to different bricks in a replica set, potentially allowing different clients to modify a single file simultaneously.
For example, imagine a three-way replicated volume is accessed by two clients, C1 and C2, who both want to modify the same file. If network partitioning occurs such that client C1 can only access brick B1, and client C2 can only access brick B2, then both clients are able to modify the file independently, creating split-brain conditions on the volume. The file becomes unusable, and manual intervention is required to correct the issue.
Client-side quorum allows administrators to set a minimum number of bricks that a client must be able to access in order to allow data in the volume to be modified. If client-side quorum is not met, files in the replica set are treated as read-only. This is useful when three-way replication is configured.
Client-side quorum is configured on a per-volume basis, and applies to all replica sets in a volume. If client-side quorum is not met for X of Y volume sets, only X volume sets are treated as read-only; the remaining volume sets continue to allow data modification.
Earlier, the replica subvolume turned read-only when the quorum does not met. With rhgs-3.4.3, the subvolume becomes unavailable as all the file operations fail with ENOTCONN error instead of becoming EROFS. This means the cluster.quorum-reads volume option is also not supported.

Client-Side Quorum Options

cluster.quorum-count
The minimum number of bricks that must be available in order for writes to be allowed. This is set on a per-volume basis. Valid values are between 1 and the number of bricks in a replica set. This option is used by the cluster.quorum-type option to determine write behavior.
This option is used in conjunction with cluster.quorum-type =fixed option to specify the number of bricks to be active to participate in quorum. If the quorum-type is auto then this option has no significance.
cluster.quorum-type
Determines when the client is allowed to write to a volume. Valid values are fixed and auto.
If cluster.quorum-type is fixed, writes are allowed as long as the number of bricks available in the replica set is greater than or equal to the value of the cluster.quorum-count option.
If cluster.quorum-type is auto, writes are allowed when at least 50% of the bricks in a replica set are be available. In a replica set with an even number of bricks, if exactly 50% of the bricks are available, the first brick in the replica set must be available in order for writes to continue.
In a three-way replication setup, it is recommended to set cluster.quorum-type to auto to avoid split-brains. If the quorum is not met, the replica pair becomes read-only.

Example 11.7. Client-Side Quorum

In the above scenario, when the client-side quorum is not met for replica group A, only replica group A becomes read-only. Replica groups B and C continue to allow data modifications.
Configure the client-side quorum using cluster.quorum-type and cluster.quorum-count options.

Important

When you integrate Red Hat Gluster Storage with Red Hat Enterprise Virtualization, the client-side quorum is enabled when you run gluster volume set VOLNAME group virt command. If on a two replica set up, if the first brick in the replica pair is offline, virtual machines will be paused because quorum is not met and writes are disallowed.
Consistency is achieved at the cost of fault tolerance. If fault-tolerance is preferred over consistency, disable client-side quorum with the following command:
# gluster volume reset VOLNAME quorum-type
Example - Setting up server-side and client-side quorum to avoid split-brain scenario

This example provides information on how to set server-side and client-side quorum on a Distribute Replicate volume to avoid split-brain scenario. The configuration of this example has 3 X 3 ( 9 bricks) Distribute Replicate setup.

# gluster volume info testvol
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 0df52d58-bded-4e5d-ac37-4c82f7c89cfh
Status: Created
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: server1:/rhgs/brick1
Brick2: server2:/rhgs/brick2
Brick3: server3:/rhgs/brick3
Brick4: server4:/rhgs/brick4
Brick5: server5:/rhgs/brick5
Brick6: server6:/rhgs/brick6
Brick7: server7:/rhgs/brick7
Brick8: server8:/rhgs/brick8
Brick9: server9:/rhgs/brick9
Setting Server-side Quorum
Enable the quorum on a particular volume to participate in the server-side quorum by running the following command:
# gluster volume set VOLNAME cluster.server-quorum-type server
Set the quorum to 51% of the trusted storage pool:
# gluster volume set all cluster.server-quorum-ratio 51%
In this example, the quorum ratio setting of 51% means that more than half of the nodes in the trusted storage pool must be online and have network connectivity between them at any given time. If a network disconnect happens to the storage pool, then the bricks running on those nodes are stopped to prevent further writes.
Setting Client-side Quorum
Set the quorum-typeoption to auto to allow writes to the file only if the percentage of active replicate bricks is more than 50% of the total number of bricks that constitute that replica.
# gluster volume set VOLNAME quorum-type auto
In this example, as there are only two bricks in the replica pair, the first brick must be up and running to allow writes.

Important

Atleast n/2 bricks need to be up for the quorum to be met. If the number of bricks (n) in a replica set is an even number, it is mandatory that the n/2 count must consist of the primary brick and it must be up and running. If n is an odd number, the n/2 count can have any brick up and running, that is, the primary brick need not be up and running to allow writes.

11.15.2. Recovering from File Split-brain

You can recover from the data and meta-data split-brain using one of the following methods:
For information on resolving entry/type-mismatch split-brain, see Chapter 23, Manually Recovering File Split-brain .

11.15.2.1.  Recovering File Split-brain from the Mount Point

Steps to recover from a split-brain from the mount point

  1. You can use a set of getfattr and setfattr commands to detect the data and meta-data split-brain status of a file and resolve split-brain from the mount point.

    Important

    This process for split-brain resolution from mount will not work on NFS mounts as it does not provide extended attributes support.
    In this example, the test-volume volume has bricks brick0, brick1, brick2 and brick3.
    # gluster volume info test-volume
    Volume Name: test-volume
    Type: Distributed-Replicate
    Status: Started
    Number of Bricks: 2 x 2 = 4
    Transport-type: tcp
    Bricks:
    Brick1: test-host:/rhgs/brick0
    Brick2: test-host:/rhgs/brick1
    Brick3: test-host:/rhgs/brick2
    Brick4: test-host:/rhgs/brick3
    Directory structure of the bricks is as follows:
    # tree -R /test/b?
    /rhgs/brick0
    ├── dir
    │   └── a
    └── file100
    
    /rhgs/brick1
    ├── dir
    │   └── a
    └── file100
    
    /rhgs/brick2
    ├── dir
    ├── file1
    ├── file2
    └── file99
    
    /rhgs/brick3
    ├── dir
    ├── file1
    ├── file2
    └── file99
    In the following output, some of the files in the volume are in split-brain.
    # gluster volume heal test-volume info split-brain
    Brick test-host:/rhgs/brick0/
    /file100
    /dir
    Number of entries in split-brain: 2
    
    Brick test-host:/rhgs/brick1/
    /file100
    /dir
    Number of entries in split-brain: 2
    
    Brick test-host:/rhgs/brick2/
    /file99
    <gfid:5399a8d1-aee9-4653-bb7f-606df02b3696>
    Number of entries in split-brain: 2
    
    Brick test-host:/rhgs/brick3/
    <gfid:05c4b283-af58-48ed-999e-4d706c7b97d5>
    <gfid:5399a8d1-aee9-4653-bb7f-606df02b3696>
    Number of entries in split-brain: 2
    To know data or meta-data split-brain status of a file:
    # getfattr -n replica.split-brain-status <path-to-file>
    The above command executed from mount provides information if a file is in data or meta-data split-brain. This command is not applicable to entry/type-mismatch split-brain.
    For example,
    • file100 is in meta-data split-brain. Executing the above mentioned command for file100 gives :
      # getfattr -n replica.split-brain-status file100
      # file: file100
      replica.split-brain-status="data-split-brain:no    metadata-split-brain:yes    Choices:test-client-0,test-client-1"
    • file1 is in data split-brain.
      # getfattr -n replica.split-brain-status file1
      # file: file1
      replica.split-brain-status="data-split-brain:yes    metadata-split-brain:no    Choices:test-client-2,test-client-3"
    • file99 is in both data and meta-data split-brain.
      # getfattr -n replica.split-brain-status file99
      # file: file99
      replica.split-brain-status="data-split-brain:yes    metadata-split-brain:yes    Choices:test-client-2,test-client-3"
    • dir is in entry/type-mismatch split-brain but as mentioned earlier, the above command is does not display if the file is in entry/type-mismatch split-brain. Hence, the command displays The file is not under data or metadata split-brain. For information on resolving entry/type-mismatch split-brain, see Chapter 23, Manually Recovering File Split-brain .
      # getfattr -n replica.split-brain-status dir
      # file: dir
      replica.split-brain-status="The file is not under data or metadata split-brain"
    • file2 is not in any kind of split-brain.
      # getfattr -n replica.split-brain-status file2
      # file: file2
      replica.split-brain-status="The file is not under data or metadata split-brain"
  2. Analyze the files in data and meta-data split-brain and resolve the issue

    When you perform operations like cat, getfattr, and more from the mount on files in split-brain, it throws an input/output error. For further analyzing such files, you can use setfattr command.

    # setfattr -n replica.split-brain-choice -v "choiceX" <path-to-file>
    Using this command, a particular brick can be chosen to access the file in split-brain.
    For example,
    file1 is in data-split-brain and when you try to read from the file, it throws input/output error.
    # cat file1
    cat: file1: Input/output error
    Split-brain choices provided for file1 were test-client-2 and test-client-3.
    Setting test-client-2 as split-brain choice for file1 serves reads from b2 for the file.
    # setfattr -n replica.split-brain-choice -v test-client-2 file1
    Now, you can perform operations on the file. For example, read operations on the file:
    # cat file1
    xyz
    Similarly, to inspect the file from other choice, replica.split-brain-choice is to be set to test-client-3.
    Trying to inspect the file from a wrong choice errors out. You can undo the split-brain-choice that has been set, the above mentioned setfattr command can be used with none as the value for extended attribute.
    For example,
    # setfattr -n replica.split-brain-choice -v none file1
    Now performing cat operation on the file will again result in input/output error, as before.
    # cat file
    cat: file1: Input/output error
    After you decide which brick to use as a source for resolving the split-brain, it must be set for the healing to be done.
    # setfattr -n replica.split-brain-heal-finalize -v <heal-choice> <path-to-file>
    Example
    # setfattr -n replica.split-brain-heal-finalize -v test-client-2 file1
    The above process can be used to resolve data and/or meta-data split-brain on all the files.
    Setting the split-brain-choice on the file
    After setting the split-brain-choice on the file, the file can be analyzed only for five minutes. If the duration of analyzing the file needs to be increased, use the following command and set the required time in timeout-in-minute argument.
    # setfattr -n replica.split-brain-choice-timeout -v <timeout-in-minutes> <mount_point/file>
    This is a global timeout and is applicable to all files as long as the mount exists. The timeout need not be set each time a file needs to be inspected but for a new mount it will have to be set again for the first time. This option becomes invalid if the operations like add-brick or remove-brick are performed.

    Note

    If fopen-keep-cache FUSE mount option is disabled, then inode must be invalidated each time before selecting a new replica.split-brain-choice to inspect a file using the following command:
    # setfattr -n inode-invalidate -v 0 <path-to-file>

11.15.2.2. Recovering File Split-brain from the gluster CLI

You can resolve the split-brain from the gluster CLI by the following ways:
  • Use bigger-file as source
  • Use the file with latest mtime as source
  • Use one replica as source for a particular file
  • Use one replica as source for all files

    Note

    The entry/type-mismatch split-brain resolution is not supported using CLI. For information on resolving entry/type-mismatch split-brain, see Chapter 23, Manually Recovering File Split-brain .
Selecting the bigger-file as source

This method is useful for per file healing and where you can decided that the file with bigger size is to be considered as source.

  1. Run the following command to obtain the list of files that are in split-brain:
    # gluster volume heal VOLNAME info split-brain
    Brick <hostname:brickpath-b1>
    <gfid:aaca219f-0e25-4576-8689-3bfd93ca70c2>
    <gfid:39f301ae-4038-48c2-a889-7dac143e82dd>
    <gfid:c3c94de2-232d-4083-b534-5da17fc476ac>
    Number of entries in split-brain: 3
    
    Brick <hostname:brickpath-b2>
    /dir/file1
    /dir
    /file4
    Number of entries in split-brain: 3
    From the command output, identify the files that are in split-brain.
    You can find the differences in the file size and md5 checksums by performing a stat and md5 checksums on the file from the bricks. The following is the stat and md5 checksum output of a file:
    On brick b1:
    # stat b1/dir/file1
      File: ‘b1/dir/file1’
      Size: 17              Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919362      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 13:55:40.149897333 +0530
    Modify: 2015-03-06 13:55:37.206880347 +0530
    Change: 2015-03-06 13:55:37.206880347 +0530
     Birth: -
    
    # md5sum b1/dir/file1
    040751929ceabf77c3c0b3b662f341a8  b1/dir/file1
    
    On brick b2:
    # stat b2/dir/file1
      File: ‘b2/dir/file1’
      Size: 13              Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919365      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 13:54:22.974451898 +0530
    Modify: 2015-03-06 13:52:22.910758923 +0530
    Change: 2015-03-06 13:52:22.910758923 +0530
     Birth: -
    
    # md5sum b2/dir/file1
    cb11635a45d45668a403145059c2a0d5  b2/dir/file1
    You can notice the differences in the file size and md5 checksums.
  2. Execute the following command along with the full file name as seen from the root of the volume (or) the gfid-string representation of the file, which is displayed in the heal info command's output.
    # gluster volume heal <VOLNAME> split-brain bigger-file <FILE>
    For example,
    # gluster volume heal test-volume split-brain bigger-file /dir/file1
    Healed /dir/file1.
After the healing is complete, the md5sum and file size on both bricks must be same. The following is a sample output of the stat and md5 checksums command after completion of healing the file.
On brick b1:
# stat b1/dir/file1
  File: ‘b1/dir/file1’
  Size: 17              Blocks: 16         IO Block: 4096   regular file
Device: fd03h/64771d    Inode: 919362      Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2015-03-06 14:17:27.752429505 +0530
Modify: 2015-03-06 13:55:37.206880347 +0530
Change: 2015-03-06 14:17:12.880343950 +0530
 Birth: -

# md5sum b1/dir/file1
040751929ceabf77c3c0b3b662f341a8  b1/dir/file1

On brick b2:
# stat b2/dir/file1
  File: ‘b2/dir/file1’
  Size: 17              Blocks: 16         IO Block: 4096   regular file
Device: fd03h/64771d    Inode: 919365      Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2015-03-06 14:17:23.249403600 +0530
Modify: 2015-03-06 13:55:37.206880000 +0530
Change: 2015-03-06 14:17:12.881343955 +0530
 Birth: -

# md5sum b2/dir/file1
040751929ceabf77c3c0b3b662f341a8  b2/dir/file1
Selecting the file with latest mtime as source

This method is useful for per file healing and if you want the file with latest mtime has to be considered as source.

  1. Run the following command to obtain the list of files that are in split-brain:
    # gluster volume heal VOLNAME info split-brain
    Brick <hostname:brickpath-b1>
    <gfid:aaca219f-0e25-4576-8689-3bfd93ca70c2>
    <gfid:39f301ae-4038-48c2-a889-7dac143e82dd>
    <gfid:c3c94de2-232d-4083-b534-5da17fc476ac>
    Number of entries in split-brain: 3
    
    Brick <hostname:brickpath-b2>
    /dir/file1
    /dir
    /file4
    Number of entries in split-brain: 3
    From the command output, identify the files that are in split-brain.
    You can find the differences in the file size and md5 checksums by performing a stat and md5 checksums on the file from the bricks. The following is the stat and md5 checksum output of a file:
    On brick b1:
    
     stat b1/file4
      File: ‘b1/file4’
        Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919356      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 13:53:19.417085062 +0530
    Modify: 2015-03-06 13:53:19.426085114 +0530
    Change: 2015-03-06 13:53:19.426085114 +0530
     Birth: -
    
    
    # md5sum b1/file4
    b6273b589df2dfdbd8fe35b1011e3183  b1/file4
    
    On brick b2:
    
    # stat b2/file4
      File: ‘b2/file4’
      Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919358      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 13:52:35.761833096 +0530
    Modify: 2015-03-06 13:52:35.769833142 +0530
    Change: 2015-03-06 13:52:35.769833142 +0530
     Birth: -
    
    
    # md5sum b2/file4
    0bee89b07a248e27c83fc3d5951213c1  b2/file4
    You can notice the differences in the md5 checksums, and the modify time.
  2. Execute the following command
    # gluster volume heal <VOLNAME> split-brain latest-mtime <FILE>
    In this command, FILE can be either the full file name as seen from the root of the volume or the gfid-string representation of the file.
    For example,
    # gluster volume heal test-volume split-brain latest-mtime /file4
    Healed /file4
    
    After the healing is complete, the md5 checksum, file size, and modify time on both bricks must be same. The following is a sample output of the stat and md5 checksums command after completion of healing the file. You can notice that the file has been healed using the brick having the latest mtime (brick b1, in this example) as the source.
    On brick b1:
    # stat b1/file4
      File: ‘b1/file4’
      Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919356      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 14:23:38.944609863 +0530
    Modify: 2015-03-06 13:53:19.426085114 +0530
    Change: 2015-03-06 14:27:15.058927962 +0530
     Birth: -
    
    # md5sum b1/file4
    b6273b589df2dfdbd8fe35b1011e3183  b1/file4
    
    On brick b2:
    # stat b2/file4
     File: ‘b2/file4’
       Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919358      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 14:23:38.944609000 +0530
    Modify: 2015-03-06 13:53:19.426085000 +0530
    Change: 2015-03-06 14:27:15.059927968 +0530
     Birth:
    
    # md5sum b2/file4
    b6273b589df2dfdbd8fe35b1011e3183  b2/file4
Selecting one replica as source for a particular file

This method is useful if you know which file is to be considered as source.

  1. Run the following command to obtain the list of files that are in split-brain:
    # gluster volume heal VOLNAME info split-brain
    Brick <hostname:brickpath-b1>
    <gfid:aaca219f-0e25-4576-8689-3bfd93ca70c2>
    <gfid:39f301ae-4038-48c2-a889-7dac143e82dd>
    <gfid:c3c94de2-232d-4083-b534-5da17fc476ac>
    Number of entries in split-brain: 3
    
    Brick <hostname:brickpath-b2>
    /dir/file1
    /dir
    /file4
    Number of entries in split-brain: 3
    From the command output, identify the files that are in split-brain.
    You can find the differences in the file size and md5 checksums by performing a stat and md5 checksums on the file from the bricks. The following is the stat and md5 checksum output of a file:
    On brick b1:
    
     stat b1/file4
      File: ‘b1/file4’
      Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919356      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 13:53:19.417085062 +0530
    Modify: 2015-03-06 13:53:19.426085114 +0530
    Change: 2015-03-06 13:53:19.426085114 +0530
     Birth: -
    
    # md5sum b1/file4
    b6273b589df2dfdbd8fe35b1011e3183  b1/file4
    
    On brick b2:
    
    # stat b2/file4
      File: ‘b2/file4’
      Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919358      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 13:52:35.761833096 +0530
    Modify: 2015-03-06 13:52:35.769833142 +0530
    Change: 2015-03-06 13:52:35.769833142 +0530
     Birth: -
    
    # md5sum b2/file4
    0bee89b07a248e27c83fc3d5951213c1  b2/file4
    You can notice the differences in the file size and md5 checksums.
  2. Execute the following command
    # gluster volume heal <VOLNAME> split-brain source-brick <HOSTNAME:BRICKNAME> <FILE>
    In this command, FILE present in <HOSTNAME:BRICKNAME> is taken as source for healing.
    For example,
    # gluster volume heal test-volume split-brain source-brick test-host:b1 /file4
    Healed /file4
    After the healing is complete, the md5 checksum and file size on both bricks must be same. The following is a sample output of the stat and md5 checksums command after completion of healing the file.
    On brick b1:
    # stat b1/file4
      File: ‘b1/file4’
      Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919356      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 14:23:38.944609863 +0530
    Modify: 2015-03-06 13:53:19.426085114 +0530
    Change: 2015-03-06 14:27:15.058927962 +0530
     Birth: -
    
    # md5sum b1/file4
    b6273b589df2dfdbd8fe35b1011e3183  b1/file4
    
    On brick b2:
    # stat b2/file4
     File: ‘b2/file4’
      Size: 4               Blocks: 16         IO Block: 4096   regular file
    Device: fd03h/64771d    Inode: 919358      Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2015-03-06 14:23:38.944609000 +0530
    Modify: 2015-03-06 13:53:19.426085000 +0530
    Change: 2015-03-06 14:27:15.059927968 +0530
     Birth: -
    
    # md5sum b2/file4
    b6273b589df2dfdbd8fe35b1011e3183  b2/file4
Selecting one replica as source for all files

This method is useful if you know want to use a particular brick as a source for the split-brain files in that replica pair.

  1. Run the following command to obtain the list of files that are in split-brain:
    # gluster volume heal VOLNAME info split-brain
    From the command output, identify the files that are in split-brain.
  2. Execute the following command
    # gluster volume heal <VOLNAME> split-brain source-brick <HOSTNAME:BRICKNAME>
    In this command, for all the files that are in split-brain in this replica, <HOSTNAME:BRICKNAME> is taken as source for healing.
    For example,
    # gluster volume heal test-volume split-brain source-brick test-host:b1

11.15.3. Recovering GFID Split-brain from the gluster CLI

With this release, Red Hat Gluster Storage allows you to resolve GFID split-brain from the gluster CLI.
You can use one of the following policies to resolve GFID split-brain:
  • Use bigger-file as source
  • Use the file with latest mtime as source
  • Use one replica as source for a particular file

Note

The entry/type-mismatch split-brain resolution is not supported using CLI. For information on resolving entry/type-mismatch split-brain, see Chapter 23, Manually Recovering File Split-brain .
Selecting the bigger-file as source

This method is useful for per file healing and where you can decided that the file with bigger size is to be considered as source.

  1. Run the following command to obtain the path of the file that is in split-brain:
      # gluster volume heal VOLNAME info split-brain
    From the output, identify the files for which file operations performed from the client failed with input/output error.
    For example,
    # gluster volume heal 12 info split-brain
    Brick 10.70.47.45:/bricks/brick2/b0
    /f5
    / - Is in split-brain
    
    Status: Connected
    Number of entries: 2
    
    Brick 10.70.47.144:/bricks/brick2/b1
    /f5
    / - Is in split-brain
    
    Status: Connected
    Number of entries: 2
    
    In the above command, 12 is the volume name, b0 and b1 are the bricks.
  2. Execute the below command on the brick to fetch information if a file is in GFID split-brain. The getfattr command is used to obtain and verify the AFR changelog extended attributes of the files.
        # getfattr -d -e hex -m. <path-to-file>
    For example,
    On brick /b0
    
    # getfattr -d -m . -e hex /bricks/brick2/b0/f5
    getfattr: Removing leading '/' from absolute path names
    # file: bricks/brick2/b0/f5
    security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
    trusted.afr.12-client-1=0x000000020000000100000000
    trusted.afr.dirty=0x000000000000000000000000
    trusted.gfid=0xce0a9956928e40afb78e95f78defd64f
    trusted.gfid2path.9cde09916eabc845=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6635
    
    
    
    On brick /b1
    
    # getfattr -d -m . -e hex /bricks/brick2/b1/f5
    getfattr: Removing leading '/' from absolute path names
    # file: bricks/brick2/b1/f5
    security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
    trusted.afr.12-client-0=0x000000020000000100000000
    trusted.afr.dirty=0x000000000000000000000000
    trusted.gfid=0x9563544118653550e888ab38c232e0c
    trusted.gfid2path.9cde09916eabc845=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6635
    
    You can notice the difference in GFID for the file f5 in both the bricks.
    You can find the differences in the file size by executing stat command on the file from the bricks. The following is the output of the file f5 in bricks b0 and b1:
    On brick /b0
    
    # stat /bricks/brick2/b0/f5
    File: ‘/bricks/brick2/b0/f5’
    Size: 15            Blocks: 8          IO Block: 4096   regular file
    Device: fd15h/64789d    Inode: 67113350    Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Context: system_u:object_r:glusterd_brick_t:s0
    Access: 2018-08-29 20:46:26.353751073 +0530
    Modify: 2018-08-29 20:46:26.361751203 +0530
    Change: 2018-08-29 20:47:16.363751236 +0530
    Birth: -
    
    
    
    On brick /b1
    
    # stat /bricks/brick2/b1/f5
    File: ‘/bricks/brick2/b1/f5’
    Size: 2             Blocks: 8          IO Block: 4096   regular file
    Device: fd15h/64789d    Inode: 67111750    Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Context: system_u:object_r:glusterd_brick_t:s0
    Access: 2018-08-29 20:44:56.153301616 +0530
    Modify: 2018-08-29 20:44:56.161301745 +0530
    Change: 2018-08-29 20:44:56.162301761 +0530
    Birth: -
    
  3. Execute the following command along with the full filename as seen from the root of the volume which is displayed in the heal info command's output:
    # gluster volume heal VOLNAME split-brain bigger-file FILE
    For example,
    # gluster volume heal12 split-brain bigger-file /f5
    GFID split-brain resolved for file /f5
    
    After the healing is complete, the file size on both bricks must be the same as that of the file which had the bigger size. The following is a sample output of the getfattrcommand after completion of healing the file.
    On brick /b0
    
    # getfattr -d -m . -e hex /bricks/brick2/b0/f5
    getfattr: Removing leading '/' from absolute path names
    # file: bricks/brick2/b0/f5
    security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
    trusted.gfid=0xce0a9956928e40afb78e95f78defd64f
    trusted.gfid2path.9cde09916eabc845=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6635
    
    
    
    On brick /b1
    
    # getfattr -d -m . -e hex /bricks/brick2/b1/f5
    getfattr: Removing leading '/' from absolute path names
    # file: bricks/brick2/b1/f5
    security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
    trusted.gfid=0xce0a9956928e40afb78e95f78defd64f
    trusted.gfid2path.9cde09916eabc845=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6635
    
Selecting the file with latest mtime as source

This method is useful for per file healing and if you want the file with latest mtime has to be considered as source.

  1. Run the following command to obtain the list of files that are in split-brain:
    # gluster volume heal VOLNAME info split-brain
    From the output, identify the files for which file operations performed from the client failed with input/output error.
    For example,
    # gluster volume heal 12 info split-brain
    Brick 10.70.47.45:/bricks/brick2/b0
    /f4
    / - Is in split-brain
    
    Status: Connected
    Number of entries: 2
    
    Brick 10.70.47.144:/bricks/brick2/b1
    /f4
    / - Is in split-brain
    
    Status: Connected
    Number of entries: 2
    
    In the above command, 12 is the volume name, b0 and b1 are the bricks.
  2. The below command executed from backend provides information if a file is in GFID split-brain.
    # getfattr -d -e hex -m. <path-to-file>
    For example,
    On brick /b0
    
    # getfattr -d -m . -e hex /bricks/brick2/b0/f4
    getfattr: Removing leading '/' from absolute path names
    # file: bricks/brick2/b0/f4
    security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
    trusted.afr.12-client-1=0x000000020000000100000000
    trusted.afr.dirty=0x000000000000000000000000
    trusted.gfid=0xb66b66d07b315f3c9cffac2fb6422a28
    trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6634
    
    
    
    On brick /b1
    
    # getfattr -d -m . -e hex /bricks/brick2/b1/f4
    getfattr: Removing leading '/' from absolute path names
    # file: bricks/brick2/b1/f4
    security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
    trusted.afr.12-client-0=0x000000020000000100000000
    trusted.afr.dirty=0x000000000000000000000000
    trusted.gfid=0x87242f808c6e56a007ef7d49d197acff
    trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6634
    
    You can notice the difference in GFID for the file f4 in both the bricks.
    You can find the difference in the modify time by executing statcommand on the file from the bricks. The following is the output of the file f4 in bricks b0 and b1:
    On brick /b0
    
    # stat /bricks/brick2/b0/f4
    File: ‘/bricks/brick2/b0/f4’
    Size: 14            Blocks: 8          IO Block: 4096   regular file
    Device: fd15h/64789d    Inode: 67113349    Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Context: system_u:object_r:glusterd_brick_t:s0
    Access: 2018-08-29 20:57:38.913629991 +0530
    Modify: 2018-08-29 20:57:38.921630122 +0530
    Change: 2018-08-29 20:57:38.923630154 +0530
    Birth: -
    
    
    
    On brick /b1
    
    # stat /bricks/brick2/b1/f4
    File: ‘/bricks/brick2/b1/f4’
    Size: 2             Blocks: 8          IO Block: 4096   regular file
    Device: fd15h/64789d    Inode: 67111749    Links: 2
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Context: system_u:object_r:glusterd_brick_t:s0
    Access: 2018-08-24 20:54:50.953217256 +0530
    Modify: 2018-08-24 20:54:50.961217385 +0530
    Change: 2018-08-24 20:54:50.962217402 +0530
    Birth: -
    
  3. Execute the following command:
    # gluster volume healVOLNAME split-brain latest-mtime FILE 
    For example,
    # gluster volume heal 12 split-brain latest-mtime /f4
    GFID split-brain resolved for file /f4
    
    After the healing is complete, the GFID of the files on both bricks must be same. The following is a sample output of the getfattr command after completion of healing the file. You can notice that the file has been healed using the brick having the latest mtime as the source.
    On brick /b0
    
    # getfattr -d -m . -e hex /bricks/brick2/b0/f4
    getfattr: Removing leading '/' from absolute path names
    # file: bricks/brick2/b0/f4
    security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
    trusted.gfid=0xb66b66d07b315f3c9cffac2fb6422a28
    trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6634
    
    
    
    On brick /b1
    
    # getfattr -d -m . -e hex /bricks/brick2/b1/f4
    getfattr: Removing leading '/' from absolute path names
    # file: bricks/brick2/b1/f4
    security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
    trusted.gfid=0xb66b66d07b315f3c9cffac2fb6422a28
    trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6634
    
Selecting one replica as source for a particular file

This method is useful if you know which file is to be considered as source.

  1. Run the following command to obtain the list of files that are in split-brain:
    # gluster volume heal VOLNAME info split-brain
    From the output, identify the files for which file operations performed from the client failed with input/output error.
    For example,
    # gluster volume heal 12 info split-brain
    Brick 10.70.47.45:/bricks/brick2/b0
    /f3
    / - Is in split-brain
    
    Status: Connected
    Number of entries: 2
    
    Brick 10.70.47.144:/bricks/brick2/b1
    /f3
    / - Is in split-brain
    
    Status: Connected
    Number of entries: 2
    
    In the above command, 12 is the volume name, b0 and b1 are the bricks.

    Note

    With one replica as source option, there is no way to resolve all the GFID split-brain in one shot by not specifying any file-path in the CLI as done for data/metadata split-brain resolutions.
    For each file in GFID split-brain, you have to run the heal command separately.
  2. The below command executed from backend provides information if a file is in GFID split-brain.
     # getfattr -d -e hex -m. <path-to-file>
    For example,
    # getfattr -d -m . -e hex /bricks/brick2/b0/f3
    On brick /b0
    
    getfattr: Removing leading '/' from absolute path names
    # file: bricks/brick2/b0/f3
    security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
    trusted.afr.12-client-1=0x000000020000000100000000
    trusted.afr.dirty=0x000000000000000000000000
    trusted.gfid=0x9d542fb1b3b15837a2f7f9dcdf5d6ee8
    trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6634
    
    
    On brick /b1
    
    # getfattr -d -m . -e hex /bricks/brick2/b1/f3
    getfattr: Removing leading '/' from absolute path names
    # file: bricks/brick2/b0/f3
    security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
    trusted.afr.12-client-1=0x000000020000000100000000
    trusted.afr.dirty=0x000000000000000000000000
    trusted.gfid=0xc90d9b0f65f6530b95b9f3f8334033df
    trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6634
    
    You can notice the difference in GFID for the file f3 in both the bricks.
  3. Execute the following command:
    # gluster volume heal VOLNAME split-brain source-brick HOSTNAME : export-directory-absolute-path FILE
    In this command, FILE present in HOSTNAME : export-directory-absolute-path is taken as source for healing.
    For example,
    # gluster volume heal 12 split-brain source-brick 10.70.47.144:/bricks/brick2/b1 /f3
    GFID split-brain resolved for file /f3
    
    After the healing is complete, the GFID of the file on both the bricks should be same as that of the file which had bigger size. The following is a sample output of the getfattr command after the file is healed.
    On brick /b0
    
    # getfattr -d -m . -e hex /bricks/brick2/b0/f3
    getfattr: Removing leading '/' from absolute path names
    # file: bricks/brick2/b0/f3
    security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
    trusted.gfid=0x90d9b0f65f6530b95b9f3f8334033df
    trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6634
    
    
    
    On brick /b1
    
    # getfattr -d -m . -e hex /bricks/brick2/b1/f3
    getfattr: Removing leading '/' from absolute path names
    # file: bricks/brick2/b1/f3
    security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
    trusted.gfid=0x90d9b0f65f6530b95b9f3f8334033df
    trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6634
    
    

    Note

    You can not use the GFID of the file as an argument with any of the CLI options to resolve GFID split-brain. It should be the absolute path as seen from the mount point to the file considered as source.
    With source-brick option there is no way to resolve all the GFID split-brain in one shot by not specifying any file-path in the CLI as done while resolving data or metadata split-brain. For each file in GFID split-brain, run the CLI with the policy you want to use.
    Resolving directory GFID split-brain using CLI with the "source-brick" option in a "distributed-replicated" volume needs to be done on all the volumes explicitly. Since directories get created on all the subvolumes, using one particular brick as source for directory GFID split-brain, heal the directories for that subvolume. In this case, other subvolumes must be healed using the brick which has same GFID as that of the previous brick which was used as source for healing other subvolume. For information on resolving entry/type-mismatch split-brain, see Chapter 23, Manually Recovering File Split-brain .

11.15.4. Triggering Self-Healing on Replicated Volumes

For replicated volumes, when a brick goes offline and comes back online, self-healing is required to re-sync all the replicas. There is a self-heal daemon which runs in the background, and automatically initiates self-healing every 10 minutes on any files which require healing.
Multithreaded Self-heal

Self-heal daemon has the capability to handle multiple heals in parallel and is supported on Replicate and Distribute-replicate volumes. However, increasing the number of heals has impact on I/O performance so the following options have been provided. The cluster.shd-max-threads volume option controls the number of entries that can be self healed in parallel on each replica by self-heal daemon using. Using cluster.shd-wait-qlength volume option, you can configure the number of entries that must be kept in the queue for self-heal daemon threads to take up as soon as any of the threads are free to heal.

For more information on cluster.shd-max-threads and cluster.shd-wait-qlength volume set options, see Section 11.1, “Configuring Volume Options”.
There are various commands that can be used to check the healing status of volumes and files, or to manually initiate healing:
  • To view the list of files that need healing:
    # gluster volume heal VOLNAME info
    For example, to view the list of files on test-volume that need healing:
    # gluster volume heal test-volume info
    Brick server1:/gfs/test-volume_0
    Number of entries: 0
    
    Brick server2:/gfs/test-volume_1
    /95.txt
    /32.txt
    /66.txt
    /35.txt
    /18.txt
    /26.txt - Possibly undergoing heal
    /47.txt
    /55.txt
    /85.txt - Possibly undergoing heal
    ...
    Number of entries: 101
  • To trigger self-healing only on the files which require healing:
    # gluster volume heal VOLNAME
    For example, to trigger self-healing on files which require healing on test-volume:
    # gluster volume heal test-volume
    Heal operation on volume test-volume has been successful
  • To trigger self-healing on all the files on a volume:
    # gluster volume heal VOLNAME full
    For example, to trigger self-heal on all the files on test-volume:
    # gluster volume heal test-volume full
    Heal operation on volume test-volume has been successful
  • To view the list of files on a volume that are in a split-brain state:
    # gluster volume heal VOLNAME info split-brain
    For example, to view the list of files on test-volume that are in a split-brain state:
    # gluster volume heal test-volume info split-brain
    Brick server1:/gfs/test-volume_2
    Number of entries: 12
    at                   path on brick
    ----------------------------------
    2012-06-13 04:02:05  /dir/file.83
    2012-06-13 04:02:05  /dir/file.28
    2012-06-13 04:02:05  /dir/file.69
    Brick server2:/gfs/test-volume_2
    Number of entries: 12
    at                   path on brick
    ----------------------------------
    2012-06-13 04:02:05  /dir/file.83
    2012-06-13 04:02:05  /dir/file.28
    2012-06-13 04:02:05  /dir/file.69
    ...
Red Hat logoGithubRedditYoutubeTwitter

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Nous aidons les utilisateurs de Red Hat à innover et à atteindre leurs objectifs grâce à nos produits et services avec un contenu auquel ils peuvent faire confiance.

Rendre l’open source plus inclusif

Red Hat s'engage à remplacer le langage problématique dans notre code, notre documentation et nos propriétés Web. Pour plus de détails, consultez leBlog Red Hat.

À propos de Red Hat

Nous proposons des solutions renforcées qui facilitent le travail des entreprises sur plusieurs plates-formes et environnements, du centre de données central à la périphérie du réseau.

© 2024 Red Hat, Inc.