13.8. Configuring IPoIB
13.8.1. Understanding the role of IPoIB
IP
networks. InfiniBand is not. The role of IPoIB is to provide an IP
network emulation layer on top of InfiniBand RDMA networks. This allows existing applications to run over InfiniBand networks unmodified. However, the performance of those applications is considerably lower than if the application were written to use RDMA communication natively. Since most InfiniBand networks have some set of applications that really must get all of the performance they can out of the network, and then some other applications for which a degraded rate of performance is acceptable if it means that the application does not need to be modified to use RDMA communications, IPoIB is there to allow those less critical applications to run on the network as they are.
IP
networks with RDMA layered on top of their IP
link layer, they have no need of IPoIB. As a result, the kernel will refuse to create any IPoIB devices on top of iWARP or RoCE/IBoE RDMA devices.
13.8.2. Understanding IPoIB communication modes
IP
packet being transmitted. As a result, the IPoIB MTU must be 4 bytes less than the InfiniBand link-layer MTU. As 2048 is a common InfiniBand link-layer MTU, the common IPoIB device MTU in datagram mode is 2044.
IP
packet only has a 16 bit size field, and is therefore limited to 65535
as the maximum byte count. The maximum allowed MTU is actually smaller than that because we have to account for various TCP/IP headers that must also fit in that size. As a result, the IPoIB MTU in connected mode is capped at 65520
in order to make sure there is sufficient room for all needed TCP
headers.
13.8.3. Understanding IPoIB hardware addresses
0xfe:80:00:00:00:00:00:00
. The device will use the default subnet prefix (0xfe80000000000000) until it makes contact with the subnet manager, at which point it will reset the subnet prefix to match what the subnet manager has configured it to be. The final 8 bytes are the GUID address of the InfiniBand port that the IPoIB device is attached to. Because both the first 4 bytes and the next 8 bytes can change from time to time, they are not used or matched against when specifying the hardware address for an IPoIB interface. Section Section 13.5.2, “Usage of 70-persistent-ipoib.rules” explains how to derive the address by leaving the first 12 bytes out of the ATTR{address}
field in the udev rules file so that device matching will happen reliably. When configuring IPoIB interfaces, the HWADDR field of the configuration file can contain all 20 bytes, but only the last 8 bytes are actually used to match against and find the hardware specified by a configuration file. However, if the TYPE=InfiniBand
entry is not spelled correctly in the device configuration file, and ifup-ib is not the actual script used to open the IPoIB interface, then an error about the system being unable to find the hardware specified by the configuration will be issued. For IPoIB interfaces, the TYPE=
field of the configuration file must be either InfiniBand
or infiniband
(the entry is case sensitive, but the scripts will accept these two specific spellings).
13.8.4. Understanding InfiniBand P_Key subnets
P_Key
subnets. This is highly analogous to using VLANs on Ethernet interfaces. All switches and hosts must be a member of the default P_Key
subnet, but administrators can create additional subnets and limit members of those subnets to subsets of the hosts or switches in the fabric. A P_Key
subnet must be defined by the subnet manager before a host can use it. See section Section 13.6.4, “Creating a P_Key definition” for information on how to define a P_Key
subnet using the opensm subnet manager. For IPoIB interfaces, once a P_Key
subnet has been created, we can create additional IPoIB configuration files specifically for those P_Key
subnets. Just like VLAN interfaces on Ethernet devices, each IPoIB interface will behave as though it were on a completely different fabric from other IPoIB interfaces that share the same link but have different P_Key
values.
P_Key
interfaces. All IPoIB P_Key
s range from 0x0000
to 0x7fff
, and the high bit, 0x8000
, denotes that membership in a P_Key
is full membership instead of partial membership. The Linux kernel’s IPoIB driver only supports full membership in P_Key
subnets, so for any subnet that Linux can connect to, the high bit of the P_Key
number will always be set. That means that if a Linux computer joins P_Key 0x0002
, its actual P_Key
number once joined will be 0x8002
, denoting that we are full members of P_Key 0x0002
. For this reason, when creating a P_Key
definition in an opensm partitions.conf
file as depicted in section Section 13.6.4, “Creating a P_Key definition”, it is required to specify a P_Key
value without 0x8000
, but when defining the P_Key
IPoIB interfaces on the Linux clients, add the 0x8000
value to the base P_Key
value.
13.8.5. Configure InfiniBand Using the Text User Interface, nmtui
~]$ nmtui
The text user interface appears. Any invalid command prints a usage message.
Figure 13.1. The NetworkManager Text User Interface Add an InfiniBand Connection menu
Figure 13.2. The NetworkManager Text User Interface Configuring a InfiniBand Connection menu
13.8.6. Configure IPoIB using the command-line tool, nmcli
ib_ipoib
kernel module and then reloading it as follows:
~]$rmmod ib_ipoib
~]$modprobe ib_ipoib
Example 13.3. Creating and modifying IPoIB in two separate commands.
~]$nmcli con add type infiniband con-name mlx4_ib0 ifname mlx4_ib0 transport-mode connected mtu 65520
Connection 'mlx4_ib0' (8029a0d7-8b05-49ff-a826-2a6d722025cc) successfully added. ~]$nmcli con edit mlx4_ib0
===| nmcli interactive connection editor |=== Editing existing 'infiniband' connection: 'mlx4_ib0' Type 'help' or '?' for available commands. Type 'describe [>setting<.>prop<]' for detailed property description. You may edit the following settings: connection, infiniband, ipv4, ipv6 nmcli> set infiniband.mac-address 80:00:02:00:fe:80:00:00:00:00:00:00:f4:52:14:03:00:7b:cb:a3 nmcli> save Connection 'mlx4_ib3' (8029a0d7-8b05-49ff-a826-2a6d722025cc) successfully updated. nmcli> quit
nmcli c add
and nmcli c modify
in one command, as follows:
Example 13.4. Creating and modifying IPoIB in one command.
nmcli con add type infiniband con-name mlx4_ib0 ifname mlx4_ib0 transport-mode connected mtu 65520
infiniband.mac-address 80:00:02:00:fe:80:00:00:00:00:00:00:f4:52:14:03:00:7b:cb:a3
mlx4_ib0
has been created and set to use connected mode, with the maximum connected mode MTU, DHCP
for IPv4
and IPv6
. If using IPoIB interfaces for cluster traffic and an Ethernet interface for out-of-cluster communications, it is likely that disabling default routes and any default name server on the IPoIB interfaces will be required. This can be done as follows:
~]$ nmcli con edit mlx4_ib0
===| nmcli interactive connection editor |===
Editing existing 'infiniband' connection: 'mlx4_ib0'
Type 'help' or '?' for available commands.
Type 'describe [>setting<.>prop<]' for detailed property description.
You may edit the following settings: connection, infiniband, ipv4, ipv6
nmcli> set ipv4.ignore-auto-dns yes
nmcli> set ipv4.ignore-auto-routes yes
nmcli> set ipv4.never-default true
nmcli> set ipv6.ignore-auto-dns yes
nmcli> set ipv6.ignore-auto-routes yes
nmcli> set ipv6.never-default true
nmcli> save
Connection 'mlx4_ib0' (8029a0d7-8b05-49ff-a826-2a6d722025cc) successfully updated.
nmcli> quit
P_Key
interface is required, create one using nmcli as follows:
~]$nmcli con add type infiniband con-name mlx4_ib0.8002 ifname mlx4_ib0.8002 parent mlx4_ib0 p-key 0x8002
Connection 'mlx4_ib0.8002' (4a9f5509-7bd9-4e89-87e9-77751a1c54b4) successfully added. ~]$nmcli con modify mlx4_ib0.8002 infiniband.mtu 65520 infiniband.transport-mode connected ipv4.ignore-auto-dns yes ipv4.ignore-auto-routes yes ipv4.never-default true ipv6.ignore-auto-dns yes ipv6.ignore-auto-routes yes ipv6.never-default true
13.8.7. Configure IPoIB Using the command line
ib_ipoib
kernel module and then reloading it as follows:
~]$rmmod ib_ipoib
~]$modprobe ib_ipoib
ifcfg
files with their preferred editor to control the devices. A typical IPoIB configuration file with static IPv4
addressing looks as follows:
~]$ more ifcfg-mlx4_ib0
DEVICE=mlx4_ib0
TYPE=InfiniBand
ONBOOT=yes
HWADDR=80:00:00:4c:fe:80:00:00:00:00:00:00:f4:52:14:03:00:7b:cb:a1
BOOTPROTO=none
IPADDR=172.31.0.254
PREFIX=24
NETWORK=172.31.0.0
BROADCAST=172.31.0.255
IPV4_FAILURE_FATAL=yes
IPV6INIT=no
MTU=65520
CONNECTED_MODE=yes
NAME=mlx4_ib0
The DEVICE field must match the custom name created in any udev renaming rules. The NAME entry need not match the device name. If the GUI connection editor is started, the NAME field is what is used to present a name for this connection to the user. The TYPE field must be InfiniBand in order for InfiniBand options to be processed properly. CONNECTED_MODE is either yes
or no
, where yes
will use connected mode and no
will use datagram mode for communications (see section Section 13.8.2, “Understanding IPoIB communication modes”).
P_Key
interfaces, this is a typical configuration file:
~]$ more ifcfg-mlx4_ib0.8002
DEVICE=mlx4_ib0.8002
PHYSDEV=mlx4_ib0
PKEY=yes
PKEY_ID=2
TYPE=InfiniBand
ONBOOT=yes
HWADDR=80:00:00:4c:fe:80:00:00:00:00:00:00:f4:52:14:03:00:7b:cb:a1
BOOTPROTO=none
IPADDR=172.31.2.254
PREFIX=24
NETWORK=172.31.2.0
BROADCAST=172.31.2.255
IPV4_FAILURE_FATAL=yes
IPV6INIT=no
MTU=65520
CONNECTED_MODE=yes
NAME=mlx4_ib0.8002
For all P_Key
interface files, the PHYSDEV directive is required and must be the name of the parent device. The PKEY directive must be set to yes
, and PKEY_ID
must be the number of the interface (either with or without the 0x8000
membership bit added in). The device name, however, must be the four digit hexadecimal representation of the PKEY_ID
combined with the 0x8000
membership bit using the logical OR operator as follows: NAME=${PHYSDEV}.$((0x8000 | $PKEY_ID))
PKEY_ID
in the file is treated as a decimal number and converted to hexadecimal and then combined using the logical OR operator with 0x8000
to arrive at the proper name for the device, but users may specify the PKEY_ID
in hexadecimal by prepending the standard 0x
prefix to the number.
13.8.8. Testing an RDMA network after IPoIB is configured
IP
addresses to specify RDMA devices. Due to the ubiquitous nature of using IP
addresses and host names to specify machines, most RDMA applications use this as their preferred, or in some cases only, way of specifying remote machines or local devices to connect to.
IP
network test tool and provide the IP
address of the IPoIB devices to be tested. For example, the ping command between the IP
addresses of the IPoIB devices should now work.
IP
address or host name of the IPoIB device, it is allowed for the test application to actually connect through a different RDMA interface. The reason for this is because the process of converting from the host name or IP
address to an RDMA address allows any valid RDMA address pair between the two machines to be used. If there are multiple ways for the client to connect to the server, then the programs might choose to use a different path if there is a problem with the path specified. For example, if there are two ports on each machine connected to the same InfiniBand subnet, and an IP
address for the second port on each machine is given, it is likely that the program will find the first port on each machine is a valid connection method and use them instead. In this case, command-line options to any of the perftest programs can be used to tell them which card and port to bind to, as was done with ibping in Section 13.7, “Testing Early InfiniBand RDMA operation”, in order to ensure that testing occurs over the specific ports required to be tested. For qperf, the method of binding to ports is slightly different. The qperf program operates as a server on one machine, listening on all devices (including non-RDMA devices). The client may connect to qperf using any valid IP
address or host name for the server. Qperf will first attempt to open a data connection and run the requested test(s) over the IP
address or host name given on the client command line, but if there is any problem using that address, qperf will fall back to attempting to run the test on any valid path between the client and server. For this reason, to force qperf to test over a specific link, use the -loc_id
and -rem_id
options to the qperf client in order to force the test to run on a specific link.
13.8.9. Configure IPoIB Using a GUI
Procedure 13.4. Adding a New InfiniBand Connection Using nm-connection-editor
- Enter nm-connection-editor in a terminal:
~]$ nm-connection-editor
- Click the Add button. The Choose a Connection Type window appears. Select InfiniBand and click Create. The Editing InfiniBand connection 1 window appears.
- On the InfiniBand tab, select the transport mode from the drop-down list you want to use for the InfiniBand connection.
- Enter the InfiniBand MAC address.
- Review and confirm the settings and then click the Save button.
- Edit the InfiniBand-specific settings by referring to Section 13.8.9.1, “Configuring the InfiniBand Tab”.
Procedure 13.5. Editing an Existing InfiniBand Connection
- Enter nm-connection-editor in a terminal:
~]$ nm-connection-editor
- Select the connection you want to edit and click the Edit button.
- Select the General tab.
- Configure the connection name, auto-connect behavior, and availability settings.Five settings in the Editing dialog are common to all connection types, see the General tab:
- Connection name — Enter a descriptive name for your network connection. This name will be used to list this connection in the menu of the Network window.
- Automatically connect to this network when it is available — Select this box if you want NetworkManager to auto-connect to this connection when it is available. See the section called “Editing an Existing Connection with control-center” for more information.
- All users may connect to this network — Select this box to create a connection available to all users on the system. Changing this setting may require root privileges. See Section 3.4.5, “Managing System-wide and Private Connection Profiles with a GUI” for details.
- Automatically connect to VPN when using this connection — Select this box if you want NetworkManager to auto-connect to a VPN connection when it is available. Select the VPN from the drop-down menu.
- Firewall Zone — Select the Firewall Zone from the drop-down menu. See the Red Hat Enterprise Linux 7 Security Guide for more information on Firewall Zones.
- Edit the InfiniBand-specific settings by referring to the Section 13.8.9.1, “Configuring the InfiniBand Tab”.
Saving Your New (or Modified) Connection and Making Further Configurations
IPv4
settings for the connection, click the IPv4 Settings tab and proceed to Section 5.4, “Configuring IPv4 Settings”orIPv6
settings for the connection, click the IPv6 Settings tab and proceed to Section 5.5, “Configuring IPv6 Settings”.
13.8.9.1. Configuring the InfiniBand Tab
- Transport mode
- Datagram or Connected mode can be selected from the drop-down list. Select the same mode the rest of your IPoIB network is using.
- Device MAC address
- The MAC address of the InfiniBand capable device to be used for the InfiniBand network traffic.This hardware address field will be pre-filled if you have InfiniBand hardware installed.
- MTU
- Optionally sets a Maximum Transmission Unit (MTU) size to be used for packets to be sent over the InfiniBand connection.
13.8.10. Additional Resources
Installed Documentation
/usr/share/doc/initscripts-version/sysconfig.txt
— Describes configuration files and their directives.
Online Documentation
- https://www.kernel.org/doc/Documentation/infiniband/ipoib.txt
- A description of the IPoIB driver. Includes references to relevant RFCs.