5.10. RDMA 接続の検証
システム間で、特にレガシー Single Root I/O Virtualization (SR-IOV) イーサネットに関して、Remote Direct Memory Access (RDMA) 接続が機能していることを確認します。
手順
次のコマンドを使用して、各
rdma-workload-client
Pod に接続します。oc rsh -n default rdma-sriov-32-workload
$ oc rsh -n default rdma-sriov-32-workload
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 出力例
sh-5.1#
sh-5.1#
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 次のコマンドを使用して、最初のワークロード Pod に割り当てられた IP アドレスを確認します。この例では、最初のワークロード Pod は RDMA テストサーバーです。
ip a
sh-5.1# ip a
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 出力例
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if3970: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default link/ether 0a:58:0a:80:02:a7 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.128.2.167/23 brd 10.128.3.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe80:2a7/64 scope link valid_lft forever preferred_lft forever 3843: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 26:34:fd:53:a6:ec brd ff:ff:ff:ff:ff:ff altname enp55s0f0v5 inet 192.168.4.225/28 brd 192.168.4.239 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::2434:fdff:fe53:a6ec/64 scope link valid_lft forever preferred_lft forever sh-5.1#
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if3970: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default link/ether 0a:58:0a:80:02:a7 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.128.2.167/23 brd 10.128.3.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe80:2a7/64 scope link valid_lft forever preferred_lft forever 3843: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 26:34:fd:53:a6:ec brd ff:ff:ff:ff:ff:ff altname enp55s0f0v5 inet 192.168.4.225/28 brd 192.168.4.239 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::2434:fdff:fe53:a6ec/64 scope link valid_lft forever preferred_lft forever sh-5.1#
Copy to Clipboard Copied! Toggle word wrap Toggle overflow この Pod に割り当てられた RDMA サーバーの IP アドレスは、
net1
インターフェイスです。この例では、IP アドレスは192.168.4.225
です。ibstatus
コマンドを実行して、各 RDMA デバイスmlx5_x
に関連付けられているlink_layer
タイプ (Ethernet または Infiniband) を取得します。出力でstate
フィールドを確認することで、すべての RDMA デバイスのステータスもわかります。フィールドにはACTIVE
またはDOWN
のいずれかが表示されます。ibstatus
sh-5.1# ibstatus
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 出力例
Infiniband device 'mlx5_0' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_1' port 1 status: default gid: fe80:0000:0000:0000:e8eb:d303:0072:1415 base lid: 0xc sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: InfiniBand Infiniband device 'mlx5_2' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_3' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_4' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_5' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_6' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_7' port 1 status: default gid: fe80:0000:0000:0000:2434:fdff:fe53:a6ec base lid: 0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_8' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_9' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet sh-5.1#
Infiniband device 'mlx5_0' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_1' port 1 status: default gid: fe80:0000:0000:0000:e8eb:d303:0072:1415 base lid: 0xc sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: InfiniBand Infiniband device 'mlx5_2' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_3' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_4' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_5' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_6' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_7' port 1 status: default gid: fe80:0000:0000:0000:2434:fdff:fe53:a6ec base lid: 0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_8' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_9' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet sh-5.1#
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ワーカーノード上の各 RDMA
mlx5
デバイスのlink_layer
を取得するために、ibstat
コマンドを実行します。ibstat | egrep "Port|Base|Link"
sh-5.1# ibstat | egrep "Port|Base|Link"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 出力例
Port 1: Physical state: LinkUp Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet Port 1: Physical state: LinkUp Base lid: 12 Port GUID: 0xe8ebd30300721415 Link layer: InfiniBand Port 1: Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet Port 1: Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet Port 1: Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet Port 1: Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet Port 1: Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet Port 1: Physical state: LinkUp Base lid: 0 Port GUID: 0x2434fdfffe53a6ec Link layer: Ethernet Port 1: Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet Port 1: Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet sh-5.1#
Port 1: Physical state: LinkUp Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet Port 1: Physical state: LinkUp Base lid: 12 Port GUID: 0xe8ebd30300721415 Link layer: InfiniBand Port 1: Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet Port 1: Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet Port 1: Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet Port 1: Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet Port 1: Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet Port 1: Physical state: LinkUp Base lid: 0 Port GUID: 0x2434fdfffe53a6ec Link layer: Ethernet Port 1: Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet Port 1: Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet sh-5.1#
Copy to Clipboard Copied! Toggle word wrap Toggle overflow RDMA 共有デバイスまたはホストデバイス用のワークロード Pod では、
mlx5_x
という名前の RDMA デバイスがすでに認識されており、通常はmlx5_0
またはmlx5_1
です。RDMA レガシー SR-IOV ワークロード Pod では、どの RDMA デバイスがどの Virtual Function (VF) サブインターフェイスに関連付けられているかを確認する必要があります。この情報は次のコマンドを使用して表示します。rdma link show
sh-5.1# rdma link show
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 出力例
link mlx5_0/1 state ACTIVE physical_state LINK_UP link mlx5_1/1 subnet_prefix fe80:0000:0000:0000 lid 12 sm_lid 1 lmc 0 state ACTIVE physical_state LINK_UP link mlx5_2/1 state DOWN physical_state DISABLED link mlx5_3/1 state DOWN physical_state DISABLED link mlx5_4/1 state DOWN physical_state DISABLED link mlx5_5/1 state DOWN physical_state DISABLED link mlx5_6/1 state DOWN physical_state DISABLED link mlx5_7/1 state ACTIVE physical_state LINK_UP netdev net1 link mlx5_8/1 state DOWN physical_state DISABLED link mlx5_9/1 state DOWN physical_state DISABLED
link mlx5_0/1 state ACTIVE physical_state LINK_UP link mlx5_1/1 subnet_prefix fe80:0000:0000:0000 lid 12 sm_lid 1 lmc 0 state ACTIVE physical_state LINK_UP link mlx5_2/1 state DOWN physical_state DISABLED link mlx5_3/1 state DOWN physical_state DISABLED link mlx5_4/1 state DOWN physical_state DISABLED link mlx5_5/1 state DOWN physical_state DISABLED link mlx5_6/1 state DOWN physical_state DISABLED link mlx5_7/1 state ACTIVE physical_state LINK_UP netdev net1 link mlx5_8/1 state DOWN physical_state DISABLED link mlx5_9/1 state DOWN physical_state DISABLED
Copy to Clipboard Copied! Toggle word wrap Toggle overflow この例では、RDMA デバイス名
mlx5_7
がnet1
インターフェイスに関連付けられています。この出力は、次のコマンドで RDMA 帯域幅テストを実行するために使用します。このテストでは、ワーカーノード間の RDMA 接続も検証します。次の
ib_write_bw
RDMA 帯域幅テストコマンドを実行します。/root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_7 -p 10000 --source_ip 192.168.4.225 --use_cuda=0 --use_cuda_dmabuf
sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_7 -p 10000 --source_ip 192.168.4.225 --use_cuda=0 --use_cuda_dmabuf
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ここでは、以下のようになります。
-
mlx5_7
RDMA デバイスを-d
スイッチで渡します。 -
RDMA サーバーを起動するための送信元 IP アドレスは
192.168.4.225
です。 -
--use_cuda=0
、--use_cuda_dmabuf
スイッチは、GPUDirect RDMA の使用を示します。
出力例
WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 ************************************ * Waiting for client to connect... * ************************************
WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 ************************************ * Waiting for client to connect... * ************************************
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
別のターミナルウィンドウを開き、RDMA テストクライアント Pod として機能する 2 番目のワークロード Pod で、
oc rsh
コマンドを実行します。oc rsh -n default rdma-sriov-33-workload
$ oc rsh -n default rdma-sriov-33-workload
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 出力例
sh-5.1#
sh-5.1#
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 次のコマンドを使用して、
net1
インターフェイスから RDMA テストクライアント Pod の IP アドレスを取得します。ip a
sh-5.1# ip a
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 出力例
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if4139: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default link/ether 0a:58:0a:83:01:d5 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.131.1.213/23 brd 10.131.1.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe83:1d5/64 scope link valid_lft forever preferred_lft forever 4076: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 56:6c:59:41:ae:4a brd ff:ff:ff:ff:ff:ff altname enp55s0f0v0 inet 192.168.4.226/28 brd 192.168.4.239 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::546c:59ff:fe41:ae4a/64 scope link valid_lft forever preferred_lft forever sh-5.1#
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if4139: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default link/ether 0a:58:0a:83:01:d5 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.131.1.213/23 brd 10.131.1.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe83:1d5/64 scope link valid_lft forever preferred_lft forever 4076: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 56:6c:59:41:ae:4a brd ff:ff:ff:ff:ff:ff altname enp55s0f0v0 inet 192.168.4.226/28 brd 192.168.4.239 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::546c:59ff:fe41:ae4a/64 scope link valid_lft forever preferred_lft forever sh-5.1#
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 次のコマンドを使用して、各 RDMA デバイス
mlx5_x
に関連付けられているlink_layer
タイプを取得します。ibstatus
sh-5.1# ibstatus
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 出力例
Infiniband device 'mlx5_0' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_1' port 1 status: default gid: fe80:0000:0000:0000:e8eb:d303:0072:09f5 base lid: 0xd sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: InfiniBand Infiniband device 'mlx5_2' port 1 status: default gid: fe80:0000:0000:0000:546c:59ff:fe41:ae4a base lid: 0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_3' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_4' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_5' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_6' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_7' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_8' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_9' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet
Infiniband device 'mlx5_0' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_1' port 1 status: default gid: fe80:0000:0000:0000:e8eb:d303:0072:09f5 base lid: 0xd sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: InfiniBand Infiniband device 'mlx5_2' port 1 status: default gid: fe80:0000:0000:0000:546c:59ff:fe41:ae4a base lid: 0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_3' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_4' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_5' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_6' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_7' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_8' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_9' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 200 Gb/sec (4X HDR) link_layer: Ethernet
Copy to Clipboard Copied! Toggle word wrap Toggle overflow オプション:
ibstat
コマンドを使用して、Mellanox カードのファームウェアバージョンを取得します。ibstat
sh-5.1# ibstat
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 出力例
CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0xe8ebd303007209f4 System image GUID: 0xe8ebd303007209f4 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0xe8ebd303007209f5 System image GUID: 0xe8ebd303007209f4 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 13 LMC: 0 SM lid: 1 Capability mask: 0xa651e848 Port GUID: 0xe8ebd303007209f5 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0x566c59fffe41ae4a System image GUID: 0xe8ebd303007209f4 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x546c59fffe41ae4a Link layer: Ethernet CA 'mlx5_3' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0xb2ae4bfffe8f3d02 System image GUID: 0xe8ebd303007209f4 Port 1: State: Down Physical state: Disabled Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet CA 'mlx5_4' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0x2a9967fffe8bf272 System image GUID: 0xe8ebd303007209f4 Port 1: State: Down Physical state: Disabled Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet CA 'mlx5_5' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0x5aff2ffffe2e17e8 System image GUID: 0xe8ebd303007209f4 Port 1: State: Down Physical state: Disabled Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet CA 'mlx5_6' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0x121bf1fffe074419 System image GUID: 0xe8ebd303007209f4 Port 1: State: Down Physical state: Disabled Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet CA 'mlx5_7' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0xb22b16fffed03dd7 System image GUID: 0xe8ebd303007209f4 Port 1: State: Down Physical state: Disabled Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet CA 'mlx5_8' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0x523800fffe16d105 System image GUID: 0xe8ebd303007209f4 Port 1: State: Down Physical state: Disabled Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet CA 'mlx5_9' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0xd2b4a1fffebdc4a9 System image GUID: 0xe8ebd303007209f4 Port 1: State: Down Physical state: Disabled Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet sh-5.1#
CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0xe8ebd303007209f4 System image GUID: 0xe8ebd303007209f4 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0xe8ebd303007209f5 System image GUID: 0xe8ebd303007209f4 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 13 LMC: 0 SM lid: 1 Capability mask: 0xa651e848 Port GUID: 0xe8ebd303007209f5 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0x566c59fffe41ae4a System image GUID: 0xe8ebd303007209f4 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x546c59fffe41ae4a Link layer: Ethernet CA 'mlx5_3' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0xb2ae4bfffe8f3d02 System image GUID: 0xe8ebd303007209f4 Port 1: State: Down Physical state: Disabled Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet CA 'mlx5_4' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0x2a9967fffe8bf272 System image GUID: 0xe8ebd303007209f4 Port 1: State: Down Physical state: Disabled Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet CA 'mlx5_5' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0x5aff2ffffe2e17e8 System image GUID: 0xe8ebd303007209f4 Port 1: State: Down Physical state: Disabled Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet CA 'mlx5_6' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0x121bf1fffe074419 System image GUID: 0xe8ebd303007209f4 Port 1: State: Down Physical state: Disabled Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet CA 'mlx5_7' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0xb22b16fffed03dd7 System image GUID: 0xe8ebd303007209f4 Port 1: State: Down Physical state: Disabled Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet CA 'mlx5_8' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0x523800fffe16d105 System image GUID: 0xe8ebd303007209f4 Port 1: State: Down Physical state: Disabled Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet CA 'mlx5_9' CA type: MT4124 Number of ports: 1 Firmware version: 20.43.1014 Hardware version: 0 Node GUID: 0xd2b4a1fffebdc4a9 System image GUID: 0xe8ebd303007209f4 Port 1: State: Down Physical state: Disabled Rate: 200 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0000000000000000 Link layer: Ethernet sh-5.1#
Copy to Clipboard Copied! Toggle word wrap Toggle overflow クライアントワークロード Pod が使用する Virtual Function サブインターフェイスに関連付けられている RDMA デバイスを特定するために、次のコマンドを実行します。この例では、
net1
インターフェイスは RDMA デバイスmlx5_2
を使用しています。rdma link show
sh-5.1# rdma link show
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 出力例
link mlx5_0/1 state ACTIVE physical_state LINK_UP link mlx5_1/1 subnet_prefix fe80:0000:0000:0000 lid 13 sm_lid 1 lmc 0 state ACTIVE physical_state LINK_UP link mlx5_2/1 state ACTIVE physical_state LINK_UP netdev net1 link mlx5_3/1 state DOWN physical_state DISABLED link mlx5_4/1 state DOWN physical_state DISABLED link mlx5_5/1 state DOWN physical_state DISABLED link mlx5_6/1 state DOWN physical_state DISABLED link mlx5_7/1 state DOWN physical_state DISABLED link mlx5_8/1 state DOWN physical_state DISABLED link mlx5_9/1 state DOWN physical_state DISABLED sh-5.1#
link mlx5_0/1 state ACTIVE physical_state LINK_UP link mlx5_1/1 subnet_prefix fe80:0000:0000:0000 lid 13 sm_lid 1 lmc 0 state ACTIVE physical_state LINK_UP link mlx5_2/1 state ACTIVE physical_state LINK_UP netdev net1 link mlx5_3/1 state DOWN physical_state DISABLED link mlx5_4/1 state DOWN physical_state DISABLED link mlx5_5/1 state DOWN physical_state DISABLED link mlx5_6/1 state DOWN physical_state DISABLED link mlx5_7/1 state DOWN physical_state DISABLED link mlx5_8/1 state DOWN physical_state DISABLED link mlx5_9/1 state DOWN physical_state DISABLED sh-5.1#
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 次の
ib_write_bw
RDMA 帯域幅テストコマンドを実行します。/root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_2 -p 10000 --source_ip 192.168.4.226 --use_cuda=0 --use_cuda_dmabuf 192.168.4.225
sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_2 -p 10000 --source_ip 192.168.4.226 --use_cuda=0 --use_cuda_dmabuf 192.168.4.225
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ここでは、以下のようになります。
-
mlx5_2
RDMA デバイスを-d
スイッチで渡します。 -
送信元 IP アドレスは
192.168.4.226
で、RDMA サーバーの送信先 IP アドレスは192.168.4.225
です。 --use_cuda=0
、--use_cuda_dmabuf
スイッチは、GPUDirect RDMA の使用を示します。出力例
WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 Requested mtu is higher than active mtu Changing to active mtu - 3 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is 61:00 Picking device No. 0 [pid = 8909, dev = 0] device name = [NVIDIA A40] creating CUDA Ctx making it the current CUDA Ctx CUDA device integrated: 0 using DMA-BUF for GPU buffer address at 0x7f8738600000 aligned at 0x7f8738600000 with aligned size 2097152 allocated GPU buffer of a 2097152 address at 0x23a7420 for type CUDA_MEM_DEVICE Calling ibv_reg_dmabuf_mr(offset=0, size=2097152, addr=0x7f8738600000, fd=40) for QP #0 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_2 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF TX depth : 128 CQ Moderation : 1 CQE Poll Batch : 16 Mtu : 1024[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x012d PSN 0x3cb6d7 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x012e PSN 0x90e0ac GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x012f PSN 0x153f50 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0130 PSN 0x5e0128 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0131 PSN 0xd89752 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0132 PSN 0xe5fc16 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0133 PSN 0x236787 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0134 PSN 0xd9273e GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0135 PSN 0x37cfd4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0136 PSN 0x3bff8f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0137 PSN 0x81f2bd GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0138 PSN 0x575c43 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0139 PSN 0x6cf53d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x013a PSN 0xcaaf6f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x013b PSN 0x346437 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x013c PSN 0xcc5865 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x026d PSN 0x359409 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x026e PSN 0xe387bf GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x026f PSN 0x5be79d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0270 PSN 0x1b4b28 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0271 PSN 0x76a61b GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0272 PSN 0x3d50e1 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0273 PSN 0x1b572c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0274 PSN 0x4ae1b5 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0275 PSN 0x5591b5 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0276 PSN 0xfa2593 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0277 PSN 0xd9473b GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0278 PSN 0x2116b2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0279 PSN 0x9b83b6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x027a PSN 0xa0822b GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x027b PSN 0x6d930d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x027c PSN 0xb1a4d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 10329004 0.00 180.47 0.344228 --------------------------------------------------------------------------------------- deallocating GPU buffer 00007f8738600000 destroying current CUDA Ctx sh-5.1#
WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 Requested mtu is higher than active mtu Changing to active mtu - 3 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is 61:00 Picking device No. 0 [pid = 8909, dev = 0] device name = [NVIDIA A40] creating CUDA Ctx making it the current CUDA Ctx CUDA device integrated: 0 using DMA-BUF for GPU buffer address at 0x7f8738600000 aligned at 0x7f8738600000 with aligned size 2097152 allocated GPU buffer of a 2097152 address at 0x23a7420 for type CUDA_MEM_DEVICE Calling ibv_reg_dmabuf_mr(offset=0, size=2097152, addr=0x7f8738600000, fd=40) for QP #0 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_2 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF TX depth : 128 CQ Moderation : 1 CQE Poll Batch : 16 Mtu : 1024[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x012d PSN 0x3cb6d7 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x012e PSN 0x90e0ac GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x012f PSN 0x153f50 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0130 PSN 0x5e0128 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0131 PSN 0xd89752 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0132 PSN 0xe5fc16 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0133 PSN 0x236787 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0134 PSN 0xd9273e GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0135 PSN 0x37cfd4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0136 PSN 0x3bff8f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0137 PSN 0x81f2bd GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0138 PSN 0x575c43 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x0139 PSN 0x6cf53d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x013a PSN 0xcaaf6f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x013b PSN 0x346437 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 local address: LID 0000 QPN 0x013c PSN 0xcc5865 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x026d PSN 0x359409 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x026e PSN 0xe387bf GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x026f PSN 0x5be79d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0270 PSN 0x1b4b28 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0271 PSN 0x76a61b GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0272 PSN 0x3d50e1 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0273 PSN 0x1b572c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0274 PSN 0x4ae1b5 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0275 PSN 0x5591b5 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0276 PSN 0xfa2593 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0277 PSN 0xd9473b GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0278 PSN 0x2116b2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x0279 PSN 0x9b83b6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x027a PSN 0xa0822b GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x027b PSN 0x6d930d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x027c PSN 0xb1a4d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 10329004 0.00 180.47 0.344228 --------------------------------------------------------------------------------------- deallocating GPU buffer 00007f8738600000 destroying current CUDA Ctx sh-5.1#
Copy to Clipboard Copied! Toggle word wrap Toggle overflow テストが正常な場合は、期待される BW average と Mpps 単位の MsgRate が表示されます。
ib_write_bw
コマンドが完了すると、サーバー側の出力もサーバー Pod に表示されます。以下の例を参照してください。出力例
WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 ************************************ * Waiting for client to connect... * ************************************ Requested mtu is higher than active mtu Changing to active mtu - 3 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is 61:00 Picking device No. 0 [pid = 9226, dev = 0] device name = [NVIDIA A40] creating CUDA Ctx making it the current CUDA Ctx CUDA device integrated: 0 using DMA-BUF for GPU buffer address at 0x7f447a600000 aligned at 0x7f447a600000 with aligned size 2097152 allocated GPU buffer of a 2097152 address at 0x2406400 for type CUDA_MEM_DEVICE Calling ibv_reg_dmabuf_mr(offset=0, size=2097152, addr=0x7f447a600000, fd=40) for QP #0 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_7 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF CQ Moderation : 1 CQE Poll Batch : 16 Mtu : 1024[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- Waiting for client rdma_cm QP to connect Please run the same command with the IB/RoCE interface IP --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x026d PSN 0x359409 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x026e PSN 0xe387bf GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x026f PSN 0x5be79d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0270 PSN 0x1b4b28 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0271 PSN 0x76a61b GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0272 PSN 0x3d50e1 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0273 PSN 0x1b572c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0274 PSN 0x4ae1b5 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0275 PSN 0x5591b5 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0276 PSN 0xfa2593 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0277 PSN 0xd9473b GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0278 PSN 0x2116b2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0279 PSN 0x9b83b6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x027a PSN 0xa0822b GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x027b PSN 0x6d930d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x027c PSN 0xb1a4d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x012d PSN 0x3cb6d7 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x012e PSN 0x90e0ac GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x012f PSN 0x153f50 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0130 PSN 0x5e0128 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0131 PSN 0xd89752 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0132 PSN 0xe5fc16 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0133 PSN 0x236787 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0134 PSN 0xd9273e GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0135 PSN 0x37cfd4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0136 PSN 0x3bff8f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0137 PSN 0x81f2bd GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0138 PSN 0x575c43 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0139 PSN 0x6cf53d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x013a PSN 0xcaaf6f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x013b PSN 0x346437 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x013c PSN 0xcc5865 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 10329004 0.00 180.47 0.344228 --------------------------------------------------------------------------------------- deallocating GPU buffer 00007f447a600000 destroying current CUDA Ctx
WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 ************************************ * Waiting for client to connect... * ************************************ Requested mtu is higher than active mtu Changing to active mtu - 3 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is 61:00 Picking device No. 0 [pid = 9226, dev = 0] device name = [NVIDIA A40] creating CUDA Ctx making it the current CUDA Ctx CUDA device integrated: 0 using DMA-BUF for GPU buffer address at 0x7f447a600000 aligned at 0x7f447a600000 with aligned size 2097152 allocated GPU buffer of a 2097152 address at 0x2406400 for type CUDA_MEM_DEVICE Calling ibv_reg_dmabuf_mr(offset=0, size=2097152, addr=0x7f447a600000, fd=40) for QP #0 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_7 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF CQ Moderation : 1 CQE Poll Batch : 16 Mtu : 1024[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- Waiting for client rdma_cm QP to connect Please run the same command with the IB/RoCE interface IP --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x026d PSN 0x359409 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x026e PSN 0xe387bf GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x026f PSN 0x5be79d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0270 PSN 0x1b4b28 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0271 PSN 0x76a61b GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0272 PSN 0x3d50e1 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0273 PSN 0x1b572c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0274 PSN 0x4ae1b5 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0275 PSN 0x5591b5 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0276 PSN 0xfa2593 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0277 PSN 0xd9473b GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0278 PSN 0x2116b2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x0279 PSN 0x9b83b6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x027a PSN 0xa0822b GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x027b PSN 0x6d930d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 local address: LID 0000 QPN 0x027c PSN 0xb1a4d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225 remote address: LID 0000 QPN 0x012d PSN 0x3cb6d7 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x012e PSN 0x90e0ac GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x012f PSN 0x153f50 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0130 PSN 0x5e0128 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0131 PSN 0xd89752 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0132 PSN 0xe5fc16 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0133 PSN 0x236787 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0134 PSN 0xd9273e GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0135 PSN 0x37cfd4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0136 PSN 0x3bff8f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0137 PSN 0x81f2bd GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0138 PSN 0x575c43 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x0139 PSN 0x6cf53d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x013a PSN 0xcaaf6f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x013b PSN 0x346437 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 remote address: LID 0000 QPN 0x013c PSN 0xcc5865 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 10329004 0.00 180.47 0.344228 --------------------------------------------------------------------------------------- deallocating GPU buffer 00007f447a600000 destroying current CUDA Ctx
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-