Chapter 7. Troubleshooting the Ceph iSCSI gateway (Limited Availability)

7.1. Prerequisites
Copy link

A running Red Hat Ceph Storage cluster.
A running Ceph iSCSI gateway.
Verify the network connections.

7.2. Gathering information for lost connections causing storage failures on VMware ESXi
Copy link

Collecting system and disk information helps determine which iSCSI target has lost a connection and is possibly causing storage failures. If needed, gathering this information can also be provided to Red Hat’s Global Support Service to aid you in troubleshooting any Ceph iSCSI gateway issues.

Prerequisites

A running Red Hat Ceph Storage cluster.
A running Ceph iSCSI gateway, the iSCSI target.
A running VMware ESXi environment, the iSCSI initiator.
Root-level access to the VMware ESXi node.

Procedure

On the VWware ESXi node, open the kernel log:
```
[root@esx:~]# more /var/log/vmkernel.log
```
```
[root@esx:~]# more /var/log/vmkernel.log
```
Copy to Clipboard Toggle word wrap

Gather information from the following error messages in the VMware ESXi kernel log:

Example

2022-05-30T11:07:07.570Z cpu32:66506)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure: Sess [ISID: 00023d000005 TARGET:
iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0]

2022-05-30T11:07:07.570Z cpu32:66506)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure: Sess [ISID: 00023d000005 TARGET:
iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0]

Copy to Clipboard

Toggle word wrap

From this message, make a note of the ISID number, the TARGET name, and the Target Portal Group Tag (TPGT) number. For this example, we have the following:

ISID: 00023d000005
TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw
TPGT: 3

ISID: 00023d000005
TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw
TPGT: 3

Copy to Clipboard

Toggle word wrap

Example

2022-05-30T11:07:07.570Z cpu32:66506)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure: vmhba64:CH:4 T:0 CN:0: Connection rx
notifying failure: Failed to Receive. State=Bound

2022-05-30T11:07:07.570Z cpu32:66506)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure: vmhba64:CH:4 T:0 CN:0: Connection rx
notifying failure: Failed to Receive. State=Bound

Copy to Clipboard

Toggle word wrap

From this message, make a note of the adapter channel (CH) number. For this example, we have the following:

vmhba64:CH:4 T:0

vmhba64:CH:4 T:0

Copy to Clipboard

Toggle word wrap

To find the remote address of the Ceph iSCSI gateway node:

[root@esx:~]# esxcli iscsi session connection list

[root@esx:~]# esxcli iscsi session connection list

Copy to Clipboard

Toggle word wrap

Example

...
vmhba64,iqn.2017-12.com.redhat.iscsi-gw:ceph-igw,00023d000003,0
   Adapter: vmhba64
   Target: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw 
   ISID: 00023d000003 
   CID: 0
   DataDigest: NONE
   HeaderDigest: NONE
   IFMarker: false
   IFMarkerInterval: 0
   MaxRecvDataSegmentLength: 131072
   MaxTransmitDataSegmentLength: 262144
   OFMarker: false
   OFMarkerInterval: 0
   ConnectionAddress: 10.2.132.2
   RemoteAddress: 10.2.132.2 
   LocalAddress: 10.2.128.77
   SessionCreateTime: 03/28/18 21:45:19
   ConnectionCreateTime: 03/28/18 21:45:19
   ConnectionStartTime: 03/28/18 21:45:19
   State: xpt_wait
...

...
vmhba64,iqn.2017-12.com.redhat.iscsi-gw:ceph-igw,00023d000003,0
   Adapter: vmhba64
   Target: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw

1


   ISID: 00023d000003

2


   CID: 0
   DataDigest: NONE
   HeaderDigest: NONE
   IFMarker: false
   IFMarkerInterval: 0
   MaxRecvDataSegmentLength: 131072
   MaxTransmitDataSegmentLength: 262144
   OFMarker: false
   OFMarkerInterval: 0
   ConnectionAddress: 10.2.132.2
   RemoteAddress: 10.2.132.2

3


   LocalAddress: 10.2.128.77
   SessionCreateTime: 03/28/18 21:45:19
   ConnectionCreateTime: 03/28/18 21:45:19
   ConnectionStartTime: 03/28/18 21:45:19
   State: xpt_wait
...

Copy to Clipboard

Toggle word wrap

From the command output, match the ISID value, and the TARGET name value gathered previously, then make a note of the RemoteAddress value. From this example, we have the following:

Target: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw
ISID: 00023d000003
RemoteAddress: 10.2.132.2

Target: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw
ISID: 00023d000003
RemoteAddress: 10.2.132.2

Copy to Clipboard

Toggle word wrap

Now, you can collect more information from the Ceph iSCSI gateway node to further troubleshoot the issue.

On the Ceph iSCSI gateway node mentioned by the RemoteAddress value, run an sosreport to gather system information:
```
sosreport
```
```
[root@igw ~]# sosreport
```
Copy to Clipboard Toggle word wrap

To find a disk that went into a dead state:

[root@esx:~]# esxcli storage nmp device list

[root@esx:~]# esxcli storage nmp device list

Copy to Clipboard

Toggle word wrap

Example

...
iqn.1998-01.com.vmware:d04-nmgjd-pa-zyc-sv039-rh2288h-xnh-732d78fd-00023d000004,iqn.2017-12.com.redhat.iscsi-gw:ceph-igw,t,3-naa.60014054a5d46697f85498e9a257567c
   Runtime Name: vmhba64:C4:T0:L4 
   Device: naa.60014054a5d46697f85498e9a257567c 
   Device Display Name: LIO-ORG iSCSI Disk
(naa.60014054a5d46697f85498e9a257567c)
   Group State: dead 
   Array Priority: 0
   Storage Array Type Path Config:
{TPG_id=3,TPG_state=ANO,RTP_id=3,RTP_health=DOWN} 
   Path Selection Policy Path Config: {non-current path; rank: 0}
...

...
iqn.1998-01.com.vmware:d04-nmgjd-pa-zyc-sv039-rh2288h-xnh-732d78fd-00023d000004,iqn.2017-12.com.redhat.iscsi-gw:ceph-igw,t,3-naa.60014054a5d46697f85498e9a257567c
   Runtime Name: vmhba64:C4:T0:L4

1


   Device: naa.60014054a5d46697f85498e9a257567c

2


   Device Display Name: LIO-ORG iSCSI Disk
(naa.60014054a5d46697f85498e9a257567c)
   Group State: dead

3


   Array Priority: 0
   Storage Array Type Path Config:
{TPG_id=3,TPG_state=ANO,RTP_id=3,RTP_health=DOWN}

4


   Path Selection Policy Path Config: {non-current path; rank: 0}
...

Copy to Clipboard

Toggle word wrap

From the command output, match the CH number, and the TPGT number gathered previously, then make a note of the Device value. For this example, we have the following:

vmhba64:C4:T0
Device: naa.60014054a5d46697f85498e9a257567c
TPG_id=3

vmhba64:C4:T0
Device: naa.60014054a5d46697f85498e9a257567c
TPG_id=3

Copy to Clipboard

Toggle word wrap

With the device name, you can gather some additional information on each iSCSI disk in a dead state.

Gather more information on the iSCSI disk:

Syntax

esxcli storage nmp path list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_nmp_path_list.txt
esxcli storage core device list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_core_device_list.txt

esxcli storage nmp path list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_nmp_path_list.txt
esxcli storage core device list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_core_device_list.txt

Copy to Clipboard

Toggle word wrap

Example

[root@esx:~]# esxcli storage nmp path list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_nmp_path_list.txt
[root@esx:~]# esxcli storage core device list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_core_device_list.txt

[root@esx:~]# esxcli storage nmp path list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_nmp_path_list.txt
[root@esx:~]# esxcli storage core device list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_core_device_list.txt

Copy to Clipboard

Toggle word wrap

Gather additional information on the VMware ESXi environment:

[root@esx:~]# esxcli storage vmfs extent list > /tmp/esxcli_storage_vmfs_extent_list.txt
[root@esx:~]# esxcli storage filesystem list > /tmp/esxcli_storage_filesystem_list.txt
[root@esx:~]# esxcli iscsi session list > /tmp/esxcli_iscsi_session_list.txt
[root@esx:~]# esxcli iscsi session connection list > /tmp/esxcli_iscsi_session_connection_list.txt

[root@esx:~]# esxcli storage vmfs extent list > /tmp/esxcli_storage_vmfs_extent_list.txt
[root@esx:~]# esxcli storage filesystem list > /tmp/esxcli_storage_filesystem_list.txt
[root@esx:~]# esxcli iscsi session list > /tmp/esxcli_iscsi_session_list.txt
[root@esx:~]# esxcli iscsi session connection list > /tmp/esxcli_iscsi_session_connection_list.txt

Copy to Clipboard

Toggle word wrap

Check for potential iSCSI login issues:
- Was the iSCSI login data not sent?
- Did the iSCSI login timeout or fail to find a portal group?

Additional Resources

See Red Hat’s Knowledgebase solution on creating an sosreport for Red Hat Global Support Services.
See Red Hat’s Knowledgebase solution on uploading files for Red Hat Global Support Services.
How to open a Red Hat support case on the Customer Portal?

7.3. Checking iSCSI login failures because data was not sent
Copy link

On the iSCSI gateway node, you might see generic login negotiation failure messages in the system log, by default /var/log/messages.

Example

Apr  2 23:17:05 osd1 kernel: rx_data returned 0, expecting 48.
Apr  2 23:17:05 osd1 kernel: iSCSI Login negotiation failed.

Apr  2 23:17:05 osd1 kernel: rx_data returned 0, expecting 48.
Apr  2 23:17:05 osd1 kernel: iSCSI Login negotiation failed.

Copy to Clipboard

Toggle word wrap

While the system is in this state, start collecting system information as suggested in this procedure.

Prerequisites

A running Red Hat Ceph Storage cluster.
A running Ceph iSCSI gateway, the iSCSI target.
A running VMware ESXi environment, the iSCSI initiator.
Root-level access to the Ceph iSCSI gateway node.
Root-level access to the VMware ESXi node.

Procedure

Enable additional logging:

echo "iscsi_target_mod +p" > /sys/kernel/debug/dynamic_debug/control
echo "target_core_mod +p" > /sys/kernel/debug/dynamic_debug/control

[root@igw ~]# echo "iscsi_target_mod +p" > /sys/kernel/debug/dynamic_debug/control
[root@igw ~]# echo "target_core_mod +p" > /sys/kernel/debug/dynamic_debug/control

Copy to Clipboard

Toggle word wrap

Wait a couple of minutes for the extra debugging information to populate the system log.

Disable the additional logging:

echo "iscsi_target_mod -p" > /sys/kernel/debug/dynamic_debug/control
echo "target_core_mod -p" > /sys/kernel/debug/dynamic_debug/control

[root@igw ~]# echo "iscsi_target_mod -p" > /sys/kernel/debug/dynamic_debug/control
[root@igw ~]# echo "target_core_mod -p" > /sys/kernel/debug/dynamic_debug/control

Copy to Clipboard

Toggle word wrap

Run an sosreport to gather system information:
```
sosreport
```
```
[root@igw ~]# sosreport
```
Copy to Clipboard Toggle word wrap
Capture network traffic for the Ceph iSCSI gateway and the VMware ESXi nodes simultaneously:
Syntax
```
tcpdump -s0 -i NETWORK_INTERFACE -w OUTPUT_FILE_PATH
```
```
tcpdump -s0 -i NETWORK_INTERFACE -w OUTPUT_FILE_PATH
```
Copy to Clipboard Toggle word wrap
Example
```
tcpdump -s 0 -i eth0 -w /tmp/igw-eth0-tcpdump.pcap
```
```
[root@igw ~]# tcpdump -s 0 -i eth0 -w /tmp/igw-eth0-tcpdump.pcap
```
Copy to Clipboard Toggle word wrap
Note
Look for traffic on port 3260.
1. Network packet capture files can be large, so compress the tcpdump output from the iSCSI target and initiators before uploading any files to Red Hat Global Support Services:
  Syntax
  gzip OUTPUT_FILE_PATH
  
  Copy to Clipboard Toggle word wrap
  Example
  [root@igw ~]# gzip /tmp/igw-eth0-tcpdump.pcap
  
  Copy to Clipboard Toggle word wrap

Gather additional information on the VMware ESXi environment:

[root@esx:~]# esxcli iscsi session list > /tmp/esxcli_iscsi_session_list.txt
[root@esx:~]# esxcli iscsi session connection list > /tmp/esxcli_iscsi_session_connection_list.txt

[root@esx:~]# esxcli iscsi session list > /tmp/esxcli_iscsi_session_list.txt
[root@esx:~]# esxcli iscsi session connection list > /tmp/esxcli_iscsi_session_connection_list.txt

Copy to Clipboard

Toggle word wrap

List and collect more information on each iSCSI disk:

Syntax

esxcli storage nmp path list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_nmp_path_list.txt

esxcli storage nmp path list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_nmp_path_list.txt

Copy to Clipboard

Toggle word wrap

Example

[root@esx:~]# esxcli storage nmp device list
[root@esx:~]# esxcli storage nmp path list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_nmp_path_list.txt
[root@esx:~]# esxcli storage core device list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_core_device_list.txt

[root@esx:~]# esxcli storage nmp device list
[root@esx:~]# esxcli storage nmp path list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_nmp_path_list.txt
[root@esx:~]# esxcli storage core device list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_core_device_list.txt

Copy to Clipboard

Toggle word wrap

Additional Resources

See Red Hat’s Knowledgebase solution on creating an sosreport for Red Hat Global Support Services.
See Red Hat’s Knowledgebase solution on uploading files for Red Hat Global Support Services.
See Red Hat’s Knowledgebase solution on How to capture network packets with tcpdump? for more information.
How to open a Red Hat support case on the Customer Portal?

7.4. Checking iSCSI login failures because of a timeout or not able to find a portal group
Copy link

On the iSCSI gateway node, you might see timeout or unable to locate a target portal group messages in the system log, by default /var/log/messages.

Example

Mar 28 00:29:01 osd2 kernel: iSCSI Login timeout on Network Portal 10.2.132.2:3260

Mar 28 00:29:01 osd2 kernel: iSCSI Login timeout on Network Portal 10.2.132.2:3260

Copy to Clipboard

Toggle word wrap

or

Example

Mar 23 20:25:39 osd1 kernel: Unable to locate Target Portal Group on iqn.2017-12.com.redhat.iscsi-gw:ceph-igw

Mar 23 20:25:39 osd1 kernel: Unable to locate Target Portal Group on iqn.2017-12.com.redhat.iscsi-gw:ceph-igw

Copy to Clipboard

Toggle word wrap

While the system is in this state, start collecting system information as suggested in this procedure.

Prerequisites

A running Red Hat Ceph Storage cluster.
A running Ceph iSCSI gateway.
Root-level access to the Ceph iSCSI gateway node.

Procedure

Enable the dumping of waiting tasks and write them to a file:

dmesg -c ; echo w > /proc/sysrq-trigger ; dmesg -c > /tmp/waiting-tasks.txt

[root@igw ~]# dmesg -c ; echo w > /proc/sysrq-trigger ; dmesg -c > /tmp/waiting-tasks.txt

Copy to Clipboard

Toggle word wrap

Review the list of waiting tasks for the following messages:
- iscsit_tpg_disable_portal_group
- core_tmr_abort_task
- transport_generic_free_cmd
If any of these messages appear in the waiting task list, then this is an indication that something went wrong with the tcmu-runner service. Maybe the tcmu-runner service was not restarted properly, or maybe the tcmu-runner service has crashed.
Verify if the tcmu-runner service is running:
```
systemctl status tcmu-runner
```
```
[root@igw ~]# systemctl status tcmu-runner
```
Copy to Clipboard Toggle word wrap
1. If the tcmu-runner service is not running, then stop the rbd-target-gw service before restarting the tcmu-runner service:
  [root@igw ~]# systemctl stop rbd-target-gw [root@igw ~]# systemctl stop tcmu-runner [root@igw ~]# systemctl start tcmu-runner [root@igw ~]# systemctl start rbd-target-gw
  Copy to Clipboard Toggle word wrap
  Important
  Stopping the Ceph iSCSI gateway first prevents IOs from getting stuck while the tcmu-runner service is down.
2. If the tcmu-runner service is running, the this might be a new bug. Open a new Red Hat support case.

Additional Resources

See Red Hat’s Knowledgebase solution on creating an sosreport for Red Hat Global Support Services.
See Red Hat’s Knowledgebase solution on uploading files for Red Hat Global Support Services.
How to open a Red Hat support case on the Customer Portal?

7.5. Timeout command errors
Copy link

The Ceph iSCSI gateway might report command timeout errors when a SCSI command has failed in the system log.

Example

Mar 23 20:03:14 igw tcmu-runner: 2018-03-23 20:03:14.052 2513 [ERROR] tcmu_rbd_handle_timedout_cmd:669 rbd/rbd.gw1lun011: Timing out cmd.

Mar 23 20:03:14 igw tcmu-runner: 2018-03-23 20:03:14.052 2513 [ERROR] tcmu_rbd_handle_timedout_cmd:669 rbd/rbd.gw1lun011: Timing out cmd.

Copy to Clipboard

Toggle word wrap

or

Example

Mar 23 20:03:14 igw tcmu-runner: tcmu_notify_conn_lost:176 rbd/rbd.gw1lun011: Handler connection lost (lock state 1)

Mar 23 20:03:14 igw tcmu-runner: tcmu_notify_conn_lost:176 rbd/rbd.gw1lun011: Handler connection lost (lock state 1)

Copy to Clipboard

Toggle word wrap

What This Means

It is possible there are other stuck tasks waiting to be processed, causing the SCSI command to timeout because a response was not received in a timely manner. Another reason for these error messages might be related to an unhealthy Red Hat Ceph Storage cluster.

To Troubleshoot This Problem

Check to see if there are waiting tasks that might be holding things up.
Check the health of the Red Hat Ceph Storage cluster.
Collect system information from each device in the path from the Ceph iSCSI gateway node to the iSCSI initiator node.

Additional Resources

See the Checking iSCSI login failures because of a timeout or not able to find a portal group section of the Red Hat Ceph Storage Troubleshooting Guide for more details on how to view waiting tasks.
See the Diagnosing the health of a storage cluster section of the Red Hat Ceph Storage Troubleshooting Guide for more details on checking the storage cluster health.
See the Gathering information for lost connections causing storage failures on VMware ESXi section of the Red Hat Ceph Storage Troubleshooting Guide for more details on collecting the necessary information.

7.6. Abort task errors
Copy link

The Ceph iSCSI gateway might report abort task errors in the system log.

Example

Apr  1 14:23:58 igw kernel: ABORT_TASK: Found referenced iSCSI task_tag: 1085531

Apr  1 14:23:58 igw kernel: ABORT_TASK: Found referenced iSCSI task_tag: 1085531

Copy to Clipboard

Toggle word wrap

What This Means

It is possible that some other network disruptions, such as a failed switch or bad port, is causing this type of error message. Another possibility is an unhealthy Red Hat Ceph Storage cluster.

To Troubleshoot This Problem

Check for any network disruptions in the environment.
Check the health of the Red Hat Ceph Storage cluster.
Collect system information from each device in the path from the Ceph iSCSI gateway node to the iSCSI initiator node.

Additional Resources

See the Diagnosing the health of a storage cluster section of the Red Hat Ceph Storage Troubleshooting Guide for more details on checking the storage cluster health.
See the Gathering information for lost connections causing storage failures on VMware ESXi section of the Red Hat Ceph Storage Troubleshooting Guide for more details on collecting the necessary information.

7.7. Additional Resources
Copy link

See the Red Hat Ceph Storage Block Device Guide for more details on the Ceph iSCSI gateway.
See Chapter 3, Troubleshooting networking issues for details.

7.1. Prerequisites
Copy link

7.2. Gathering information for lost connections causing storage failures on VMware ESXi
Copy link

7.5. Timeout command errors
Copy link

7.6. Abort task errors
Copy link

7.7. Additional Resources
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 7. Troubleshooting the Ceph iSCSI gateway (Limited Availability)

7.1. PrerequisitesCopy linkLink copied to clipboard!

7.2. Gathering information for lost connections causing storage failures on VMware ESXiCopy linkLink copied to clipboard!

7.3. Checking iSCSI login failures because data was not sentCopy linkLink copied to clipboard!

7.4. Checking iSCSI login failures because of a timeout or not able to find a portal groupCopy linkLink copied to clipboard!

7.5. Timeout command errorsCopy linkLink copied to clipboard!

7.6. Abort task errorsCopy linkLink copied to clipboard!

7.7. Additional ResourcesCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

7.1. Prerequisites
Copy link

7.2. Gathering information for lost connections causing storage failures on VMware ESXi
Copy link

7.3. Checking iSCSI login failures because data was not sent
Copy link

7.4. Checking iSCSI login failures because of a timeout or not able to find a portal group
Copy link

7.5. Timeout command errors
Copy link

7.6. Abort task errors
Copy link

7.7. Additional Resources
Copy link