Chapter 7. Troubleshooting the Ceph iSCSI gateway (Limited Availability)
As a storage administrator, you can troubleshoot most common errors that can occur when using the Ceph iSCSI gateway. These are some of the common errors that you might encounter:
- iSCSI login issues.
- VMware ESXi reporting various connection failures.
- Timeout errors.
This technology is Limited Availability. See the Deprecated functionality chapter for additional information.
7.1. Prerequisites Copy linkLink copied to clipboard!
- A running Red Hat Ceph Storage cluster.
- A running Ceph iSCSI gateway.
- Verify the network connections.
7.2. Gathering information for lost connections causing storage failures on VMware ESXi Copy linkLink copied to clipboard!
Collecting system and disk information helps determine which iSCSI target has lost a connection and is possibly causing storage failures. If needed, gathering this information can also be provided to Red Hat’s Global Support Service to aid you in troubleshooting any Ceph iSCSI gateway issues.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- A running Ceph iSCSI gateway, the iSCSI target.
- A running VMware ESXi environment, the iSCSI initiator.
- Root-level access to the VMware ESXi node.
Procedure
On the VWware ESXi node, open the kernel log:
[root@esx:~]# more /var/log/vmkernel.log
[root@esx:~]# more /var/log/vmkernel.log
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Gather information from the following error messages in the VMware ESXi kernel log:
Example
2022-05-30T11:07:07.570Z cpu32:66506)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Sess [ISID: 00023d000005 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0]
2022-05-30T11:07:07.570Z cpu32:66506)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Sess [ISID: 00023d000005 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0]
Copy to Clipboard Copied! Toggle word wrap Toggle overflow From this message, make a note of the
ISID
number, theTARGET
name, and the Target Portal Group Tag (TPGT
) number. For this example, we have the following:ISID: 00023d000005 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3
ISID: 00023d000005 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
2022-05-30T11:07:07.570Z cpu32:66506)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: vmhba64:CH:4 T:0 CN:0: Connection rx notifying failure: Failed to Receive. State=Bound
2022-05-30T11:07:07.570Z cpu32:66506)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: vmhba64:CH:4 T:0 CN:0: Connection rx notifying failure: Failed to Receive. State=Bound
Copy to Clipboard Copied! Toggle word wrap Toggle overflow From this message, make a note of the adapter channel (
CH
) number. For this example, we have the following:vmhba64:CH:4 T:0
vmhba64:CH:4 T:0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To find the remote address of the Ceph iSCSI gateway node:
[root@esx:~]# esxcli iscsi session connection list
[root@esx:~]# esxcli iscsi session connection list
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow From the command output, match the
ISID
value, and theTARGET
name value gathered previously, then make a note of theRemoteAddress
value. From this example, we have the following:Target: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw ISID: 00023d000003 RemoteAddress: 10.2.132.2
Target: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw ISID: 00023d000003 RemoteAddress: 10.2.132.2
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Now, you can collect more information from the Ceph iSCSI gateway node to further troubleshoot the issue.
On the Ceph iSCSI gateway node mentioned by the
RemoteAddress
value, run ansosreport
to gather system information:sosreport
[root@igw ~]# sosreport
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
To find a disk that went into a dead state:
[root@esx:~]# esxcli storage nmp device list
[root@esx:~]# esxcli storage nmp device list
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow From the command output, match the
CH
number, and theTPGT
number gathered previously, then make a note of theDevice
value. For this example, we have the following:vmhba64:C4:T0 Device: naa.60014054a5d46697f85498e9a257567c TPG_id=3
vmhba64:C4:T0 Device: naa.60014054a5d46697f85498e9a257567c TPG_id=3
Copy to Clipboard Copied! Toggle word wrap Toggle overflow With the device name, you can gather some additional information on each iSCSI disk in a
dead
state.Gather more information on the iSCSI disk:
Syntax
esxcli storage nmp path list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_nmp_path_list.txt esxcli storage core device list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_core_device_list.txt
esxcli storage nmp path list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_nmp_path_list.txt esxcli storage core device list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_core_device_list.txt
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[root@esx:~]# esxcli storage nmp path list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_nmp_path_list.txt [root@esx:~]# esxcli storage core device list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_core_device_list.txt
[root@esx:~]# esxcli storage nmp path list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_nmp_path_list.txt [root@esx:~]# esxcli storage core device list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_core_device_list.txt
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Gather additional information on the VMware ESXi environment:
[root@esx:~]# esxcli storage vmfs extent list > /tmp/esxcli_storage_vmfs_extent_list.txt [root@esx:~]# esxcli storage filesystem list > /tmp/esxcli_storage_filesystem_list.txt [root@esx:~]# esxcli iscsi session list > /tmp/esxcli_iscsi_session_list.txt [root@esx:~]# esxcli iscsi session connection list > /tmp/esxcli_iscsi_session_connection_list.txt
[root@esx:~]# esxcli storage vmfs extent list > /tmp/esxcli_storage_vmfs_extent_list.txt [root@esx:~]# esxcli storage filesystem list > /tmp/esxcli_storage_filesystem_list.txt [root@esx:~]# esxcli iscsi session list > /tmp/esxcli_iscsi_session_list.txt [root@esx:~]# esxcli iscsi session connection list > /tmp/esxcli_iscsi_session_connection_list.txt
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for potential iSCSI login issues:
Additional Resources
-
See Red Hat’s Knowledgebase solution on creating an
sosreport
for Red Hat Global Support Services. - See Red Hat’s Knowledgebase solution on uploading files for Red Hat Global Support Services.
- How to open a Red Hat support case on the Customer Portal?
7.3. Checking iSCSI login failures because data was not sent Copy linkLink copied to clipboard!
On the iSCSI gateway node, you might see generic login negotiation failure messages in the system log, by default /var/log/messages
.
Example
Apr 2 23:17:05 osd1 kernel: rx_data returned 0, expecting 48. Apr 2 23:17:05 osd1 kernel: iSCSI Login negotiation failed.
Apr 2 23:17:05 osd1 kernel: rx_data returned 0, expecting 48.
Apr 2 23:17:05 osd1 kernel: iSCSI Login negotiation failed.
While the system is in this state, start collecting system information as suggested in this procedure.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- A running Ceph iSCSI gateway, the iSCSI target.
- A running VMware ESXi environment, the iSCSI initiator.
- Root-level access to the Ceph iSCSI gateway node.
- Root-level access to the VMware ESXi node.
Procedure
Enable additional logging:
echo "iscsi_target_mod +p" > /sys/kernel/debug/dynamic_debug/control echo "target_core_mod +p" > /sys/kernel/debug/dynamic_debug/control
[root@igw ~]# echo "iscsi_target_mod +p" > /sys/kernel/debug/dynamic_debug/control [root@igw ~]# echo "target_core_mod +p" > /sys/kernel/debug/dynamic_debug/control
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Wait a couple of minutes for the extra debugging information to populate the system log.
Disable the additional logging:
echo "iscsi_target_mod -p" > /sys/kernel/debug/dynamic_debug/control echo "target_core_mod -p" > /sys/kernel/debug/dynamic_debug/control
[root@igw ~]# echo "iscsi_target_mod -p" > /sys/kernel/debug/dynamic_debug/control [root@igw ~]# echo "target_core_mod -p" > /sys/kernel/debug/dynamic_debug/control
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run an
sosreport
to gather system information:sosreport
[root@igw ~]# sosreport
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Capture network traffic for the Ceph iSCSI gateway and the VMware ESXi nodes simultaneously:
Syntax
tcpdump -s0 -i NETWORK_INTERFACE -w OUTPUT_FILE_PATH
tcpdump -s0 -i NETWORK_INTERFACE -w OUTPUT_FILE_PATH
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
tcpdump -s 0 -i eth0 -w /tmp/igw-eth0-tcpdump.pcap
[root@igw ~]# tcpdump -s 0 -i eth0 -w /tmp/igw-eth0-tcpdump.pcap
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteLook for traffic on port 3260.
Network packet capture files can be large, so compress the
tcpdump
output from the iSCSI target and initiators before uploading any files to Red Hat Global Support Services:Syntax
gzip OUTPUT_FILE_PATH
gzip OUTPUT_FILE_PATH
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
gzip /tmp/igw-eth0-tcpdump.pcap
[root@igw ~]# gzip /tmp/igw-eth0-tcpdump.pcap
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Gather additional information on the VMware ESXi environment:
[root@esx:~]# esxcli iscsi session list > /tmp/esxcli_iscsi_session_list.txt [root@esx:~]# esxcli iscsi session connection list > /tmp/esxcli_iscsi_session_connection_list.txt
[root@esx:~]# esxcli iscsi session list > /tmp/esxcli_iscsi_session_list.txt [root@esx:~]# esxcli iscsi session connection list > /tmp/esxcli_iscsi_session_connection_list.txt
Copy to Clipboard Copied! Toggle word wrap Toggle overflow List and collect more information on each iSCSI disk:
Syntax
esxcli storage nmp path list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_nmp_path_list.txt
esxcli storage nmp path list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_nmp_path_list.txt
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[root@esx:~]# esxcli storage nmp device list [root@esx:~]# esxcli storage nmp path list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_nmp_path_list.txt [root@esx:~]# esxcli storage core device list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_core_device_list.txt
[root@esx:~]# esxcli storage nmp device list [root@esx:~]# esxcli storage nmp path list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_nmp_path_list.txt [root@esx:~]# esxcli storage core device list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_core_device_list.txt
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional Resources
-
See Red Hat’s Knowledgebase solution on creating an
sosreport
for Red Hat Global Support Services. - See Red Hat’s Knowledgebase solution on uploading files for Red Hat Global Support Services.
- See Red Hat’s Knowledgebase solution on How to capture network packets with tcpdump? for more information.
- How to open a Red Hat support case on the Customer Portal?
7.4. Checking iSCSI login failures because of a timeout or not able to find a portal group Copy linkLink copied to clipboard!
On the iSCSI gateway node, you might see timeout or unable to locate a target portal group messages in the system log, by default /var/log/messages
.
Example
Mar 28 00:29:01 osd2 kernel: iSCSI Login timeout on Network Portal 10.2.132.2:3260
Mar 28 00:29:01 osd2 kernel: iSCSI Login timeout on Network Portal 10.2.132.2:3260
or
Example
Mar 23 20:25:39 osd1 kernel: Unable to locate Target Portal Group on iqn.2017-12.com.redhat.iscsi-gw:ceph-igw
Mar 23 20:25:39 osd1 kernel: Unable to locate Target Portal Group on iqn.2017-12.com.redhat.iscsi-gw:ceph-igw
While the system is in this state, start collecting system information as suggested in this procedure.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- A running Ceph iSCSI gateway.
- Root-level access to the Ceph iSCSI gateway node.
Procedure
Enable the dumping of waiting tasks and write them to a file:
dmesg -c ; echo w > /proc/sysrq-trigger ; dmesg -c > /tmp/waiting-tasks.txt
[root@igw ~]# dmesg -c ; echo w > /proc/sysrq-trigger ; dmesg -c > /tmp/waiting-tasks.txt
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Review the list of waiting tasks for the following messages:
-
iscsit_tpg_disable_portal_group
-
core_tmr_abort_task
-
transport_generic_free_cmd
If any of these messages appear in the waiting task list, then this is an indication that something went wrong with the
tcmu-runner
service. Maybe thetcmu-runner
service was not restarted properly, or maybe thetcmu-runner
service has crashed.-
Verify if the
tcmu-runner
service is running:systemctl status tcmu-runner
[root@igw ~]# systemctl status tcmu-runner
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the
tcmu-runner
service is not running, then stop therbd-target-gw
service before restarting thetcmu-runner
service:systemctl stop rbd-target-gw systemctl stop tcmu-runner systemctl start tcmu-runner systemctl start rbd-target-gw
[root@igw ~]# systemctl stop rbd-target-gw [root@igw ~]# systemctl stop tcmu-runner [root@igw ~]# systemctl start tcmu-runner [root@igw ~]# systemctl start rbd-target-gw
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantStopping the Ceph iSCSI gateway first prevents IOs from getting stuck while the
tcmu-runner
service is down.-
If the
tcmu-runner
service is running, the this might be a new bug. Open a new Red Hat support case.
Additional Resources
-
See Red Hat’s Knowledgebase solution on creating an
sosreport
for Red Hat Global Support Services. - See Red Hat’s Knowledgebase solution on uploading files for Red Hat Global Support Services.
- How to open a Red Hat support case on the Customer Portal?
7.5. Timeout command errors Copy linkLink copied to clipboard!
The Ceph iSCSI gateway might report command timeout errors when a SCSI command has failed in the system log.
Example
Mar 23 20:03:14 igw tcmu-runner: 2018-03-23 20:03:14.052 2513 [ERROR] tcmu_rbd_handle_timedout_cmd:669 rbd/rbd.gw1lun011: Timing out cmd.
Mar 23 20:03:14 igw tcmu-runner: 2018-03-23 20:03:14.052 2513 [ERROR] tcmu_rbd_handle_timedout_cmd:669 rbd/rbd.gw1lun011: Timing out cmd.
or
Example
Mar 23 20:03:14 igw tcmu-runner: tcmu_notify_conn_lost:176 rbd/rbd.gw1lun011: Handler connection lost (lock state 1)
Mar 23 20:03:14 igw tcmu-runner: tcmu_notify_conn_lost:176 rbd/rbd.gw1lun011: Handler connection lost (lock state 1)
What This Means
It is possible there are other stuck tasks waiting to be processed, causing the SCSI command to timeout because a response was not received in a timely manner. Another reason for these error messages might be related to an unhealthy Red Hat Ceph Storage cluster.
To Troubleshoot This Problem
- Check to see if there are waiting tasks that might be holding things up.
- Check the health of the Red Hat Ceph Storage cluster.
- Collect system information from each device in the path from the Ceph iSCSI gateway node to the iSCSI initiator node.
Additional Resources
- See the Checking iSCSI login failures because of a timeout or not able to find a portal group section of the Red Hat Ceph Storage Troubleshooting Guide for more details on how to view waiting tasks.
- See the Diagnosing the health of a storage cluster section of the Red Hat Ceph Storage Troubleshooting Guide for more details on checking the storage cluster health.
- See the Gathering information for lost connections causing storage failures on VMware ESXi section of the Red Hat Ceph Storage Troubleshooting Guide for more details on collecting the necessary information.
7.6. Abort task errors Copy linkLink copied to clipboard!
The Ceph iSCSI gateway might report abort task errors in the system log.
Example
Apr 1 14:23:58 igw kernel: ABORT_TASK: Found referenced iSCSI task_tag: 1085531
Apr 1 14:23:58 igw kernel: ABORT_TASK: Found referenced iSCSI task_tag: 1085531
What This Means
It is possible that some other network disruptions, such as a failed switch or bad port, is causing this type of error message. Another possibility is an unhealthy Red Hat Ceph Storage cluster.
To Troubleshoot This Problem
- Check for any network disruptions in the environment.
- Check the health of the Red Hat Ceph Storage cluster.
- Collect system information from each device in the path from the Ceph iSCSI gateway node to the iSCSI initiator node.
Additional Resources
- See the Diagnosing the health of a storage cluster section of the Red Hat Ceph Storage Troubleshooting Guide for more details on checking the storage cluster health.
- See the Gathering information for lost connections causing storage failures on VMware ESXi section of the Red Hat Ceph Storage Troubleshooting Guide for more details on collecting the necessary information.
7.7. Additional Resources Copy linkLink copied to clipboard!
- See the Red Hat Ceph Storage Block Device Guide for more details on the Ceph iSCSI gateway.
- See Chapter 3, Troubleshooting networking issues for details.