All of lore.kernel.org
 help / color / mirror / Atom feed
* Unable to recover from DataOut timeout while in ERL=0
@ 2022-07-13 19:04 Nick Couchman
  2022-07-13 20:40 ` Dmitry Bogdanov
  0 siblings, 1 reply; 7+ messages in thread
From: Nick Couchman @ 2022-07-13 19:04 UTC (permalink / raw)
  To: target-devel

(Apologies if this ends up as a double-post, re-sending in Plain Text Mode)

Hello, everyone,
Hopefully this is the correct place to ask a general
usage/troubleshooting question regarding the Linux Target iSCSI
system.

I'm using the Linux iSCSI target on a pair of CentOS 8 Stream VMs that
are configured with DRBD to synchronize data between two ESXi hosts,
and then present that disk back to the ESXi hosts via iSCSI. Basically
I'm attempting to achieve a vSAN-like configuration, where I have
"shared storage" backed by the underlying physical storage of the
individual hosts.

It's worth noting that, at present, I'm not using an Active/Active
configuration (DRBD dual-primary), but each of the VMs has the DRBD
configuration and iSCSI configuration, and I can fail the primary and
iSCSI service back and forth between the nodes.

I'm running into a situation where, once I get the system under
moderate I/O load (installing Linux in another VM, for example), I
start seeing the following errors in dmesg and/or journalctl on the
active node:

Unable to recover from DataOut timeout while in ERL=0, closing iSCSI
connection for I_T Nexus
iqn.1998-01.com.vmware:esx01-18f91cf9,i,0x00023d000001,iqn.1902-01.com.example.site:drbd1,t,0x01

This gets repeated a couple of dozen or so times, and then I/O to the
iSCSI LUN from the ESXi host halts, the path to the LUN shows as
"Dead", and I have to reboot the active node and fail over to the
other node, at which point VMware picks back up and continues.

I've searched around the web to try to find assistance with this
error, but it doesn't seem all that common - in one case it appears to
be a bug from several years ago that was patched, and beyond that not
much relevant has turned up. Based on the error message, it almost
seems as if the target system is trying to say that it couldn't write
its data out to the disk in a timely fashion (which might be because
DRBD can't sync as quickly as is expected?), but it isn't all that
clear from the error.

I'm wondering if anyone can provide tips as to how to best mitigate
this - any tuning that can be done to change the time out, or throttle
the iSCSI traffic, or is it indicative of a lack of available RAM for
buffering (I'm not seeing a lot of RAM pressure, but possible I'm just
not catching it)?

Environment:
* CentOS 8 Stream
* Kernel: 4.18.0-394.el8.x86_64
* DRBD: 9.1.7
* 2 CPU, 4GB of RAM per VM
* Shared block devices is 1 TB

Thanks - Nick

^ permalink raw reply	[flat|nested] 7+ messages in thread
* Unable to recover from DataOut timeout while in ERL=0
@ 2024-03-05 10:36 Holger Amberg
  2024-03-05 16:55 ` Mike Christie
  0 siblings, 1 reply; 7+ messages in thread
From: Holger Amberg @ 2024-03-05 10:36 UTC (permalink / raw)
  To: target-devel

Hello, everyone, 

i am using a pair of Ubuntu 22.04 AMD EPYC 9124 servers with DRBD Primary/Secondary and iscsi target configuration to serve VMWare ESXi hosts. Everything was working a few month, but i have had two complete freezes recently within 24hrs caused by stuck iscsi targets. Only a complete reboot or failover to the other node can fix this issue. The problem is starting with „Unable to recover from DataOut timeout“ and then causing a set of many other issues errors, hung_task_timeouts, kernel stack traces at the end. Any idea what might cause this problem? 

Environment:
* Ubuntu 22.04
* Kernel: 5.15.0-94-generic x86_64
* DRBD: 9.2.7
* 1 CPU EPYC 9124, 32GB of RAM
* Asus RS500A-E12-RS12U/K14PA-U24
* 4 targets with 5 TB each backend by DRBD and LVM

Mar  4 18:23:08 km38033 kernel: [32853.081205] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus iqn.1998-01.com.vmware:km38112-6f3c301b,i,0x00023d000001,iqn.2001-09.com.customer:target2,t,0x01

Mar  4 18:23:13 km38033 kernel: [32857.897105] ABORT_TASK: Found referenced iSCSI task_tag: 756708339
Mar  4 18:23:13 km38033 kernel: [32857.897144] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 756708339
Mar  4 18:23:14 km38033 kernel: [32859.382095] Detected MISCOMPARE at offset 20
Mar  4 18:23:21 km38033 kernel: [32866.464749] Detected MISCOMPARE at offset 56
Mar  4 18:23:22 km38033 kernel: [32867.481244] Detected MISCOMPARE at offset 13
Mar  4 18:23:23 km38033 kernel: [32868.000291] Detected MISCOMPARE at offset 13
Mar  4 18:23:23 km38033 kernel: [32868.335822] Detected MISCOMPARE at offset 13
Mar  4 18:23:24 km38033 kernel: [32869.176568] Detected MISCOMPARE at offset 13
Mar  4 18:23:28 km38033 kernel: [32873.047201] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 157801725
Mar  4 18:23:29 km38033 kernel: [32873.901843] Detected MISCOMPARE at offset 13
Mar  4 18:23:35 km38033 kernel: [32880.495731] Detected MISCOMPARE at offset 28
Mar  4 18:23:36 km38033 kernel: [32881.012692] Detected MISCOMPARE at offset 56
Mar  4 18:23:36 km38033 kernel: [32881.013909] Detected MISCOMPARE at offset 14
Mar  4 18:23:38 km38033 kernel: [32882.886672] Detected MISCOMPARE at offset 14
Mar  4 18:27:46 km38033 kernel: [33130.878094] ABORT_TASK: Found referenced iSCSI task_tag: 263906787
Mar  4 18:27:46 km38033 kernel: [33130.878118] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 263906787
Mar  4 18:27:47 km38033 kernel: [33131.879914] ABORT_TASK: Found referenced iSCSI task_tag: 756717268
Mar  4 18:27:47 km38033 kernel: [33131.879943] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 756717268
Mar  4 18:27:47 km38033 kernel: [33131.880389] ABORT_TASK: Found referenced iSCSI task_tag: 756717269
Mar  4 18:27:47 km38033 kernel: [33132.179400] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 756717269
Mar  4 18:27:47 km38033 kernel: [33132.180525] ABORT_TASK: Found referenced iSCSI task_tag: 756717304
Mar  4 18:27:47 km38033 kernel: [33132.180748] ABORT_TASK: Found referenced iSCSI task_tag: 209408054
Mar  4 18:27:47 km38033 kernel: [33132.180783] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 209408054
Mar  4 18:27:47 km38033 kernel: [33132.181353] ABORT_TASK: Found referenced iSCSI task_tag: 209408082
Mar  4 18:27:48 km38033 kernel: [33133.234828] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 756717304
Mar  4 18:27:48 km38033 kernel: [33133.234932] Detected MISCOMPARE at offset 20
Mar  4 18:27:49 km38033 kernel: [33133.568484] ABORT_TASK: Found referenced iSCSI task_tag: 18127
Mar  4 18:27:49 km38033 kernel: [33133.568507] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 18127
Mar  4 18:27:49 km38033 kernel: [33133.575026] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 209408082
Mar  4 18:27:49 km38033 kernel: [33133.845133] ABORT_TASK: Found referenced iSCSI task_tag: 1434712138
Mar  4 18:27:49 km38033 kernel: [33133.845159] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1434712138
Mar  4 18:27:49 km38033 kernel: [33133.845273] ABORT_TASK: Found referenced iSCSI task_tag: 157803718
Mar  4 18:27:49 km38033 kernel: [33133.845307] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 157803718
Mar  4 18:27:49 km38033 kernel: [33133.845344] ABORT_TASK: Found referenced iSCSI task_tag: 286437689
Mar  4 18:27:49 km38033 kernel: [33133.845522] ABORT_TASK: Found referenced iSCSI task_tag: 157803719
Mar  4 18:27:49 km38033 kernel: [33133.845531] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 286437689
Mar  4 18:27:49 km38033 kernel: [33133.845967] ABORT_TASK: Found referenced iSCSI task_tag: 286437691
Mar  4 18:27:51 km38033 kernel: [33135.654498] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 286437691
Mar  4 18:27:51 km38033 kernel: [33135.654987] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 157803719
Mar  4 18:27:51 km38033 kernel: [33135.655087] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 286437697
Mar  4 18:27:52 km38033 kernel: [33137.235524] ABORT_TASK: Found referenced iSCSI task_tag: 263906810
Mar  4 18:27:52 km38033 kernel: [33137.408856] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 263906810
Mar  4 18:27:52 km38033 kernel: [33137.409000] Detected MISCOMPARE at offset 56
Mar  4 18:27:52 km38033 kernel: [33137.409320] ABORT_TASK: Found referenced iSCSI task_tag: 263906811
Mar  4 18:27:52 km38033 kernel: [33137.409522] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 263906811
Mar  4 18:27:53 km38033 kernel: [33137.584808] Detected MISCOMPARE at offset 56
Mar  4 18:27:53 km38033 kernel: [33137.584843] Detected MISCOMPARE at offset 56
Mar  4 18:27:53 km38033 kernel: [33137.584877] Detected MISCOMPARE at offset 56
Mar  4 18:27:53 km38033 kernel: [33138.289566] Detected MISCOMPARE at offset 56
Mar  4 18:27:53 km38033 kernel: [33138.289721] Detected MISCOMPARE at offset 56
Mar  4 18:27:53 km38033 kernel: [33138.289903] Detected MISCOMPARE at offset 56
Mar  4 18:27:57 km38033 kernel: [33141.720127] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1434712401
Mar  4 18:27:58 km38033 kernel: [33142.940133] Detected MISCOMPARE at offset 56
Mar  4 18:27:59 km38033 kernel: [33144.323760] Detected MISCOMPARE at offset 56
Mar  4 18:27:59 km38033 kernel: [33144.323965] Detected MISCOMPARE at offset 56
Mar  4 18:27:59 km38033 kernel: [33144.498569] Detected MISCOMPARE at offset 56
Mar  4 18:28:01 km38033 kernel: [33146.244787] Detected MISCOMPARE at offset 56
Mar  4 18:28:01 km38033 kernel: [33146.246893] Detected MISCOMPARE at offset 56
Mar  4 18:28:01 km38033 kernel: [33146.246931] Detected MISCOMPARE at offset 56
Mar  4 18:28:02 km38033 kernel: [33147.476180] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus iqn.1998-01.com.vmware:km38116-369efe55,i,0x00023d000001,iqn.2001-09.com.customer:target1,t,0x01
Mar  4 18:39:28 km38033 kernel: [33833.544517] INFO: task iscsi_np:27427 blocked for more than 241 seconds.
Mar  4 18:39:28 km38033 kernel: [33833.544952]       Tainted: G           OE     5.15.0-94-generic #104-Ubuntu
Mar  4 18:39:28 km38033 kernel: [33833.545254] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  4 18:39:28 km38033 kernel: [33833.545547] task:iscsi_np        state:D stack:    0 pid:27427 ppid:     2 flags:0x00004004
Mar  4 18:39:28 km38033 kernel: [33833.545550] Call Trace:
Mar  4 18:39:28 km38033 kernel: [33833.545552]  <TASK>
Mar  4 18:39:28 km38033 kernel: [33833.545554]  __schedule+0x24e/0x590
Mar  4 18:39:28 km38033 kernel: [33833.545560]  schedule+0x69/0x110
Mar  4 18:39:28 km38033 kernel: [33833.545562]  schedule_timeout+0x105/0x140
Mar  4 18:39:28 km38033 kernel: [33833.545564]  ? __kmalloc_track_caller+0x181/0x340
Mar  4 18:39:28 km38033 kernel: [33833.545569]  ? iscsi_update_param_value+0x28/0x70 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545578]  __wait_for_common+0xae/0x150
Mar  4 18:39:28 km38033 kernel: [33833.545579]  ? usleep_range_state+0x90/0x90
Mar  4 18:39:28 km38033 kernel: [33833.545581]  wait_for_completion+0x24/0x30
Mar  4 18:39:28 km38033 kernel: [33833.545582]  iscsit_stop_session+0x18f/0x1b0 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545589]  iscsi_check_for_session_reinstatement+0x1df/0x280 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545596]  iscsi_target_handle_csg_one+0x128/0x150 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545603]  iscsi_target_do_login+0x71/0xd0 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545609]  iscsi_target_start_negotiation+0x56/0x100 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545615]  __iscsi_target_login_thread+0x2f0/0x620 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545621]  iscsi_target_login_thread+0x24/0x60 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545627]  ? __iscsi_target_login_thread+0x620/0x620 [iscsi_target_mod]
Mar  4 18:39:28 km38033 kernel: [33833.545633]  kthread+0x12a/0x150
Mar  4 18:39:28 km38033 kernel: [33833.545636]  ? set_kthread_struct+0x50/0x50
Mar  4 18:39:28 km38033 kernel: [33833.545638]  ret_from_fork+0x22/0x30
Mar  4 18:39:28 km38033 kernel: [33833.545641]  </TASK>
Mar  4 18:41:29 km38033 kernel: [33954.376305] INFO: task iscsi_trx:588744 blocked for more than 120 seconds.
Mar  4 18:41:29 km38033 kernel: [33954.376560]       Tainted: G           OE     5.15.0-94-generic #104-Ubuntu
Mar  4 18:41:29 km38033 kernel: [33954.376811] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  4 18:41:29 km38033 kernel: [33954.377065] task:iscsi_trx       state:D stack:    0 pid:588744 ppid:     2 flags:0x00004004
Mar  4 18:41:29 km38033 kernel: [33954.377067] Call Trace:
Mar  4 18:41:29 km38033 kernel: [33954.377067]  <TASK>
Mar  4 18:41:29 km38033 kernel: [33954.377068]  __schedule+0x24e/0x590
Mar  4 18:41:29 km38033 kernel: [33954.377070]  schedule+0x69/0x110
Mar  4 18:41:29 km38033 kernel: [33954.377071]  schedule_timeout+0x87/0x140
Mar  4 18:41:29 km38033 kernel: [33954.377072]  ? __bpf_trace_tick_stop+0x20/0x20
Mar  4 18:41:29 km38033 kernel: [33954.377075]  __wait_for_common+0xae/0x150
Mar  4 18:41:29 km38033 kernel: [33954.377076]  ? usleep_range_state+0x90/0x90
Mar  4 18:41:29 km38033 kernel: [33954.377078]  wait_for_completion_timeout+0x1d/0x30
Mar  4 18:41:29 km38033 kernel: [33954.377079]  __transport_wait_for_tasks+0xd3/0x140 [target_core_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377088]  transport_generic_free_cmd+0x14c/0x190 [target_core_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377096]  iscsit_free_cmd+0x58/0xd0 [iscsi_target_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377103]  iscsit_release_commands_from_conn+0x1cd/0x210 [iscsi_target_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377109]  iscsit_close_connection+0x3bf/0x720 [iscsi_target_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377115]  iscsit_take_action_for_connection_exit+0x86/0x110 [iscsi_target_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377122]  iscsi_target_rx_thread+0xba/0x110 [iscsi_target_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377128]  ? iscsi_target_tx_thread+0x220/0x220 [iscsi_target_mod]
Mar  4 18:41:29 km38033 kernel: [33954.377134]  kthread+0x12a/0x150
Mar  4 18:41:29 km38033 kernel: [33954.377135]  ? set_kthread_struct+0x50/0x50
Mar  4 18:41:29 km38033 kernel: [33954.377137]  ret_from_fork+0x22/0x30
Mar  4 18:41:29 km38033 kernel: [33954.377139]  </TASK>

Many thanks in advance.

Best regards,
Holger Amberg


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-03-05 16:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-07-13 19:04 Unable to recover from DataOut timeout while in ERL=0 Nick Couchman
2022-07-13 20:40 ` Dmitry Bogdanov
2022-07-14  1:47   ` Nick Couchman
2022-07-26 11:12     ` Dmitry Bogdanov
2022-07-26 11:21       ` Nick Couchman
  -- strict thread matches above, loose matches on Subject: below --
2024-03-05 10:36 Holger Amberg
2024-03-05 16:55 ` Mike Christie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.