Re: Unable to recover from DataOut timeout while in ERL=0

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dmitry Bogdanov <d.bogdanov@yadro.com>
To: Nick Couchman <nick.e.couchman@gmail.com>
Cc: <target-devel@vger.kernel.org>
Subject: Re: Unable to recover from DataOut timeout while in ERL=0
Date: Wed, 13 Jul 2022 23:40:05 +0300	[thread overview]
Message-ID: <20220713204005.GA6045@yadro.com> (raw)
In-Reply-To: <CAFjj603YVVF8jK9RS_Pe5d0YTEUkCWZ5EwdXsVGgjSQWNfU_Lw@mail.gmail.com>

Hi Nick,

On Wed, Jul 13, 2022 at 03:04:12PM -0400, Nick Couchman wrote:
> 
> (Apologies if this ends up as a double-post, re-sending in Plain Text Mode)
> 
> Hello, everyone,
> Hopefully this is the correct place to ask a general
> usage/troubleshooting question regarding the Linux Target iSCSI
> system.
> 
> I'm using the Linux iSCSI target on a pair of CentOS 8 Stream VMs that
> are configured with DRBD to synchronize data between two ESXi hosts,
> and then present that disk back to the ESXi hosts via iSCSI. Basically
> I'm attempting to achieve a vSAN-like configuration, where I have
> "shared storage" backed by the underlying physical storage of the
> individual hosts.
> 
> It's worth noting that, at present, I'm not using an Active/Active
> configuration (DRBD dual-primary), but each of the VMs has the DRBD
> configuration and iSCSI configuration, and I can fail the primary and
> iSCSI service back and forth between the nodes.
> 
> I'm running into a situation where, once I get the system under
> moderate I/O load (installing Linux in another VM, for example), I
> start seeing the following errors in dmesg and/or journalctl on the
> active node:
> 
> Unable to recover from DataOut timeout while in ERL=0, closing iSCSI
> connection for I_T Nexus
> iqn.1998-01.com.vmware:esx01-18f91cf9,i,0x00023d000001,iqn.1902-01.com.example.site:drbd1,t,0x01
> 
> This gets repeated a couple of dozen or so times, and then I/O to the
> iSCSI LUN from the ESXi host halts, the path to the LUN shows as
> "Dead", and I have to reboot the active node and fail over to the
> other node, at which point VMware picks back up and continues.
> 
> I've searched around the web to try to find assistance with this
> error, but it doesn't seem all that common - in one case it appears to
> be a bug from several years ago that was patched, and beyond that not
> much relevant has turned up. Based on the error message, it almost
> seems as if the target system is trying to say that it couldn't write
> its data out to the disk in a timely fashion (which might be because
> DRBD can't sync as quickly as is expected?), but it isn't all that
> clear from the error.
We have been encountering the same issue with ESXi. For some reasons it
may not send an IO data for the already sent SCSI WRITE command - iSCSI
DataOUT PDUs. Instead, it send an ABORT for that command. Linux Target
Core does not abort a SCSI command when it has not yet full IO data
collected. iSCSI DataOut timer times out and triggers connection
reinstatement.
But during that reinstatement iSCSI hangs waiting for that aborted WRITE
command got completed. A not finished logout prevents a new login from
the same initiator.
That condition solves only by a target reboot.

> 
> I'm wondering if anyone can provide tips as to how to best mitigate
> this - any tuning that can be done to change the time out, or throttle
> the iSCSI traffic, or is it indicative of a lack of available RAM for
> buffering (I'm not seeing a lot of RAM pressure, but possible I'm just
> not catching it)?
> 
I may just send you a patch for a target that fixes the hanging. ESXi
will reconnect to the target and will continue work with it without a
reboot.

> Environment:
> * CentOS 8 Stream
> * Kernel: 4.18.0-394.el8.x86_64
> * DRBD: 9.1.7
> * 2 CPU, 4GB of RAM per VM
> * Shared block devices is 1 TB
> 
> Thanks - Nick

next prev parent reply	other threads:[~2022-07-13 20:40 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-13 19:04 Unable to recover from DataOut timeout while in ERL=0 Nick Couchman
2022-07-13 20:40 ` Dmitry Bogdanov [this message]
2022-07-14  1:47   ` Nick Couchman
2022-07-26 11:12     ` Dmitry Bogdanov
2022-07-26 11:21       ` Nick Couchman
  -- strict thread matches above, loose matches on Subject: below --
2024-03-05 10:36 Holger Amberg
2024-03-05 16:55 ` Mike Christie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220713204005.GA6045@yadro.com \
    --to=d.bogdanov@yadro.com \
    --cc=nick.e.couchman@gmail.com \
    --cc=target-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.