Re: [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout

From: James Smart <james.smart@broadcom.com>
To: Bart Van Assche <bvanassche@acm.org>,
	"Singh, Balbir" <sblbir@amazon.com>,
	"kbusch@kernel.org" <kbusch@kernel.org>
Cc: "axboe@fb.com" <axboe@fb.com>,
	"sblbir@amzn.com" <sblbir@amzn.com>, "hch@lst.de" <hch@lst.de>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"sagi@grimberg.me" <sagi@grimberg.me>
Subject: Re: [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout
Date: Tue, 17 Sep 2019 10:21:44 -0700	[thread overview]
Message-ID: <79e20170-e7c2-279a-9787-41eed3d053fd@broadcom.com> (raw)
In-Reply-To: <ee9a3b5b-2572-5022-0322-e1e2ae38904b@acm.org>

On 9/16/2019 8:17 PM, Bart Van Assche wrote:
> On 9/16/19 7:56 PM, Singh, Balbir wrote:
>> On Mon, 2019-09-16 at 19:14 -0600, Keith Busch wrote:
>> [ ... ]
>> In my case I was doing a simple mirror (using madadm across two nvme
>> devices) and when I get timeouts on one, I need to wait up to 128 
>> seconds
>> before switching over. Ideally I want this to be really fast and drop
>> the slow broken controller.
>>
>>> If you really need the driver to do spec non-compliant behavior, we 
>>> have
>>> quirks for that.
>>
>> I like the quirks approach, but it assumes the timeout value is not
>> variable, but rather fixed by the quirk. I was attempting to really have
>> mirrored IO timeout quickly on a bad device.
>
> Other Linux kernel storage transports (FC, SRP) decouple the failover 
> timeout from the I/O timeout. See also the output of git grep -nH 
> fast_io_fail for the kernel source tree. See also the documentation of 
> fast_io_fail_tmo in https://linux.die.net/man/5/multipath.conf. Maybe 
> we need something similar for NVMe?
>

Well, it's not really fast io fail that needs to be replicated, and in 
fact, when I looked at nvme-fc, I saw no need for fast_io_failover as it 
didn't apply.

To understand:
with SCSI - we had the device "blocked" as there was a detection of a 
loss of connectivity to the device. This blocked state did not terminate 
i/o - we let any io completions trickling in continue to finish, but we 
certainly stopped new i/o from being started.  I/O could continue to 
timeout, but in most cases, a timeout while in this 
loss-of-connectivity, meant the timeout was just rescheduled. The 
blocked state was dependent on the "device loss" timeout that was 
running. Outstanding i/o wouldn't be terminated by the lldd until the 
final point when we gave up on the device and tore it down - the device 
loss timeout expiration.   With multipathing, waiting for the device 
loss timeout was too long - so we invented the fast-io-fail timeout, 
started at the same point at devloss, and inherently would expire before 
devloss, that would terminate all i/o to the device. This allowed 
multipath to get the io back faster than actual device failure.

With NVME-FC - there is a similar behavior to the "blocked" state, which 
is the reconnecting state.  E.g. when loss of connectivity is 
determined, the controller goes through an implicit reset which 
terminates all outstanding io, then goes into a reconnect timeout that 
retries connections up until an overall timer expires - known 
generically as ctrl_loss_tmo which applies to all fabric types and 
defaults to 60s.  FC additionally adds in the "device loss" tmo known by 
SCSI (the FC device may be both SCSI and NVME and should use the same 
value) and expires on the minimum of those two timeout values.    The 
fact that the controller reset terminates all outstanding i/o, true on 
any fabric transport, means the fast_io_fail timeout isn't needed.

So what seems to be talked about in this thread is how the fabric 
detects device connectivity loss.   FC has it's nameserver so it's 
automatic.  But the other transports don't have such a thing, unless 
it's TCP connection timeout failures or similar.  Connectivity loss is 
supposed to be the job of the keep alive timeout.  So I would look at 
that area to see how it should be manipulated.

-- james

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme