From: James Smart <james.smart@broadcom.com>
To: Bart Van Assche <bvanassche@acm.org>,
"Singh, Balbir" <sblbir@amazon.com>,
"kbusch@kernel.org" <kbusch@kernel.org>
Cc: "axboe@fb.com" <axboe@fb.com>,
"sblbir@amzn.com" <sblbir@amzn.com>, "hch@lst.de" <hch@lst.de>,
"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
"sagi@grimberg.me" <sagi@grimberg.me>
Subject: Re: [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout
Date: Tue, 17 Sep 2019 10:21:44 -0700 [thread overview]
Message-ID: <79e20170-e7c2-279a-9787-41eed3d053fd@broadcom.com> (raw)
In-Reply-To: <ee9a3b5b-2572-5022-0322-e1e2ae38904b@acm.org>
On 9/16/2019 8:17 PM, Bart Van Assche wrote:
> On 9/16/19 7:56 PM, Singh, Balbir wrote:
>> On Mon, 2019-09-16 at 19:14 -0600, Keith Busch wrote:
>> [ ... ]
>> In my case I was doing a simple mirror (using madadm across two nvme
>> devices) and when I get timeouts on one, I need to wait up to 128
>> seconds
>> before switching over. Ideally I want this to be really fast and drop
>> the slow broken controller.
>>
>>> If you really need the driver to do spec non-compliant behavior, we
>>> have
>>> quirks for that.
>>
>> I like the quirks approach, but it assumes the timeout value is not
>> variable, but rather fixed by the quirk. I was attempting to really have
>> mirrored IO timeout quickly on a bad device.
>
> Other Linux kernel storage transports (FC, SRP) decouple the failover
> timeout from the I/O timeout. See also the output of git grep -nH
> fast_io_fail for the kernel source tree. See also the documentation of
> fast_io_fail_tmo in https://linux.die.net/man/5/multipath.conf. Maybe
> we need something similar for NVMe?
>
Well, it's not really fast io fail that needs to be replicated, and in
fact, when I looked at nvme-fc, I saw no need for fast_io_failover as it
didn't apply.
To understand:
with SCSI - we had the device "blocked" as there was a detection of a
loss of connectivity to the device. This blocked state did not terminate
i/o - we let any io completions trickling in continue to finish, but we
certainly stopped new i/o from being started. I/O could continue to
timeout, but in most cases, a timeout while in this
loss-of-connectivity, meant the timeout was just rescheduled. The
blocked state was dependent on the "device loss" timeout that was
running. Outstanding i/o wouldn't be terminated by the lldd until the
final point when we gave up on the device and tore it down - the device
loss timeout expiration. With multipathing, waiting for the device
loss timeout was too long - so we invented the fast-io-fail timeout,
started at the same point at devloss, and inherently would expire before
devloss, that would terminate all i/o to the device. This allowed
multipath to get the io back faster than actual device failure.
With NVME-FC - there is a similar behavior to the "blocked" state, which
is the reconnecting state. E.g. when loss of connectivity is
determined, the controller goes through an implicit reset which
terminates all outstanding io, then goes into a reconnect timeout that
retries connections up until an overall timer expires - known
generically as ctrl_loss_tmo which applies to all fabric types and
defaults to 60s. FC additionally adds in the "device loss" tmo known by
SCSI (the FC device may be both SCSI and NVME and should use the same
value) and expires on the minimum of those two timeout values. The
fact that the controller reset terminates all outstanding i/o, true on
any fabric transport, means the fast_io_fail timeout isn't needed.
So what seems to be talked about in this thread is how the fabric
detects device connectivity loss. FC has it's nameserver so it's
automatic. But the other transports don't have such a thing, unless
it's TCP connection timeout failures or similar. Connectivity loss is
supposed to be the job of the keep alive timeout. So I would look at
that area to see how it should be manipulated.
-- james
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
next prev parent reply other threads:[~2019-09-17 17:22 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-13 23:36 [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal Balbir Singh
2019-09-13 23:36 ` [PATCH v2 2/2] nvme/host/core: Allow overriding of wait_ready timeout Balbir Singh
2019-09-16 7:41 ` Christoph Hellwig
2019-09-16 12:33 ` Singh, Balbir
2019-09-16 16:01 ` hch
2019-09-16 21:04 ` Singh, Balbir
2019-09-17 1:14 ` Keith Busch
2019-09-17 2:56 ` Singh, Balbir
2019-09-17 3:17 ` Bart Van Assche
2019-09-17 5:02 ` Singh, Balbir
2019-09-17 17:21 ` James Smart [this message]
2019-09-17 20:08 ` James Smart
2019-09-17 3:54 ` Keith Busch
2019-09-16 7:49 ` [PATCH v2 1/2] nvme/host/pci: Fix a race in controller removal Christoph Hellwig
2019-09-16 12:07 ` Singh, Balbir
2019-09-16 15:40 ` Bart Van Assche
2019-09-16 19:38 ` Singh, Balbir
2019-09-16 19:56 ` Bart Van Assche
2019-09-16 20:40 ` Singh, Balbir
2019-09-17 17:55 ` Bart Van Assche
2019-09-17 20:30 ` Keith Busch
2019-09-17 20:44 ` Singh, Balbir
2019-09-16 20:07 ` Keith Busch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=79e20170-e7c2-279a-9787-41eed3d053fd@broadcom.com \
--to=james.smart@broadcom.com \
--cc=axboe@fb.com \
--cc=bvanassche@acm.org \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=sagi@grimberg.me \
--cc=sblbir@amazon.com \
--cc=sblbir@amzn.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox