From: Ming Lei <ming.lei@redhat.com>
To: Sagi Grimberg <sagi@grimberg.me>
Cc: Christoph Hellwig <hch@lst.de>,
linux-nvme@lists.infradead.org, Yi Zhang <yi.zhang@redhat.com>,
Chao Leng <lengchao@huawei.com>, Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCH] nvme: don't wait freeze during resetting
Date: Thu, 22 Sep 2022 16:36:51 +0800 [thread overview]
Message-ID: <YyweoxKabKafs/K+@T590> (raw)
In-Reply-To: <8aed1d8f-3bd7-e86a-837a-b2fc6d316d8b@grimberg.me>
On Wed, Sep 21, 2022 at 11:19:21AM +0300, Sagi Grimberg wrote:
>
>
> On 9/21/22 04:25, Ming Lei wrote:
> > On Tue, Sep 20, 2022 at 11:18:33AM +0300, Sagi Grimberg wrote:
> > >
> > > > First it isn't necessary to call nvme_wait_freeze during reset.
> > > > For nvme-pci, if tagset isn't allocated, there can't be any inflight
> > > > IOs; otherwise blk_mq_update_nr_hw_queues can freeze & wait queues.
> > > >
> > > > Second, since commit bdd6316094e0 ("block: Allow unfreezing of a queue
> > > > while requests are in progress"), it is fine to unfreeze queue without
> > > > draining inflight IOs.
> > > >
> > > > Also both nvme-rdma and nvme-tcp's timeout handler provides forward
> > > > progress if the controller state isn't LIVE, so it is fine to drop
> > > > the timeout function of nvme_wait_freeze_timeout().
> > >
> > > The rdma/tcp should probably be split to separate patches.
> > >
> > > >
> > > > Cc: Sagi Grimberg <sagi@grimberg.me>
> > > > Cc: Chao Leng <lengchao@huawei.com>
> > > > Cc: Keith Busch <kbusch@kernel.org>
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > > drivers/nvme/host/apple.c | 1 -
> > > > drivers/nvme/host/pci.c | 1 -
> > > > drivers/nvme/host/rdma.c | 13 -------------
> > > > drivers/nvme/host/tcp.c | 13 -------------
> > > > 4 files changed, 28 deletions(-)
> > > >
> > > > diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
> > > > index 5fc5ea196b40..9cd02b57fc85 100644
> > > > --- a/drivers/nvme/host/apple.c
> > > > +++ b/drivers/nvme/host/apple.c
> > > > @@ -1126,7 +1126,6 @@ static void apple_nvme_reset_work(struct work_struct *work)
> > > > anv->ctrl.queue_count = nr_io_queues + 1;
> > > > nvme_start_queues(&anv->ctrl);
> > > > - nvme_wait_freeze(&anv->ctrl);
> > > > blk_mq_update_nr_hw_queues(&anv->tagset, 1);
> > > > nvme_unfreeze(&anv->ctrl);
> > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > > index 98864b853eef..985b216907fc 100644
> > > > --- a/drivers/nvme/host/pci.c
> > > > +++ b/drivers/nvme/host/pci.c
> > > > @@ -2910,7 +2910,6 @@ static void nvme_reset_work(struct work_struct *work)
> > > > nvme_free_tagset(dev);
> > > > } else {
> > > > nvme_start_queues(&dev->ctrl);
> > > > - nvme_wait_freeze(&dev->ctrl);
> > > > if (!dev->ctrl.tagset)
> > > > nvme_pci_alloc_tag_set(dev);
> > > > else
> > > > diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> > > > index 3100643be299..beb0d1a6a84d 100644
> > > > --- a/drivers/nvme/host/rdma.c
> > > > +++ b/drivers/nvme/host/rdma.c
> > > > @@ -986,15 +986,6 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
> > > > if (!new) {
> > > > nvme_start_queues(&ctrl->ctrl);
> > > > - if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) {
> > > > - /*
> > > > - * If we timed out waiting for freeze we are likely to
> > > > - * be stuck. Fail the controller initialization just
> > > > - * to be safe.
> > > > - */
> > > > - ret = -ENODEV;
> > > > - goto out_wait_freeze_timed_out;
> > > > - }
> > >
> > > So here is the description from the patch that introduced this:
> > > --
> > > nvme-rdma: fix reset hang if controller died in the middle of a reset
> > >
> > > If the controller becomes unresponsive in the middle of a reset, we
> > > will hang because we are waiting for the freeze to complete, but that
> > > cannot happen since we have commands that are inflight holding the
> > > q_usage_counter, and we can't blindly fail requests that times out.
> > >
> > > So give a timeout and if we cannot wait for queue freeze before
> > > unfreezing, fail and have the error handling take care how to
> > > proceed (either schedule a reconnect of remove the controller).
> > > --
> > >
> > > So if between nvme_start_queues() and the freeze (with a full wait)
> > > that is done in blk_mq_update_nr_hw_queues() the controller becomes
> > > non responsive, in this case we may hang blocking on I/O that was
> > > pending and requeued after nvme_start_queues().
> > >
> > > The problem is, that we cannot do any error recovery because the
> > > controller is in the middle of a reset/reconnect...
> > > So the code that you deleted was designed to detect this state, and
> > > reschedule another reconnect if the controller became non responsive.
> > >
> > > What is preventing this from happening now?
> >
> > Please see nvme_rdma_timeout() & nvme_tcp_timeout(), if controller state
> > isn't live, request will be aborted.
>
> I agree with you. However non-mpath devices will most likely retry the
> command and not fail it like in the multipath case (see
> nvme_decide_disposition) and will cause the I/O to block.
>
> While it is arguable if non-mpath fabrics devices are important in any
> capacity, the design was that IO is not completed until the controller
> either successfully reconnects (and retried), or it disconnects
> (failed), or fast_io_fail_tmo expires.
>
> Hence for non-mpath controllers, the request(s) will timeout, and
> aborted, but nvme will opt to retry them instead of completing them
> with a failure (at least until fast_io_fail_tmo expires, but that can
> be arbitrarily long).
OK, I think it is better to change the behavior for non-mpath rdma/tcp, will
remove it in next version.
thanks,
Ming
next prev parent reply other threads:[~2022-09-22 8:37 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-20 1:57 [PATCH] nvme: don't wait freeze during resetting Ming Lei
2022-09-20 8:18 ` Sagi Grimberg
2022-09-21 1:25 ` Ming Lei
2022-09-21 8:19 ` Sagi Grimberg
2022-09-22 8:36 ` Ming Lei [this message]
2022-09-22 14:22 ` Christoph Hellwig
2022-09-22 14:22 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YyweoxKabKafs/K+@T590 \
--to=ming.lei@redhat.com \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=lengchao@huawei.com \
--cc=linux-nvme@lists.infradead.org \
--cc=sagi@grimberg.me \
--cc=yi.zhang@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox