[PATCH 0/1] nvme: Ensure forward progress during Admin passthru

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

From: keith.busch@linux.intel.com (Keith Busch)
Subject: [PATCH 0/1] nvme: Ensure forward progress during Admin passthru
Date: Wed, 27 Jun 2018 13:08:03 -0600	[thread overview]
Message-ID: <20180627190803.GD9361@localhost.localdomain> (raw)
In-Reply-To: <8263513b-a714-b669-b485-28e894106607@grimberg.me>

On Sun, Jun 24, 2018@08:38:12PM +0300, Sagi Grimberg wrote:
> 
> > This small patch is an attempt to fix a rare boundary condition.
> > The only downside of the patch is scan work cannot happen concurrently,
> > but I don't think that's possible outside of someone issuing a sysctl.
> > 
> > If the controller goes down during the admin command, the command will
> > timeout, if the controller is in a poor enough state the reset commands
> > will time out. After the original admin command timeout, we unblock userland
> > which can start to revalidate the namespaces. In order todo that we
> > aquire the controller namespace sempahore as WRITE. We then issue an admin
> > command which times out, triggers a reset where we attempt to stop queues
> > in dev_disable(). Stopping queues attempts to down_read on the semaphore,
> > which is currently being held on a down write, waiting for this I/O to timeout.
> > This I/O cannot timeout until the semaphore is dropped from down_write, and
> > now we're deadlocked.
> > 
> > The patch does what nvme_remove_namespaces does, it takes over the controllers
> > namespace list under the down_write. Then modifies the private list, potentially
> > removing namespaces, and resplicing it once revalidation has occured. The whole
> > goal is to *not* hold a down write while issuing admin commands.
> 
> Keith, can you explain why we need to disable the device from the
> timeout handler instead of simply schedule a reset and return
> BLK_EH_RESET_TIMER?
> 
> The only place where I *think* can't schedule a reset and need to
> return BLK_EH_DONE is when the controller is already resetting, in which
> case we'd simply need to cancel the timed out command.
> 
> What am I missing?

The reset, removing, and dead states.

It was done this way for a few reasons, some of which have changed. We
didn't actualy have the nvme state machine back then, so we didn't know
what state we were in, which forced the remove case to have lots of odd
work-queue syncs.

When an IO times out because a controller is stuck, we'll often time
them out in batches. By disabling the controller inline with the timeout
handler, we handle every single timed out command in a single call to
the nvme timeout handler instead of 1000's of them, and since we were
already in a work-queue, we are conveniently in a context that can handle
a timeout inline with the notification.

The other reason was how the blk-mq timeout handler worked. It was
creating restrictions on when a driver can complete a request, and
returning BLK_EH_HANDLED was the only way we could deterministically
get the request actually completed.

We might be able to revisit the nvme-pci behavior now that blk-wq
timeout works very differently, but I'll have to wait to see if the
current behavior is settled or not.

next prev parent reply	other threads:[~2018-06-27 19:08 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-22 19:59 [PATCH 0/1] nvme: Ensure forward progress during Admin passthru Scott Bauer
2018-06-22 19:59 ` [PATCH 1/1] " Scott Bauer
2018-06-27 19:12   ` Keith Busch
2018-06-27 19:01     ` Scott Bauer
2018-06-27 20:27       ` Keith Busch
2018-06-27 20:49         ` Keith Busch
2018-06-24 17:38 ` [PATCH 0/1] " Sagi Grimberg
2018-06-27 19:08   ` Keith Busch [this message]
2018-06-28 17:10 ` [PATCH v2 1/1] " Scott Bauer
2018-06-28 19:16   ` Keith Busch
2018-06-28 19:19     ` Scott Bauer
2018-06-28 19:54       ` Keith Busch
2018-06-29 19:03 ` [PATCH v3 " Scott Bauer
2018-06-29 20:23   ` Keith Busch
2018-07-16 22:09     ` Keith Busch
2018-07-17 12:42       ` Christoph Hellwig
2018-07-18 11:26         ` Sagi Grimberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180627190803.GD9361@localhost.localdomain \
    --to=keith.busch@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).