All of lore.kernel.org
 help / color / mirror / Atom feed
From: hch@lst.de (Christoph Hellwig)
Subject: [PATCH 0/4] Rework NVMe abort handling
Date: Thu, 19 Jul 2018 16:23:55 +0200	[thread overview]
Message-ID: <20180719142355.GA18800@lst.de> (raw)
In-Reply-To: <20180719141025.yveza2svhvc2r4lw@linux-x5ow.site>

On Thu, Jul 19, 2018@04:10:25PM +0200, Johannes Thumshirn wrote:
> The problem I'm trying to solve here is really just single commands
> timing out because of i.e. a bad switch in between which causes frame
> loss somewhere.

And that is exactly the case where NVMe abort does not actually work
in any sensible way.

Remember that while NVMe guarantes ordered delivery inside a given
queue it does not guarantee anything between multiple queues.

So now you have your buggy FC setup where an I/O command times out
because your switch delayed it for two hours due to a firmware bug.

After 30 seconds we send an abort over the admin queue, which happens
to pass through just fine.  The controller will tell you: no command
found as it has never seen it.

No with the the code following what we have in PCIe that just means
we'll eventually controller reset after the I/O command times out
the second time as we still won't have seen a completion for it.

If you incorrectly just continue and resend the command we'll actually
get the command sent twice and thus a potential bug once the original
command just gets sent along.

WARNING: multiple messages have this Message-ID (diff)
From: Christoph Hellwig <hch@lst.de>
To: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>,
	Keith Busch <keith.busch@intel.com>,
	James Smart <james.smart@broadcom.com>,
	Hannes Reinecke <hare@suse.de>, Ewan Milne <emilne@redhat.com>,
	Max Gurtovoy <maxg@mellanox.com>,
	Linux NVMe Mailinglist <linux-nvme@lists.infradead.org>,
	Linux Kernel Mailinglist <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 0/4] Rework NVMe abort handling
Date: Thu, 19 Jul 2018 16:23:55 +0200	[thread overview]
Message-ID: <20180719142355.GA18800@lst.de> (raw)
In-Reply-To: <20180719141025.yveza2svhvc2r4lw@linux-x5ow.site>

On Thu, Jul 19, 2018 at 04:10:25PM +0200, Johannes Thumshirn wrote:
> The problem I'm trying to solve here is really just single commands
> timing out because of i.e. a bad switch in between which causes frame
> loss somewhere.

And that is exactly the case where NVMe abort does not actually work
in any sensible way.

Remember that while NVMe guarantes ordered delivery inside a given
queue it does not guarantee anything between multiple queues.

So now you have your buggy FC setup where an I/O command times out
because your switch delayed it for two hours due to a firmware bug.

After 30 seconds we send an abort over the admin queue, which happens
to pass through just fine.  The controller will tell you: no command
found as it has never seen it.

No with the the code following what we have in PCIe that just means
we'll eventually controller reset after the I/O command times out
the second time as we still won't have seen a completion for it.

If you incorrectly just continue and resend the command we'll actually
get the command sent twice and thus a potential bug once the original
command just gets sent along.

  reply	other threads:[~2018-07-19 14:23 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-19 13:28 [PATCH 0/4] Rework NVMe abort handling Johannes Thumshirn
2018-07-19 13:28 ` Johannes Thumshirn
2018-07-19 13:28 ` [PATCH 1/4] nvme: factor out pci abort handling into core Johannes Thumshirn
2018-07-19 13:28   ` Johannes Thumshirn
2018-07-19 16:29   ` kbuild test robot
2018-07-19 16:29     ` kbuild test robot
2018-07-19 13:28 ` [PATCH 2/4] nvme: rdma: abort commands before resetting controller Johannes Thumshirn
2018-07-19 13:28   ` Johannes Thumshirn
2018-07-19 13:28 ` [PATCH 3/4] nvmet: loop: " Johannes Thumshirn
2018-07-19 13:28   ` Johannes Thumshirn
2018-07-19 13:28 ` [PATCH 4/4] nvme: fc: " Johannes Thumshirn
2018-07-19 13:28   ` Johannes Thumshirn
2018-07-19 13:42 ` [PATCH 0/4] Rework NVMe abort handling Christoph Hellwig
2018-07-19 13:42   ` Christoph Hellwig
2018-07-19 14:10   ` Johannes Thumshirn
2018-07-19 14:10     ` Johannes Thumshirn
2018-07-19 14:23     ` Christoph Hellwig [this message]
2018-07-19 14:23       ` Christoph Hellwig
2018-07-19 14:35       ` Johannes Thumshirn
2018-07-19 14:35         ` Johannes Thumshirn
2018-07-19 14:50         ` Christoph Hellwig
2018-07-19 14:50           ` Christoph Hellwig
2018-07-19 14:54           ` Johannes Thumshirn
2018-07-19 14:54             ` Johannes Thumshirn
2018-07-19 15:04             ` James Smart
2018-07-19 15:04               ` James Smart
2018-07-20  6:36               ` Johannes Thumshirn
2018-07-20  6:36                 ` Johannes Thumshirn
2018-07-19 15:00     ` James Smart
2018-07-19 15:00       ` James Smart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180719142355.GA18800@lst.de \
    --to=hch@lst.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.