All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christoph Hellwig <hch@lst.de>
To: Sagi Grimberg <sagi@grimberg.me>
Cc: Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
	Keith Busch <keith.busch@intel.com>,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org
Subject: Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
Date: Mon, 28 Aug 2017 11:06:20 +0200	[thread overview]
Message-ID: <20170828090620.GA4529@lst.de> (raw)
In-Reply-To: <84429b00-300f-a36b-c4ac-b7d930ca6c7e@grimberg.me>

On Mon, Aug 28, 2017 at 10:23:33AM +0300, Sagi Grimberg wrote:
> This is really taking a lot into the nvme driver.

This patch adds about 100 lines of code, out of which 1/4th is the
parsing of status codes.  I don't think it's all that much..

> I'm not sure if
> this approach will be used in other block driver, but would it
> make sense to place the block_device node creation,

I really want to reduce the amount of boilerplate code for creating
a block queue and I have some ideas for that.  But that's not in
any way related to this multi-path code.

> the make_request

That basically just does a lookup (in a data structure we need for
tracking the siblings inside nvme anyway) and the submits the
I/O again.  The generic code all is in generic_make_request_fast.
There are two more things which could move into common code, one
reasonable, the other not:

 (1) the whole requeue_list list logic.  I thought about moving this
     to the block layer as I'd also have some other use cases for it,
     but decided to wait until those materialize to see if I'd really
     need it.
 (2) turning the logic inside out, e.g. providing a generic make_request
     and supplying a find_path callback to it.  This would require (1)
     but even then we'd save basically no code, add an indirect call
     in the fast path and make things harder to read.  This actually
     is how I started out and didn't like it.

> and failover logic

The big thing about the failover logic is looking a the detailed error
codes and changing the controller state, all of which is fundamentally
driver specific.  The only thing that could reasonably be common is
the requeue_list handling as mentioned above.  Note that this will
become even more integrated with the nvme driver once ANA support
lands.

> and maybe the path selection in the block layer
> leaving just the construction of the path mappings in nvme?

The point is that path selection fundamentally depends on your
actual protocol.  E.g. for NVMe we now look at the controller state
as that's what our keep alive work on, and what reflects that status
in PCIe. We could try to communicate this up, but it would just lead
to more data structure that get out of sync.  And this will become
much worse with ANA where we have even more fine grained state.
In the end path selection right now is less than 20 lines of code.

Path selection is complicated when you want non-trivial path selector
algorithms like round robin or service time, but I'd really avoid having
those on nvme - we already have multiple queues to go beyong the limits
of a single queue and use blk-mq for that, and once we're beyond the
limits of a single transport path for performance reasons I'd much
rather rely on ANA, numa nodes or static partitioning of namepsaces
than trying to do dynamic decicions in the fast path.  But if I don't
get what I want there and we'll need more complicated algorithms they
absolutely should be in common code.  In fact one of my earlier protoypes
moved the dm-mpath path selector algorithms to core block code and
tried to use them - it just turned out enabling them was pretty much
pointless.

WARNING: multiple messages have this Message-ID (diff)
From: hch@lst.de (Christoph Hellwig)
Subject: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
Date: Mon, 28 Aug 2017 11:06:20 +0200	[thread overview]
Message-ID: <20170828090620.GA4529@lst.de> (raw)
In-Reply-To: <84429b00-300f-a36b-c4ac-b7d930ca6c7e@grimberg.me>

On Mon, Aug 28, 2017@10:23:33AM +0300, Sagi Grimberg wrote:
> This is really taking a lot into the nvme driver.

This patch adds about 100 lines of code, out of which 1/4th is the
parsing of status codes.  I don't think it's all that much..

> I'm not sure if
> this approach will be used in other block driver, but would it
> make sense to place the block_device node creation,

I really want to reduce the amount of boilerplate code for creating
a block queue and I have some ideas for that.  But that's not in
any way related to this multi-path code.

> the make_request

That basically just does a lookup (in a data structure we need for
tracking the siblings inside nvme anyway) and the submits the
I/O again.  The generic code all is in generic_make_request_fast.
There are two more things which could move into common code, one
reasonable, the other not:

 (1) the whole requeue_list list logic.  I thought about moving this
     to the block layer as I'd also have some other use cases for it,
     but decided to wait until those materialize to see if I'd really
     need it.
 (2) turning the logic inside out, e.g. providing a generic make_request
     and supplying a find_path callback to it.  This would require (1)
     but even then we'd save basically no code, add an indirect call
     in the fast path and make things harder to read.  This actually
     is how I started out and didn't like it.

> and failover logic

The big thing about the failover logic is looking a the detailed error
codes and changing the controller state, all of which is fundamentally
driver specific.  The only thing that could reasonably be common is
the requeue_list handling as mentioned above.  Note that this will
become even more integrated with the nvme driver once ANA support
lands.

> and maybe the path selection in the block layer
> leaving just the construction of the path mappings in nvme?

The point is that path selection fundamentally depends on your
actual protocol.  E.g. for NVMe we now look at the controller state
as that's what our keep alive work on, and what reflects that status
in PCIe. We could try to communicate this up, but it would just lead
to more data structure that get out of sync.  And this will become
much worse with ANA where we have even more fine grained state.
In the end path selection right now is less than 20 lines of code.

Path selection is complicated when you want non-trivial path selector
algorithms like round robin or service time, but I'd really avoid having
those on nvme - we already have multiple queues to go beyong the limits
of a single queue and use blk-mq for that, and once we're beyond the
limits of a single transport path for performance reasons I'd much
rather rely on ANA, numa nodes or static partitioning of namepsaces
than trying to do dynamic decicions in the fast path.  But if I don't
get what I want there and we'll need more complicated algorithms they
absolutely should be in common code.  In fact one of my earlier protoypes
moved the dm-mpath path selector algorithms to core block code and
tried to use them - it just turned out enabling them was pretty much
pointless.

  reply	other threads:[~2017-08-28  9:06 UTC|newest]

Thread overview: 122+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-23 17:58 RFC: nvme multipath support Christoph Hellwig
2017-08-23 17:58 ` Christoph Hellwig
2017-08-23 17:58 ` [PATCH 01/10] nvme: report more detailed status codes to the block layer Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  6:06   ` Sagi Grimberg
2017-08-28  6:06     ` Sagi Grimberg
2017-08-28 18:50   ` Keith Busch
2017-08-28 18:50     ` Keith Busch
2017-08-23 17:58 ` [PATCH 02/10] nvme: allow calling nvme_change_ctrl_state from irq context Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  6:06   ` Sagi Grimberg
2017-08-28  6:06     ` Sagi Grimberg
2017-08-28 18:50   ` Keith Busch
2017-08-28 18:50     ` Keith Busch
2017-08-23 17:58 ` [PATCH 03/10] nvme: remove unused struct nvme_ns fields Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  6:07   ` Sagi Grimberg
2017-08-28  6:07     ` Sagi Grimberg
2017-08-28 19:13   ` Keith Busch
2017-08-28 19:13     ` Keith Busch
2017-08-23 17:58 ` [PATCH 04/10] nvme: remove nvme_revalidate_ns Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  6:12   ` Sagi Grimberg
2017-08-28  6:12     ` Sagi Grimberg
2017-08-28 19:14   ` Keith Busch
2017-08-28 19:14     ` Keith Busch
2017-08-23 17:58 ` [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  6:17   ` Sagi Grimberg
2017-08-28  6:17     ` Sagi Grimberg
2017-08-28  6:23     ` Christoph Hellwig
2017-08-28  6:23       ` Christoph Hellwig
2017-08-28  6:32       ` Sagi Grimberg
2017-08-28  6:32         ` Sagi Grimberg
2017-08-28 19:15   ` Keith Busch
2017-08-28 19:15     ` Keith Busch
2017-08-23 17:58 ` [PATCH 06/10] nvme: track subsystems Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-23 22:04   ` Keith Busch
2017-08-23 22:04     ` Keith Busch
2017-08-24  8:52     ` Christoph Hellwig
2017-08-24  8:52       ` Christoph Hellwig
2017-08-28  6:22   ` Sagi Grimberg
2017-08-28  6:22     ` Sagi Grimberg
2017-08-23 17:58 ` [PATCH 07/10] nvme: track shared namespaces Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  6:51   ` Sagi Grimberg
2017-08-28  6:51     ` Sagi Grimberg
2017-08-28  8:50     ` Christoph Hellwig
2017-08-28  8:50       ` Christoph Hellwig
2017-08-28 20:21     ` J Freyensee
2017-08-28 20:21       ` J Freyensee
2017-08-29  8:25       ` Christoph Hellwig
2017-08-29  8:25         ` Christoph Hellwig
2017-08-29  6:54     ` Guan Junxiong
2017-08-29  6:54       ` Guan Junxiong
2017-08-28 12:04   ` javigon
2017-08-28 12:04     ` javigon
2017-08-28 12:41   ` Guan Junxiong
2017-08-28 12:41     ` Guan Junxiong
2017-08-28 14:30     ` Christoph Hellwig
2017-08-28 14:30       ` Christoph Hellwig
2017-08-29  2:42       ` Guan Junxiong
2017-08-29  2:42         ` Guan Junxiong
2017-08-29  8:30         ` Christoph Hellwig
2017-08-29  8:30           ` Christoph Hellwig
2017-08-29  8:29     ` Christoph Hellwig
2017-08-29  8:29       ` Christoph Hellwig
2017-08-28 19:18   ` Keith Busch
2017-08-28 19:18     ` Keith Busch
2017-08-23 17:58 ` [PATCH 08/10] block: provide a generic_make_request_fast helper Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  7:00   ` Sagi Grimberg
2017-08-28  7:00     ` Sagi Grimberg
2017-08-28  8:54     ` Christoph Hellwig
2017-08-28  8:54       ` Christoph Hellwig
2017-08-28 11:01       ` Sagi Grimberg
2017-08-28 11:01         ` Sagi Grimberg
2017-08-28 11:54         ` Christoph Hellwig
2017-08-28 11:54           ` Christoph Hellwig
2017-08-28 12:38           ` Sagi Grimberg
2017-08-28 12:38             ` Sagi Grimberg
2017-08-23 17:58 ` [PATCH 09/10] blk-mq: add a blk_steal_bios helper Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  7:04   ` Sagi Grimberg
2017-08-28  7:04     ` Sagi Grimberg
2017-08-23 17:58 ` [PATCH 10/10] nvme: implement multipath access to nvme subsystems Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-23 18:21   ` Bart Van Assche
2017-08-23 18:21     ` Bart Van Assche
2017-08-24  8:59     ` hch
2017-08-24  8:59       ` hch
2017-08-24 20:17       ` Bart Van Assche
2017-08-24 20:17         ` Bart Van Assche
2017-09-05 11:53         ` Christoph Hellwig
2017-09-05 11:53           ` Christoph Hellwig
2017-09-11  6:34           ` Tony Yang
2017-08-23 22:53   ` Keith Busch
2017-08-23 22:53     ` Keith Busch
2017-08-24  8:52     ` Christoph Hellwig
2017-08-24  8:52       ` Christoph Hellwig
2017-08-28  7:23   ` Sagi Grimberg
2017-08-28  7:23     ` Sagi Grimberg
2017-08-28  9:06     ` Christoph Hellwig [this message]
2017-08-28  9:06       ` Christoph Hellwig
2017-08-28 13:40       ` Sagi Grimberg
2017-08-28 13:40         ` Sagi Grimberg
2017-08-28 14:24         ` Christoph Hellwig
2017-08-28 14:24           ` Christoph Hellwig
2017-09-07 15:17       ` Tony Yang
2017-08-29 10:22   ` Guan Junxiong
2017-08-29 10:22     ` Guan Junxiong
2017-08-29 14:51     ` Christoph Hellwig
2017-08-29 14:51       ` Christoph Hellwig
2017-08-29 14:54   ` Keith Busch
2017-08-29 14:54     ` Keith Busch
2017-08-29 14:55     ` Christoph Hellwig
2017-08-29 14:55       ` Christoph Hellwig
2017-08-29 15:41       ` Keith Busch
2017-08-29 15:41         ` Keith Busch
2017-09-18  0:17         ` Christoph Hellwig
2017-09-18  0:17           ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170828090620.GA4529@lst.de \
    --to=hch@lst.de \
    --cc=axboe@kernel.dk \
    --cc=keith.busch@intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.