All of lore.kernel.org
 help / color / mirror / Atom feed
From: Walker, Benjamin <benjamin.walker at intel.com>
To: spdk@lists.01.org
Subject: Re: [SPDK] NVMe-oF Target Library
Date: Wed, 19 Jul 2017 17:10:28 +0000	[thread overview]
Message-ID: <1500484227.3169.1.camel@intel.com> (raw)
In-Reply-To: 37B08312E007AE46A00101F43DA919DD79AE02F2@FMSMSX105.amr.corp.intel.com

[-- Attachment #1: Type: text/plain, Size: 6334 bytes --]

I'm reviving this old thread because I'm back to working in this area again. The
challenge for changes this large is to figure out how to do them in small
pieces. Some of the major changes to the threading model will need to be done in
one go, but I think there are a few incremental steps we can take to improve the
code base and prepare it for the big transition.

I think the first of those changes is to remove the "mode" parameter from
subsystems. Today, NVMe-oF subsystems can be in either direct (I/O routed to
nvme library) or virtual (I/O routed to bdev library) mode. Recently, a new bdev
command was added by the wider community (thanks for the patch!) that adds an
NVMe passthrough command to the bdev layer. That allows us to send an NVMe
command through the regular bdev stack. The commands are generally only
interpreted by the NVMe bdev module - the other backing devices don't report
support for the NVMe passthrough command - but that's good enough. Given that
new capability, we can do anything in virtual mode that we previously did in
direct mode.

The only reason we didn't remove direct mode immediately after the addition of
NVMe passthrough was because we wanted to do a full performance evaluation to
verify the bdev layer doesn't have a measurable amount of overhead. I'm glad to
report those results have come in and the overhead of routing I/O through the
bdev library instead of nvme isn't measurable for any hardware set up we were
able to build.

I wrote up a patch here: https://review.gerrithub.io/#/c/369496/

The next big step is probably to make some changes to the transport API to
accommodate the new ideas in my previous email.

Discussion and requests are always welcome!

Thanks,
Ben

On Fri, 2017-07-14 at 17:16 +0000, Walker, Benjamin wrote:
> 
> -----Original Message-----
> From: Walker, Benjamin 
> Sent: Wednesday, April 26, 2017 2:06 PM
> To: spdk(a)lists.01.org
> Subject: NVMe-oF Target Library
> 
> Hi all,
> 
> I was hoping to start a bit of a design discussion about the future of the
> NVMe-oF target library (lib/nvmf). The NVMe-oF target was originally created
> as
> part of a skunkworks project and was very much an application. It wasn't
> divided into a library and an app as it is today. Right before we released it,
> I decided to attempt to break it up into a library and an application, but I
> never really finished that task. I'd like to resume that work now, but let the
> entire community weigh in on what the library looks like.
> 
> First, libraries in SPDK (most things that live in lib/) shouldn't enforce a
> threading model. They should, as much as possible, be entirely passive C
> libraries with as few dependencies as we can manage. Applications in SPDK
> (things that live in app/), on the other hand, necessarily must choose a
> particular threading model. We universally use our application/event framework
> (lib/event) for apps, which spawns one thread per core, etc. We'll continue
> this model for NVMe-oF where app/nvmf_tgt will be a full application with a
> threading model dictated by the application/event framework, while lib/nvmf
> will be a passive C library that will depend only on other passive C
> libraries.
> I don't think this distinction is at all reality today, but let's work to make
> it so.
> 
> The other major issue with the NVMe-oF target implementation is that it has
> quite a few baked in assumptions about what the backing storage device looks
> like. In particular, it was written assuming that it was talking directly to
> an
> NVMe device (Direct mode), and the ability to route I/O to the bdev layer
> (Virtual mode) was added much later and isn't entirely fleshed out yet. One of
> these assumptions is that real NVMe devices don't benefit from multiple queues
> - you can get the full performance from an NVMe device using just one queue
> pair. That isn't necessarily true for bdevs, which may be arbitrarily
> complex virtualized devices. Given that assumption, the NVMe-oF target
> today only creates a single queue pair to the backing storage device and only
> uses a single thread to route I/O to it. We're definitely going to need to
> break that assumption.
> 
> The first discussion that I want to have is around what the high level
> concepts
> should be. We clearly need to expose things like "subsystem", "queue
> pair/connection", "namespace", and "port". We should probably have an object
> that represents the entire target too, maybe "nvmf_tgt". However, in order to
> separate the threading model from the library I think we'll need at least two
> more concepts.
> 
> First, some thread has to be in charge of polling for new connections. We
> typically refer to this as the "acceptor" thread today. Maybe the best way to
> handle this is to add an "accept" function that takes the nvmf_tgt object as
> an
> argument. This function can only be called one a single thread at a time and
> is
> repeatedly called to discover new connections. I think the user will end up
> passing a callback in to this function that will be called for each new
> connection discovered.
> 
> Second, once a new connection is discovered, we need to hand it off to some
> collection that a dedicated thread can poll. This collection of connections
> would be tied specifically to that dedicated thread, but it wouldn't
> necessarily be tied to a subsystem or a particular storage device. I don't
> really know what to call this thing - right now I'm kind of thing
> "io_handler".
> 
> So the general flow for an application would be to construct a target, add
> subsystems, namespaces, and ports as needed, and then poll the target for
> incoming connections. For each new connection, the application would assign it
> to an io_handler (using whatever algorithm it wanted) and then poll the
> io_handlers to actually handle I/O on the connections. Does this seem like a
> reasonable design at a very high level? Feedback is very much welcome and
> encouraged.
> 
> If I don't hear back with a bunch of "you're wrong!" or "that's stupid!" type
> replies over the next few days, the next step will be to write up a new header
> file for the library that we can discuss in more detail.
> 
> Thanks,
> Ben

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 3274 bytes --]

             reply	other threads:[~2017-07-19 17:10 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-19 17:10 Walker, Benjamin [this message]
  -- strict thread matches above, loose matches on Subject: below --
2017-04-26 21:06 [SPDK] NVMe-oF Target Library Walker, Benjamin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1500484227.3169.1.camel@intel.com \
    --to=spdk@lists.01.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.