Re: [SPDK] sharing of single NVMe device between spdk userspace and native kernel driver

From: Walker, Benjamin <benjamin.walker at intel.com>
To: spdk@lists.01.org
Subject: Re: [SPDK] sharing of single NVMe device between spdk userspace and native kernel driver
Date: Mon, 18 Jul 2016 23:06:02 +0000	[thread overview]
Message-ID: <1468883161.5999.436.camel@intel.com> (raw)
In-Reply-To: CANvN+ekzEk7Zhth5FOb-N59-aatxeoergfo=aSpCn6ArT5dGdw@mail.gmail.com

[-- Attachment #1: Type: text/plain, Size: 6879 bytes --]

On Sun, 2016-07-17 at 14:23 +0000, Andrey Kuzmin wrote:
> 
> 
> On Thu, Jul 14, 2016, 20:55 Walker, Benjamin <benjamin.walker(a)intel.com> wrote:
> > 
> > On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote:
> > > Hello Ben
> > >
> > > I have a use case where I want to attach one namespace of a nvme device to spdk driver and use
> > the
> > > other namespace as a kernel block device to create a regular filesystem. Current
> > implementation of
> > > spdk requires the device to be unbound completely from the native kernel driver. I was
> > wondering
> > > if this is at all possible and if yes can this be accomplished with the current spdk
> > > implementation?
> > 
> > Your request is one we get every few days or so, and it is a perfectly reasonable thing to ask.
> > I
> > haven't written down my standard response on the mailing list yet, so I'm going to take this
> > opportunity to lay out our position for all to see and discuss.
> > 
> > From a purely technical standpoint, it is impossible to both load the SPDK driver as it exists
> > today
> > and the kernel driver against the same PCI device. The registers exposed by the PCI device
> > contain
> > global state and so there can only be a single "owner". There is an established hardware
> > mechanism
> > for creating multiple virtual PCI devices from a single physical devices that each can load
> > their
> > own driver called SR-IOV. This is typically used by NICs today and I'm not aware of any NVMe
> > SSDs
> > that support it currently. SR-IOV is the right solution for sharing the device like you outline
> > in
> > the long term, though.
> > 
> > In the short term, it would be technically possible to create some kernel patches that add
> > entries
> > to sysfs or provide ioctls that allow a user space process to claim an NVMe hardware queue for a
> > device that the kernel is managing. You could then run the SPDK driver's I/O path against that
> > queue. Unfortunately, there are two insurmountable issues with this strategy. First, NVMe
> > hardware
> > queues can write to any namespace on the device. Therefore, you couldn't enforce that the queue
> > can
> > only write to the namespace you are intending. You couldn't even enforce that the queue is only
> > used
> > for reads - you basically just have to trust the application to only do reasonable things.
> > Second,
> > the device is owned by the kernel and therefore is not in an IOMMU protection domain with this
> > strategy. The device can directly access the DMA engine, and with a small amount of work, you
> > could
> > hijack that DMA engine to copy data to wherever you wanted on the system. For these two reasons,
> > patches of this nature would never be accepted into the mainline kernel. The SPDK team can't be
> > in
> > the business of supporting patches that have been rejected by the kernel community.
> > 
> > Clearly, lots of people have requested to share a device between the kernel and SPDK, so I've
> > been
> > trying to uncover all of the reasons they may want to do that. So far, in every case, it boils
> > down
> > to not having a filesystem for use with SPDK. I'm hoping to steer the community to solve the
> > problem
> > of not having a filesystem rather than trying to share the device. 
> 
> An intriguing perspective, in this regard, is provided by the upcoming NVMoF support in SPDK. Once
> this stabilizes, remote user-space SPDK target could be exposed via a local kernel nvmf host,
> whereupon any standard filesystem may be mounted atop such a remote NVMe target. This nicely
> addresses the root cause of not having a filesystem atop SPDK w/o any blobstore development which,
> IMO, just doesn't belong here.
> 
> I'm not a networking guy, so may be someone else on the list can opine if the above approach to
> access a remote SPDK target might work locally via any sort of loop mount. If possible, that would
> address the root cause in both remote and local settings.

Loopback with RDMA generally works as you'd expect - that's how we do the majority of our testing on
the NVMf target today. You can indeed use the Linux kernel initiator to connect to the SPDK NVMf
target and that's again how we do all of our testing. The two pieces of code were developed
together, right alongside development of the specification. The SPDK NVMf target does not share code
with the Linux kernel for licensing reasons and we were silo'd during development from a code
standpoint, so the code in SPDK is clean BSD-licensed code.

I'm not sure using the SPDK NVMf target and connecting the Linux kernel initiator via loopback has
any use case outside of testing though. The purpose of using SPDK locally is to avoid the kernel to
get better performance. If you are using the Linux kernel initiator, you're going through the whole
kernel stack and then additional through the userspace stack, so you're losing all of your
performance benefit. If you do that, it's probably faster to just use the kernel to access your
local NVMe device and not use SPDK at all.

To be clear, just because it makes no sense to use the Linux kernel NVMf initiator to connect to the
SPDK NVMf target in loopback doesn't mean it doesn't make sense to use the kernel NVMf initiator to
connect to a remote SPDK NVMf target. Any single client will of course go through its local kernel
and pay the penalty, but the target itself should be able to service many times more clients for a
given amount of compute using the SPDK NVMf target compared to the Linux kernel target.

> 
> Regards,
> Andrey
> 
> > I'm not advocating for writing a
> > (mostly) POSIX compliant filesystem, but I do think there is a small core of functionality that
> > most
> > databases or storage applications all require. These are things like allocating blocks into some
> > unit (I've been calling it a blob) that has a name and is persistent and rediscoverable across
> > reboots. Writing this layer requires some serious thought - SPDK is fast in no small part
> > because it
> > is purely asynchronous, polled, and lockless - so this layer would need to preserve those
> > characteristics.
> > 
> > Sorry for the very long response, but I wanted to document my current thoughts on the mailing
> > list
> > for all to see. 
> > 
> > >
> > > --Tyc
> > >
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > > https://lists.01.org/mailman/listinfo/spdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> > 
> -- 
> Regards,
> Andrey
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk