From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============4014661847363965278==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] sharing of single NVMe device between spdk userspace and native kernel driver Date: Mon, 18 Jul 2016 23:06:02 +0000 Message-ID: <1468883161.5999.436.camel@intel.com> In-Reply-To: CANvN+ekzEk7Zhth5FOb-N59-aatxeoergfo=aSpCn6ArT5dGdw@mail.gmail.com List-ID: To: spdk@lists.01.org --===============4014661847363965278== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Sun, 2016-07-17 at 14:23 +0000, Andrey Kuzmin wrote: > = > = > On Thu, Jul 14, 2016, 20:55 Walker, Benjamin wrote: > > = > > On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote: > > > Hello Ben > > > > > > I have a use case where I want to attach one namespace of a nvme devi= ce to spdk driver and use > > the > > > other namespace as a kernel block device to create a regular filesyst= em. Current > > implementation of > > > spdk requires the device to be unbound completely from the native ker= nel driver. I was > > wondering > > > if this is at all possible and if yes can this be accomplished with t= he current spdk > > > implementation? > > = > > Your request is one we get every few days or so, and it is a perfectly = reasonable thing to ask. > > I > > haven't written down my standard response on the mailing list yet, so I= 'm going to take this > > opportunity to lay out our position for all to see and discuss. > > = > > From a purely technical standpoint, it is impossible to both load the S= PDK driver as it exists > > today > > and the kernel driver against the same PCI device. The registers expose= d by the PCI device > > contain > > global state and so there can only be a single "owner". There is an est= ablished hardware > > mechanism > > for creating multiple virtual PCI devices from a single physical device= s that each can load > > their > > own driver called SR-IOV. This is typically used by NICs today and I'm = not aware of any NVMe > > SSDs > > that support it currently. SR-IOV is the right solution for sharing the= device like you outline > > in > > the long term, though. > > = > > In the short term, it would be technically possible to create some kern= el patches that add > > entries > > to sysfs or provide ioctls that allow a user space process to claim an = NVMe hardware queue for a > > device that the kernel is managing. You could then run the SPDK driver'= s I/O path against that > > queue. Unfortunately, there are two insurmountable issues with this str= ategy. First, NVMe > > hardware > > queues can write to any namespace on the device. Therefore, you couldn'= t enforce that the queue > > can > > only write to the namespace you are intending. You couldn't even enforc= e that the queue is only > > used > > for reads - you basically just have to trust the application to only do= reasonable things. > > Second, > > the device is owned by the kernel and therefore is not in an IOMMU prot= ection domain with this > > strategy. The device can directly access the DMA engine, and with a sma= ll amount of work, you > > could > > hijack that DMA engine to copy data to wherever you wanted on the syste= m. For these two reasons, > > patches of this nature would never be accepted into the mainline kernel= . The SPDK team can't be > > in > > the business of supporting patches that have been rejected by the kerne= l community. > > = > > Clearly, lots of people have requested to share a device between the ke= rnel and SPDK, so I've > > been > > trying to uncover all of the reasons they may want to do that. So far, = in every case, it boils > > down > > to not having a filesystem for use with SPDK. I'm hoping to steer the c= ommunity to solve the > > problem > > of not having a filesystem rather than trying to share the device.=C2= =A0 > = > An intriguing perspective, in this regard, is provided by the upcoming NV= MoF support in SPDK. Once > this stabilizes, remote user-space SPDK target could be exposed via a loc= al kernel nvmf host, > whereupon any standard filesystem may be mounted atop such a remote NVMe = target. This nicely > addresses the root cause of not having a filesystem atop SPDK w/o any blo= bstore development which, > IMO, just doesn't belong here. > = > I'm not a networking guy, so may be someone else on the list can opine if= the above approach to > access a remote SPDK target might work locally via any sort of loop mount= . If possible, that would > address the root cause in both remote and local settings. Loopback with RDMA generally works as you'd expect - that's how we do the m= ajority of our testing on the NVMf target today. You can indeed use the Linux kernel initiator to con= nect to the SPDK NVMf target and that's again how we do all of our testing. The two pieces of cod= e were developed together, right alongside development of the specification. The SPDK NVMf t= arget does not share code with the Linux kernel for licensing reasons and we were silo'd during devel= opment from a code standpoint, so the code in SPDK is clean BSD-licensed code. I'm not sure using the SPDK NVMf target and connecting the Linux kernel ini= tiator via loopback has any use case outside of testing though. The purpose of using SPDK locally i= s to avoid the kernel to get better performance. If you are using the Linux kernel initiator, you're= going through the whole kernel stack and then additional through the userspace stack, so you're los= ing all of your performance benefit. If you do that, it's probably faster to just use the k= ernel to access your local NVMe device and not use SPDK at all. To be clear, just because it makes no sense to use the Linux kernel NVMf in= itiator to connect to the SPDK NVMf target in loopback doesn't mean it doesn't make sense to use the = kernel NVMf initiator to connect to a remote SPDK NVMf target. Any single client will of course go t= hrough its local kernel and pay the penalty, but the target itself should be able to service many t= imes more clients for a given amount of compute using the SPDK NVMf target compared to the Linux ke= rnel target. > = > Regards, > Andrey > = > > I'm not advocating for writing a > > (mostly) POSIX compliant filesystem, but I do think there is a small co= re of functionality that > > most > > databases or storage applications all require. These are things like al= locating blocks into some > > unit (I've been calling it a blob) that has a name and is persistent an= d rediscoverable across > > reboots. Writing this layer requires some serious thought - SPDK is fas= t in no small part > > because it > > is purely asynchronous, polled, and lockless - so this layer would need= to preserve those > > characteristics. > > = > > Sorry for the very long response, but I wanted to document my current t= houghts on the mailing > > list > > for all to see.=C2=A0 > > = > > > > > > --Tyc > > > > > > _______________________________________________ > > > SPDK mailing list > > > SPDK(a)lists.01.org > > > https://lists.01.org/mailman/listinfo/spdk > > _______________________________________________ > > SPDK mailing list > > SPDK(a)lists.01.org > > https://lists.01.org/mailman/listinfo/spdk > > = > --=C2=A0 > Regards, > Andrey > = > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk --===============4014661847363965278==--