From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============2829176466894282327==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] FW: sharing of single NVMe device between spdk userspace and native kernel driver Date: Thu, 14 Jul 2016 20:18:01 +0000 Message-ID: <1468527479.5999.390.camel@intel.com> In-Reply-To: CABSNBDEYab+gyZPCmKq=_ugP_+cSfxtB6GaNCeLmrbfCiSJmzQ@mail.gmail.com List-ID: To: spdk@lists.01.org --===============2829176466894282327== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Thu, 2016-07-14 at 11:59 -0700, txcy uio wrote: > > Thank you Ben for the detailed reply. > A filesystem which can make use of SPDK is precisely the requirement. Eve= rything else is a way to > get around that. In my specific use case I wish to have a single nvme dev= ice which will have a > rootfs as well. So such a filesystem will need to handle that as well (pr= obably I am being too > ambitious here). = It won't ever be possible to use SPDK as the driver for your boot device. S= R-IOV would let you share the device between the kernel (for booting) and your application, if and/or= when that exists. > The only other "filesystem" that I am aware of is Ceph's bluefs which is = very minimal and specific > to Rocksdb backend.=C2=A0 This is the only one that I'm aware of currently as well, and it has a numb= er of features that make it not particularly suitable for use with SPDK (even though it does work wi= th SPDK). The biggest problems are around synchronous I/O operations and lack of memory pre-regis= tration, forcing copies on every I/O. > On a side note if I had more than one nvme device on a system , do all t= he nvme devices need to > be unbound from the kernel driver?=C2=A0 Each NVMe device is independent. You can use some NVMe devices with SPDK an= d others with the kernel at the same time with no conflict. Our setup scripts do either bind or unbi= nd all of them at once, but that's just for convenience. > = > --Tyc > =C2=A0 > > On 7/14/16, 10:55 AM, "SPDK on behalf of Walker, Benjamin" > of benjamin.walker(a)intel.com> wrote: > > = > > > > > > > > >On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote: > > >> Hello Ben > > >> > > >> I have a use case where I want to attach one namespace of a nvme dev= ice to spdk driver and > > use the > > >> other namespace as a kernel block device to create a regular filesys= tem. Current > > implementation of > > >> spdk requires the device to be unbound completely from the native ke= rnel driver. I was > > wondering > > >> if this is at all possible and if yes can this be accomplished with = the current spdk > > >> implementation? > > > > > >Your request is one we get every few days or so, and it is a perfectly= reasonable thing to ask. > > I > > >haven't written down my standard response on the mailing list yet, so = I'm going to take this > > >opportunity to lay out our position for all to see and discuss. > > > > > >From a purely technical standpoint, it is impossible to both load the = SPDK driver as it exists > > today > > >and the kernel driver against the same PCI device. The registers expos= ed by the PCI device > > contain > > >global state and so there can only be a single "owner". There is an es= tablished hardware > > mechanism > > >for creating multiple virtual PCI devices from a single physical devic= es that each can load > > their > > >own driver called SR-IOV. This is typically used by NICs today and I'm= not aware of any NVMe > > SSDs > > >that support it currently. SR-IOV is the right solution for sharing th= e device like you outline > > in > > >the long term, though. > > > > > >In the short term, it would be technically possible to create some ker= nel patches that add > > entries > > >to sysfs or provide ioctls that allow a user space process to claim an= NVMe hardware queue for > > a > > >device that the kernel is managing. You could then run the SPDK driver= 's I/O path against that > > >queue. Unfortunately, there are two insurmountable issues with this st= rategy. First, NVMe > > hardware > > >queues can write to any namespace on the device. Therefore, you couldn= 't enforce that the queue > > can > > >only write to the namespace you are intending. You couldn't even enfor= ce that the queue is only > > used > > >for reads - you basically just have to trust the application to only d= o reasonable things. > > Second, > > >the device is owned by the kernel and therefore is not in an IOMMU pro= tection domain with this > > >strategy. The device can directly access the DMA engine, and with a sm= all amount of work, you > > could > > >hijack that DMA engine to copy data to wherever you wanted on the syst= em. For these two > > reasons, > > >patches of this nature would never be accepted into the mainline kerne= l. The SPDK team can't be > > in > > >the business of supporting patches that have been rejected by the kern= el community. > > > > > >Clearly, lots of people have requested to share a device between the k= ernel and SPDK, so I've > > been > > >trying to uncover all of the reasons they may want to do that. So far,= in every case, it boils > > down > > >to not having a filesystem for use with SPDK. I'm hoping to steer the = community to solve the > > problem > > >of not having a filesystem rather than trying to share the device. I'm= not advocating for > > writing a > > >(mostly) POSIX compliant filesystem, but I do think there is a small c= ore of functionality that > > most > > >databases or storage applications all require. These are things like a= llocating blocks into > > some > > >unit (I've been calling it a blob) that has a name and is persistent a= nd rediscoverable across > > >reboots. Writing this layer requires some serious thought - SPDK is fa= st in no small part > > because it > > >is purely asynchronous, polled, and lockless - so this layer would nee= d to preserve those > > >characteristics. > > > > > >Sorry for the very long response, but I wanted to document my current = thoughts on the mailing > > list > > >for all to see. > > > > > >> > > >> --Tyc > > >> > > >> _______________________________________________ > > >> SPDK mailing list > > >> SPDK(a)lists.01.org > > >> https://lists.01.org/mailman/listinfo/spdk > > >_______________________________________________ > > >SPDK mailing list > > >SPDK(a)lists.01.org > > >https://lists.01.org/mailman/listinfo/spdk > > = > = > = > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk --===============2829176466894282327==--