From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============6017580835562205713==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] sharing of single NVMe device between spdk userspace and native kernel driver Date: Thu, 14 Jul 2016 17:55:47 +0000 Message-ID: <1468518945.5999.384.camel@intel.com> In-Reply-To: CABSNBDEeFdxDJbXsaWscz_i0+7nanc2j4eZb6YEJOLMt7wDMPg@mail.gmail.com List-ID: To: spdk@lists.01.org --===============6017580835562205713== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote: > Hello Ben > = > I have a use case where I want to attach one namespace of a nvme device t= o spdk driver and use the > other namespace as a kernel block device to create a regular filesystem. = Current implementation of > spdk requires the device to be unbound completely from the native kernel = driver. I was wondering > if this is at all possible and if yes can this be accomplished with the c= urrent spdk > implementation? Your request is one we get every few days or so, and it is a perfectly reas= onable thing to ask. I haven't written down my standard response on the mailing list yet, so I'm g= oing to take this opportunity to lay out our position for all to see and discuss. >From a purely technical standpoint, it is impossible to both load the SPDK = driver as it exists today and the kernel driver against the same PCI device. The registers exposed by= the PCI device contain global state and so there can only be a single "owner". There is an establi= shed hardware mechanism for creating multiple virtual PCI devices from a single physical devices th= at each can load their own driver called SR-IOV. This is typically used by NICs today and I'm not = aware of any NVMe SSDs that support it currently. SR-IOV is the right solution for sharing the dev= ice like you outline in the long term, though. In the short term, it would be technically possible to create some kernel p= atches that add entries to sysfs or provide ioctls that allow a user space process to claim an NVMe= hardware queue for a device that the kernel is managing. You could then run the SPDK driver's I/= O path against that queue. Unfortunately, there are two insurmountable issues with this strateg= y. First, NVMe hardware queues can write to any namespace on the device. Therefore, you couldn't en= force that the queue can only write to the namespace you are intending. You couldn't even enforce th= at the queue is only used for reads - you basically just have to trust the application to only do rea= sonable things. Second, the device is owned by the kernel and therefore is not in an IOMMU protecti= on domain with this strategy. The device can directly access the DMA engine, and with a small a= mount of work, you could hijack that DMA engine to copy data to wherever you wanted on the system. F= or these two reasons, patches of this nature would never be accepted into the mainline kernel. Th= e SPDK team can't be in the business of supporting patches that have been rejected by the kernel co= mmunity. Clearly, lots of people have requested to share a device between the kernel= and SPDK, so I've been trying to uncover all of the reasons they may want to do that. So far, in e= very case, it boils down to not having a filesystem for use with SPDK. I'm hoping to steer the commu= nity to solve the problem of not having a filesystem rather than trying to share the device. I'm not = advocating for writing a (mostly) POSIX compliant filesystem, but I do think there is a small core o= f functionality that most databases or storage applications all require. These are things like alloca= ting blocks into some unit (I've been calling it a blob) that has a name and is persistent and re= discoverable across reboots. Writing this layer requires some serious thought - SPDK is fast in= no small part because it is purely asynchronous, polled, and lockless - so this layer would need to = preserve those characteristics. Sorry for the very long response, but I wanted to document my current thoug= hts on the mailing list for all to see.=C2=A0 > = > --Tyc > = > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk --===============6017580835562205713==--