From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============0028155258082066584==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] FW: sharing of single NVMe device between spdk userspace and native kernel driver Date: Fri, 15 Jul 2016 20:10:30 +0000 Message-ID: <1468613428.5999.404.camel@intel.com> In-Reply-To: FEB349DD341C264E8FE626BE236D26ED1C48DF7B@SSIEXCH-MB3.ssi.samsung.com List-ID: To: spdk@lists.01.org --===============0028155258082066584== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Thu, 2016-07-14 at 21:34 +0000, Raj (Rajinikanth) Pandurangan wrote: > Hi Ben, > = > Yes, I too agree that one of the most important requirement is to have a = filesystem with SPDK. > = > What are all the known challenges to develop a filesystem with SPDK?=C2= =A0=C2=A0Are the interface/APIs > provided by SPDK good enough? SPDK provides only block-level access to storage devices today. At a minimu= m, a "filesystem" on top of SPDK would need to provide a mechanism to dynamically allocate discontig= uous physical blocks and present them as a contiguous space that can be written to or read from in s= ome unit (4k? 1 byte?). I've been choosing to call that a "blob" to differentiate it from a file in= the Unix sense, and I've been calling the whole thing a "blobstore" as opposed to a filesystem. The = blobstore would ensure blobs are persistent and rediscoverable across reboots. > = > It would be good to list down the known challenges in the mailing-list, s= o that community may try > to address/discuss about them? Beyond the very basic requirements above, I think the additional requiremen= ts depend on the application that is using it. Some applications can tolerate only being all= owed to write and read in sector-size chunks, for instance, which is important if the blobstore wishe= s to implement zero copy. Other applications need finer granularity. Many databases don't need direct= ories either - they can live with a flat namespace in which to place their blobs. I think file-leve= l permissions aren't needed either. Functional requirements aside, in order to get the best performance possibl= e, the blobstore would need to be asynchronous, lockless, and polled mode. = That's a real challenge due to shared metadata, although I have a number of= ideas in this area. > = > Thanks, > = > = > -----Original Message----- > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benj= amin > Sent: Thursday, July 14, 2016 1:18 PM > To: spdk(a)lists.01.org > Subject: Re: [SPDK] FW: sharing of single NVMe device between spdk usersp= ace and native kernel > driver > = > On Thu, 2016-07-14 at 11:59 -0700, txcy uio wrote: > > > Thank you Ben for the detailed reply. > > A filesystem which can make use of SPDK is precisely the requirement.= =C2=A0 > > Everything else is a way to get around that. In my specific use case I= =C2=A0 > > wish to have a single nvme device which will have a rootfs as well. So= =C2=A0 > > such a filesystem will need to handle that as well (probably I am being= too ambitious here). > = > It won't ever be possible to use SPDK as the driver for your boot device.= SR-IOV would let you > share the device between the kernel (for booting) and your application, i= f and/or when that > exists. > = > > The only other "filesystem" that I am aware of is Ceph's bluefs which= =C2=A0 > > is very minimal and specific to Rocksdb backend. > = > This is the only one that I'm aware of currently as well, and it has a nu= mber of features that > make it not particularly suitable for use with SPDK (even though it does = work with SPDK). The > biggest problems are around synchronous I/O operations and lack of memory= pre-registration, > forcing copies on every I/O. > = > > =C2=A0On a side note if I had more than one nvme device on a system , d= o=C2=A0 > > all the nvme devices need to be unbound from the kernel driver? > = > Each NVMe device is independent. You can use some NVMe devices with SPDK = and others with the > kernel at the same time with no conflict. Our setup scripts do either bin= d or unbind all of them > at once, but that's just for convenience. > = > > = > > --Tyc > > =C2=A0 > > > On 7/14/16, 10:55 AM, "SPDK on behalf of Walker, Benjamin"=C2=A0 > > > wrote: > > > = > > > > = > > > > = > > > > On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote: > > > > > Hello Ben > > > > > = > > > > > I have a use case where I want to attach one namespace of a nvme= =C2=A0 > > > > > device to spdk driver and > > > use the > > > > > other namespace as a kernel block device to create a regular=C2= =A0 > > > > > filesystem. Current > > > implementation of > > > > > spdk requires the device to be unbound completely from the native= =C2=A0 > > > > > kernel driver. I was > > > wondering > > > > > if this is at all possible and if yes can this be accomplished=C2= =A0 > > > > > with the current spdk implementation? > > > > = > > > > Your request is one we get every few days or so, and it is a perfec= tly reasonable thing to > > > > ask. > > > I > > > > haven't written down my standard response on the mailing list yet,= =C2=A0 > > > > so I'm going to take this opportunity to lay out our position for a= ll to see and discuss. > > > > = > > > > From a purely technical standpoint, it is impossible to both load= =C2=A0 > > > > the SPDK driver as it exists > > > today > > > > and the kernel driver against the same PCI device. The registers=C2= =A0 > > > > exposed by the PCI device > > > contain > > > > global state and so there can only be a single "owner". There is an= =C2=A0 > > > > established hardware > > > mechanism > > > > for creating multiple virtual PCI devices from a single physical=C2= =A0 > > > > devices that each can load > > > their > > > > own driver called SR-IOV. This is typically used by NICs today and= =C2=A0 > > > > I'm not aware of any NVMe > > > SSDs > > > > that support it currently. SR-IOV is the right solution for sharing= =C2=A0 > > > > the device like you outline > > > in > > > > the long term, though. > > > > = > > > > In the short term, it would be technically possible to create some= =C2=A0 > > > > kernel patches that add > > > entries > > > > to sysfs or provide ioctls that allow a user space process to claim= =C2=A0 > > > > an NVMe hardware queue for > > > a > > > > device that the kernel is managing. You could then run the SPDK=C2= =A0 > > > > driver's I/O path against that queue. Unfortunately, there are two= =C2=A0 > > > > insurmountable issues with this strategy. First, NVMe > > > hardware > > > > queues can write to any namespace on the device. Therefore, you=C2= =A0 > > > > couldn't enforce that the queue > > > can > > > > only write to the namespace you are intending. You couldn't even=C2= =A0 > > > > enforce that the queue is only > > > used > > > > for reads - you basically just have to trust the application to onl= y do reasonable things. > > > Second, > > > > the device is owned by the kernel and therefore is not in an IOMMU= =C2=A0 > > > > protection domain with this strategy. The device can directly=C2=A0 > > > > access the DMA engine, and with a small amount of work, you > > > could > > > > hijack that DMA engine to copy data to wherever you wanted on the= =C2=A0 > > > > system. For these two > > > reasons, > > > > patches of this nature would never be accepted into the mainline=C2= =A0 > > > > kernel. The SPDK team can't be > > > in > > > > the business of supporting patches that have been rejected by the k= ernel community. > > > > = > > > > Clearly, lots of people have requested to share a device between=C2= =A0 > > > > the kernel and SPDK, so I've > > > been > > > > trying to uncover all of the reasons they may want to do that. So= =C2=A0 > > > > far, in every case, it boils > > > down > > > > to not having a filesystem for use with SPDK. I'm hoping to steer= =C2=A0 > > > > the community to solve the > > > problem > > > > of not having a filesystem rather than trying to share the device.= =C2=A0 > > > > I'm not advocating for > > > writing a > > > > (mostly) POSIX compliant filesystem, but I do think there is a=C2= =A0 > > > > small core of functionality that > > > most > > > > databases or storage applications all require. These are things=C2= =A0 > > > > like allocating blocks into > > > some > > > > unit (I've been calling it a blob) that has a name and is=C2=A0 > > > > persistent and rediscoverable across reboots. Writing this layer=C2= =A0 > > > > requires some serious thought - SPDK is fast in no small part > > > because it > > > > is purely asynchronous, polled, and lockless - so this layer would= =C2=A0 > > > > need to preserve those characteristics. > > > > = > > > > Sorry for the very long response, but I wanted to document my=C2=A0 > > > > current thoughts on the mailing > > > list > > > > for all to see. > > > > = > > > > > = > > > > > --Tyc > > > > > = > > > > > _______________________________________________ > > > > > SPDK mailing list > > > > > SPDK(a)lists.01.org > > > > > https://lists.01.org/mailman/listinfo/spdk > > > > _______________________________________________ > > > > SPDK mailing list > > > > SPDK(a)lists.01.org > > > > https://lists.01.org/mailman/listinfo/spdk > > > = > > = > > = > > _______________________________________________ > > SPDK mailing list > > SPDK(a)lists.01.org > > https://lists.01.org/mailman/listinfo/spdk > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk --===============0028155258082066584==--