From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============3662346224120187532==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] FW: sharing of single NVMe device between spdk userspace Date: Mon, 18 Jul 2016 20:49:32 +0000 Message-ID: <1468874969.5999.424.camel@intel.com> In-Reply-To: CANvN+ekx-qUskB=qv9Fjg63QWMQhawirFUg8e2nnMh_HRg08ow@mail.gmail.com List-ID: To: spdk@lists.01.org --===============3662346224120187532== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Fri, 2016-07-15 at 21:07 +0000, Andrey Kuzmin wrote: > = > = > On Fri, Jul 15, 2016, 23:10 Walker, Benjamin wrote: > > On Thu, 2016-07-14 at 21:34 +0000, Raj (Rajinikanth) Pandurangan wrote: > > > Hi Ben, > > > > > > Yes, I too agree that one of the most important requirement is to hav= e a filesystem with SPDK. > > > > > > What are all the known challenges to develop a filesystem with SPDK?= =C2=A0=C2=A0Are the interface/APIs > > > provided by SPDK good enough? > > = > > SPDK provides only block-level access to storage devices today. > = > = > I don't think SPDK does even that today as it's very much (naturally) NVM= e-centric, exposing a > wealth of details no block device provides. In a kernel I/O stack, SPDK w= ould be a protocol > driver, with the block layer above to be added. I mean that SPDK allows for direct access to logical blocks on the device, = not that we implement an equivalent of the Linux kernel's block layer which is responsible for comma= nd translation and queueing. I may start using the term "sector" instead of block so that its = clear I'm not referring to the architecture of Linux. > = > > At a minimum, a "filesystem" on top > > of SPDK would need to provide a mechanism to dynamically allocate disco= ntiguous physical blocks > > and > > present them as a contiguous space that can be written to or read from = in some unit (4k? 1 > > byte?). > > I've been choosing to call that a "blob" to differentiate it from a fil= e in the Unix sense, and > > I've > > been calling the whole thing a "blobstore" as opposed to a filesystem. = The blobstore would > > ensure > > blobs are persistent and rediscoverable across reboots. > = > It sounds very much like a key-value store. If by key-value store you mean something along the lines of RocksDB or the = numerous other similar projects, then no - I'm not talking about a key-value store. I think a true= key-value store on top of SPDK is a great long term goal, but there are a number of simpler interm= ediate layers that need to be written first. RocksDB, for example, runs on top of a POSIX filesyste= m by default. It happens to use a very minimal set of features from the filesystem, but it uses func= tionality in the filesystem nonetheless, and the layer I'm talking about would replace that = functionality. The key is to keep the layers simple - key-value stores do lots of complicated things = to keep I/O sequential and batched, do compression automatically, insert CRCs, etc. that I wouldn'= t recommend doing in the "blobstore" itself. > = > Regards, > Andrey > = > > > > > > It would be good to list down the known challenges in the mailing-lis= t, so that community may > > try > > > to address/discuss about them? > > = > > Beyond the very basic requirements above, I think the additional requir= ements depend on the > > application that is using it. Some applications can tolerate only being= allowed to write and > > read in > > sector-size chunks, for instance, which is important if the blobstore w= ishes to implement zero > > copy. > > Other applications need finer granularity. Many databases don't need di= rectories either - they > > can > > live with a flat namespace in which to place their blobs. I think file-= level permissions aren't > > needed either. > > = > > Functional requirements aside, in order to get the best performance pos= sible, the blobstore > > would need to be asynchronous, lockless, and polled mode. That's a real= challenge due to shared > > metadata, although I have a number of ideas in this area. > > = > > > > > > Thanks, > > > > > > > > > -----Original Message----- > > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, = Benjamin > > > Sent: Thursday, July 14, 2016 1:18 PM > > > To: spdk(a)lists.01.org > > > Subject: Re: [SPDK] FW: sharing of single NVMe device between spdk us= erspace and native kernel > > > driver > > > > > > On Thu, 2016-07-14 at 11:59 -0700, txcy uio wrote: > > > > > Thank you Ben for the detailed reply. > > > > A filesystem which can make use of SPDK is precisely the requiremen= t.=C2=A0 > > > > Everything else is a way to get around that. In my specific use cas= e I=C2=A0 > > > > wish to have a single nvme device which will have a rootfs as well.= So=C2=A0 > > > > such a filesystem will need to handle that as well (probably I am b= eing too ambitious here). > > > > > > It won't ever be possible to use SPDK as the driver for your boot dev= ice. SR-IOV would let you > > > share the device between the kernel (for booting) and your applicatio= n, if and/or when that > > > exists. > > > > > > > The only other "filesystem" that I am aware of is Ceph's bluefs whi= ch=C2=A0 > > > > is very minimal and specific to Rocksdb backend. > > > > > > This is the only one that I'm aware of currently as well, and it has = a number of features that > > > make it not particularly suitable for use with SPDK (even though it d= oes work with SPDK). The > > > biggest problems are around synchronous I/O operations and lack of me= mory pre-registration, > > > forcing copies on every I/O. > > > > > > > =C2=A0On a side note if I had more than one nvme device on a system= , do=C2=A0 > > > > all the nvme devices need to be unbound from the kernel driver? > > > > > > Each NVMe device is independent. You can use some NVMe devices with S= PDK and others with the > > > kernel at the same time with no conflict. Our setup scripts do either= bind or unbind all of > > them > > > at once, but that's just for convenience. > > > > > > > > > > > --Tyc > > > > =C2=A0 > > > > > On 7/14/16, 10:55 AM, "SPDK on behalf of Walker, Benjamin"=C2=A0 > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote: > > > > > > > Hello Ben > > > > > > > > > > > > > > I have a use case where I want to attach one namespace of a n= vme=C2=A0 > > > > > > > device to spdk driver and > > > > > use the > > > > > > > other namespace as a kernel block device to create a regular= =C2=A0 > > > > > > > filesystem. Current > > > > > implementation of > > > > > > > spdk requires the device to be unbound completely from the na= tive=C2=A0 > > > > > > > kernel driver. I was > > > > > wondering > > > > > > > if this is at all possible and if yes can this be accomplishe= d=C2=A0 > > > > > > > with the current spdk implementation? > > > > > > > > > > > > Your request is one we get every few days or so, and it is a pe= rfectly reasonable thing > > to > > > > > > ask. > > > > > I > > > > > > haven't written down my standard response on the mailing list y= et,=C2=A0 > > > > > > so I'm going to take this opportunity to lay out our position f= or all to see and > > discuss. > > > > > > > > > > > > From a purely technical standpoint, it is impossible to both lo= ad=C2=A0 > > > > > > the SPDK driver as it exists > > > > > today > > > > > > and the kernel driver against the same PCI device. The register= s=C2=A0 > > > > > > exposed by the PCI device > > > > > contain > > > > > > global state and so there can only be a single "owner". There i= s an=C2=A0 > > > > > > established hardware > > > > > mechanism > > > > > > for creating multiple virtual PCI devices from a single physica= l=C2=A0 > > > > > > devices that each can load > > > > > their > > > > > > own driver called SR-IOV. This is typically used by NICs today = and=C2=A0 > > > > > > I'm not aware of any NVMe > > > > > SSDs > > > > > > that support it currently. SR-IOV is the right solution for sha= ring=C2=A0 > > > > > > the device like you outline > > > > > in > > > > > > the long term, though. > > > > > > > > > > > > In the short term, it would be technically possible to create s= ome=C2=A0 > > > > > > kernel patches that add > > > > > entries > > > > > > to sysfs or provide ioctls that allow a user space process to c= laim=C2=A0 > > > > > > an NVMe hardware queue for > > > > > a > > > > > > device that the kernel is managing. You could then run the SPDK= =C2=A0 > > > > > > driver's I/O path against that queue. Unfortunately, there are = two=C2=A0 > > > > > > insurmountable issues with this strategy. First, NVMe > > > > > hardware > > > > > > queues can write to any namespace on the device. Therefore, you= =C2=A0 > > > > > > couldn't enforce that the queue > > > > > can > > > > > > only write to the namespace you are intending. You couldn't eve= n=C2=A0 > > > > > > enforce that the queue is only > > > > > used > > > > > > for reads - you basically just have to trust the application to= only do reasonable > > things. > > > > > Second, > > > > > > the device is owned by the kernel and therefore is not in an IO= MMU=C2=A0 > > > > > > protection domain with this strategy. The device can directly= =C2=A0 > > > > > > access the DMA engine, and with a small amount of work, you > > > > > could > > > > > > hijack that DMA engine to copy data to wherever you wanted on t= he=C2=A0 > > > > > > system. For these two > > > > > reasons, > > > > > > patches of this nature would never be accepted into the mainlin= e=C2=A0 > > > > > > kernel. The SPDK team can't be > > > > > in > > > > > > the business of supporting patches that have been rejected by t= he kernel community. > > > > > > > > > > > > Clearly, lots of people have requested to share a device betwee= n=C2=A0 > > > > > > the kernel and SPDK, so I've > > > > > been > > > > > > trying to uncover all of the reasons they may want to do that. = So=C2=A0 > > > > > > far, in every case, it boils > > > > > down > > > > > > to not having a filesystem for use with SPDK. I'm hoping to ste= er=C2=A0 > > > > > > the community to solve the > > > > > problem > > > > > > of not having a filesystem rather than trying to share the devi= ce.=C2=A0 > > > > > > I'm not advocating for > > > > > writing a > > > > > > (mostly) POSIX compliant filesystem, but I do think there is a= =C2=A0 > > > > > > small core of functionality that > > > > > most > > > > > > databases or storage applications all require. These are things= =C2=A0 > > > > > > like allocating blocks into > > > > > some > > > > > > unit (I've been calling it a blob) that has a name and is=C2=A0 > > > > > > persistent and rediscoverable across reboots. Writing this laye= r=C2=A0 > > > > > > requires some serious thought - SPDK is fast in no small part > > > > > because it > > > > > > is purely asynchronous, polled, and lockless - so this layer wo= uld=C2=A0 > > > > > > need to preserve those characteristics. > > > > > > > > > > > > Sorry for the very long response, but I wanted to document my= =C2=A0 > > > > > > current thoughts on the mailing > > > > > list > > > > > > for all to see. > > > > > > > > > > > > > > > > > > > > --Tyc > > > > > > > > > > > > > > _______________________________________________ > > > > > > > SPDK mailing list > > > > > > > SPDK(a)lists.01.org > > > > > > > https://lists.01.org/mailman/listinfo/spdk > > > > > > _______________________________________________ > > > > > > SPDK mailing list > > > > > > SPDK(a)lists.01.org > > > > > > https://lists.01.org/mailman/listinfo/spdk > > > > > > > > > > > > > > > > > _______________________________________________ > > > > SPDK mailing list > > > > SPDK(a)lists.01.org > > > > https://lists.01.org/mailman/listinfo/spdk > > > _______________________________________________ > > > SPDK mailing list > > > SPDK(a)lists.01.org > > > https://lists.01.org/mailman/listinfo/spdk > > > _______________________________________________ > > > SPDK mailing list > > > SPDK(a)lists.01.org > > > https://lists.01.org/mailman/listinfo/spdk > > _______________________________________________ > > SPDK mailing list > > SPDK(a)lists.01.org > > https://lists.01.org/mailman/listinfo/spdk > > = > --=C2=A0 > Regards, > Andrey > = > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk --===============3662346224120187532==--