All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [SPDK] FW: sharing of single NVMe device between spdk userspace
@ 2016-07-18 20:49 Walker, Benjamin
  0 siblings, 0 replies; 2+ messages in thread
From: Walker, Benjamin @ 2016-07-18 20:49 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 11896 bytes --]

On Fri, 2016-07-15 at 21:07 +0000, Andrey Kuzmin wrote:
> 
> 
> On Fri, Jul 15, 2016, 23:10 Walker, Benjamin <benjamin.walker(a)intel.com> wrote:
> > On Thu, 2016-07-14 at 21:34 +0000, Raj (Rajinikanth) Pandurangan wrote:
> > > Hi Ben,
> > >
> > > Yes, I too agree that one of the most important requirement is to have a filesystem with SPDK.
> > >
> > > What are all the known challenges to develop a filesystem with SPDK?  Are the interface/APIs
> > > provided by SPDK good enough?
> > 
> > SPDK provides only block-level access to storage devices today.
> 
> 
> I don't think SPDK does even that today as it's very much (naturally) NVMe-centric, exposing a
> wealth of details no block device provides. In a kernel I/O stack, SPDK would be a protocol
> driver, with the block layer above to be added.

I mean that SPDK allows for direct access to logical blocks on the device, not that we implement an
equivalent of the Linux kernel's block layer which is responsible for command translation and
queueing. I may start using the term "sector" instead of block so that its clear I'm not referring
to the architecture of Linux.

> 
> > At a minimum, a "filesystem" on top
> > of SPDK would need to provide a mechanism to dynamically allocate discontiguous physical blocks
> > and
> > present them as a contiguous space that can be written to or read from in some unit (4k? 1
> > byte?).
> > I've been choosing to call that a "blob" to differentiate it from a file in the Unix sense, and
> > I've
> > been calling the whole thing a "blobstore" as opposed to a filesystem. The blobstore would
> > ensure
> > blobs are persistent and rediscoverable across reboots.
> 
> It sounds very much like a key-value store.

If by key-value store you mean something along the lines of RocksDB or the numerous other similar
projects, then no - I'm not talking about a key-value store. I think a true key-value store on top
of SPDK is a great long term goal, but there are a number of simpler intermediate layers that need
to be written first. RocksDB, for example, runs on top of a POSIX filesystem by default. It happens
to use a very minimal set of features from the filesystem, but it uses functionality in the
filesystem nonetheless, and the layer I'm talking about would replace that functionality. The key is
to keep the layers simple - key-value stores do lots of complicated things to keep I/O sequential
and batched, do compression automatically, insert CRCs, etc. that I wouldn't recommend doing in the
"blobstore" itself.


> 
> Regards,
> Andrey
> 
> > >
> > > It would be good to list down the known challenges in the mailing-list, so that community may
> > try
> > > to address/discuss about them?
> > 
> > Beyond the very basic requirements above, I think the additional requirements depend on the
> > application that is using it. Some applications can tolerate only being allowed to write and
> > read in
> > sector-size chunks, for instance, which is important if the blobstore wishes to implement zero
> > copy.
> > Other applications need finer granularity. Many databases don't need directories either - they
> > can
> > live with a flat namespace in which to place their blobs. I think file-level permissions aren't
> > needed either.
> > 
> > Functional requirements aside, in order to get the best performance possible, the blobstore
> > would need to be asynchronous, lockless, and polled mode. That's a real challenge due to shared
> > metadata, although I have a number of ideas in this area.
> > 
> > >
> > > Thanks,
> > >
> > >
> > > -----Original Message-----
> > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
> > > Sent: Thursday, July 14, 2016 1:18 PM
> > > To: spdk(a)lists.01.org
> > > Subject: Re: [SPDK] FW: sharing of single NVMe device between spdk userspace and native kernel
> > > driver
> > >
> > > On Thu, 2016-07-14 at 11:59 -0700, txcy uio wrote:
> > > > > Thank you Ben for the detailed reply.
> > > > A filesystem which can make use of SPDK is precisely the requirement. 
> > > > Everything else is a way to get around that. In my specific use case I 
> > > > wish to have a single nvme device which will have a rootfs as well. So 
> > > > such a filesystem will need to handle that as well (probably I am being too ambitious here).
> > >
> > > It won't ever be possible to use SPDK as the driver for your boot device. SR-IOV would let you
> > > share the device between the kernel (for booting) and your application, if and/or when that
> > > exists.
> > >
> > > > The only other "filesystem" that I am aware of is Ceph's bluefs which 
> > > > is very minimal and specific to Rocksdb backend.
> > >
> > > This is the only one that I'm aware of currently as well, and it has a number of features that
> > > make it not particularly suitable for use with SPDK (even though it does work with SPDK). The
> > > biggest problems are around synchronous I/O operations and lack of memory pre-registration,
> > > forcing copies on every I/O.
> > >
> > > >  On a side note if I had more than one nvme device on a system , do 
> > > > all the nvme devices need to be unbound from the kernel driver?
> > >
> > > Each NVMe device is independent. You can use some NVMe devices with SPDK and others with the
> > > kernel at the same time with no conflict. Our setup scripts do either bind or unbind all of
> > them
> > > at once, but that's just for convenience.
> > >
> > > >
> > > > --Tyc
> > > >  
> > > > > On 7/14/16, 10:55 AM, "SPDK on behalf of Walker, Benjamin" 
> > > > > <spdk-bounces(a)lists.01.org on behalf of benjamin.walker(a)intel.com> wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote:
> > > > > > > Hello Ben
> > > > > > >
> > > > > > > I have a use case where I want to attach one namespace of a nvme 
> > > > > > > device to spdk driver and
> > > > > use the
> > > > > > > other namespace as a kernel block device to create a regular 
> > > > > > > filesystem. Current
> > > > > implementation of
> > > > > > > spdk requires the device to be unbound completely from the native 
> > > > > > > kernel driver. I was
> > > > > wondering
> > > > > > > if this is at all possible and if yes can this be accomplished 
> > > > > > > with the current spdk implementation?
> > > > > >
> > > > > > Your request is one we get every few days or so, and it is a perfectly reasonable thing
> > to
> > > > > > ask.
> > > > > I
> > > > > > haven't written down my standard response on the mailing list yet, 
> > > > > > so I'm going to take this opportunity to lay out our position for all to see and
> > discuss.
> > > > > >
> > > > > > From a purely technical standpoint, it is impossible to both load 
> > > > > > the SPDK driver as it exists
> > > > > today
> > > > > > and the kernel driver against the same PCI device. The registers 
> > > > > > exposed by the PCI device
> > > > > contain
> > > > > > global state and so there can only be a single "owner". There is an 
> > > > > > established hardware
> > > > > mechanism
> > > > > > for creating multiple virtual PCI devices from a single physical 
> > > > > > devices that each can load
> > > > > their
> > > > > > own driver called SR-IOV. This is typically used by NICs today and 
> > > > > > I'm not aware of any NVMe
> > > > > SSDs
> > > > > > that support it currently. SR-IOV is the right solution for sharing 
> > > > > > the device like you outline
> > > > > in
> > > > > > the long term, though.
> > > > > >
> > > > > > In the short term, it would be technically possible to create some 
> > > > > > kernel patches that add
> > > > > entries
> > > > > > to sysfs or provide ioctls that allow a user space process to claim 
> > > > > > an NVMe hardware queue for
> > > > > a
> > > > > > device that the kernel is managing. You could then run the SPDK 
> > > > > > driver's I/O path against that queue. Unfortunately, there are two 
> > > > > > insurmountable issues with this strategy. First, NVMe
> > > > > hardware
> > > > > > queues can write to any namespace on the device. Therefore, you 
> > > > > > couldn't enforce that the queue
> > > > > can
> > > > > > only write to the namespace you are intending. You couldn't even 
> > > > > > enforce that the queue is only
> > > > > used
> > > > > > for reads - you basically just have to trust the application to only do reasonable
> > things.
> > > > > Second,
> > > > > > the device is owned by the kernel and therefore is not in an IOMMU 
> > > > > > protection domain with this strategy. The device can directly 
> > > > > > access the DMA engine, and with a small amount of work, you
> > > > > could
> > > > > > hijack that DMA engine to copy data to wherever you wanted on the 
> > > > > > system. For these two
> > > > > reasons,
> > > > > > patches of this nature would never be accepted into the mainline 
> > > > > > kernel. The SPDK team can't be
> > > > > in
> > > > > > the business of supporting patches that have been rejected by the kernel community.
> > > > > >
> > > > > > Clearly, lots of people have requested to share a device between 
> > > > > > the kernel and SPDK, so I've
> > > > > been
> > > > > > trying to uncover all of the reasons they may want to do that. So 
> > > > > > far, in every case, it boils
> > > > > down
> > > > > > to not having a filesystem for use with SPDK. I'm hoping to steer 
> > > > > > the community to solve the
> > > > > problem
> > > > > > of not having a filesystem rather than trying to share the device. 
> > > > > > I'm not advocating for
> > > > > writing a
> > > > > > (mostly) POSIX compliant filesystem, but I do think there is a 
> > > > > > small core of functionality that
> > > > > most
> > > > > > databases or storage applications all require. These are things 
> > > > > > like allocating blocks into
> > > > > some
> > > > > > unit (I've been calling it a blob) that has a name and is 
> > > > > > persistent and rediscoverable across reboots. Writing this layer 
> > > > > > requires some serious thought - SPDK is fast in no small part
> > > > > because it
> > > > > > is purely asynchronous, polled, and lockless - so this layer would 
> > > > > > need to preserve those characteristics.
> > > > > >
> > > > > > Sorry for the very long response, but I wanted to document my 
> > > > > > current thoughts on the mailing
> > > > > list
> > > > > > for all to see.
> > > > > >
> > > > > > >
> > > > > > > --Tyc
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > SPDK mailing list
> > > > > > > SPDK(a)lists.01.org
> > > > > > > https://lists.01.org/mailman/listinfo/spdk
> > > > > > _______________________________________________
> > > > > > SPDK mailing list
> > > > > > SPDK(a)lists.01.org
> > > > > > https://lists.01.org/mailman/listinfo/spdk
> > > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > SPDK mailing list
> > > > SPDK(a)lists.01.org
> > > > https://lists.01.org/mailman/listinfo/spdk
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > > https://lists.01.org/mailman/listinfo/spdk
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > > https://lists.01.org/mailman/listinfo/spdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> > 
> -- 
> Regards,
> Andrey
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 2+ messages in thread
* Re: [SPDK] FW: sharing of single NVMe device between spdk userspace
@ 2016-07-19  8:59 Andrey Kuzmin
  0 siblings, 0 replies; 2+ messages in thread
From: Andrey Kuzmin @ 2016-07-19  8:59 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 13054 bytes --]

On Mon, Jul 18, 2016, 23:49 Walker, Benjamin <benjamin.walker(a)intel.com>
wrote:

> On Fri, 2016-07-15 at 21:07 +0000, Andrey Kuzmin wrote:
> >
> >
> > On Fri, Jul 15, 2016, 23:10 Walker, Benjamin <benjamin.walker(a)intel.com>
> wrote:
> > > On Thu, 2016-07-14 at 21:34 +0000, Raj (Rajinikanth) Pandurangan wrote:
> > > > Hi Ben,
> > > >
> > > > Yes, I too agree that one of the most important requirement is to
> have a filesystem with SPDK.
> > > >
> > > > What are all the known challenges to develop a filesystem with
> SPDK?  Are the interface/APIs
> > > > provided by SPDK good enough?
> > >
> > > SPDK provides only block-level access to storage devices today.
> >
> >
> > I don't think SPDK does even that today as it's very much (naturally)
> NVMe-centric, exposing a
> > wealth of details no block device provides. In a kernel I/O stack, SPDK
> would be a protocol
> > driver, with the block layer above to be added.
>
> I mean that SPDK allows for direct access to logical blocks on the device,
> not that we implement an
> equivalent of the Linux kernel's block layer which is responsible for
> command translation and
> queueing. I may start using the term "sector" instead of block so that its
> clear I'm not referring
> to the architecture of Linux.
>
> >
> > > At a minimum, a "filesystem" on top
> > > of SPDK would need to provide a mechanism to dynamically allocate
> discontiguous physical blocks
> > > and
> > > present them as a contiguous space that can be written to or read from
> in some unit (4k? 1
> > > byte?).
> > > I've been choosing to call that a "blob" to differentiate it from a
> file in the Unix sense, and
> > > I've
> > > been calling the whole thing a "blobstore" as opposed to a filesystem.
> The blobstore would
> > > ensure
> > > blobs are persistent and rediscoverable across reboots.
> >
> > It sounds very much like a key-value store.
>
> If by key-value store you mean something along the lines of RocksDB or the
> numerous other similar
> projects, then no - I'm not talking about a key-value store.


No, I was referring to the OSD-like block device interface alternatives,
e.g. what Seagate did with Kinetic. OSD is essentially your blobstore:
arbitrary size space chunks addressed by a unique key, a flat filesystem if
you wish.

Regards,
Andrey

I think a true key-value store on top
> of SPDK is a great long term goal, but there are a number of simpler
> intermediate layers that need
> to be written first. RocksDB, for example, runs on top of a POSIX
> filesystem by default. It happens
> to use a very minimal set of features from the filesystem, but it uses
> functionality in the
> filesystem nonetheless, and the layer I'm talking about would replace that
> functionality. The key is
> to keep the layers simple - key-value stores do lots of complicated things
> to keep I/O sequential
> and batched, do compression automatically, insert CRCs, etc. that I
> wouldn't recommend doing in the
> "blobstore" itself.
>
>
> >
> > Regards,
> > Andrey
> >
> > > >
> > > > It would be good to list down the known challenges in the
> mailing-list, so that community may
> > > try
> > > > to address/discuss about them?
> > >
> > > Beyond the very basic requirements above, I think the additional
> requirements depend on the
> > > application that is using it. Some applications can tolerate only
> being allowed to write and
> > > read in
> > > sector-size chunks, for instance, which is important if the blobstore
> wishes to implement zero
> > > copy.
> > > Other applications need finer granularity. Many databases don't need
> directories either - they
> > > can
> > > live with a flat namespace in which to place their blobs. I think
> file-level permissions aren't
> > > needed either.
> > >
> > > Functional requirements aside, in order to get the best performance
> possible, the blobstore
> > > would need to be asynchronous, lockless, and polled mode. That's a
> real challenge due to shared
> > > metadata, although I have a number of ideas in this area.
> > >
> > > >
> > > > Thanks,
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker,
> Benjamin
> > > > Sent: Thursday, July 14, 2016 1:18 PM
> > > > To: spdk(a)lists.01.org
> > > > Subject: Re: [SPDK] FW: sharing of single NVMe device between spdk
> userspace and native kernel
> > > > driver
> > > >
> > > > On Thu, 2016-07-14 at 11:59 -0700, txcy uio wrote:
> > > > > > Thank you Ben for the detailed reply.
> > > > > A filesystem which can make use of SPDK is precisely the
> requirement.
> > > > > Everything else is a way to get around that. In my specific use
> case I
> > > > > wish to have a single nvme device which will have a rootfs as
> well. So
> > > > > such a filesystem will need to handle that as well (probably I am
> being too ambitious here).
> > > >
> > > > It won't ever be possible to use SPDK as the driver for your boot
> device. SR-IOV would let you
> > > > share the device between the kernel (for booting) and your
> application, if and/or when that
> > > > exists.
> > > >
> > > > > The only other "filesystem" that I am aware of is Ceph's bluefs
> which
> > > > > is very minimal and specific to Rocksdb backend.
> > > >
> > > > This is the only one that I'm aware of currently as well, and it has
> a number of features that
> > > > make it not particularly suitable for use with SPDK (even though it
> does work with SPDK). The
> > > > biggest problems are around synchronous I/O operations and lack of
> memory pre-registration,
> > > > forcing copies on every I/O.
> > > >
> > > > >  On a side note if I had more than one nvme device on a system ,
> do
> > > > > all the nvme devices need to be unbound from the kernel driver?
> > > >
> > > > Each NVMe device is independent. You can use some NVMe devices with
> SPDK and others with the
> > > > kernel at the same time with no conflict. Our setup scripts do
> either bind or unbind all of
> > > them
> > > > at once, but that's just for convenience.
> > > >
> > > > >
> > > > > --Tyc
> > > > >
> > > > > > On 7/14/16, 10:55 AM, "SPDK on behalf of Walker, Benjamin"
> > > > > > <spdk-bounces(a)lists.01.org on behalf of
> benjamin.walker(a)intel.com> wrote:
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote:
> > > > > > > > Hello Ben
> > > > > > > >
> > > > > > > > I have a use case where I want to attach one namespace of a
> nvme
> > > > > > > > device to spdk driver and
> > > > > > use the
> > > > > > > > other namespace as a kernel block device to create a regular
> > > > > > > > filesystem. Current
> > > > > > implementation of
> > > > > > > > spdk requires the device to be unbound completely from the
> native
> > > > > > > > kernel driver. I was
> > > > > > wondering
> > > > > > > > if this is at all possible and if yes can this be
> accomplished
> > > > > > > > with the current spdk implementation?
> > > > > > >
> > > > > > > Your request is one we get every few days or so, and it is a
> perfectly reasonable thing
> > > to
> > > > > > > ask.
> > > > > > I
> > > > > > > haven't written down my standard response on the mailing list
> yet,
> > > > > > > so I'm going to take this opportunity to lay out our position
> for all to see and
> > > discuss.
> > > > > > >
> > > > > > > From a purely technical standpoint, it is impossible to both
> load
> > > > > > > the SPDK driver as it exists
> > > > > > today
> > > > > > > and the kernel driver against the same PCI device. The
> registers
> > > > > > > exposed by the PCI device
> > > > > > contain
> > > > > > > global state and so there can only be a single "owner". There
> is an
> > > > > > > established hardware
> > > > > > mechanism
> > > > > > > for creating multiple virtual PCI devices from a single
> physical
> > > > > > > devices that each can load
> > > > > > their
> > > > > > > own driver called SR-IOV. This is typically used by NICs today
> and
> > > > > > > I'm not aware of any NVMe
> > > > > > SSDs
> > > > > > > that support it currently. SR-IOV is the right solution for
> sharing
> > > > > > > the device like you outline
> > > > > > in
> > > > > > > the long term, though.
> > > > > > >
> > > > > > > In the short term, it would be technically possible to create
> some
> > > > > > > kernel patches that add
> > > > > > entries
> > > > > > > to sysfs or provide ioctls that allow a user space process to
> claim
> > > > > > > an NVMe hardware queue for
> > > > > > a
> > > > > > > device that the kernel is managing. You could then run the
> SPDK
> > > > > > > driver's I/O path against that queue. Unfortunately, there are
> two
> > > > > > > insurmountable issues with this strategy. First, NVMe
> > > > > > hardware
> > > > > > > queues can write to any namespace on the device. Therefore,
> you
> > > > > > > couldn't enforce that the queue
> > > > > > can
> > > > > > > only write to the namespace you are intending. You couldn't
> even
> > > > > > > enforce that the queue is only
> > > > > > used
> > > > > > > for reads - you basically just have to trust the application
> to only do reasonable
> > > things.
> > > > > > Second,
> > > > > > > the device is owned by the kernel and therefore is not in an
> IOMMU
> > > > > > > protection domain with this strategy. The device can directly
> > > > > > > access the DMA engine, and with a small amount of work, you
> > > > > > could
> > > > > > > hijack that DMA engine to copy data to wherever you wanted on
> the
> > > > > > > system. For these two
> > > > > > reasons,
> > > > > > > patches of this nature would never be accepted into the
> mainline
> > > > > > > kernel. The SPDK team can't be
> > > > > > in
> > > > > > > the business of supporting patches that have been rejected by
> the kernel community.
> > > > > > >
> > > > > > > Clearly, lots of people have requested to share a device
> between
> > > > > > > the kernel and SPDK, so I've
> > > > > > been
> > > > > > > trying to uncover all of the reasons they may want to do that.
> So
> > > > > > > far, in every case, it boils
> > > > > > down
> > > > > > > to not having a filesystem for use with SPDK. I'm hoping to
> steer
> > > > > > > the community to solve the
> > > > > > problem
> > > > > > > of not having a filesystem rather than trying to share the
> device.
> > > > > > > I'm not advocating for
> > > > > > writing a
> > > > > > > (mostly) POSIX compliant filesystem, but I do think there is a
> > > > > > > small core of functionality that
> > > > > > most
> > > > > > > databases or storage applications all require. These are
> things
> > > > > > > like allocating blocks into
> > > > > > some
> > > > > > > unit (I've been calling it a blob) that has a name and is
> > > > > > > persistent and rediscoverable across reboots. Writing this
> layer
> > > > > > > requires some serious thought - SPDK is fast in no small part
> > > > > > because it
> > > > > > > is purely asynchronous, polled, and lockless - so this layer
> would
> > > > > > > need to preserve those characteristics.
> > > > > > >
> > > > > > > Sorry for the very long response, but I wanted to document my
> > > > > > > current thoughts on the mailing
> > > > > > list
> > > > > > > for all to see.
> > > > > > >
> > > > > > > >
> > > > > > > > --Tyc
> > > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > SPDK mailing list
> > > > > > > > SPDK(a)lists.01.org
> > > > > > > > https://lists.01.org/mailman/listinfo/spdk
> > > > > > > _______________________________________________
> > > > > > > SPDK mailing list
> > > > > > > SPDK(a)lists.01.org
> > > > > > > https://lists.01.org/mailman/listinfo/spdk
> > > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > SPDK mailing list
> > > > > SPDK(a)lists.01.org
> > > > > https://lists.01.org/mailman/listinfo/spdk
> > > > _______________________________________________
> > > > SPDK mailing list
> > > > SPDK(a)lists.01.org
> > > > https://lists.01.org/mailman/listinfo/spdk
> > > > _______________________________________________
> > > > SPDK mailing list
> > > > SPDK(a)lists.01.org
> > > > https://lists.01.org/mailman/listinfo/spdk
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > > https://lists.01.org/mailman/listinfo/spdk
> > >
> > --
> > Regards,
> > Andrey
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
>
-- 

Regards,
Andrey

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 18344 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2016-07-19  8:59 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-18 20:49 [SPDK] FW: sharing of single NVMe device between spdk userspace Walker, Benjamin
  -- strict thread matches above, loose matches on Subject: below --
2016-07-19  8:59 Andrey Kuzmin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.