* [SPDK] sharing of single NVMe device between spdk userspace and native kernel driver
@ 2016-07-13 19:59 txcy uio
0 siblings, 0 replies; 5+ messages in thread
From: txcy uio @ 2016-07-13 19:59 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 424 bytes --]
Hello Ben
I have a use case where I want to attach one namespace of a nvme device to
spdk driver and use the other namespace as a kernel block device to create
a regular filesystem. Current implementation of spdk requires the device to
be unbound completely from the native kernel driver. I was wondering if
this is at all possible and if yes can this be accomplished with the
current spdk implementation?
--Tyc
[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 483 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [SPDK] sharing of single NVMe device between spdk userspace and native kernel driver
@ 2016-07-14 17:55 Walker, Benjamin
0 siblings, 0 replies; 5+ messages in thread
From: Walker, Benjamin @ 2016-07-14 17:55 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 3741 bytes --]
On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote:
> Hello Ben
>
> I have a use case where I want to attach one namespace of a nvme device to spdk driver and use the
> other namespace as a kernel block device to create a regular filesystem. Current implementation of
> spdk requires the device to be unbound completely from the native kernel driver. I was wondering
> if this is at all possible and if yes can this be accomplished with the current spdk
> implementation?
Your request is one we get every few days or so, and it is a perfectly reasonable thing to ask. I
haven't written down my standard response on the mailing list yet, so I'm going to take this
opportunity to lay out our position for all to see and discuss.
From a purely technical standpoint, it is impossible to both load the SPDK driver as it exists today
and the kernel driver against the same PCI device. The registers exposed by the PCI device contain
global state and so there can only be a single "owner". There is an established hardware mechanism
for creating multiple virtual PCI devices from a single physical devices that each can load their
own driver called SR-IOV. This is typically used by NICs today and I'm not aware of any NVMe SSDs
that support it currently. SR-IOV is the right solution for sharing the device like you outline in
the long term, though.
In the short term, it would be technically possible to create some kernel patches that add entries
to sysfs or provide ioctls that allow a user space process to claim an NVMe hardware queue for a
device that the kernel is managing. You could then run the SPDK driver's I/O path against that
queue. Unfortunately, there are two insurmountable issues with this strategy. First, NVMe hardware
queues can write to any namespace on the device. Therefore, you couldn't enforce that the queue can
only write to the namespace you are intending. You couldn't even enforce that the queue is only used
for reads - you basically just have to trust the application to only do reasonable things. Second,
the device is owned by the kernel and therefore is not in an IOMMU protection domain with this
strategy. The device can directly access the DMA engine, and with a small amount of work, you could
hijack that DMA engine to copy data to wherever you wanted on the system. For these two reasons,
patches of this nature would never be accepted into the mainline kernel. The SPDK team can't be in
the business of supporting patches that have been rejected by the kernel community.
Clearly, lots of people have requested to share a device between the kernel and SPDK, so I've been
trying to uncover all of the reasons they may want to do that. So far, in every case, it boils down
to not having a filesystem for use with SPDK. I'm hoping to steer the community to solve the problem
of not having a filesystem rather than trying to share the device. I'm not advocating for writing a
(mostly) POSIX compliant filesystem, but I do think there is a small core of functionality that most
databases or storage applications all require. These are things like allocating blocks into some
unit (I've been calling it a blob) that has a name and is persistent and rediscoverable across
reboots. Writing this layer requires some serious thought - SPDK is fast in no small part because it
is purely asynchronous, polled, and lockless - so this layer would need to preserve those
characteristics.
Sorry for the very long response, but I wanted to document my current thoughts on the mailing list
for all to see.
>
> --Tyc
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [SPDK] sharing of single NVMe device between spdk userspace and native kernel driver
@ 2016-07-17 14:23 Andrey Kuzmin
0 siblings, 0 replies; 5+ messages in thread
From: Andrey Kuzmin @ 2016-07-17 14:23 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 4906 bytes --]
On Thu, Jul 14, 2016, 20:55 Walker, Benjamin <benjamin.walker(a)intel.com>
wrote:
>
>
> On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote:
> > Hello Ben
> >
> > I have a use case where I want to attach one namespace of a nvme device
> to spdk driver and use the
> > other namespace as a kernel block device to create a regular filesystem.
> Current implementation of
> > spdk requires the device to be unbound completely from the native kernel
> driver. I was wondering
> > if this is at all possible and if yes can this be accomplished with the
> current spdk
> > implementation?
>
> Your request is one we get every few days or so, and it is a perfectly
> reasonable thing to ask. I
> haven't written down my standard response on the mailing list yet, so I'm
> going to take this
> opportunity to lay out our position for all to see and discuss.
>
> From a purely technical standpoint, it is impossible to both load the SPDK
> driver as it exists today
> and the kernel driver against the same PCI device. The registers exposed
> by the PCI device contain
> global state and so there can only be a single "owner". There is an
> established hardware mechanism
> for creating multiple virtual PCI devices from a single physical devices
> that each can load their
> own driver called SR-IOV. This is typically used by NICs today and I'm not
> aware of any NVMe SSDs
> that support it currently. SR-IOV is the right solution for sharing the
> device like you outline in
> the long term, though.
>
> In the short term, it would be technically possible to create some kernel
> patches that add entries
> to sysfs or provide ioctls that allow a user space process to claim an
> NVMe hardware queue for a
> device that the kernel is managing. You could then run the SPDK driver's
> I/O path against that
> queue. Unfortunately, there are two insurmountable issues with this
> strategy. First, NVMe hardware
> queues can write to any namespace on the device. Therefore, you couldn't
> enforce that the queue can
> only write to the namespace you are intending. You couldn't even enforce
> that the queue is only used
> for reads - you basically just have to trust the application to only do
> reasonable things. Second,
> the device is owned by the kernel and therefore is not in an IOMMU
> protection domain with this
> strategy. The device can directly access the DMA engine, and with a small
> amount of work, you could
> hijack that DMA engine to copy data to wherever you wanted on the system.
> For these two reasons,
> patches of this nature would never be accepted into the mainline kernel.
> The SPDK team can't be in
> the business of supporting patches that have been rejected by the kernel
> community.
>
> Clearly, lots of people have requested to share a device between the
> kernel and SPDK, so I've been
> trying to uncover all of the reasons they may want to do that. So far, in
> every case, it boils down
> to not having a filesystem for use with SPDK. I'm hoping to steer the
> community to solve the problem
> of not having a filesystem rather than trying to share the device.
An intriguing perspective, in this regard, is provided by the upcoming
NVMoF support in SPDK. Once this stabilizes, remote user-space SPDK target
could be exposed via a local kernel nvmf host, whereupon any standard
filesystem may be mounted atop such a remote NVMe target. This nicely
addresses the root cause of not having a filesystem atop SPDK w/o any
blobstore development which, IMO, just doesn't belong here.
I'm not a networking guy, so may be someone else on the list can opine if
the above approach to access a remote SPDK target might work locally via
any sort of loop mount. If possible, that would address the root cause in
both remote and local settings.
Regards,
Andrey
I'm not advocating for writing a
> (mostly) POSIX compliant filesystem, but I do think there is a small core
> of functionality that most
> databases or storage applications all require. These are things like
> allocating blocks into some
> unit (I've been calling it a blob) that has a name and is persistent and
> rediscoverable across
> reboots. Writing this layer requires some serious thought - SPDK is fast
> in no small part because it
> is purely asynchronous, polled, and lockless - so this layer would need to
> preserve those
> characteristics.
>
> Sorry for the very long response, but I wanted to document my current
> thoughts on the mailing list
> for all to see.
>
> >
> > --Tyc
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
>
--
Regards,
Andrey
[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 5799 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [SPDK] sharing of single NVMe device between spdk userspace and native kernel driver
@ 2016-07-18 23:06 Walker, Benjamin
0 siblings, 0 replies; 5+ messages in thread
From: Walker, Benjamin @ 2016-07-18 23:06 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 6879 bytes --]
On Sun, 2016-07-17 at 14:23 +0000, Andrey Kuzmin wrote:
>
>
> On Thu, Jul 14, 2016, 20:55 Walker, Benjamin <benjamin.walker(a)intel.com> wrote:
> >
> > On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote:
> > > Hello Ben
> > >
> > > I have a use case where I want to attach one namespace of a nvme device to spdk driver and use
> > the
> > > other namespace as a kernel block device to create a regular filesystem. Current
> > implementation of
> > > spdk requires the device to be unbound completely from the native kernel driver. I was
> > wondering
> > > if this is at all possible and if yes can this be accomplished with the current spdk
> > > implementation?
> >
> > Your request is one we get every few days or so, and it is a perfectly reasonable thing to ask.
> > I
> > haven't written down my standard response on the mailing list yet, so I'm going to take this
> > opportunity to lay out our position for all to see and discuss.
> >
> > From a purely technical standpoint, it is impossible to both load the SPDK driver as it exists
> > today
> > and the kernel driver against the same PCI device. The registers exposed by the PCI device
> > contain
> > global state and so there can only be a single "owner". There is an established hardware
> > mechanism
> > for creating multiple virtual PCI devices from a single physical devices that each can load
> > their
> > own driver called SR-IOV. This is typically used by NICs today and I'm not aware of any NVMe
> > SSDs
> > that support it currently. SR-IOV is the right solution for sharing the device like you outline
> > in
> > the long term, though.
> >
> > In the short term, it would be technically possible to create some kernel patches that add
> > entries
> > to sysfs or provide ioctls that allow a user space process to claim an NVMe hardware queue for a
> > device that the kernel is managing. You could then run the SPDK driver's I/O path against that
> > queue. Unfortunately, there are two insurmountable issues with this strategy. First, NVMe
> > hardware
> > queues can write to any namespace on the device. Therefore, you couldn't enforce that the queue
> > can
> > only write to the namespace you are intending. You couldn't even enforce that the queue is only
> > used
> > for reads - you basically just have to trust the application to only do reasonable things.
> > Second,
> > the device is owned by the kernel and therefore is not in an IOMMU protection domain with this
> > strategy. The device can directly access the DMA engine, and with a small amount of work, you
> > could
> > hijack that DMA engine to copy data to wherever you wanted on the system. For these two reasons,
> > patches of this nature would never be accepted into the mainline kernel. The SPDK team can't be
> > in
> > the business of supporting patches that have been rejected by the kernel community.
> >
> > Clearly, lots of people have requested to share a device between the kernel and SPDK, so I've
> > been
> > trying to uncover all of the reasons they may want to do that. So far, in every case, it boils
> > down
> > to not having a filesystem for use with SPDK. I'm hoping to steer the community to solve the
> > problem
> > of not having a filesystem rather than trying to share the device.
>
> An intriguing perspective, in this regard, is provided by the upcoming NVMoF support in SPDK. Once
> this stabilizes, remote user-space SPDK target could be exposed via a local kernel nvmf host,
> whereupon any standard filesystem may be mounted atop such a remote NVMe target. This nicely
> addresses the root cause of not having a filesystem atop SPDK w/o any blobstore development which,
> IMO, just doesn't belong here.
>
> I'm not a networking guy, so may be someone else on the list can opine if the above approach to
> access a remote SPDK target might work locally via any sort of loop mount. If possible, that would
> address the root cause in both remote and local settings.
Loopback with RDMA generally works as you'd expect - that's how we do the majority of our testing on
the NVMf target today. You can indeed use the Linux kernel initiator to connect to the SPDK NVMf
target and that's again how we do all of our testing. The two pieces of code were developed
together, right alongside development of the specification. The SPDK NVMf target does not share code
with the Linux kernel for licensing reasons and we were silo'd during development from a code
standpoint, so the code in SPDK is clean BSD-licensed code.
I'm not sure using the SPDK NVMf target and connecting the Linux kernel initiator via loopback has
any use case outside of testing though. The purpose of using SPDK locally is to avoid the kernel to
get better performance. If you are using the Linux kernel initiator, you're going through the whole
kernel stack and then additional through the userspace stack, so you're losing all of your
performance benefit. If you do that, it's probably faster to just use the kernel to access your
local NVMe device and not use SPDK at all.
To be clear, just because it makes no sense to use the Linux kernel NVMf initiator to connect to the
SPDK NVMf target in loopback doesn't mean it doesn't make sense to use the kernel NVMf initiator to
connect to a remote SPDK NVMf target. Any single client will of course go through its local kernel
and pay the penalty, but the target itself should be able to service many times more clients for a
given amount of compute using the SPDK NVMf target compared to the Linux kernel target.
>
> Regards,
> Andrey
>
> > I'm not advocating for writing a
> > (mostly) POSIX compliant filesystem, but I do think there is a small core of functionality that
> > most
> > databases or storage applications all require. These are things like allocating blocks into some
> > unit (I've been calling it a blob) that has a name and is persistent and rediscoverable across
> > reboots. Writing this layer requires some serious thought - SPDK is fast in no small part
> > because it
> > is purely asynchronous, polled, and lockless - so this layer would need to preserve those
> > characteristics.
> >
> > Sorry for the very long response, but I wanted to document my current thoughts on the mailing
> > list
> > for all to see.
> >
> > >
> > > --Tyc
> > >
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > > https://lists.01.org/mailman/listinfo/spdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> >
> --
> Regards,
> Andrey
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [SPDK] sharing of single NVMe device between spdk userspace and native kernel driver
@ 2016-07-19 8:41 Andrey Kuzmin
0 siblings, 0 replies; 5+ messages in thread
From: Andrey Kuzmin @ 2016-07-19 8:41 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 7922 bytes --]
On Tue, Jul 19, 2016, 02:06 Walker, Benjamin <benjamin.walker(a)intel.com>
wrote:
> On Sun, 2016-07-17 at 14:23 +0000, Andrey Kuzmin wrote:
> >
> >
> > On Thu, Jul 14, 2016, 20:55 Walker, Benjamin <benjamin.walker(a)intel.com>
> wrote:
> > >
> > > On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote:
> > > > Hello Ben
> > > >
> > > > I have a use case where I want to attach one namespace of a nvme
> device to spdk driver and use
> > > the
> > > > other namespace as a kernel block device to create a regular
> filesystem. Current
> > > implementation of
> > > > spdk requires the device to be unbound completely from the native
> kernel driver. I was
> > > wondering
> > > > if this is at all possible and if yes can this be accomplished with
> the current spdk
> > > > implementation?
> > >
> > > Your request is one we get every few days or so, and it is a perfectly
> reasonable thing to ask.
> > > I
> > > haven't written down my standard response on the mailing list yet, so
> I'm going to take this
> > > opportunity to lay out our position for all to see and discuss.
> > >
> > > From a purely technical standpoint, it is impossible to both load the
> SPDK driver as it exists
> > > today
> > > and the kernel driver against the same PCI device. The registers
> exposed by the PCI device
> > > contain
> > > global state and so there can only be a single "owner". There is an
> established hardware
> > > mechanism
> > > for creating multiple virtual PCI devices from a single physical
> devices that each can load
> > > their
> > > own driver called SR-IOV. This is typically used by NICs today and I'm
> not aware of any NVMe
> > > SSDs
> > > that support it currently. SR-IOV is the right solution for sharing
> the device like you outline
> > > in
> > > the long term, though.
> > >
> > > In the short term, it would be technically possible to create some
> kernel patches that add
> > > entries
> > > to sysfs or provide ioctls that allow a user space process to claim an
> NVMe hardware queue for a
> > > device that the kernel is managing. You could then run the SPDK
> driver's I/O path against that
> > > queue. Unfortunately, there are two insurmountable issues with this
> strategy. First, NVMe
> > > hardware
> > > queues can write to any namespace on the device. Therefore, you
> couldn't enforce that the queue
> > > can
> > > only write to the namespace you are intending. You couldn't even
> enforce that the queue is only
> > > used
> > > for reads - you basically just have to trust the application to only
> do reasonable things.
> > > Second,
> > > the device is owned by the kernel and therefore is not in an IOMMU
> protection domain with this
> > > strategy. The device can directly access the DMA engine, and with a
> small amount of work, you
> > > could
> > > hijack that DMA engine to copy data to wherever you wanted on the
> system. For these two reasons,
> > > patches of this nature would never be accepted into the mainline
> kernel. The SPDK team can't be
> > > in
> > > the business of supporting patches that have been rejected by the
> kernel community.
> > >
> > > Clearly, lots of people have requested to share a device between the
> kernel and SPDK, so I've
> > > been
> > > trying to uncover all of the reasons they may want to do that. So far,
> in every case, it boils
> > > down
> > > to not having a filesystem for use with SPDK. I'm hoping to steer the
> community to solve the
> > > problem
> > > of not having a filesystem rather than trying to share the device.
> >
> > An intriguing perspective, in this regard, is provided by the upcoming
> NVMoF support in SPDK. Once
> > this stabilizes, remote user-space SPDK target could be exposed via a
> local kernel nvmf host,
> > whereupon any standard filesystem may be mounted atop such a remote NVMe
> target. This nicely
> > addresses the root cause of not having a filesystem atop SPDK w/o any
> blobstore development which,
> > IMO, just doesn't belong here.
> >
> > I'm not a networking guy, so may be someone else on the list can opine
> if the above approach to
> > access a remote SPDK target might work locally via any sort of loop
> mount. If possible, that would
> > address the root cause in both remote and local settings.
>
> Loopback with RDMA generally works as you'd expect - that's how we do the
> majority of our testing on
> the NVMf target today. You can indeed use the Linux kernel initiator to
> connect to the SPDK NVMf
> target and that's again how we do all of our testing. The two pieces of
> code were developed
> together, right alongside development of the specification. The SPDK NVMf
> target does not share code
> with the Linux kernel for licensing reasons and we were silo'd during
> development from a code
> standpoint, so the code in SPDK is clean BSD-licensed code.
>
> I'm not sure using the SPDK NVMf target and connecting the Linux kernel
> initiator via loopback has
> any use case outside of testing though. The purpose of using SPDK locally
> is to avoid the kernel to
> get better performance.
Totally agree. But, to get there, user app has to talk directly to SPDK.
While that's exactly what, I believe, will happen longer term, with
developments like MongoDB's extensible storage layer, short-term the
loopback gives the guys looking for a quick path toward evaluation with a
kernel filesystem a free option, with no extra development involved.
Regards,
Andrey
If you are using the Linux kernel initiator, you're going through the whole
> kernel stack and then additional through the userspace stack, so you're
> losing all of your
> performance benefit. If you do that, it's probably faster to just use the
> kernel to access your
> local NVMe device and not use SPDK at all.
>
> To be clear, just because it makes no sense to use the Linux kernel NVMf
> initiator to connect to the
> SPDK NVMf target in loopback doesn't mean it doesn't make sense to use the
> kernel NVMf initiator to
> connect to a remote SPDK NVMf target. Any single client will of course go
> through its local kernel
> and pay the penalty, but the target itself should be able to service many
> times more clients for a
> given amount of compute using the SPDK NVMf target compared to the Linux
> kernel target.
>
> >
> > Regards,
> > Andrey
> >
> > > I'm not advocating for writing a
> > > (mostly) POSIX compliant filesystem, but I do think there is a small
> core of functionality that
> > > most
> > > databases or storage applications all require. These are things like
> allocating blocks into some
> > > unit (I've been calling it a blob) that has a name and is persistent
> and rediscoverable across
> > > reboots. Writing this layer requires some serious thought - SPDK is
> fast in no small part
> > > because it
> > > is purely asynchronous, polled, and lockless - so this layer would
> need to preserve those
> > > characteristics.
> > >
> > > Sorry for the very long response, but I wanted to document my current
> thoughts on the mailing
> > > list
> > > for all to see.
> > >
> > > >
> > > > --Tyc
> > > >
> > > > _______________________________________________
> > > > SPDK mailing list
> > > > SPDK(a)lists.01.org
> > > > https://lists.01.org/mailman/listinfo/spdk
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > > https://lists.01.org/mailman/listinfo/spdk
> > >
> > --
> > Regards,
> > Andrey
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
>
--
Regards,
Andrey
[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 9794 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2016-07-19 8:41 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-18 23:06 [SPDK] sharing of single NVMe device between spdk userspace and native kernel driver Walker, Benjamin
-- strict thread matches above, loose matches on Subject: below --
2016-07-19 8:41 Andrey Kuzmin
2016-07-17 14:23 Andrey Kuzmin
2016-07-14 17:55 Walker, Benjamin
2016-07-13 19:59 txcy uio
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.