From: Walker, Benjamin <benjamin.walker at intel.com>
To: spdk@lists.01.org
Subject: Re: [SPDK] FW: sharing of single NVMe device between spdk userspace and native kernel driver
Date: Thu, 14 Jul 2016 20:18:01 +0000 [thread overview]
Message-ID: <1468527479.5999.390.camel@intel.com> (raw)
In-Reply-To: CABSNBDEYab+gyZPCmKq=_ugP_+cSfxtB6GaNCeLmrbfCiSJmzQ@mail.gmail.com
[-- Attachment #1: Type: text/plain, Size: 6074 bytes --]
On Thu, 2016-07-14 at 11:59 -0700, txcy uio wrote:
> > Thank you Ben for the detailed reply.
> A filesystem which can make use of SPDK is precisely the requirement. Everything else is a way to
> get around that. In my specific use case I wish to have a single nvme device which will have a
> rootfs as well. So such a filesystem will need to handle that as well (probably I am being too
> ambitious here).
It won't ever be possible to use SPDK as the driver for your boot device. SR-IOV would let you share
the device between the kernel (for booting) and your application, if and/or when that exists.
> The only other "filesystem" that I am aware of is Ceph's bluefs which is very minimal and specific
> to Rocksdb backend.
This is the only one that I'm aware of currently as well, and it has a number of features that make
it not particularly suitable for use with SPDK (even though it does work with SPDK). The biggest
problems are around synchronous I/O operations and lack of memory pre-registration, forcing copies
on every I/O.
> On a side note if I had more than one nvme device on a system , do all the nvme devices need to
> be unbound from the kernel driver?
Each NVMe device is independent. You can use some NVMe devices with SPDK and others with the kernel
at the same time with no conflict. Our setup scripts do either bind or unbind all of them at once,
but that's just for convenience.
>
> --Tyc
>
> > On 7/14/16, 10:55 AM, "SPDK on behalf of Walker, Benjamin" <spdk-bounces(a)lists.01.org on behalf
> > of benjamin.walker(a)intel.com> wrote:
> >
> > >
> > >
> > >On Wed, 2016-07-13 at 12:59 -0700, txcy uio wrote:
> > >> Hello Ben
> > >>
> > >> I have a use case where I want to attach one namespace of a nvme device to spdk driver and
> > use the
> > >> other namespace as a kernel block device to create a regular filesystem. Current
> > implementation of
> > >> spdk requires the device to be unbound completely from the native kernel driver. I was
> > wondering
> > >> if this is at all possible and if yes can this be accomplished with the current spdk
> > >> implementation?
> > >
> > >Your request is one we get every few days or so, and it is a perfectly reasonable thing to ask.
> > I
> > >haven't written down my standard response on the mailing list yet, so I'm going to take this
> > >opportunity to lay out our position for all to see and discuss.
> > >
> > >From a purely technical standpoint, it is impossible to both load the SPDK driver as it exists
> > today
> > >and the kernel driver against the same PCI device. The registers exposed by the PCI device
> > contain
> > >global state and so there can only be a single "owner". There is an established hardware
> > mechanism
> > >for creating multiple virtual PCI devices from a single physical devices that each can load
> > their
> > >own driver called SR-IOV. This is typically used by NICs today and I'm not aware of any NVMe
> > SSDs
> > >that support it currently. SR-IOV is the right solution for sharing the device like you outline
> > in
> > >the long term, though.
> > >
> > >In the short term, it would be technically possible to create some kernel patches that add
> > entries
> > >to sysfs or provide ioctls that allow a user space process to claim an NVMe hardware queue for
> > a
> > >device that the kernel is managing. You could then run the SPDK driver's I/O path against that
> > >queue. Unfortunately, there are two insurmountable issues with this strategy. First, NVMe
> > hardware
> > >queues can write to any namespace on the device. Therefore, you couldn't enforce that the queue
> > can
> > >only write to the namespace you are intending. You couldn't even enforce that the queue is only
> > used
> > >for reads - you basically just have to trust the application to only do reasonable things.
> > Second,
> > >the device is owned by the kernel and therefore is not in an IOMMU protection domain with this
> > >strategy. The device can directly access the DMA engine, and with a small amount of work, you
> > could
> > >hijack that DMA engine to copy data to wherever you wanted on the system. For these two
> > reasons,
> > >patches of this nature would never be accepted into the mainline kernel. The SPDK team can't be
> > in
> > >the business of supporting patches that have been rejected by the kernel community.
> > >
> > >Clearly, lots of people have requested to share a device between the kernel and SPDK, so I've
> > been
> > >trying to uncover all of the reasons they may want to do that. So far, in every case, it boils
> > down
> > >to not having a filesystem for use with SPDK. I'm hoping to steer the community to solve the
> > problem
> > >of not having a filesystem rather than trying to share the device. I'm not advocating for
> > writing a
> > >(mostly) POSIX compliant filesystem, but I do think there is a small core of functionality that
> > most
> > >databases or storage applications all require. These are things like allocating blocks into
> > some
> > >unit (I've been calling it a blob) that has a name and is persistent and rediscoverable across
> > >reboots. Writing this layer requires some serious thought - SPDK is fast in no small part
> > because it
> > >is purely asynchronous, polled, and lockless - so this layer would need to preserve those
> > >characteristics.
> > >
> > >Sorry for the very long response, but I wanted to document my current thoughts on the mailing
> > list
> > >for all to see.
> > >
> > >>
> > >> --Tyc
> > >>
> > >> _______________________________________________
> > >> SPDK mailing list
> > >> SPDK(a)lists.01.org
> > >> https://lists.01.org/mailman/listinfo/spdk
> > >_______________________________________________
> > >SPDK mailing list
> > >SPDK(a)lists.01.org
> > >https://lists.01.org/mailman/listinfo/spdk
> >
>
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
next reply other threads:[~2016-07-14 20:18 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-07-14 20:18 Walker, Benjamin [this message]
-- strict thread matches above, loose matches on Subject: below --
2016-07-15 21:07 [SPDK] FW: sharing of single NVMe device between spdk userspace and native kernel driver Andrey Kuzmin
2016-07-15 20:10 Walker, Benjamin
2016-07-14 21:34 Raj Pandurangan
2016-07-14 18:59 txcy uio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1468527479.5999.390.camel@intel.com \
--to=spdk@lists.01.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.