No subject

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* No subject
@ 2014-11-10  3:11 Libo Chen
  0 siblings, 0 replies; 19+ messages in thread
From: Libo Chen @ 2014-11-10  3:11 UTC (permalink / raw)




^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
@ 2014-11-10  6:39 Libo Chen
  0 siblings, 0 replies; 19+ messages in thread
From: Libo Chen @ 2014-11-10  6:39 UTC (permalink / raw)




^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
@ 2017-02-07  0:22 Scott Bauer
  2017-02-07  0:46 ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Scott Bauer @ 2017-02-07  0:22 UTC (permalink / raw)


I screwed up and had size_t's in the uapi structures which of course
differ in size on 32 and 64 bit platforms. This caused issues running
32 bit userland on a 64 bit kernel. We're hoping to sneak this
patch in so we don't have to maintain a compat layer.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
  2017-02-07  0:22 Scott Bauer
@ 2017-02-07  0:46 ` Jens Axboe
  0 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2017-02-07  0:46 UTC (permalink / raw)


On 02/06/2017 05:22 PM, Scott Bauer wrote:
> I screwed up and had size_t's in the uapi structures which of course
> differ in size on 32 and 64 bit platforms. This caused issues running
> 32 bit userland on a 64 bit kernel. We're hoping to sneak this
> patch in so we don't have to maintain a compat layer.

I'll apply it - you forgot to add a Fixes, always do that when a patch
fixes something that a specific commit introduced.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
@ 2017-04-21 23:23 Sandeep Mann
  0 siblings, 0 replies; 19+ messages in thread
From: Sandeep Mann @ 2017-04-21 23:23 UTC (permalink / raw)




Adding linux-nvme at lists.infradead.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
@ 2017-09-13 18:15 unmesh rathi
  0 siblings, 0 replies; 19+ messages in thread
From: unmesh rathi @ 2017-09-13 18:15 UTC (permalink / raw)




^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
@ 2018-02-02  6:54 Jianchao Wang
  0 siblings, 0 replies; 19+ messages in thread
From: Jianchao Wang @ 2018-02-02  6:54 UTC (permalink / raw)


Hi Christoph, Keith and Sagi

Please consider and comment on the following patchset.
That's really appreciated.

There is a complicated relationship between nvme_timeout and nvme_dev_disable.
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   outstanding requests.
We have found some issues introduced by them, please refer the following link

http://lists.infradead.org/pipermail/linux-nvme/2018-January/015053.html 
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015276.html
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015328.html
Even we cannot ensure there is no other issue.

The best way to fix them is to break up the relationship between them.
With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and eliminate the race between nvme_timeout and
nvme_dev_disable on outstanding requests.


There are 6 patches:

1st ~ 3th patches does some preparation for the 4th one.
4th is to avoid nvme_dev_disable to be invoked by nvme_timeout, and implement
the synchronization between them. More details, please refer to the comment of
this patch.
5th fixes a bug after 4th patch is introduced. It let nvme_delete_io_queues can
only be wakeup by completion path.
6th fixes a bug found when test, it is not related with 4th patch.

This patchset was tested under debug patch for some days.
And some bugfix have been done.
The debug patch and other patches are available in following it branch:
https://github.com/jianchwa/linux-blcok.git nvme_fixes_test

Jianchao Wang (6)
0001-nvme-pci-move-clearing-host-mem-behind-stopping-queu.patch
0002-nvme-pci-fix-the-freeze-and-quiesce-for-shutdown-and.patch
0003-blk-mq-make-blk_mq_rq_update_aborted_gstate-a-extern.patch
0004-nvme-pci-break-up-nvme_timeout-and-nvme_dev_disable.patch
0005-nvme-pci-discard-wait-timeout-when-delete-cq-sq.patch
0006-nvme-pci-suspend-queues-based-on-online_queues.patch

diff stat following:
 block/blk-mq.c          |   3 +-
 drivers/nvme/host/pci.c | 225 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------
 include/linux/blk-mq.h  |   1 +
 3 files changed, 169 insertions(+), 60 deletions(-)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
@ 2018-10-05 13:39 Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2018-10-05 13:39 UTC (permalink / raw)


Bcc: 
Subject: [GIT PULL] nvme updates for 4.20
Reply-To: 

A relatively boring merge window:

 - better AEN tracing (Chaitanya)
 - NUMA aware PCIe multipathing (me)
 - RDMA workqueue fixes (Sagi)
 - better bio usage in the target (Sagi)
 - FC rework for target removal (James)
 - better multipath handling of ->queue_rq failures (James)
 - various cleanups (Milan)

The following changes since commit c0aac682fa6590cb660cb083dbc09f55e799d2d2:

  Merge tag 'v4.19-rc6' into for-4.20/block (2018-10-01 08:58:57 -0600)

are available in the Git repository at:

  git://git.infradead.org/nvme.git nvme-4.20

for you to fetch changes up to 2acf70ade79d26b97611a8df52eb22aa33814cd4:

  nvmet-rdma: use a private workqueue for delete (2018-10-05 09:25:18 +0200)

----------------------------------------------------------------
Chaitanya Kulkarni (2):
      nvmet: remove redundant module prefix
      nvme-core: add async event trace helper

Christoph Hellwig (1):
      nvme: take node locality into account when selecting a path

James Smart (3):
      nvmet_fc: support target port removal with nvmet layer
      nvme_fc: add 'nvme_discovery' sysfs attribute to fc transport device
      nvme: call nvme_complete_rq when nvmf_check_ready fails for mpath I/O

Milan P. Gandhi (2):
      nvme: fix typo in nvme_identify_ns_descs
      nvme-fc: fix for a minor typos

Sagi Grimberg (2):
      nvmet: don't split large I/Os unconditionally
      nvmet-rdma: use a private workqueue for delete

 drivers/nvme/host/core.c          |  20 ++++--
 drivers/nvme/host/fabrics.c       |   7 +-
 drivers/nvme/host/fc.c            | 108 +++++++++++++++++++++++++++----
 drivers/nvme/host/multipath.c     |  57 +++++++++++++----
 drivers/nvme/host/nvme.h          |  25 +++-----
 drivers/nvme/host/trace.h         |  28 ++++++++
 drivers/nvme/target/admin-cmd.c   |   2 +-
 drivers/nvme/target/fc.c          | 130 +++++++++++++++++++++++++++++++++++---
 drivers/nvme/target/io-cmd-bdev.c |   9 ++-
 drivers/nvme/target/nvmet.h       |   1 +
 drivers/nvme/target/rdma.c        |  19 ++++--
 include/linux/nvme.h              |   1 +
 12 files changed, 347 insertions(+), 60 deletions(-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
@ 2019-03-19 14:41 Maxim Levitsky
  2019-03-20 11:03 ` Felipe Franciosi
  0 siblings, 1 reply; 19+ messages in thread
From: Maxim Levitsky @ 2019-03-19 14:41 UTC (permalink / raw)


Date: Tue, 19 Mar 2019 14:45:45 +0200
Subject: [PATCH 0/9] RFC: NVME VFIO mediated device

Hi everyone!

In this patch series, I would like to introduce my take on the problem of doing 
as fast as possible virtualization of storage with emphasis on low latency.

In this patch series I implemented a kernel vfio based, mediated device that 
allows the user to pass through a partition and/or whole namespace to a guest.

The idea behind this driver is based on paper you can find at
https://www.usenix.org/conference/atc18/presentation/peng,

Although note that I stared the development prior to reading this paper, 
independently.

In addition to that implementation is not based on code used in the paper as 
I wasn't being able at that time to make the source available to me.

***Key points about the implementation:***

* Polling kernel thread is used. The polling is stopped after a 
predefined timeout (1/2 sec by default).
Support for all interrupt driven mode is planned, and it shows promising results.

* Guest sees a standard NVME device - this allows to run guest with 
unmodified drivers, for example windows guests.

* The NVMe device is shared between host and guest.
That means that even a single namespace can be split between host 
and guest based on different partitions.

* Simple configuration

*** Performance ***

Performance was tested on Intel DC P3700, With Xeon E5-2620 v2 
and both latency and throughput is very similar to SPDK.

Soon I will test this on a better server and nvme device and provide
more formal performance numbers.

Latency numbers:
~80ms - spdk with fio plugin on the host.
~84ms - nvme driver on the host
~87ms - mdev-nvme + nvme driver in the guest

Throughput was following similar pattern as well.

* Configuration example
  $ modprobe nvme mdev_queues=4
  $ modprobe nvme-mdev

  $ UUID=$(uuidgen)
  $ DEVICE='device pci address'
  $ echo $UUID > /sys/bus/pci/devices/$DEVICE/mdev_supported_types/nvme-2Q_V1/create
  $ echo n1p3 > /sys/bus/mdev/devices/$UUID/namespaces/add_namespace #attach host namespace 1 parition 3
  $ echo 11 > /sys/bus/mdev/devices/$UUID/settings/iothread_cpu #pin the io thread to cpu 11

  Afterward boot qemu with
  -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID
  
  Zero configuration on the guest.
  
*** FAQ ***

* Why to make this in the kernel? Why this is better that SPDK

  -> Reuse the existing nvme kernel driver in the host. No new drivers in the guest.
  
  -> Share the NVMe device between host and guest. 
     Even in fully virtualized configurations,
     some partitions of nvme device could be used by guests as block devices 
     while others passed through with nvme-mdev to achieve balance between
     all features of full IO stack emulation and performance.
  
  -> NVME-MDEV is a bit faster due to the fact that in-kernel driver 
     can send interrupts to the guest directly without a context 
     switch that can be expensive due to meltdown mitigation.

  -> Is able to utilize interrupts to get reasonable performance. 
     This is only implemented
     as a proof of concept and not included in the patches, 
     but interrupt driven mode shows reasonable performance
     
  -> This is a framework that later can be used to support NVMe devices 
     with more of the IO virtualization built-in 
     (IOMMU with PASID support coupled with device that supports it)

* Why to attach directly to nvme-pci driver and not use block layer IO
  -> The direct attachment allows for better performance, but I will
     check the possibility of using block IO, especially for fabrics drivers.
  
*** Implementation notes ***

*  All guest memory is mapped into the physical nvme device 
   but not 1:1 as vfio-pci would do this.
   This allows very efficient DMA.
   To support this, patch 2 adds ability for a mdev device to listen on 
   guest's memory map events. 
   Any such memory is immediately pinned and then DMA mapped.
   (Support for fabric drivers where this is not possible exits too,
    in which case the fabric driver will do its own DMA mapping)

*  nvme core driver is modified to announce the appearance 
   and disappearance of nvme controllers and namespaces,
   to which the nvme-mdev driver is subscribed.
 
*  nvme-pci driver is modified to expose raw interface of attaching to 
   and sending/polling the IO queues.
   This allows the mdev driver very efficiently to submit/poll for the IO.
   By default one host queue is used per each mediated device.
   (support for other fabric based host drivers is planned)

* The nvme-mdev doesn't assume presence of KVM, thus any VFIO user, including
  SPDK, a qemu running with tccg, ... can use this virtual device.

*** Testing ***

The device was tested with stock QEMU 3.0 on the host,
with host was using 5.0 kernel with nvme-mdev added and the following hardware:
 * QEMU nvme virtual device (with nested guest)
 * Intel DC P3700 on Xeon E5-2620 v2 server
 * Samsung SM981 (in a Thunderbolt enclosure, with my laptop)
 * Lenovo NVME device found in my laptop

The guest was tested with kernel 4.16, 4.18, 4.20 and
the same custom complied kernel 5.0
Windows 10 guest was tested too with both Microsoft's inbox driver and
open source community NVME driver
(https://lists.openfabrics.org/pipermail/nvmewin/2016-December/001420.html)

Testing was mostly done on x86_64, but 32 bit host/guest combination
was lightly tested too.

In addition to that, the virtual device was tested with nested guest,
by passing the virtual device to it,
using pci passthrough, qemu userspace nvme driver, and spdk


PS: I used to contribute to the kernel as a hobby using the
    maximlevitsky at gmail.com address

Maxim Levitsky (9):
  vfio/mdev: add .request callback
  nvme/core: add some more values from the spec
  nvme/core: add NVME_CTRL_SUSPENDED controller state
  nvme/pci: use the NVME_CTRL_SUSPENDED state
  nvme/pci: add known admin effects to augument admin effects log page
  nvme/pci: init shadow doorbell after each reset
  nvme/core: add mdev interfaces
  nvme/core: add nvme-mdev core driver
  nvme/pci: implement the mdev external queue allocation interface

 MAINTAINERS                   |   5 +
 drivers/nvme/Kconfig          |   1 +
 drivers/nvme/Makefile         |   1 +
 drivers/nvme/host/core.c      | 149 +++++-
 drivers/nvme/host/nvme.h      |  55 ++-
 drivers/nvme/host/pci.c       | 385 ++++++++++++++-
 drivers/nvme/mdev/Kconfig     |  16 +
 drivers/nvme/mdev/Makefile    |   5 +
 drivers/nvme/mdev/adm.c       | 873 ++++++++++++++++++++++++++++++++++
 drivers/nvme/mdev/events.c    | 142 ++++++
 drivers/nvme/mdev/host.c      | 491 +++++++++++++++++++
 drivers/nvme/mdev/instance.c  | 802 +++++++++++++++++++++++++++++++
 drivers/nvme/mdev/io.c        | 563 ++++++++++++++++++++++
 drivers/nvme/mdev/irq.c       | 264 ++++++++++
 drivers/nvme/mdev/mdev.h      |  56 +++
 drivers/nvme/mdev/mmio.c      | 591 +++++++++++++++++++++++
 drivers/nvme/mdev/pci.c       | 247 ++++++++++
 drivers/nvme/mdev/priv.h      | 700 +++++++++++++++++++++++++++
 drivers/nvme/mdev/udata.c     | 390 +++++++++++++++
 drivers/nvme/mdev/vcq.c       | 207 ++++++++
 drivers/nvme/mdev/vctrl.c     | 514 ++++++++++++++++++++
 drivers/nvme/mdev/viommu.c    | 322 +++++++++++++
 drivers/nvme/mdev/vns.c       | 356 ++++++++++++++
 drivers/nvme/mdev/vsq.c       | 178 +++++++
 drivers/vfio/mdev/vfio_mdev.c |  11 +
 include/linux/mdev.h          |   4 +
 include/linux/nvme.h          |  88 +++-
 27 files changed, 7375 insertions(+), 41 deletions(-)
 create mode 100644 drivers/nvme/mdev/Kconfig
 create mode 100644 drivers/nvme/mdev/Makefile
 create mode 100644 drivers/nvme/mdev/adm.c
 create mode 100644 drivers/nvme/mdev/events.c
 create mode 100644 drivers/nvme/mdev/host.c
 create mode 100644 drivers/nvme/mdev/instance.c
 create mode 100644 drivers/nvme/mdev/io.c
 create mode 100644 drivers/nvme/mdev/irq.c
 create mode 100644 drivers/nvme/mdev/mdev.h
 create mode 100644 drivers/nvme/mdev/mmio.c
 create mode 100644 drivers/nvme/mdev/pci.c
 create mode 100644 drivers/nvme/mdev/priv.h
 create mode 100644 drivers/nvme/mdev/udata.c
 create mode 100644 drivers/nvme/mdev/vcq.c
 create mode 100644 drivers/nvme/mdev/vctrl.c
 create mode 100644 drivers/nvme/mdev/viommu.c
 create mode 100644 drivers/nvme/mdev/vns.c
 create mode 100644 drivers/nvme/mdev/vsq.c

-- 
2.17.2

^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
  2019-03-19 14:41 Maxim Levitsky
@ 2019-03-20 11:03 ` Felipe Franciosi
  2019-03-20 19:08   ` Maxim Levitsky
  0 siblings, 1 reply; 19+ messages in thread
From: Felipe Franciosi @ 2019-03-20 11:03 UTC (permalink / raw)

> On Mar 19, 2019,@2:41 PM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
> 
> Date: Tue, 19 Mar 2019 14:45:45 +0200
> Subject: [PATCH 0/9] RFC: NVME VFIO mediated device
> 
> Hi everyone!
> 
> In this patch series, I would like to introduce my take on the problem of doing 
> as fast as possible virtualization of storage with emphasis on low latency.
> 
> In this patch series I implemented a kernel vfio based, mediated device that 
> allows the user to pass through a partition and/or whole namespace to a guest.

Hey Maxim!

I'm really excited to see this series, as it aligns to some extent with what we discussed in last year's KVM Forum VFIO BoF.

There's no arguing that we need a better story to efficiently virtualise NVMe devices. So far, for Qemu-based VMs, Changpeng's vhost-user-nvme is the best attempt at that. However, I seem to recall there was some pushback from qemu-devel in the sense that they would rather see investment in virtio-blk. I'm not sure what's the latest on that work and what are the next steps.

The pushback drove the discussion towards pursuing an mdev approach, which is why I'm excited to see your patches.

What I'm thinking is that passing through namespaces or partitions is very restrictive. It leaves no room to implement more elaborate virtualisation stacks like replicating data across multiple devices (local or remote), storage migration, software-managed thin provisioning, encryption, deduplication, compression, etc. In summary, anything that requires software intervention in the datapath. (Worth noting: vhost-user-nvme allows all of that to be easily done in SPDK's bdev layer.)

These complicated stacks should probably not be implemented in the kernel, though. So I'm wondering whether we could talk about mechanisms to allow efficient and performant userspace datapath intervention  in your approach or pursue a mechanism to completely offload the device emulation to userspace (and align with what SPDK has to offer).

Thoughts welcome!
Felipe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
  2019-03-20 11:03 ` Felipe Franciosi
@ 2019-03-20 19:08   ` Maxim Levitsky
  2019-03-21 16:12     ` Stefan Hajnoczi
  0 siblings, 1 reply; 19+ messages in thread
From: Maxim Levitsky @ 2019-03-20 19:08 UTC (permalink / raw)

On Wed, 2019-03-20@11:03 +0000, Felipe Franciosi wrote:
> > On Mar 19, 2019,@2:41 PM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > 
> > Date: Tue, 19 Mar 2019 14:45:45 +0200
> > Subject: [PATCH 0/9] RFC: NVME VFIO mediated device
> > 
> > Hi everyone!
> > 
> > In this patch series, I would like to introduce my take on the problem of
> > doing 
> > as fast as possible virtualization of storage with emphasis on low latency.
> > 
> > In this patch series I implemented a kernel vfio based, mediated device
> > that 
> > allows the user to pass through a partition and/or whole namespace to a
> > guest.
> 
> Hey Maxim!
> 
> I'm really excited to see this series, as it aligns to some extent with what
> we discussed in last year's KVM Forum VFIO BoF.
> 
> There's no arguing that we need a better story to efficiently virtualise NVMe
> devices. So far, for Qemu-based VMs, Changpeng's vhost-user-nvme is the best
> attempt at that. However, I seem to recall there was some pushback from qemu-
> devel in the sense that they would rather see investment in virtio-blk. I'm
> not sure what's the latest on that work and what are the next steps.
I agree with that. All my benchmarks were agains his vhost-user-nvme driver, and
I am able to get pretty much the same througput and latency.

The ssd I tested on died just recently (Murphy law), not due to bug in my driver
but some internal fault (even though most of my tests were reads, plus
occassional 'nvme format's.
We are in process of buying an replacement.

> 
> The pushback drove the discussion towards pursuing an mdev approach, which is
> why I'm excited to see your patches.
> 
> What I'm thinking is that passing through namespaces or partitions is very
> restrictive. It leaves no room to implement more elaborate virtualisation
> stacks like replicating data across multiple devices (local or remote),
> storage migration, software-managed thin provisioning, encryption,
> deduplication, compression, etc. In summary, anything that requires software
> intervention in the datapath. (Worth noting: vhost-user-nvme allows all of
> that to be easily done in SPDK's bdev layer.)

Hi Felipe!

I guess that my driver is not geared toward more complicated use cases like you
mentioned, but instead it is focused to get as fast as possible performance for
the common case.

One thing that I can do which would solve several of the above problems is to
accept an map betwent virtual and real logical blocks, pretty much in exactly
the same way as EPT does it.
Then userspace can map any portions of the device anywhere, while still keeping
the dataplane in the kernel, and having minimal overhead.

On top of that, note that the direction of IO virtualization is to do dataplane
in hardware, which will probably give you even worse partition granuality /
features but will be the fastest option aviable,
like for instance SR-IOV which alrady exists and just allows to split by
namespaces without any more fine grained control.

Think of nvme-mdev as a very low level driver, which currntly uses polling, but
eventually will use PASID based IOMMU to provide the guest with raw PCI device.
The userspace / qemu can build on top of that with varios software layers.

On top of that I am thinking to solve the problem of migration in Qemu, by
creating a 'vfio-nvme' driver which would bind vfio to bind to device exposed by
the kernel, and would pass through all the doorbells and queues to the guest,
while intercepting the admin queue. Such driver I think can be made to support
migration while beeing able to run on top both SR-IOV device, my vfio-nvme abit
with double admin queue emulation (its a bit ugly but won't affect performance
at all) and on top of even regular NVME device vfio assigned to guest.

Best regards,
	Maxim Levitsky

> 
> These complicated stacks should probably not be implemented in the kernel,
> though. So I'm wondering whether we could talk about mechanisms to allow
> efficient and performant userspace datapath intervention  in your approach or
> pursue a mechanism to completely offload the device emulation to userspace
> (and align with what SPDK has to offer).
> 
> Thoughts welcome!
> Felipe
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
  2019-03-20 19:08   ` Maxim Levitsky
@ 2019-03-21 16:12     ` Stefan Hajnoczi
  2019-03-21 16:21       ` Keith Busch
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Hajnoczi @ 2019-03-21 16:12 UTC (permalink / raw)


On Wed, Mar 20, 2019@09:08:37PM +0200, Maxim Levitsky wrote:
> On Wed, 2019-03-20@11:03 +0000, Felipe Franciosi wrote:
> > > On Mar 19, 2019,@2:41 PM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > 
> > > Date: Tue, 19 Mar 2019 14:45:45 +0200
> > > Subject: [PATCH 0/9] RFC: NVME VFIO mediated device
> > > 
> > > Hi everyone!
> > > 
> > > In this patch series, I would like to introduce my take on the problem of
> > > doing 
> > > as fast as possible virtualization of storage with emphasis on low latency.
> > > 
> > > In this patch series I implemented a kernel vfio based, mediated device
> > > that 
> > > allows the user to pass through a partition and/or whole namespace to a
> > > guest.
> > 
> > Hey Maxim!
> > 
> > I'm really excited to see this series, as it aligns to some extent with what
> > we discussed in last year's KVM Forum VFIO BoF.
> > 
> > There's no arguing that we need a better story to efficiently virtualise NVMe
> > devices. So far, for Qemu-based VMs, Changpeng's vhost-user-nvme is the best
> > attempt at that. However, I seem to recall there was some pushback from qemu-
> > devel in the sense that they would rather see investment in virtio-blk. I'm
> > not sure what's the latest on that work and what are the next steps.
> I agree with that. All my benchmarks were agains his vhost-user-nvme driver, and
> I am able to get pretty much the same througput and latency.
> 
> The ssd I tested on died just recently (Murphy law), not due to bug in my driver
> but some internal fault (even though most of my tests were reads, plus
> occassional 'nvme format's.
> We are in process of buying an replacement.
> 
> > 
> > The pushback drove the discussion towards pursuing an mdev approach, which is
> > why I'm excited to see your patches.
> > 
> > What I'm thinking is that passing through namespaces or partitions is very
> > restrictive. It leaves no room to implement more elaborate virtualisation
> > stacks like replicating data across multiple devices (local or remote),
> > storage migration, software-managed thin provisioning, encryption,
> > deduplication, compression, etc. In summary, anything that requires software
> > intervention in the datapath. (Worth noting: vhost-user-nvme allows all of
> > that to be easily done in SPDK's bdev layer.)
> 
> Hi Felipe!
> 
> I guess that my driver is not geared toward more complicated use cases like you
> mentioned, but instead it is focused to get as fast as possible performance for
> the common case.
> 
> One thing that I can do which would solve several of the above problems is to
> accept an map betwent virtual and real logical blocks, pretty much in exactly
> the same way as EPT does it.
> Then userspace can map any portions of the device anywhere, while still keeping
> the dataplane in the kernel, and having minimal overhead.
> 
> On top of that, note that the direction of IO virtualization is to do dataplane
> in hardware, which will probably give you even worse partition granuality /
> features but will be the fastest option aviable,
> like for instance SR-IOV which alrady exists and just allows to split by
> namespaces without any more fine grained control.
> 
> Think of nvme-mdev as a very low level driver, which currntly uses polling, but
> eventually will use PASID based IOMMU to provide the guest with raw PCI device.
> The userspace / qemu can build on top of that with varios software layers.
> 
> On top of that I am thinking to solve the problem of migration in Qemu, by
> creating a 'vfio-nvme' driver which would bind vfio to bind to device exposed by
> the kernel, and would pass through all the doorbells and queues to the guest,
> while intercepting the admin queue. Such driver I think can be made to support
> migration while beeing able to run on top both SR-IOV device, my vfio-nvme abit
> with double admin queue emulation (its a bit ugly but won't affect performance
> at all) and on top of even regular NVME device vfio assigned to guest.

mdev-nvme seems like a duplication of SPDK.  The performance is not
better and the features are more limited, so why focus on this approach?

One argument might be that the kernel NVMe subsystem wants to offer this
functionality and loading the kernel module is more convenient than
managing SPDK to some users.

Thoughts?

Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20190321/8b770a99/attachment.sig>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
  2019-03-21 16:12     ` Stefan Hajnoczi
@ 2019-03-21 16:21       ` Keith Busch
  2019-03-21 16:41         ` Felipe Franciosi
  0 siblings, 1 reply; 19+ messages in thread
From: Keith Busch @ 2019-03-21 16:21 UTC (permalink / raw)


On Thu, Mar 21, 2019@04:12:39PM +0000, Stefan Hajnoczi wrote:
> mdev-nvme seems like a duplication of SPDK.  The performance is not
> better and the features are more limited, so why focus on this approach?
> 
> One argument might be that the kernel NVMe subsystem wants to offer this
> functionality and loading the kernel module is more convenient than
> managing SPDK to some users.
> 
> Thoughts?

Doesn't SPDK bind a controller to a single process? mdev binds to
namespaces (or their partitions), so you could have many mdev's assigned
to many VMs accessing a single controller.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
  2019-03-21 16:21       ` Keith Busch
@ 2019-03-21 16:41         ` Felipe Franciosi
  2019-03-21 17:04           ` Maxim Levitsky
  0 siblings, 1 reply; 19+ messages in thread
From: Felipe Franciosi @ 2019-03-21 16:41 UTC (permalink / raw)

> On Mar 21, 2019,@4:21 PM, Keith Busch <kbusch@kernel.org> wrote:
> 
> On Thu, Mar 21, 2019@04:12:39PM +0000, Stefan Hajnoczi wrote:
>> mdev-nvme seems like a duplication of SPDK.  The performance is not
>> better and the features are more limited, so why focus on this approach?
>> 
>> One argument might be that the kernel NVMe subsystem wants to offer this
>> functionality and loading the kernel module is more convenient than
>> managing SPDK to some users.
>> 
>> Thoughts?
> 
> Doesn't SPDK bind a controller to a single process? mdev binds to
> namespaces (or their partitions), so you could have many mdev's assigned
> to many VMs accessing a single controller.

Yes, it binds to a single process which can drive the datapath of multiple virtual controllers for multiple VMs (similar to what you described for mdev). You can therefore efficiently poll multiple VM submission queues (and multiple device completion queues) from a single physical CPU.

The same could be done in the kernel, but the code gets complicated as you add more functionality to it. As this is a direct interface with an untrusted front-end (the guest), it's also arguably safer to do in userspace.

Worth noting: you can eventually have a single physical core polling all sorts of virtual devices (eg. virtual storage or network controllers) very efficiently. And this is quite configurable, too. In the interest of fairness, performance or efficiency, you can choose to dynamically add or remove queues to the poll thread or spawn more threads and redistribute the work.

F.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
  2019-03-21 16:41         ` Felipe Franciosi
@ 2019-03-21 17:04           ` Maxim Levitsky
  2019-03-22  7:54             ` Felipe Franciosi
  0 siblings, 1 reply; 19+ messages in thread
From: Maxim Levitsky @ 2019-03-21 17:04 UTC (permalink / raw)


On Thu, 2019-03-21@16:41 +0000, Felipe Franciosi wrote:
> > On Mar 21, 2019,@4:21 PM, Keith Busch <kbusch@kernel.org> wrote:
> > 
> > On Thu, Mar 21, 2019@04:12:39PM +0000, Stefan Hajnoczi wrote:
> > > mdev-nvme seems like a duplication of SPDK.  The performance is not
> > > better and the features are more limited, so why focus on this approach?
> > > 
> > > One argument might be that the kernel NVMe subsystem wants to offer this
> > > functionality and loading the kernel module is more convenient than
> > > managing SPDK to some users.
> > > 
> > > Thoughts?
> > 
> > Doesn't SPDK bind a controller to a single process? mdev binds to
> > namespaces (or their partitions), so you could have many mdev's assigned
> > to many VMs accessing a single controller.
> 
> Yes, it binds to a single process which can drive the datapath of multiple
> virtual controllers for multiple VMs (similar to what you described for mdev).
> You can therefore efficiently poll multiple VM submission queues (and multiple
> device completion queues) from a single physical CPU.
> 
> The same could be done in the kernel, but the code gets complicated as you add
> more functionality to it. As this is a direct interface with an untrusted
> front-end (the guest), it's also arguably safer to do in userspace.
> 
> Worth noting: you can eventually have a single physical core polling all sorts
> of virtual devices (eg. virtual storage or network controllers) very
> efficiently. And this is quite configurable, too. In the interest of fairness,
> performance or efficiency, you can choose to dynamically add or remove queues
> to the poll thread or spawn more threads and redistribute the work.
> 
> F.

Note though that SPDK doesn't support sharing the device between host and the
guests, it takes over the nvme device, thus it makes the kernel nvme driver
unbind from it.

My driver creates a polling thread per guest, but its trivial to add option to
use the same polling thread for many guests if there need for that.

Best regards,
	Maxim Levitsky

^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
  2019-03-21 17:04           ` Maxim Levitsky
@ 2019-03-22  7:54             ` Felipe Franciosi
  2019-03-22 10:32               ` Maxim Levitsky
  2019-03-22 15:30               ` Keith Busch
  0 siblings, 2 replies; 19+ messages in thread
From: Felipe Franciosi @ 2019-03-22  7:54 UTC (permalink / raw)




> On Mar 21, 2019,@5:04 PM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
> 
> On Thu, 2019-03-21@16:41 +0000, Felipe Franciosi wrote:
>>> On Mar 21, 2019,@4:21 PM, Keith Busch <kbusch@kernel.org> wrote:
>>> 
>>> On Thu, Mar 21, 2019@04:12:39PM +0000, Stefan Hajnoczi wrote:
>>>> mdev-nvme seems like a duplication of SPDK.  The performance is not
>>>> better and the features are more limited, so why focus on this approach?
>>>> 
>>>> One argument might be that the kernel NVMe subsystem wants to offer this
>>>> functionality and loading the kernel module is more convenient than
>>>> managing SPDK to some users.
>>>> 
>>>> Thoughts?
>>> 
>>> Doesn't SPDK bind a controller to a single process? mdev binds to
>>> namespaces (or their partitions), so you could have many mdev's assigned
>>> to many VMs accessing a single controller.
>> 
>> Yes, it binds to a single process which can drive the datapath of multiple
>> virtual controllers for multiple VMs (similar to what you described for mdev).
>> You can therefore efficiently poll multiple VM submission queues (and multiple
>> device completion queues) from a single physical CPU.
>> 
>> The same could be done in the kernel, but the code gets complicated as you add
>> more functionality to it. As this is a direct interface with an untrusted
>> front-end (the guest), it's also arguably safer to do in userspace.
>> 
>> Worth noting: you can eventually have a single physical core polling all sorts
>> of virtual devices (eg. virtual storage or network controllers) very
>> efficiently. And this is quite configurable, too. In the interest of fairness,
>> performance or efficiency, you can choose to dynamically add or remove queues
>> to the poll thread or spawn more threads and redistribute the work.
>> 
>> F.
> 
> Note though that SPDK doesn't support sharing the device between host and the
> guests, it takes over the nvme device, thus it makes the kernel nvme driver
> unbind from it.

That is absolutely true. However, I find it not to be a problem in practice.

Hypervisor products, specially those caring about performance, efficiency and fairness, will dedicate NVMe devices for a particular purpose (eg. vDisk storage, cache, metadata) and will not share these devices for other use cases. That's because these products want to deterministically control the performance aspects of the device, which you just cannot do if you are sharing the device with a subsystem you do not control.

For scenarios where the device must be shared and such fine grained control is not required, it looks like using the kernel driver with io_uring offers very good performance with flexibility.

Cheers,
Felipe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
  2019-03-22  7:54             ` Felipe Franciosi
@ 2019-03-22 10:32               ` Maxim Levitsky
  2019-03-22 15:30               ` Keith Busch
  1 sibling, 0 replies; 19+ messages in thread
From: Maxim Levitsky @ 2019-03-22 10:32 UTC (permalink / raw)

On Fri, 2019-03-22@07:54 +0000, Felipe Franciosi wrote:
> > On Mar 21, 2019,@5:04 PM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > 
> > On Thu, 2019-03-21@16:41 +0000, Felipe Franciosi wrote:
> > > > On Mar 21, 2019,@4:21 PM, Keith Busch <kbusch@kernel.org> wrote:
> > > > 
> > > > On Thu, Mar 21, 2019@04:12:39PM +0000, Stefan Hajnoczi wrote:
> > > > > mdev-nvme seems like a duplication of SPDK.  The performance is not
> > > > > better and the features are more limited, so why focus on this
> > > > > approach?
> > > > > 
> > > > > One argument might be that the kernel NVMe subsystem wants to offer
> > > > > this
> > > > > functionality and loading the kernel module is more convenient than
> > > > > managing SPDK to some users.
> > > > > 
> > > > > Thoughts?
> > > > 
> > > > Doesn't SPDK bind a controller to a single process? mdev binds to
> > > > namespaces (or their partitions), so you could have many mdev's assigned
> > > > to many VMs accessing a single controller.
> > > 
> > > Yes, it binds to a single process which can drive the datapath of multiple
> > > virtual controllers for multiple VMs (similar to what you described for
> > > mdev).
> > > You can therefore efficiently poll multiple VM submission queues (and
> > > multiple
> > > device completion queues) from a single physical CPU.
> > > 
> > > The same could be done in the kernel, but the code gets complicated as you
> > > add
> > > more functionality to it. As this is a direct interface with an untrusted
> > > front-end (the guest), it's also arguably safer to do in userspace.
> > > 
> > > Worth noting: you can eventually have a single physical core polling all
> > > sorts
> > > of virtual devices (eg. virtual storage or network controllers) very
> > > efficiently. And this is quite configurable, too. In the interest of
> > > fairness,
> > > performance or efficiency, you can choose to dynamically add or remove
> > > queues
> > > to the poll thread or spawn more threads and redistribute the work.
> > > 
> > > F.
> > 
> > Note though that SPDK doesn't support sharing the device between host and
> > the
> > guests, it takes over the nvme device, thus it makes the kernel nvme driver
> > unbind from it.
> 
> That is absolutely true. However, I find it not to be a problem in practice.
> 
> Hypervisor products, specially those caring about performance, efficiency and
> fairness, will dedicate NVMe devices for a particular purpose (eg. vDisk
> storage, cache, metadata) and will not share these devices for other use
> cases. That's because these products want to deterministically control the
> performance aspects of the device, which you just cannot do if you are sharing
> the device with a subsystem you do not control.
> 
> For scenarios where the device must be shared and such fine grained control is
> not required, it looks like using the kernel driver with io_uring offers very
> good performance with flexibility

I see the host/guest parition in the following way:
The guest assigned partitions are for guests that need lowest possible latency,
and in between these guests it is possible to guarantee good enough level of
fairness in my driver.
For example, in the current implementation of my driver, each guest gets its own
host submission queue.

On the other hand, the host assigned partitions are for significantly higher
latency IO, with no guarantees, and/or for guests that need all the more
advanced features of full IO virtualization, for instance snapshots, thin
provisioning, replication/backup over network, etc.
io_uring can be used here to speed things up but it won't reach the nvme-mdev
levels of latency.

Furthermore on NVME drives that support WRRU, its possible to let queues of
guest's assigned partitions to belong to the high priority class and let the
host queues use the regular medium/low priority class.
For drives that don't support WRRU, the IO throttling can be done in software on
the host queues.

Host assigned partitions also don't need polling, thus allowing polling to be
used only for guests that actually need low latency IO.
This reduces the number of cores that would be otherwise lost to polling,
because the less work the polling core does, the less latency it contributes to
overall latency, thus with less users, you could use less cores to achieve the
same levels of latency.

For Stefan's argument, we can look at it in a slightly different way too:
While the nvme-mdev can be seen as a duplication of SPDK, the SPDK can also be
seen as duplication of an existing kernel functionality which nvme-mdev can
reuse for free.

Best regards,
	Maxim Levitsky

^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
  2019-03-22  7:54             ` Felipe Franciosi
  2019-03-22 10:32               ` Maxim Levitsky
@ 2019-03-22 15:30               ` Keith Busch
  2019-03-25 15:44                 ` Felipe Franciosi
  1 sibling, 1 reply; 19+ messages in thread
From: Keith Busch @ 2019-03-22 15:30 UTC (permalink / raw)


On Fri, Mar 22, 2019@07:54:50AM +0000, Felipe Franciosi wrote:
> > 
> > Note though that SPDK doesn't support sharing the device between host and the
> > guests, it takes over the nvme device, thus it makes the kernel nvme driver
> > unbind from it.
> 
> That is absolutely true. However, I find it not to be a problem in practice.
> 
> Hypervisor products, specially those caring about performance, efficiency and fairness, will dedicate NVMe devices for a particular purpose (eg. vDisk storage, cache, metadata) and will not share these devices for other use cases. That's because these products want to deterministically control the performance aspects of the device, which you just cannot do if you are sharing the device with a subsystem you do not control.

I don't know, it sounds like you've traded kernel syscalls for IPC,
and I don't think one performs better than the other.

> For scenarios where the device must be shared and such fine grained control is not required, it looks like using the kernel driver with io_uring offers very good performance with flexibility.

NVMe's IO Determinism features provide fine grained control for shared
devices. It's still uncommon to find hardware supporting that, though.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* No subject
  2019-03-22 15:30               ` Keith Busch
@ 2019-03-25 15:44                 ` Felipe Franciosi
  0 siblings, 0 replies; 19+ messages in thread
From: Felipe Franciosi @ 2019-03-25 15:44 UTC (permalink / raw)


Hi Keith,

> On Mar 22, 2019,@3:30 PM, Keith Busch <kbusch@kernel.org> wrote:
> 
> On Fri, Mar 22, 2019@07:54:50AM +0000, Felipe Franciosi wrote:
>>> 
>>> Note though that SPDK doesn't support sharing the device between host and the
>>> guests, it takes over the nvme device, thus it makes the kernel nvme driver
>>> unbind from it.
>> 
>> That is absolutely true. However, I find it not to be a problem in practice.
>> 
>> Hypervisor products, specially those caring about performance, efficiency and fairness, will dedicate NVMe devices for a particular purpose (eg. vDisk storage, cache, metadata) and will not share these devices for other use cases. That's because these products want to deterministically control the performance aspects of the device, which you just cannot do if you are sharing the device with a subsystem you do not control.
> 
> I don't know, it sounds like you've traded kernel syscalls for IPC,
> and I don't think one performs better than the other.

Sorry, I'm not sure I understand. My point is that if you are packaging a distro to be a hypervisor and you want to use a storage device for VM data, you _most likely_ won't be using that device for anything else. To that end, driving the device directly from your application definitely gives you more deterministic control.

> 
>> For scenarios where the device must be shared and such fine grained control is not required, it looks like using the kernel driver with io_uring offers very good performance with flexibility.
> 
> NVMe's IO Determinism features provide fine grained control for shared
> devices. It's still uncommon to find hardware supporting that, though.

Sure, but then your hypervisor needs to certify devices that support that. This will limit your HCL. Moreover, unless the feature is solid, well-established and works reliably on all devices you support, it's arguably preferable to have an architecture which gives you that control in software.

Cheers,
Felipe

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-03-25 15:44 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-10  3:11 No subject Libo Chen
  -- strict thread matches above, loose matches on Subject: below --
2014-11-10  6:39 Libo Chen
2017-02-07  0:22 Scott Bauer
2017-02-07  0:46 ` Jens Axboe
2017-04-21 23:23 Sandeep Mann
2017-09-13 18:15 unmesh rathi
2018-02-02  6:54 Jianchao Wang
2018-10-05 13:39 Christoph Hellwig
2019-03-19 14:41 Maxim Levitsky
2019-03-20 11:03 ` Felipe Franciosi
2019-03-20 19:08   ` Maxim Levitsky
2019-03-21 16:12     ` Stefan Hajnoczi
2019-03-21 16:21       ` Keith Busch
2019-03-21 16:41         ` Felipe Franciosi
2019-03-21 17:04           ` Maxim Levitsky
2019-03-22  7:54             ` Felipe Franciosi
2019-03-22 10:32               ` Maxim Levitsky
2019-03-22 15:30               ` Keith Busch
2019-03-25 15:44                 ` Felipe Franciosi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox