Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
* No subject
@ 2014-11-10  3:11 Libo Chen
  0 siblings, 0 replies; 19+ messages in thread
From: Libo Chen @ 2014-11-10  3:11 UTC (permalink / raw)




^ permalink raw reply	[flat|nested] 19+ messages in thread
* No subject
@ 2014-11-10  6:39 Libo Chen
  0 siblings, 0 replies; 19+ messages in thread
From: Libo Chen @ 2014-11-10  6:39 UTC (permalink / raw)




^ permalink raw reply	[flat|nested] 19+ messages in thread
* No subject
@ 2017-02-07  0:22 Scott Bauer
  2017-02-07  0:46 ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Scott Bauer @ 2017-02-07  0:22 UTC (permalink / raw)


I screwed up and had size_t's in the uapi structures which of course
differ in size on 32 and 64 bit platforms. This caused issues running
32 bit userland on a 64 bit kernel. We're hoping to sneak this
patch in so we don't have to maintain a compat layer.

^ permalink raw reply	[flat|nested] 19+ messages in thread
* No subject
@ 2017-04-21 23:23 Sandeep Mann
  0 siblings, 0 replies; 19+ messages in thread
From: Sandeep Mann @ 2017-04-21 23:23 UTC (permalink / raw)




Adding linux-nvme at lists.infradead.org

^ permalink raw reply	[flat|nested] 19+ messages in thread
* No subject
@ 2017-09-13 18:15 unmesh rathi
  0 siblings, 0 replies; 19+ messages in thread
From: unmesh rathi @ 2017-09-13 18:15 UTC (permalink / raw)




^ permalink raw reply	[flat|nested] 19+ messages in thread
* No subject
@ 2018-02-02  6:54 Jianchao Wang
  0 siblings, 0 replies; 19+ messages in thread
From: Jianchao Wang @ 2018-02-02  6:54 UTC (permalink / raw)


Hi Christoph, Keith and Sagi

Please consider and comment on the following patchset.
That's really appreciated.

There is a complicated relationship between nvme_timeout and nvme_dev_disable.
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   outstanding requests.
We have found some issues introduced by them, please refer the following link

http://lists.infradead.org/pipermail/linux-nvme/2018-January/015053.html 
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015276.html
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015328.html
Even we cannot ensure there is no other issue.

The best way to fix them is to break up the relationship between them.
With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and eliminate the race between nvme_timeout and
nvme_dev_disable on outstanding requests.


There are 6 patches:

1st ~ 3th patches does some preparation for the 4th one.
4th is to avoid nvme_dev_disable to be invoked by nvme_timeout, and implement
the synchronization between them. More details, please refer to the comment of
this patch.
5th fixes a bug after 4th patch is introduced. It let nvme_delete_io_queues can
only be wakeup by completion path.
6th fixes a bug found when test, it is not related with 4th patch.

This patchset was tested under debug patch for some days.
And some bugfix have been done.
The debug patch and other patches are available in following it branch:
https://github.com/jianchwa/linux-blcok.git nvme_fixes_test

Jianchao Wang (6)
0001-nvme-pci-move-clearing-host-mem-behind-stopping-queu.patch
0002-nvme-pci-fix-the-freeze-and-quiesce-for-shutdown-and.patch
0003-blk-mq-make-blk_mq_rq_update_aborted_gstate-a-extern.patch
0004-nvme-pci-break-up-nvme_timeout-and-nvme_dev_disable.patch
0005-nvme-pci-discard-wait-timeout-when-delete-cq-sq.patch
0006-nvme-pci-suspend-queues-based-on-online_queues.patch

diff stat following:
 block/blk-mq.c          |   3 +-
 drivers/nvme/host/pci.c | 225 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------
 include/linux/blk-mq.h  |   1 +
 3 files changed, 169 insertions(+), 60 deletions(-)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 19+ messages in thread
* No subject
@ 2018-10-05 13:39 Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2018-10-05 13:39 UTC (permalink / raw)


Bcc: 
Subject: [GIT PULL] nvme updates for 4.20
Reply-To: 

A relatively boring merge window:

 - better AEN tracing (Chaitanya)
 - NUMA aware PCIe multipathing (me)
 - RDMA workqueue fixes (Sagi)
 - better bio usage in the target (Sagi)
 - FC rework for target removal (James)
 - better multipath handling of ->queue_rq failures (James)
 - various cleanups (Milan)

The following changes since commit c0aac682fa6590cb660cb083dbc09f55e799d2d2:

  Merge tag 'v4.19-rc6' into for-4.20/block (2018-10-01 08:58:57 -0600)

are available in the Git repository at:

  git://git.infradead.org/nvme.git nvme-4.20

for you to fetch changes up to 2acf70ade79d26b97611a8df52eb22aa33814cd4:

  nvmet-rdma: use a private workqueue for delete (2018-10-05 09:25:18 +0200)

----------------------------------------------------------------
Chaitanya Kulkarni (2):
      nvmet: remove redundant module prefix
      nvme-core: add async event trace helper

Christoph Hellwig (1):
      nvme: take node locality into account when selecting a path

James Smart (3):
      nvmet_fc: support target port removal with nvmet layer
      nvme_fc: add 'nvme_discovery' sysfs attribute to fc transport device
      nvme: call nvme_complete_rq when nvmf_check_ready fails for mpath I/O

Milan P. Gandhi (2):
      nvme: fix typo in nvme_identify_ns_descs
      nvme-fc: fix for a minor typos

Sagi Grimberg (2):
      nvmet: don't split large I/Os unconditionally
      nvmet-rdma: use a private workqueue for delete

 drivers/nvme/host/core.c          |  20 ++++--
 drivers/nvme/host/fabrics.c       |   7 +-
 drivers/nvme/host/fc.c            | 108 +++++++++++++++++++++++++++----
 drivers/nvme/host/multipath.c     |  57 +++++++++++++----
 drivers/nvme/host/nvme.h          |  25 +++-----
 drivers/nvme/host/trace.h         |  28 ++++++++
 drivers/nvme/target/admin-cmd.c   |   2 +-
 drivers/nvme/target/fc.c          | 130 +++++++++++++++++++++++++++++++++++---
 drivers/nvme/target/io-cmd-bdev.c |   9 ++-
 drivers/nvme/target/nvmet.h       |   1 +
 drivers/nvme/target/rdma.c        |  19 ++++--
 include/linux/nvme.h              |   1 +
 12 files changed, 347 insertions(+), 60 deletions(-)

^ permalink raw reply	[flat|nested] 19+ messages in thread
* No subject
@ 2019-03-19 14:41 Maxim Levitsky
  2019-03-20 11:03 ` Felipe Franciosi
  0 siblings, 1 reply; 19+ messages in thread
From: Maxim Levitsky @ 2019-03-19 14:41 UTC (permalink / raw)


Date: Tue, 19 Mar 2019 14:45:45 +0200
Subject: [PATCH 0/9] RFC: NVME VFIO mediated device

Hi everyone!

In this patch series, I would like to introduce my take on the problem of doing 
as fast as possible virtualization of storage with emphasis on low latency.

In this patch series I implemented a kernel vfio based, mediated device that 
allows the user to pass through a partition and/or whole namespace to a guest.

The idea behind this driver is based on paper you can find at
https://www.usenix.org/conference/atc18/presentation/peng,

Although note that I stared the development prior to reading this paper, 
independently.

In addition to that implementation is not based on code used in the paper as 
I wasn't being able at that time to make the source available to me.

***Key points about the implementation:***

* Polling kernel thread is used. The polling is stopped after a 
predefined timeout (1/2 sec by default).
Support for all interrupt driven mode is planned, and it shows promising results.

* Guest sees a standard NVME device - this allows to run guest with 
unmodified drivers, for example windows guests.

* The NVMe device is shared between host and guest.
That means that even a single namespace can be split between host 
and guest based on different partitions.

* Simple configuration

*** Performance ***

Performance was tested on Intel DC P3700, With Xeon E5-2620 v2 
and both latency and throughput is very similar to SPDK.

Soon I will test this on a better server and nvme device and provide
more formal performance numbers.

Latency numbers:
~80ms - spdk with fio plugin on the host.
~84ms - nvme driver on the host
~87ms - mdev-nvme + nvme driver in the guest

Throughput was following similar pattern as well.

* Configuration example
  $ modprobe nvme mdev_queues=4
  $ modprobe nvme-mdev

  $ UUID=$(uuidgen)
  $ DEVICE='device pci address'
  $ echo $UUID > /sys/bus/pci/devices/$DEVICE/mdev_supported_types/nvme-2Q_V1/create
  $ echo n1p3 > /sys/bus/mdev/devices/$UUID/namespaces/add_namespace #attach host namespace 1 parition 3
  $ echo 11 > /sys/bus/mdev/devices/$UUID/settings/iothread_cpu #pin the io thread to cpu 11

  Afterward boot qemu with
  -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID
  
  Zero configuration on the guest.
  
*** FAQ ***

* Why to make this in the kernel? Why this is better that SPDK

  -> Reuse the existing nvme kernel driver in the host. No new drivers in the guest.
  
  -> Share the NVMe device between host and guest. 
     Even in fully virtualized configurations,
     some partitions of nvme device could be used by guests as block devices 
     while others passed through with nvme-mdev to achieve balance between
     all features of full IO stack emulation and performance.
  
  -> NVME-MDEV is a bit faster due to the fact that in-kernel driver 
     can send interrupts to the guest directly without a context 
     switch that can be expensive due to meltdown mitigation.

  -> Is able to utilize interrupts to get reasonable performance. 
     This is only implemented
     as a proof of concept and not included in the patches, 
     but interrupt driven mode shows reasonable performance
     
  -> This is a framework that later can be used to support NVMe devices 
     with more of the IO virtualization built-in 
     (IOMMU with PASID support coupled with device that supports it)

* Why to attach directly to nvme-pci driver and not use block layer IO
  -> The direct attachment allows for better performance, but I will
     check the possibility of using block IO, especially for fabrics drivers.
  
*** Implementation notes ***

*  All guest memory is mapped into the physical nvme device 
   but not 1:1 as vfio-pci would do this.
   This allows very efficient DMA.
   To support this, patch 2 adds ability for a mdev device to listen on 
   guest's memory map events. 
   Any such memory is immediately pinned and then DMA mapped.
   (Support for fabric drivers where this is not possible exits too,
    in which case the fabric driver will do its own DMA mapping)

*  nvme core driver is modified to announce the appearance 
   and disappearance of nvme controllers and namespaces,
   to which the nvme-mdev driver is subscribed.
 
*  nvme-pci driver is modified to expose raw interface of attaching to 
   and sending/polling the IO queues.
   This allows the mdev driver very efficiently to submit/poll for the IO.
   By default one host queue is used per each mediated device.
   (support for other fabric based host drivers is planned)

* The nvme-mdev doesn't assume presence of KVM, thus any VFIO user, including
  SPDK, a qemu running with tccg, ... can use this virtual device.

*** Testing ***

The device was tested with stock QEMU 3.0 on the host,
with host was using 5.0 kernel with nvme-mdev added and the following hardware:
 * QEMU nvme virtual device (with nested guest)
 * Intel DC P3700 on Xeon E5-2620 v2 server
 * Samsung SM981 (in a Thunderbolt enclosure, with my laptop)
 * Lenovo NVME device found in my laptop

The guest was tested with kernel 4.16, 4.18, 4.20 and
the same custom complied kernel 5.0
Windows 10 guest was tested too with both Microsoft's inbox driver and
open source community NVME driver
(https://lists.openfabrics.org/pipermail/nvmewin/2016-December/001420.html)

Testing was mostly done on x86_64, but 32 bit host/guest combination
was lightly tested too.

In addition to that, the virtual device was tested with nested guest,
by passing the virtual device to it,
using pci passthrough, qemu userspace nvme driver, and spdk


PS: I used to contribute to the kernel as a hobby using the
    maximlevitsky at gmail.com address

Maxim Levitsky (9):
  vfio/mdev: add .request callback
  nvme/core: add some more values from the spec
  nvme/core: add NVME_CTRL_SUSPENDED controller state
  nvme/pci: use the NVME_CTRL_SUSPENDED state
  nvme/pci: add known admin effects to augument admin effects log page
  nvme/pci: init shadow doorbell after each reset
  nvme/core: add mdev interfaces
  nvme/core: add nvme-mdev core driver
  nvme/pci: implement the mdev external queue allocation interface

 MAINTAINERS                   |   5 +
 drivers/nvme/Kconfig          |   1 +
 drivers/nvme/Makefile         |   1 +
 drivers/nvme/host/core.c      | 149 +++++-
 drivers/nvme/host/nvme.h      |  55 ++-
 drivers/nvme/host/pci.c       | 385 ++++++++++++++-
 drivers/nvme/mdev/Kconfig     |  16 +
 drivers/nvme/mdev/Makefile    |   5 +
 drivers/nvme/mdev/adm.c       | 873 ++++++++++++++++++++++++++++++++++
 drivers/nvme/mdev/events.c    | 142 ++++++
 drivers/nvme/mdev/host.c      | 491 +++++++++++++++++++
 drivers/nvme/mdev/instance.c  | 802 +++++++++++++++++++++++++++++++
 drivers/nvme/mdev/io.c        | 563 ++++++++++++++++++++++
 drivers/nvme/mdev/irq.c       | 264 ++++++++++
 drivers/nvme/mdev/mdev.h      |  56 +++
 drivers/nvme/mdev/mmio.c      | 591 +++++++++++++++++++++++
 drivers/nvme/mdev/pci.c       | 247 ++++++++++
 drivers/nvme/mdev/priv.h      | 700 +++++++++++++++++++++++++++
 drivers/nvme/mdev/udata.c     | 390 +++++++++++++++
 drivers/nvme/mdev/vcq.c       | 207 ++++++++
 drivers/nvme/mdev/vctrl.c     | 514 ++++++++++++++++++++
 drivers/nvme/mdev/viommu.c    | 322 +++++++++++++
 drivers/nvme/mdev/vns.c       | 356 ++++++++++++++
 drivers/nvme/mdev/vsq.c       | 178 +++++++
 drivers/vfio/mdev/vfio_mdev.c |  11 +
 include/linux/mdev.h          |   4 +
 include/linux/nvme.h          |  88 +++-
 27 files changed, 7375 insertions(+), 41 deletions(-)
 create mode 100644 drivers/nvme/mdev/Kconfig
 create mode 100644 drivers/nvme/mdev/Makefile
 create mode 100644 drivers/nvme/mdev/adm.c
 create mode 100644 drivers/nvme/mdev/events.c
 create mode 100644 drivers/nvme/mdev/host.c
 create mode 100644 drivers/nvme/mdev/instance.c
 create mode 100644 drivers/nvme/mdev/io.c
 create mode 100644 drivers/nvme/mdev/irq.c
 create mode 100644 drivers/nvme/mdev/mdev.h
 create mode 100644 drivers/nvme/mdev/mmio.c
 create mode 100644 drivers/nvme/mdev/pci.c
 create mode 100644 drivers/nvme/mdev/priv.h
 create mode 100644 drivers/nvme/mdev/udata.c
 create mode 100644 drivers/nvme/mdev/vcq.c
 create mode 100644 drivers/nvme/mdev/vctrl.c
 create mode 100644 drivers/nvme/mdev/viommu.c
 create mode 100644 drivers/nvme/mdev/vns.c
 create mode 100644 drivers/nvme/mdev/vsq.c

-- 
2.17.2

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-03-25 15:44 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-10  3:11 No subject Libo Chen
  -- strict thread matches above, loose matches on Subject: below --
2014-11-10  6:39 Libo Chen
2017-02-07  0:22 Scott Bauer
2017-02-07  0:46 ` Jens Axboe
2017-04-21 23:23 Sandeep Mann
2017-09-13 18:15 unmesh rathi
2018-02-02  6:54 Jianchao Wang
2018-10-05 13:39 Christoph Hellwig
2019-03-19 14:41 Maxim Levitsky
2019-03-20 11:03 ` Felipe Franciosi
2019-03-20 19:08   ` Maxim Levitsky
2019-03-21 16:12     ` Stefan Hajnoczi
2019-03-21 16:21       ` Keith Busch
2019-03-21 16:41         ` Felipe Franciosi
2019-03-21 17:04           ` Maxim Levitsky
2019-03-22  7:54             ` Felipe Franciosi
2019-03-22 10:32               ` Maxim Levitsky
2019-03-22 15:30               ` Keith Busch
2019-03-25 15:44                 ` Felipe Franciosi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox