kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Maxim Levitsky <mlevitsk@redhat.com>
To: Felipe Franciosi <felipe@nutanix.com>
Cc: Fam Zheng <fam@euphon.net>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	Wolfram Sang <wsa@the-dreams.de>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Keith Busch <keith.busch@intel.com>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	Mauro Carvalho Chehab <mchehab+samsung@kernel.org>,
	"Paul E . McKenney" <paulmck@linux.ibm.com>,
	Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>,
	"Harris, James R" <james.r.harris@intel.com>,
	Liang Cunming <cunming.liang@intel.com>,
	Jens Axboe <axboe@fb.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Thanos Makatos <thanos.makatos@nutanix.com>,
	John Ferlan <jferlan@redhat.com>,
Subject: Re:
Date: Wed, 20 Mar 2019 21:08:37 +0200	[thread overview]
Message-ID: <42f444d22363bc747f4ad75e9f0c27b40a810631.camel@redhat.com> (raw)
In-Reply-To: <488768D7-1396-4DD1-A648-C86E5CF7DB2F@nutanix.com>

On Wed, 2019-03-20 at 11:03 +0000, Felipe Franciosi wrote:
> > On Mar 19, 2019, at 2:41 PM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > 
> > Date: Tue, 19 Mar 2019 14:45:45 +0200
> > Subject: [PATCH 0/9] RFC: NVME VFIO mediated device
> > 
> > Hi everyone!
> > 
> > In this patch series, I would like to introduce my take on the problem of
> > doing 
> > as fast as possible virtualization of storage with emphasis on low latency.
> > 
> > In this patch series I implemented a kernel vfio based, mediated device
> > that 
> > allows the user to pass through a partition and/or whole namespace to a
> > guest.
> 
> Hey Maxim!
> 
> I'm really excited to see this series, as it aligns to some extent with what
> we discussed in last year's KVM Forum VFIO BoF.
> 
> There's no arguing that we need a better story to efficiently virtualise NVMe
> devices. So far, for Qemu-based VMs, Changpeng's vhost-user-nvme is the best
> attempt at that. However, I seem to recall there was some pushback from qemu-
> devel in the sense that they would rather see investment in virtio-blk. I'm
> not sure what's the latest on that work and what are the next steps.
I agree with that. All my benchmarks were agains his vhost-user-nvme driver, and
I am able to get pretty much the same througput and latency.

The ssd I tested on died just recently (Murphy law), not due to bug in my driver
but some internal fault (even though most of my tests were reads, plus
occassional 'nvme format's.
We are in process of buying an replacement.

> 
> The pushback drove the discussion towards pursuing an mdev approach, which is
> why I'm excited to see your patches.
> 
> What I'm thinking is that passing through namespaces or partitions is very
> restrictive. It leaves no room to implement more elaborate virtualisation
> stacks like replicating data across multiple devices (local or remote),
> storage migration, software-managed thin provisioning, encryption,
> deduplication, compression, etc. In summary, anything that requires software
> intervention in the datapath. (Worth noting: vhost-user-nvme allows all of
> that to be easily done in SPDK's bdev layer.)

Hi Felipe!

I guess that my driver is not geared toward more complicated use cases like you
mentioned, but instead it is focused to get as fast as possible performance for
the common case.

One thing that I can do which would solve several of the above problems is to
accept an map betwent virtual and real logical blocks, pretty much in exactly
the same way as EPT does it.
Then userspace can map any portions of the device anywhere, while still keeping
the dataplane in the kernel, and having minimal overhead.

On top of that, note that the direction of IO virtualization is to do dataplane
in hardware, which will probably give you even worse partition granuality /
features but will be the fastest option aviable,
like for instance SR-IOV which alrady exists and just allows to split by
namespaces without any more fine grained control.

Think of nvme-mdev as a very low level driver, which currntly uses polling, but
eventually will use PASID based IOMMU to provide the guest with raw PCI device.
The userspace / qemu can build on top of that with varios software layers.

On top of that I am thinking to solve the problem of migration in Qemu, by
creating a 'vfio-nvme' driver which would bind vfio to bind to device exposed by
the kernel, and would pass through all the doorbells and queues to the guest,
while intercepting the admin queue. Such driver I think can be made to support
migration while beeing able to run on top both SR-IOV device, my vfio-nvme abit
with double admin queue emulation (its a bit ugly but won't affect performance
at all) and on top of even regular NVME device vfio assigned to guest.


Best regards,
	Maxim Levitsky

> 
> These complicated stacks should probably not be implemented in the kernel,
> though. So I'm wondering whether we could talk about mechanisms to allow
> efficient and performant userspace datapath intervention  in your approach or
> pursue a mechanism to completely offload the device emulation to userspace
> (and align with what SPDK has to offer).
> 
> Thoughts welcome!
> Felipe
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

  reply	other threads:[~2019-03-20 19:08 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-19 14:41 (unknown) Maxim Levitsky
2019-03-19 14:41 ` [PATCH 1/9] vfio/mdev: add .request callback Maxim Levitsky
2019-03-19 14:41 ` [PATCH 2/9] nvme/core: add some more values from the spec Maxim Levitsky
2019-03-19 14:41 ` [PATCH 3/9] nvme/core: add NVME_CTRL_SUSPENDED controller state Maxim Levitsky
2019-03-19 14:41 ` [PATCH 4/9] nvme/pci: use the NVME_CTRL_SUSPENDED state Maxim Levitsky
2019-03-20  2:54   ` Fam Zheng
2019-03-19 14:41 ` [PATCH 5/9] nvme/pci: add known admin effects to augument admin effects log page Maxim Levitsky
2019-03-19 14:41 ` [PATCH 6/9] nvme/pci: init shadow doorbell after each reset Maxim Levitsky
2019-03-19 14:41 ` [PATCH 7/9] nvme/core: add mdev interfaces Maxim Levitsky
2019-03-20 11:46   ` Stefan Hajnoczi
2019-03-20 12:50     ` Maxim Levitsky
2019-03-19 14:41 ` [PATCH 8/9] nvme/core: add nvme-mdev core driver Maxim Levitsky
2019-03-19 14:41 ` [PATCH 9/9] nvme/pci: implement the mdev external queue allocation interface Maxim Levitsky
2019-03-19 14:58 ` [PATCH 0/9] RFC: NVME VFIO mediated device Maxim Levitsky
2019-03-25 18:52   ` [PATCH 0/9] RFC: NVME VFIO mediated device [BENCHMARKS] Maxim Levitsky
2019-03-26  9:38     ` Stefan Hajnoczi
2019-03-26  9:50       ` Maxim Levitsky
2019-03-19 15:22 ` your mail Keith Busch
2019-03-19 23:49   ` Chaitanya Kulkarni
2019-03-20 16:44     ` Maxim Levitsky
2019-03-20 16:30   ` Maxim Levitsky
2019-03-20 17:03     ` Keith Busch
2019-03-20 17:33       ` Maxim Levitsky
2019-04-08 10:04   ` Maxim Levitsky
2019-03-20 11:03 ` Felipe Franciosi
2019-03-20 19:08   ` Maxim Levitsky [this message]
2019-03-21 16:12     ` Re: Stefan Hajnoczi
2019-03-21 16:21       ` Re: Keith Busch
2019-03-21 16:41         ` Re: Felipe Franciosi
2019-03-21 17:04           ` Re: Maxim Levitsky
2019-03-22  7:54             ` Re: Felipe Franciosi
2019-03-22 10:32               ` Re: Maxim Levitsky
2019-03-22 15:30               ` Re: Keith Busch
2019-03-25 15:44                 ` Re: Felipe Franciosi
2019-03-20 15:08 ` [PATCH 0/9] RFC: NVME VFIO mediated device Bart Van Assche
2019-03-20 16:48   ` Maxim Levitsky
2019-03-20 15:28 ` Bart Van Assche
2019-03-20 16:42   ` Maxim Levitsky
2019-03-20 17:03     ` Alex Williamson
2019-03-21 16:13 ` your mail Stefan Hajnoczi
2019-03-21 17:07   ` Maxim Levitsky
2019-03-25 16:46     ` Stefan Hajnoczi
  -- strict thread matches above, loose matches on Subject: below --
2022-01-14 10:54 Li RongQing
2022-01-14 10:55 ` Paolo Bonzini
2022-01-14 17:13   ` Re: Sean Christopherson
2022-01-14 17:17     ` Re: Paolo Bonzini
     [not found] <E1hUrZM-0007qA-Q8@sslproxy01.your-server.de>
2019-05-29 19:54 ` Re: Alex Williamson
2018-02-05  5:28 Re: Fahama Vaserman
2017-11-13 14:44 Re: Amos Kalonzo
     [not found] <CAMj-D2DO_CfvD77izsGfggoKP45HSC9aD6auUPAYC9Yeq_aX7w@mail.gmail.com>
2017-05-04 16:44 ` Re: gengdongjiu
2017-02-23 15:09 Qin's Yanjun
     [not found] <D0613EBE33E8FD439137DAA95CCF59555B7A5A4D@MGCCCMAIL2010-5.mgccc.cc.ms.us>
2015-11-24 13:21 ` RE: Amis, Ryann
2014-04-13 21:01 (unknown), Marcus White
2014-04-15  0:59 ` Marcus White
2014-04-16 21:17   ` Re: Marcelo Tosatti
2014-04-17 21:33     ` Re: Marcus White
2014-04-21 21:49       ` Re: Marcelo Tosatti
2013-06-28 10:14 Re: emirates
2013-06-28 10:12 Re: emirates
2013-06-27 21:21 Re: emirates
2011-10-29 21:27 Re: Young Chang
2010-12-14  3:03 Re: Irish Online News
     [not found] <20090427104117.GB29082@redhat.com>
2009-04-27 13:16 ` Re: Sheng Yang
2009-03-08 22:05 "-vga std" causes guest OS to crash (on disk io?) on a 1440x900 latpop Dylan Reid
2009-03-14  5:09 ` Dylan
2009-02-25  0:50 (unknown), Josh Borke
2009-02-25  0:58 ` Atsushi SAKAI
2009-01-10 21:53 (unknown) Ekin Meroğlu
2009-11-07 15:59 ` Bulent Abali
2009-11-07 16:36   ` Neil Aggarwal
2008-07-28 21:27 (unknown) Mohammed Gamal
2008-07-28 21:29 ` Mohammed Gamal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42f444d22363bc747f4ad75e9f0c27b40a810631.camel@redhat.com \
    --to=mlevitsk@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=axboe@fb.com \
    --cc=cunming.liang@intel.com \
    --cc=fam@euphon.net \
    --cc=felipe@nutanix.com \
    --cc=hch@lst.de \
    --cc=james.r.harris@intel.com \
    --cc=jferlan@redhat.com \
    --cc=keith.busch@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=mchehab+samsung@kernel.org \
    --cc=paulmck@linux.ibm.com \
    --cc=sagi@grimberg.me \
    --cc=stefanha@redhat.com \
    --cc=thanos.makatos@nutanix.com \
    --cc=wsa@the-dreams.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).