All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Yuval Shaia <yuval.shaia@oracle.com>
Cc: Marcel Apfelbaum <marcel@redhat.com>,
	qemu-devel@nongnu.org, ehabkost@redhat.com, imammedo@redhat.com,
	pbonzini@redhat.com
Subject: Re: [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation
Date: Wed, 20 Dec 2017 20:05:40 +0200	[thread overview]
Message-ID: <20171220200159-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <20171220175646.GA2729@yuvallap>

On Wed, Dec 20, 2017 at 07:56:47PM +0200, Yuval Shaia wrote:
> On Tue, Dec 19, 2017 at 08:05:18PM +0200, Michael S. Tsirkin wrote:
> > On Sun, Dec 17, 2017 at 02:54:52PM +0200, Marcel Apfelbaum wrote:
> > > RFC -> V2:
> > >  - Full implementation of the pvrdma device
> > >  - Backend is an ibdevice interface, no need for the KDBR module
> > > 
> > > General description
> > > ===================
> > > PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
> > > It works with its Linux Kernel driver AS IS, no need for any special guest
> > > modifications.
> > > 
> > > While it complies with the VMware device, it can also communicate with bare
> > > metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
> > > can work with Soft-RoCE (rxe).
> > > 
> > > It does not require the whole guest RAM to be pinned
> > 
> > What happens if guest attempts to register all its memory?
> > 
> > > allowing memory
> > > over-commit
> > > and, even if not implemented yet, migration support will be
> > > possible with some HW assistance.
> > 
> > What does "HW assistance" mean here?
> > Can it work with any existing hardware?
> > 
> > > 
> > >  Design
> > >  ======
> > >  - Follows the behavior of VMware's pvrdma device, however is not tightly
> > >    coupled with it
> > 
> > Everything seems to be in pvrdma. Since it's not coupled, could you
> > split code to pvrdma specific and generic parts?
> > 
> > > and most of the code can be reused if we decide to
> > >    continue to a Virtio based RDMA device.
> 
> The current design takes into account a future code reuse with virtio-rdma
> device although not sure it is 100%.
> 
> We divided it to four software layers:
> - Front-end interface with PCI:
> 	- pvrdma_main.c
> - Front-end interface with pvrdma driver:
> 	- pvrdma_cmd.c
> 	- pvrdma_qp_ops.c
> 	- pvrdma_dev_ring.c
> 	- pvrdma_utils.c
> - Device emulation:
> 	- pvrdma_rm.c
> - Back-end interface:
> 	- pvrdma_backend.c
> 
> So in the future, when starting to work on virtio-rdma device we will move
> the generic code to generic directory.
> 
> Any reason why we want to split it now, when we have only one device?

To make it easier for me to ignore pvrdma stuff and review the generic stuff.

> > 
> > I suspect that without virtio we won't be able to do any future
> > extensions.
> 
> As i see it these are two different issues, virtio RDMA device is on our
> plate but the contribution of VMWare pvrdma device to QEMU is no doubt a
> real advantage that will allow customers that runs ESX to easy move to QEMU.

I don't have anything against it but I'm not really interested in
reviewing it either.

> > 
> > >  - It exposes 3 BARs:
> > >     BAR 0 - MSIX, utilize 3 vectors for command ring, async events and
> > >             completions
> > >     BAR 1 - Configuration of registers
> > 
> > What does this mean?
> 
> Device control operations:
> 	- Setting of interrupt mask.
> 	- Setup of Device/Driver shared configuration area.
> 	- Reset device, activate device etc.
> 	- Device commands such as create QP, create MR etc.
> 
> > 
> > >     BAR 2 - UAR, used to pass HW commands from driver.
> > 
> > A detailed description of above belongs in documentation.
> 
> Will do.
> 
> > 
> > >  - The device performs internal management of the RDMA
> > >    resources (PDs, CQs, QPs, ...), meaning the objects
> > >    are not directly coupled to a physical RDMA device resources.
> > 
> > I am wondering how do you make connections? QP#s are exposed on
> > the wire during connection management.
> 
> QP#s that guest sees are the QP#s that are used on the wire.
> The meaning of "internal management of the RDMA resources" is that we keep
> context of internal QP in device (ex rings).

The question was that you need to parse CM/MAD etc messages if you
need to change QP#s on the fly, and that code does not seem
to be there. I guess the answer is that
a bunch of this stuff is just broken or non-spec compliant.


> > 
> > > The pvrdma backend is an ibdevice interface that can be exposed
> > > either by a Soft-RoCE(rxe) device on machines with no RDMA device,
> > > or an HCA SRIOV function(VF/PF).
> > > Note that ibdevice interfaces can't be shared between pvrdma devices,
> > > each one requiring a separate instance (rxe or SRIOV VF).
> > 
> > So what's the advantage of this over pass-through then?
> > 
> > 
> > > 
> > > Tests and performance
> > > =====================
> > > Tested with SoftRoCE backend (rxe)/Mellanox ConnectX3,
> > > and Mellanox ConnectX4 HCAs with:
> > >   - VMs in the same host
> > >   - VMs in different hosts 
> > >   - VMs to bare metal.
> > > 
> > > The best performance achieved with ConnectX HCAs and buffer size
> > > bigger than 1MB which was the line rate ~ 50Gb/s.
> > > The conclusion is that using the PVRDMA device there are no
> > > actual performance penalties compared to bare metal for big enough
> > > buffers (which is quite common when using RDMA), while allowing
> > > memory overcommit.
> > > 
> > > Marcel Apfelbaum (3):
> > >   mem: add share parameter to memory-backend-ram
> > >   docs: add pvrdma device documentation.
> > >   MAINTAINERS: add entry for hw/net/pvrdma
> > > 
> > > Yuval Shaia (2):
> > >   pci/shpc: Move function to generic header file
> > >   pvrdma: initial implementation
> > > 
> > >  MAINTAINERS                         |   7 +
> > >  Makefile.objs                       |   1 +
> > >  backends/hostmem-file.c             |  25 +-
> > >  backends/hostmem-ram.c              |   4 +-
> > >  backends/hostmem.c                  |  21 +
> > >  configure                           |   9 +-
> > >  default-configs/arm-softmmu.mak     |   2 +
> > >  default-configs/i386-softmmu.mak    |   1 +
> > >  default-configs/x86_64-softmmu.mak  |   1 +
> > >  docs/pvrdma.txt                     | 145 ++++++
> > >  exec.c                              |  26 +-
> > >  hw/net/Makefile.objs                |   7 +
> > >  hw/net/pvrdma/pvrdma.h              | 179 +++++++
> > >  hw/net/pvrdma/pvrdma_backend.c      | 986 ++++++++++++++++++++++++++++++++++++
> > >  hw/net/pvrdma/pvrdma_backend.h      |  74 +++
> > >  hw/net/pvrdma/pvrdma_backend_defs.h |  68 +++
> > >  hw/net/pvrdma/pvrdma_cmd.c          | 338 ++++++++++++
> > >  hw/net/pvrdma/pvrdma_defs.h         | 121 +++++
> > >  hw/net/pvrdma/pvrdma_dev_api.h      | 580 +++++++++++++++++++++
> > >  hw/net/pvrdma/pvrdma_dev_ring.c     | 138 +++++
> > >  hw/net/pvrdma/pvrdma_dev_ring.h     |  42 ++
> > >  hw/net/pvrdma/pvrdma_ib_verbs.h     | 399 +++++++++++++++
> > >  hw/net/pvrdma/pvrdma_main.c         | 664 ++++++++++++++++++++++++
> > >  hw/net/pvrdma/pvrdma_qp_ops.c       | 187 +++++++
> > >  hw/net/pvrdma/pvrdma_qp_ops.h       |  26 +
> > >  hw/net/pvrdma/pvrdma_ring.h         | 134 +++++
> > >  hw/net/pvrdma/pvrdma_rm.c           | 791 +++++++++++++++++++++++++++++
> > >  hw/net/pvrdma/pvrdma_rm.h           |  54 ++
> > >  hw/net/pvrdma/pvrdma_rm_defs.h      | 111 ++++
> > >  hw/net/pvrdma/pvrdma_types.h        |  37 ++
> > >  hw/net/pvrdma/pvrdma_utils.c        | 133 +++++
> > >  hw/net/pvrdma/pvrdma_utils.h        |  41 ++
> > >  hw/net/pvrdma/trace-events          |   9 +
> > >  hw/pci/shpc.c                       |  11 +-
> > >  include/exec/memory.h               |  23 +
> > >  include/exec/ram_addr.h             |   3 +-
> > >  include/hw/pci/pci_ids.h            |   3 +
> > >  include/qemu/cutils.h               |  10 +
> > >  include/qemu/osdep.h                |   2 +-
> > >  include/sysemu/hostmem.h            |   2 +-
> > >  include/sysemu/kvm.h                |   2 +-
> > >  memory.c                            |  16 +-
> > >  util/oslib-posix.c                  |   4 +-
> > >  util/oslib-win32.c                  |   2 +-
> > >  44 files changed, 5378 insertions(+), 61 deletions(-)
> > >  create mode 100644 docs/pvrdma.txt
> > >  create mode 100644 hw/net/pvrdma/pvrdma.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_backend.c
> > >  create mode 100644 hw/net/pvrdma/pvrdma_backend.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_backend_defs.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_cmd.c
> > >  create mode 100644 hw/net/pvrdma/pvrdma_defs.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_dev_api.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_dev_ring.c
> > >  create mode 100644 hw/net/pvrdma/pvrdma_dev_ring.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_ib_verbs.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_main.c
> > >  create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.c
> > >  create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_ring.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_rm.c
> > >  create mode 100644 hw/net/pvrdma/pvrdma_rm.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_rm_defs.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_types.h
> > >  create mode 100644 hw/net/pvrdma/pvrdma_utils.c
> > >  create mode 100644 hw/net/pvrdma/pvrdma_utils.h
> > >  create mode 100644 hw/net/pvrdma/trace-events
> > > 
> > > -- 
> > > 2.13.5

      reply	other threads:[~2017-12-20 18:05 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-17 12:54 [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 1/5] pci/shpc: Move function to generic header file Marcel Apfelbaum
2017-12-17 18:16   ` Philippe Mathieu-Daudé
2017-12-17 19:03     ` Yuval Shaia
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 2/5] mem: add share parameter to memory-backend-ram Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 3/5] docs: add pvrdma device documentation Marcel Apfelbaum
2017-12-19 17:47   ` Michael S. Tsirkin
2017-12-20 14:45     ` Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 4/5] pvrdma: initial implementation Marcel Apfelbaum
2017-12-19 16:12   ` Michael S. Tsirkin
2017-12-19 17:29     ` Marcel Apfelbaum
2017-12-19 17:48   ` Michael S. Tsirkin
2017-12-20 15:25     ` Yuval Shaia
2017-12-20 18:01       ` Michael S. Tsirkin
2017-12-19 19:13   ` Philippe Mathieu-Daudé
2017-12-20  4:08     ` Michael S. Tsirkin
2017-12-20 14:46       ` Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 5/5] MAINTAINERS: add entry for hw/net/pvrdma Marcel Apfelbaum
2017-12-19 17:49   ` Michael S. Tsirkin
2017-12-19 18:05 ` [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation Michael S. Tsirkin
2017-12-20 15:07   ` Marcel Apfelbaum
2017-12-21  0:05     ` Michael S. Tsirkin
2017-12-21  7:27       ` Yuval Shaia
2017-12-21 14:22         ` Michael S. Tsirkin
2017-12-21 15:59           ` Marcel Apfelbaum
2017-12-21 20:46             ` Michael S. Tsirkin
2017-12-21 22:30               ` Yuval Shaia
2017-12-22  4:58                 ` Marcel Apfelbaum
2017-12-20 17:56   ` Yuval Shaia
2017-12-20 18:05     ` Michael S. Tsirkin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171220200159-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=marcel@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=yuval.shaia@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.