From: Yuval Shaia <yuval.shaia@oracle.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Marcel Apfelbaum <marcel@redhat.com>,
qemu-devel@nongnu.org, ehabkost@redhat.com, imammedo@redhat.com,
pbonzini@redhat.com
Subject: Re: [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation
Date: Wed, 20 Dec 2017 19:56:47 +0200 [thread overview]
Message-ID: <20171220175646.GA2729@yuvallap> (raw)
In-Reply-To: <20171219194951-mutt-send-email-mst@kernel.org>
On Tue, Dec 19, 2017 at 08:05:18PM +0200, Michael S. Tsirkin wrote:
> On Sun, Dec 17, 2017 at 02:54:52PM +0200, Marcel Apfelbaum wrote:
> > RFC -> V2:
> > - Full implementation of the pvrdma device
> > - Backend is an ibdevice interface, no need for the KDBR module
> >
> > General description
> > ===================
> > PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
> > It works with its Linux Kernel driver AS IS, no need for any special guest
> > modifications.
> >
> > While it complies with the VMware device, it can also communicate with bare
> > metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
> > can work with Soft-RoCE (rxe).
> >
> > It does not require the whole guest RAM to be pinned
>
> What happens if guest attempts to register all its memory?
>
> > allowing memory
> > over-commit
> > and, even if not implemented yet, migration support will be
> > possible with some HW assistance.
>
> What does "HW assistance" mean here?
> Can it work with any existing hardware?
>
> >
> > Design
> > ======
> > - Follows the behavior of VMware's pvrdma device, however is not tightly
> > coupled with it
>
> Everything seems to be in pvrdma. Since it's not coupled, could you
> split code to pvrdma specific and generic parts?
>
> > and most of the code can be reused if we decide to
> > continue to a Virtio based RDMA device.
The current design takes into account a future code reuse with virtio-rdma
device although not sure it is 100%.
We divided it to four software layers:
- Front-end interface with PCI:
- pvrdma_main.c
- Front-end interface with pvrdma driver:
- pvrdma_cmd.c
- pvrdma_qp_ops.c
- pvrdma_dev_ring.c
- pvrdma_utils.c
- Device emulation:
- pvrdma_rm.c
- Back-end interface:
- pvrdma_backend.c
So in the future, when starting to work on virtio-rdma device we will move
the generic code to generic directory.
Any reason why we want to split it now, when we have only one device?
>
> I suspect that without virtio we won't be able to do any future
> extensions.
As i see it these are two different issues, virtio RDMA device is on our
plate but the contribution of VMWare pvrdma device to QEMU is no doubt a
real advantage that will allow customers that runs ESX to easy move to QEMU.
>
> > - It exposes 3 BARs:
> > BAR 0 - MSIX, utilize 3 vectors for command ring, async events and
> > completions
> > BAR 1 - Configuration of registers
>
> What does this mean?
Device control operations:
- Setting of interrupt mask.
- Setup of Device/Driver shared configuration area.
- Reset device, activate device etc.
- Device commands such as create QP, create MR etc.
>
> > BAR 2 - UAR, used to pass HW commands from driver.
>
> A detailed description of above belongs in documentation.
Will do.
>
> > - The device performs internal management of the RDMA
> > resources (PDs, CQs, QPs, ...), meaning the objects
> > are not directly coupled to a physical RDMA device resources.
>
> I am wondering how do you make connections? QP#s are exposed on
> the wire during connection management.
QP#s that guest sees are the QP#s that are used on the wire.
The meaning of "internal management of the RDMA resources" is that we keep
context of internal QP in device (ex rings).
>
> > The pvrdma backend is an ibdevice interface that can be exposed
> > either by a Soft-RoCE(rxe) device on machines with no RDMA device,
> > or an HCA SRIOV function(VF/PF).
> > Note that ibdevice interfaces can't be shared between pvrdma devices,
> > each one requiring a separate instance (rxe or SRIOV VF).
>
> So what's the advantage of this over pass-through then?
>
>
> >
> > Tests and performance
> > =====================
> > Tested with SoftRoCE backend (rxe)/Mellanox ConnectX3,
> > and Mellanox ConnectX4 HCAs with:
> > - VMs in the same host
> > - VMs in different hosts
> > - VMs to bare metal.
> >
> > The best performance achieved with ConnectX HCAs and buffer size
> > bigger than 1MB which was the line rate ~ 50Gb/s.
> > The conclusion is that using the PVRDMA device there are no
> > actual performance penalties compared to bare metal for big enough
> > buffers (which is quite common when using RDMA), while allowing
> > memory overcommit.
> >
> > Marcel Apfelbaum (3):
> > mem: add share parameter to memory-backend-ram
> > docs: add pvrdma device documentation.
> > MAINTAINERS: add entry for hw/net/pvrdma
> >
> > Yuval Shaia (2):
> > pci/shpc: Move function to generic header file
> > pvrdma: initial implementation
> >
> > MAINTAINERS | 7 +
> > Makefile.objs | 1 +
> > backends/hostmem-file.c | 25 +-
> > backends/hostmem-ram.c | 4 +-
> > backends/hostmem.c | 21 +
> > configure | 9 +-
> > default-configs/arm-softmmu.mak | 2 +
> > default-configs/i386-softmmu.mak | 1 +
> > default-configs/x86_64-softmmu.mak | 1 +
> > docs/pvrdma.txt | 145 ++++++
> > exec.c | 26 +-
> > hw/net/Makefile.objs | 7 +
> > hw/net/pvrdma/pvrdma.h | 179 +++++++
> > hw/net/pvrdma/pvrdma_backend.c | 986 ++++++++++++++++++++++++++++++++++++
> > hw/net/pvrdma/pvrdma_backend.h | 74 +++
> > hw/net/pvrdma/pvrdma_backend_defs.h | 68 +++
> > hw/net/pvrdma/pvrdma_cmd.c | 338 ++++++++++++
> > hw/net/pvrdma/pvrdma_defs.h | 121 +++++
> > hw/net/pvrdma/pvrdma_dev_api.h | 580 +++++++++++++++++++++
> > hw/net/pvrdma/pvrdma_dev_ring.c | 138 +++++
> > hw/net/pvrdma/pvrdma_dev_ring.h | 42 ++
> > hw/net/pvrdma/pvrdma_ib_verbs.h | 399 +++++++++++++++
> > hw/net/pvrdma/pvrdma_main.c | 664 ++++++++++++++++++++++++
> > hw/net/pvrdma/pvrdma_qp_ops.c | 187 +++++++
> > hw/net/pvrdma/pvrdma_qp_ops.h | 26 +
> > hw/net/pvrdma/pvrdma_ring.h | 134 +++++
> > hw/net/pvrdma/pvrdma_rm.c | 791 +++++++++++++++++++++++++++++
> > hw/net/pvrdma/pvrdma_rm.h | 54 ++
> > hw/net/pvrdma/pvrdma_rm_defs.h | 111 ++++
> > hw/net/pvrdma/pvrdma_types.h | 37 ++
> > hw/net/pvrdma/pvrdma_utils.c | 133 +++++
> > hw/net/pvrdma/pvrdma_utils.h | 41 ++
> > hw/net/pvrdma/trace-events | 9 +
> > hw/pci/shpc.c | 11 +-
> > include/exec/memory.h | 23 +
> > include/exec/ram_addr.h | 3 +-
> > include/hw/pci/pci_ids.h | 3 +
> > include/qemu/cutils.h | 10 +
> > include/qemu/osdep.h | 2 +-
> > include/sysemu/hostmem.h | 2 +-
> > include/sysemu/kvm.h | 2 +-
> > memory.c | 16 +-
> > util/oslib-posix.c | 4 +-
> > util/oslib-win32.c | 2 +-
> > 44 files changed, 5378 insertions(+), 61 deletions(-)
> > create mode 100644 docs/pvrdma.txt
> > create mode 100644 hw/net/pvrdma/pvrdma.h
> > create mode 100644 hw/net/pvrdma/pvrdma_backend.c
> > create mode 100644 hw/net/pvrdma/pvrdma_backend.h
> > create mode 100644 hw/net/pvrdma/pvrdma_backend_defs.h
> > create mode 100644 hw/net/pvrdma/pvrdma_cmd.c
> > create mode 100644 hw/net/pvrdma/pvrdma_defs.h
> > create mode 100644 hw/net/pvrdma/pvrdma_dev_api.h
> > create mode 100644 hw/net/pvrdma/pvrdma_dev_ring.c
> > create mode 100644 hw/net/pvrdma/pvrdma_dev_ring.h
> > create mode 100644 hw/net/pvrdma/pvrdma_ib_verbs.h
> > create mode 100644 hw/net/pvrdma/pvrdma_main.c
> > create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.c
> > create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.h
> > create mode 100644 hw/net/pvrdma/pvrdma_ring.h
> > create mode 100644 hw/net/pvrdma/pvrdma_rm.c
> > create mode 100644 hw/net/pvrdma/pvrdma_rm.h
> > create mode 100644 hw/net/pvrdma/pvrdma_rm_defs.h
> > create mode 100644 hw/net/pvrdma/pvrdma_types.h
> > create mode 100644 hw/net/pvrdma/pvrdma_utils.c
> > create mode 100644 hw/net/pvrdma/pvrdma_utils.h
> > create mode 100644 hw/net/pvrdma/trace-events
> >
> > --
> > 2.13.5
next prev parent reply other threads:[~2017-12-20 17:57 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-17 12:54 [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 1/5] pci/shpc: Move function to generic header file Marcel Apfelbaum
2017-12-17 18:16 ` Philippe Mathieu-Daudé
2017-12-17 19:03 ` Yuval Shaia
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 2/5] mem: add share parameter to memory-backend-ram Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 3/5] docs: add pvrdma device documentation Marcel Apfelbaum
2017-12-19 17:47 ` Michael S. Tsirkin
2017-12-20 14:45 ` Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 4/5] pvrdma: initial implementation Marcel Apfelbaum
2017-12-19 16:12 ` Michael S. Tsirkin
2017-12-19 17:29 ` Marcel Apfelbaum
2017-12-19 17:48 ` Michael S. Tsirkin
2017-12-20 15:25 ` Yuval Shaia
2017-12-20 18:01 ` Michael S. Tsirkin
2017-12-19 19:13 ` Philippe Mathieu-Daudé
2017-12-20 4:08 ` Michael S. Tsirkin
2017-12-20 14:46 ` Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 5/5] MAINTAINERS: add entry for hw/net/pvrdma Marcel Apfelbaum
2017-12-19 17:49 ` Michael S. Tsirkin
2017-12-19 18:05 ` [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation Michael S. Tsirkin
2017-12-20 15:07 ` Marcel Apfelbaum
2017-12-21 0:05 ` Michael S. Tsirkin
2017-12-21 7:27 ` Yuval Shaia
2017-12-21 14:22 ` Michael S. Tsirkin
2017-12-21 15:59 ` Marcel Apfelbaum
2017-12-21 20:46 ` Michael S. Tsirkin
2017-12-21 22:30 ` Yuval Shaia
2017-12-22 4:58 ` Marcel Apfelbaum
2017-12-20 17:56 ` Yuval Shaia [this message]
2017-12-20 18:05 ` Michael S. Tsirkin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171220175646.GA2729@yuvallap \
--to=yuval.shaia@oracle.com \
--cc=ehabkost@redhat.com \
--cc=imammedo@redhat.com \
--cc=marcel@redhat.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).