qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Marcel Apfelbaum <marcel@redhat.com>
Cc: qemu-devel@nongnu.org, ehabkost@redhat.com, imammedo@redhat.com,
	yuval.shaia@oracle.com, pbonzini@redhat.com
Subject: Re: [Qemu-devel] [PATCH V2 3/5] docs: add pvrdma device documentation
Date: Tue, 19 Dec 2017 19:47:55 +0200	[thread overview]
Message-ID: <20171219194739-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <20171217125457.3429-4-marcel@redhat.com>

On Sun, Dec 17, 2017 at 02:54:55PM +0200, Marcel Apfelbaum wrote:
> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>  docs/pvrdma.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 145 insertions(+)
>  create mode 100644 docs/pvrdma.txt
> 
> diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
> new file mode 100644
> index 0000000000..74c5cf2495
> --- /dev/null
> +++ b/docs/pvrdma.txt
> @@ -0,0 +1,145 @@
> +Paravirtualized RDMA Device (PVRDMA)
> +====================================
> +
> +
> +1. Description
> +===============
> +PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
> +It works with its Linux Kernel driver AS IS, no need for any special guest
> +modifications.
> +
> +While it complies with the VMware device, it can also communicate with bare
> +metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
> +can work with Soft-RoCE (rxe).
> +
> +It does not require the whole guest RAM to be pinned allowing memory
> +over-commit and, even if not implemented yet, migration support will be
> +possible with some HW assistance.
> +
> +A project presentation accompany this document:
> +- http://events.linuxfoundation.org/sites/events/files/slides/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf
> +
> +
> +
> +2. Setup
> +========
> +
> +
> +2.1 Guest setup
> +===============
> +Fedora 27+ kernels work out of the box, older distributions
> +require updating the kernel to 4.14 to include the pvrdma driver.
> +
> +However the libpvrdma library needed by User Level Software is still
> +not available as part of the distributions, so the rdma-core library
> +needs to be compiled and optionally installed.
> +
> +Please follow the instructions at:
> +  https://github.com/linux-rdma/rdma-core.git
> +
> +
> +2.2 Host Setup
> +==============
> +The pvrdma backend is an ibdevice interface that can be exposed
> +either by a Soft-RoCE(rxe) device on machines with no RDMA device,
> +or an HCA SRIOV function(VF/PF).
> +Note that ibdevice interfaces can't be shared between pvrdma devices,
> +each one requiring a separate instance (rxe or SRIOV VF).
> +
> +
> +2.2.1 Soft-RoCE backend(rxe)
> +===========================
> +A stable version of rxe is required, Fedora 27+ or a Linux
> +Kernel 4.14+ is preferred.
> +
> +The rdma_rxe module is part of the Linux Kernel but not loaded by default.
> +Install the User Level library (librxe) following the instructions from:
> +https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
> +
> +Associate an ETH interface with rxe by running:
> +   rxe_cfg add eth0
> +An rxe0 ibdevice interface will be created and can be used as pvrdma backend.
> +
> +
> +2.2.2 RDMA device Virtual Function backend
> +==========================================
> +Nothing special is required, the pvrdma device can work not only with
> +Ethernet Links, but also Infinibands Links.
> +All is needed is an ibdevice with an active port, for Mellanox cards
> +will be something like mlx5_6 which can be the backend.
> +
> +
> +2.2.3 QEMU setup
> +================
> +Configure QEMU with --enable-rdma flag, installing
> +the required RDMA libraries.
> +
> +
> +3. Usage
> +========
> +Currently the device is working only with memory backed RAM
> +and it must be mark as "shared":
> +   -m 1G \
> +   -object memory-backend-ram,id=mb1,size=1G,share \
> +   -numa node,memdev=mb1 \
> +
> +The pvrdma device is composed of two functions:
> + - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
> +   but is required to pass the ibdevice GID using its MAC.
> +   Examples:
> +     For an rxe backend using eth0 interface it will use its mac:
> +       -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
> +     For an SRIOV VF, we take the Ethernet Interface exposed by it:
> +       -device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
> + - Function 1 is the actual device:
> +       -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
> +   where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
> + Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC.
> + The rules of conversion are part of the RoCE spec, but since manual conversion
> + is not required, spotting problems is not hard:
> +    Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
> +             MAC: 7c:fe:90:cb:74:3a
> +    Note the difference between the first byte of the MAC and the GID.
> +
> +
> +4. Implementation details
> +=========================
> +The device acts like a proxy between the Guest Driver and the host
> +ibdevice interface.
> +On configuration path:
> + - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
> +   a resource from the backend interface, maintaining a 1-1 mapping
> +   between the guest and host.
> +On data path:
> + - Every post_send/receive received from the guest will be converted into
> +   a post_send/receive for the backend. The buffers data will not be touched
> +   or copied resulting in near bare-metal performance for large enough buffers.
> + - Completions from the backend interface will result in completions for
> +   the pvrdma device.


Where's the host/guest interface documented?

> +
> +
> +5. Limitations
> +==============
> +- The device obviously is limited by the Guest Linux Driver features implementation
> +  of the VMware device API.
> +- Memory registration mechanism requires mremap for every page in the buffer in order
> +  to map it to a contiguous virtual address range. Since this is not the data path
> +  it should not matter much.
> +- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
> +  so it can't work with huge pages. The limitation will be addressed in the future,
> +  however QEMU allocates Gust RAM with MADV_HUGEPAGE so if there are enough huge
> +  pages available, QEMU will use them.
> +- As previously stated, migration is not supported yet, however with some hardware
> +  support can be done.
> +
> +
> +
> +6. Performance
> +==============
> +By design the pvrdma device exits on each post-send/receive, so for small buffers
> +the performance is affected; however for medium buffers it will became close to
> +bare metal and from 1MB buffers and  up it reaches bare metal performance.
> +(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)
> +
> +All the above assumes no memory registration is done on data path.
> -- 
> 2.13.5

  reply	other threads:[~2017-12-19 17:48 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-17 12:54 [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 1/5] pci/shpc: Move function to generic header file Marcel Apfelbaum
2017-12-17 18:16   ` Philippe Mathieu-Daudé
2017-12-17 19:03     ` Yuval Shaia
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 2/5] mem: add share parameter to memory-backend-ram Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 3/5] docs: add pvrdma device documentation Marcel Apfelbaum
2017-12-19 17:47   ` Michael S. Tsirkin [this message]
2017-12-20 14:45     ` Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 4/5] pvrdma: initial implementation Marcel Apfelbaum
2017-12-19 16:12   ` Michael S. Tsirkin
2017-12-19 17:29     ` Marcel Apfelbaum
2017-12-19 17:48   ` Michael S. Tsirkin
2017-12-20 15:25     ` Yuval Shaia
2017-12-20 18:01       ` Michael S. Tsirkin
2017-12-19 19:13   ` Philippe Mathieu-Daudé
2017-12-20  4:08     ` Michael S. Tsirkin
2017-12-20 14:46       ` Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 5/5] MAINTAINERS: add entry for hw/net/pvrdma Marcel Apfelbaum
2017-12-19 17:49   ` Michael S. Tsirkin
2017-12-19 18:05 ` [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation Michael S. Tsirkin
2017-12-20 15:07   ` Marcel Apfelbaum
2017-12-21  0:05     ` Michael S. Tsirkin
2017-12-21  7:27       ` Yuval Shaia
2017-12-21 14:22         ` Michael S. Tsirkin
2017-12-21 15:59           ` Marcel Apfelbaum
2017-12-21 20:46             ` Michael S. Tsirkin
2017-12-21 22:30               ` Yuval Shaia
2017-12-22  4:58                 ` Marcel Apfelbaum
2017-12-20 17:56   ` Yuval Shaia
2017-12-20 18:05     ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171219194739-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=marcel@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=yuval.shaia@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).