linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Dr. David Alan Gilbert" <dgilbert-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: virtio-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b@public.gmane.org,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A@public.gmane.org,
	Marcel Apfelbaum
	<marcel.a-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Subject: Re: [Qemu-devel] host side todo list for virtio rdma
Date: Wed, 19 Jul 2017 11:55:50 +0100	[thread overview]
Message-ID: <20170719105549.GB3500@work-vm> (raw)
In-Reply-To: <20170719004721-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

* Michael S. Tsirkin (mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org) wrote:
> Here are some thoughts on bits that are still missing to get a working
> virtio-rdma, with some suggestions. These are very preliminary but I
> feel I kept these in my head (and discussed offline) for too long. All
> of the below is just my personal humble opinion.
> 
> Feature Requirements:
> 
> The basic requirement is to be able to do RDMA to/from
> VM memory, with support for VM migration and/or memory
> overcommit and/or autonuma and/or THP.
> Why are migration/overcommit/autonuma required?
> Without these, you can do RDMA with device passthrough,
> with likely better performance.

Is this solution usable on a system without host-RDMA hardware?
i.e. just to run RDMA between two VMs on the same host
without using something like SoftROCE on the host?

> Feature Non-requirements:
> 
> It's not a requirement to support RDMA without VM exits,
> e.g. like with device passthrough. While avoiding exits improves
> performance, it would be handy to more than RDMA,
> so there seems no reason to require it from RDMA when we
> do not have it for e.g. network.
> 
> Assumptions:
> 
> It's OK to assume specific hardware capabilities at least initially.
> 
> High level architecture:
> 
> Follows the same lines as most other virtio devices:
> 
> +-----------------------------------
> + 
> + guest kernel
> +             ^
> +-------------|----------------------
> +             v
> + host kernel (kvm, vhost)
> + 
> +             ^
> +-------------|----------------------
> +             v
> + 
> + host userspace (QEMU, vhost-user)
> + 
> +-----------------------------------
> 
> Each request is forwarded by host kernel to QEMU,
> that executes it using the ibverbs library.

Should that be 'forwarded by guest kernel' ?
Is there a guest-userspace here as well - most of the
RDMA NICs seem to have a userspace component.

> Most of this should be implementable host-side using existing
> software. However, several issues remain and would need
> infrastructure changes, as outlined below.
> 
> Host-side todo list for virtio-rdma support:
> 
> - Memory registration for guest userspace.
> 
>   Register memory region verb accepts a single virtual address,
>   which supplies both the on-wire key for access and the
>   range of memory to access. Guest kernel turns this into a
>   list of pages (e.g. by get_user_pages); when forwarded to host this
>   turns into a s/g list of virtual addresses in QEMU address space.
> 
>   Suggestion: add a new verb, along the lines of ibv_register_physical,
>   which splits these two parameters, accepting the on-wire VA key
>   and separately a list of userspace virtual address/size pairs.
> 
> - Memory registration for guest kernels.
> 
>   Another ability used by some in-kernel users is registering all of memory.
>   Ranges not actually present are never accessed - this is OK as
>   kernel users are trusted. Memory hotplug changes which ranges
>   are present.
> 
>   Suggestion: add some throw-away memory and map all
>   non-present ranges there. Add ibv_reregister_physical_mr or similar
>   API to update mappings on guest memory hotplug/unplug.
> 
> - Memory overcommit/autonuma/THP.
> 
>   This includes techniques such as swap,KSM,COW, page migration.
>   All these rely on ability to move pages around without
>   breaking hardware access.
> 
>   Suggestion: for hardware that supports it,
>   enabling on-demand paging for all registered memory seems
>   to address the issue more or less transparently to guests.
>   This isn't supported by all hardware but might be
>   at least a reasonable first step.
> 
> - Migration: memory tracking.
> 
>   Migration requires detecting hardware access to pages
>   either on write (pre-copy) or any access (post-copy).
>   Post copy just requires ODP support to work with
>   userfaultfd properly.

Can you explain what ODP support is?

>   Pre-copy would require a write-tracking API along
>   the lines of one exposed by KVM or vhost.
>   Each tracked page would be write-protected (causing faults on
>   hardware access) on hardware write fault is generated
>   and recorded, page is made writeable.

Can you write-protect like that from the RDMA hardware?
I'd be surprised if the hardware was happy with that.

> - Migration: moving QP numbers.
> 
>   QP numbers are exposed on the wire and so must move together
>   with the VM.
> 
>   Suggestion: allow specifying QP number when creating a QP.
>   To avoid conflicts between multiple users, initial version can limit
>   library to a single user per device. Multiple VMs can simply
>   attach to distinct VFs.
> 
> - Migration: moving QP state.
> 
>   When migrating the VM, a QP has to be torn down
>   on source and created on destination.
>   We have to migrate e.g. the current PSN - but what
>   should happen when a new packet arrives on source
>   after QP has been torn down?
> 
>   Suggestion 1: move QP to a special state "suspended" and ignore
>   packets, or cause source to retransmit with e.g. an out of
>   resources error. Retransmit counter might need to be
>   adjusted compared to what guest requested to account
>   for the extra need to retransmit.
>   Is there a good existing QP state that does this?
> 
>   Suggestion 2: forward packets to destination somehow.
>   Might overload the fabric as we are crossing e.g.
>   pci bus multiple times.
> 
> - Migration: network update
> 
>   ROCE v1 and infiniband seem to tie connections to
>   hardware specific GIDs which can not be moved by software.
> 
>   Suggestion: limit migration to RoCE v2 initially.
> 
> - Migration: packet loss recovery.
> 
>   As a RoCE address moves across the network, network has
>   to be updated which takes time, meanwhile packet loss seems
>   to be hard to avoid.
> 
>   Suggestion: limit initial support to hardware that is
>   able to recover from occasional packet drops, with
>   some slowdown.
> 
> - Migration: suspend/resume API?
>   It might be easier to pack up state of all resources
>   such as all QP numbers and state of all QPs etc
>   in a single memory buffer, migrate then unpack on destination.
> 
>   Removes need for 2 separate APIs for suspended state and
>   for specifying QPN on creation.
> 
>   This creates a format for serialization that will have to
>   be maintained in a compatible way - it is not clear that
>   the maintainance overhead is worth the potential
>   simplification, if any.
> 
> 
> That's it - I hope this helps, feel free to discuss, preferably copying
> virtio-dev (subscription required for now, people are looking into
> fixing this, sorry about that).

Dave

> Thanks!
> 
> -- 
> MST
> 
--
Dr. David Alan Gilbert / dgilbert-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2017-07-19 10:55 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-19  2:05 host side todo list for virtio rdma Michael S. Tsirkin
     [not found] ` <20170719004721-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2017-07-19 10:55   ` Dr. David Alan Gilbert [this message]
2017-07-25 14:05     ` [Qemu-devel] " Michael S. Tsirkin
     [not found]       ` <20170725165915-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2017-07-28 18:02         ` Dr. David Alan Gilbert
2017-07-28 23:48           ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170719105549.GB3500@work-vm \
    --to=dgilbert-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=marcel.a-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=qemu-devel-qX2TKyscuCcdnm+yROfE0A@public.gmane.org \
    --cc=virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b@public.gmane.org \
    --cc=virtio-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).