Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Si-Wei Liu <si-wei.liu@oracle.com>
To: Eugenio Perez Martin <eperezma@redhat.com>
Cc: qemu-devel@nongnu.org, Shannon <shannon.nelson@amd.com>,
	Parav Pandit <parav@mellanox.com>,
	Stefano Garzarella <sgarzare@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	yin31149@gmail.com, Jason Wang <jasowang@redhat.com>,
	Yajun Wu <yajunw@nvidia.com>,
	Zhu Lingshan <lingshan.zhu@intel.com>,
	Lei Yang <leiyang@redhat.com>,
	Dragos Tatulea <dtatulea@nvidia.com>,
	Juan Quintela <quintela@redhat.com>,
	Laurent Vivier <lvivier@redhat.com>,
	Gautam Dawar <gdawar@xilinx.com>
Subject: Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration
Date: Fri, 3 Nov 2023 13:19:22 -0700	[thread overview]
Message-ID: <b4cdaabd-45b1-4ff7-a267-05230a754588@oracle.com> (raw)
In-Reply-To: <CAJaqyWeOG0Eoc8W9FYFzJ9r1ENd3Cd1oLGF=ggJvWth2xD6u9A@mail.gmail.com>



On 11/2/2023 5:37 AM, Eugenio Perez Martin wrote:
> On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
>>> Current memory operations like pinning may take a lot of time at the
>>>
>>> destination.  Currently they are done after the source of the migration is
>>>
>>> stopped, and before the workload is resumed at the destination.  This is a
>>>
>>> period where neigher traffic can flow, nor the VM workload can continue
>>>
>>> (downtime).
>>>
>>>
>>>
>>> We can do better as we know the memory layout of the guest RAM at the
>>>
>>> destination from the moment the migration starts.  Moving that operation allows
>>>
>>> QEMU to communicate the kernel the maps while the workload is still running in
>>>
>>> the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
>>>
>>> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
>>>
>>> saving all the pinning time.
>> I get what you want to say, though not sure how pinning is relevant to
>> on-chip IOMMU and .set_map here, essentially pinning is required for all
>> parent vdpa drivers that perform DMA hence don't want VM pages to move
>> around.
> Basically highlighting that the work done under .set_map is not only
> pinning, but it is a significant fraction on it. It can be reworded or
> deleted for sure.
>
>>>
>>>
>>> Note that further devices setup at the end of the migration may alter the guest
>>>
>>> memory layout. But same as the previous point, many operations are still done
>>>
>>> incrementally, like memory pinning, so we're saving time anyway.
>>>
>>>
>>>
>>> The first bunch of patches just reorganizes the code, so memory related
>>>
>>> operation parameters are shared between all vhost_vdpa devices.  This is
>>>
>>> because the destination does not know what vhost_vdpa struct will have the
>>>
>>> registered listener member, so it is easier to place them in a shared struct
>>>
>>> rather to keep them in vhost_vdpa struct.  Future version may squash or omit
>>>
>>> these patches.
>> It looks this VhostVDPAShared facility (patch 1-13) is also what I need
>> in my SVQ descriptor group series [*], for which I've built similar
>> construct there. If possible please try to merge this in ASAP. I'll
>> rework my series on top of that.
>>
>> [*]
>> https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
>>
> I can send it individually, for sure.
>
> MST, Jason, can this first part be merged? It doesn't add a lot by
> itself but it helps pave the way for future changes.
If it cannot, it doesn't matter. I can pick it from here and get my 
series posted with your patches 1-13 applied upfront. This should work, 
I think?

>>>
>>>
>>> Only tested with vdpa_sim. I'm sending this before full benchmark, as some work
>>>
>>> like [1] can be based on it, and Si-Wei agreed on benchmark this series with
>>>
>>> his experience.
>> Haven't done the full benchmark compared to pre-map at destination yet,
>> though an observation is that the destination QEMU seems very easy to
>> get stuck for very long time while in mid of pinning pages. During this
>> period, any client doing read-only QMP query or executing HMP info
>> command got frozen indefinitely (subject to how large size the memory is
>> being pinned). Is it possible to unblock those QMP request or HMP
>> command from being executed (at least the read-only ones) while in
>> migration? Yield from the load_setup corourtine and spawn another thread?
>>
> Ok, I wasn't aware of that.
>
> I think we cannot yield in a coroutine and wait for an ioctl.
I was wondering if we need a separate coroutine out of the general 
migration path to support this special code without overloading 
load_setup or its callers. For instance, unblock the source from sending 
guest rams while allow destination pin pages in parallel should be 
possible.

Regardless, a separate thread is needed to carry out all the heavy 
lifting, i.e. ioctl(2) or write(2) syscalls to map&pin pages.


> One
> option that came to my mind is to effectively use another thread, and
> use a POSIX barrier (or equivalent on glib / QEMU) before finishing
> the migration.
Yes, a separate thread is needed anyway.

>   I'm not sure if there are more points where we can
> check the barrier and tell the migration to continue or stop though.
I think there is, for e.g. what if the dma_map fails. There must be a 
check point for that.

>
> Another option is to effectively start doing these ioctls in an
> asynchronous way, io_uring cmds like, but I'd like to achieve this
> first.
Yes, io_uring or any async API could be another option. Though this 
needs new uAPI through additional kernel facility to support. Anyway, 
it's up to you to decide. :)

Regards,
-Siwei
>> Having said, not sure if .load_setup is a good fit for what we want to
>> do. Searching all current users of .load_setup, either the job can be
>> done instantly or the task is time bound without trapping into kernel
>> for too long. Maybe pinning is too special use case here...
>>
>> -Siwei
>>>
>>>
>>> Future directions on top of this series may include:
>>>
>>> * Iterative migration of virtio-net devices, as it may reduce downtime per [1].
>>>
>>>     vhost-vdpa net can apply the configuration through CVQ in the destination
>>>
>>>     while the source is still migrating.
>>>
>>> * Move more things ahead of migration time, like DRIVER_OK.
>>>
>>> * Check that the devices of the destination are valid, and cancel the migration
>>>
>>>     in case it is not.
>>>
>>>
>>>
>>> [1] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
>>>
>>>
>>>
>>> Eugenio Pérez (18):
>>>
>>>     vdpa: add VhostVDPAShared
>>>
>>>     vdpa: move iova tree to the shared struct
>>>
>>>     vdpa: move iova_range to vhost_vdpa_shared
>>>
>>>     vdpa: move shadow_data to vhost_vdpa_shared
>>>
>>>     vdpa: use vdpa shared for tracing
>>>
>>>     vdpa: move file descriptor to vhost_vdpa_shared
>>>
>>>     vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
>>>
>>>     vdpa: move backend_cap to vhost_vdpa_shared
>>>
>>>     vdpa: remove msg type of vhost_vdpa
>>>
>>>     vdpa: move iommu_list to vhost_vdpa_shared
>>>
>>>     vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
>>>
>>>     vdpa: use dev_shared in vdpa_iommu
>>>
>>>     vdpa: move memory listener to vhost_vdpa_shared
>>>
>>>     vdpa: do not set virtio status bits if unneeded
>>>
>>>     vdpa: add vhost_vdpa_load_setup
>>>
>>>     vdpa: add vhost_vdpa_net_load_setup NetClient callback
>>>
>>>     vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
>>>
>>>     virtio_net: register incremental migration handlers
>>>
>>>
>>>
>>>    include/hw/virtio/vhost-vdpa.h |  43 +++++---
>>>
>>>    include/net/net.h              |   4 +
>>>
>>>    hw/net/virtio-net.c            |  23 +++++
>>>
>>>    hw/virtio/vdpa-dev.c           |   7 +-
>>>
>>>    hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
>>>
>>>    net/vhost-vdpa.c               | 127 ++++++++++++-----------
>>>
>>>    hw/virtio/trace-events         |  14 +--
>>>
>>>    7 files changed, 239 insertions(+), 162 deletions(-)
>>>
>>>
>>>

next prev parent reply	other threads:[~2023-11-03 20:20 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 01/18] vdpa: add VhostVDPAShared Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 02/18] vdpa: move iova tree to the shared struct Eugenio Pérez
2023-11-02  9:36   ` Si-Wei Liu
2023-11-24 17:11     ` Eugenio Perez Martin
2023-10-19 14:34 ` [RFC PATCH 03/18] vdpa: move iova_range to vhost_vdpa_shared Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 04/18] vdpa: move shadow_data " Eugenio Pérez
2023-11-02  8:47   ` Si-Wei Liu
2023-11-02 15:45     ` Eugenio Perez Martin
2023-10-19 14:34 ` [RFC PATCH 05/18] vdpa: use vdpa shared for tracing Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 06/18] vdpa: move file descriptor to vhost_vdpa_shared Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 07/18] vdpa: move iotlb_batch_begin_sent " Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 08/18] vdpa: move backend_cap " Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 09/18] vdpa: remove msg type of vhost_vdpa Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 10/18] vdpa: move iommu_list to vhost_vdpa_shared Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 11/18] vdpa: use VhostVDPAShared in vdpa_dma_map and unmap Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 12/18] vdpa: use dev_shared in vdpa_iommu Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 13/18] vdpa: move memory listener to vhost_vdpa_shared Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 14/18] vdpa: do not set virtio status bits if unneeded Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 15/18] vdpa: add vhost_vdpa_load_setup Eugenio Pérez
2023-11-02  8:48   ` Si-Wei Liu
2023-11-02 15:24     ` Eugenio Perez Martin
2023-10-19 14:34 ` [RFC PATCH 16/18] vdpa: add vhost_vdpa_net_load_setup NetClient callback Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 17/18] vdpa: use shadow_data instead of first device v->shadow_vqs_enabled Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 18/18] virtio_net: register incremental migration handlers Eugenio Pérez
2023-11-02  4:36 ` [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Jason Wang
2023-11-02 10:12 ` Si-Wei Liu
2023-11-02 12:37   ` Eugenio Perez Martin
2023-11-03 20:19     ` Si-Wei Liu [this message]
2023-12-05 14:23       ` Eugenio Perez Martin
2023-12-06  0:36         ` Si-Wei Liu
2023-11-06  4:17     ` Jason Wang
2023-11-03 20:40   ` Si-Wei Liu
2023-11-06  9:04     ` Eugenio Perez Martin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b4cdaabd-45b1-4ff7-a267-05230a754588@oracle.com \
    --to=si-wei.liu@oracle.com \
    --cc=dtatulea@nvidia.com \
    --cc=eperezma@redhat.com \
    --cc=gdawar@xilinx.com \
    --cc=jasowang@redhat.com \
    --cc=leiyang@redhat.com \
    --cc=lingshan.zhu@intel.com \
    --cc=lvivier@redhat.com \
    --cc=mst@redhat.com \
    --cc=parav@mellanox.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=sgarzare@redhat.com \
    --cc=shannon.nelson@amd.com \
    --cc=yajunw@nvidia.com \
    --cc=yin31149@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).