[RFC PATCH 00/18] Map memory at destination .load

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration
@ 2023-10-19 14:34 Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 01/18] vdpa: add VhostVDPAShared Eugenio Pérez
                   ` (19 more replies)
  0 siblings, 20 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment the migration starts.  Moving that operation allows
QEMU to communicate the kernel the maps while the workload is still running in
the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
saving all the pinning time.

Note that further devices setup at the end of the migration may alter the guest
memory layout. But same as the previous point, many operations are still done
incrementally, like memory pinning, so we're saving time anyway.

The first bunch of patches just reorganizes the code, so memory related
operation parameters are shared between all vhost_vdpa devices.  This is
because the destination does not know what vhost_vdpa struct will have the
registered listener member, so it is easier to place them in a shared struct
rather to keep them in vhost_vdpa struct.  Future version may squash or omit
these patches.

Only tested with vdpa_sim. I'm sending this before full benchmark, as some work
like [1] can be based on it, and Si-Wei agreed on benchmark this series with
his experience.

Future directions on top of this series may include:
* Iterative migration of virtio-net devices, as it may reduce downtime per [1].
  vhost-vdpa net can apply the configuration through CVQ in the destination
  while the source is still migrating.
* Move more things ahead of migration time, like DRIVER_OK.
* Check that the devices of the destination are valid, and cancel the migration
  in case it is not.

[1] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/

Eugenio Pérez (18):
  vdpa: add VhostVDPAShared
  vdpa: move iova tree to the shared struct
  vdpa: move iova_range to vhost_vdpa_shared
  vdpa: move shadow_data to vhost_vdpa_shared
  vdpa: use vdpa shared for tracing
  vdpa: move file descriptor to vhost_vdpa_shared
  vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
  vdpa: move backend_cap to vhost_vdpa_shared
  vdpa: remove msg type of vhost_vdpa
  vdpa: move iommu_list to vhost_vdpa_shared
  vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
  vdpa: use dev_shared in vdpa_iommu
  vdpa: move memory listener to vhost_vdpa_shared
  vdpa: do not set virtio status bits if unneeded
  vdpa: add vhost_vdpa_load_setup
  vdpa: add vhost_vdpa_net_load_setup NetClient callback
  vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
  virtio_net: register incremental migration handlers

 include/hw/virtio/vhost-vdpa.h |  43 +++++---
 include/net/net.h              |   4 +
 hw/net/virtio-net.c            |  23 +++++
 hw/virtio/vdpa-dev.c           |   7 +-
 hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
 net/vhost-vdpa.c               | 127 ++++++++++++-----------
 hw/virtio/trace-events         |  14 +--
 7 files changed, 239 insertions(+), 162 deletions(-)

-- 
2.39.3

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 01/18] vdpa: add VhostVDPAShared
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 02/18] vdpa: move iova tree to the shared struct Eugenio Pérez
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

It will hold properties shared among all vhost_vdpa instances associated
with of the same device.  For example, we just need one iova_tree or one
memory listener for the entire device.

Next patches will register the vhost_vdpa memory listener at the
beginning of the VM migration at the destination. This enables QEMU to
map the memory to the device before stopping the VM at the source,
instead of doing while both source and destination are stopped, thus
minimizing the downtime.

However, the destination QEMU is unaware of which vhost_vdpa struct will
register its memory_listener.  If the source guest has CVQ enabled, it
will be the one associated with the CVQ.  Otherwise, it will be the
first one.

Save the memory operations related members in a common place rather than
always in the first / last vhost_vdpa.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h |  5 +++++
 net/vhost-vdpa.c               | 24 ++++++++++++++++++++++--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 5407d54fd7..eb1a56d75a 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -30,6 +30,10 @@ typedef struct VhostVDPAHostNotifier {
     void *addr;
 } VhostVDPAHostNotifier;
 
+/* Info shared by all vhost_vdpa device models */
+typedef struct vhost_vdpa_shared {
+} VhostVDPAShared;
+
 typedef struct vhost_vdpa {
     int device_fd;
     int index;
@@ -46,6 +50,7 @@ typedef struct vhost_vdpa {
     bool suspended;
     /* IOVA mapping used by the Shadow Virtqueue */
     VhostIOVATree *iova_tree;
+    VhostVDPAShared *shared;
     GPtrArray *shadow_vqs;
     const VhostShadowVirtqueueOps *shadow_vq_ops;
     void *shadow_vq_ops_opaque;
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 939c984d5b..a2f9855288 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -238,6 +238,10 @@ static void vhost_vdpa_cleanup(NetClientState *nc)
         qemu_close(s->vhost_vdpa.device_fd);
         s->vhost_vdpa.device_fd = -1;
     }
+    if (s->vhost_vdpa.index != 0) {
+        return;
+    }
+    g_free(s->vhost_vdpa.shared);
 }
 
 static bool vhost_vdpa_has_vnet_hdr(NetClientState *nc)
@@ -1428,6 +1432,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
                                        bool svq,
                                        struct vhost_vdpa_iova_range iova_range,
                                        uint64_t features,
+                                       VhostVDPAShared *shared,
                                        Error **errp)
 {
     NetClientState *nc = NULL;
@@ -1463,6 +1468,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
     if (queue_pair_index == 0) {
         vhost_vdpa_net_valid_svq_features(features,
                                           &s->vhost_vdpa.migration_blocker);
+        s->vhost_vdpa.shared = g_new0(VhostVDPAShared, 1);
     } else if (!is_datapath) {
         s->cvq_cmd_out_buffer = mmap(NULL, vhost_vdpa_net_cvq_cmd_page_len(),
                                      PROT_READ | PROT_WRITE,
@@ -1475,11 +1481,16 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
         s->vhost_vdpa.shadow_vq_ops_opaque = s;
         s->cvq_isolated = cvq_isolated;
     }
+    if (queue_pair_index != 0) {
+        s->vhost_vdpa.shared = shared;
+    }
+
     ret = vhost_vdpa_add(nc, (void *)&s->vhost_vdpa, queue_pair_index, nvqs);
     if (ret) {
         qemu_del_net_client(nc);
         return NULL;
     }
+
     return nc;
 }
 
@@ -1591,17 +1602,26 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
     ncs = g_malloc0(sizeof(*ncs) * queue_pairs);
 
     for (i = 0; i < queue_pairs; i++) {
+        VhostVDPAShared *shared = NULL;
+
+        if (i) {
+            shared = DO_UPCAST(VhostVDPAState, nc, ncs[0])->vhost_vdpa.shared;
+        }
         ncs[i] = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
                                      vdpa_device_fd, i, 2, true, opts->x_svq,
-                                     iova_range, features, errp);
+                                     iova_range, features, shared, errp);
         if (!ncs[i])
             goto err;
     }
 
     if (has_cvq) {
+        VhostVDPAState *s0 = DO_UPCAST(VhostVDPAState, nc, ncs[0]);
+        VhostVDPAShared *shared = s0->vhost_vdpa.shared;
+
         nc = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
                                  vdpa_device_fd, i, 1, false,
-                                 opts->x_svq, iova_range, features, errp);
+                                 opts->x_svq, iova_range, features, shared,
+                                 errp);
         if (!nc)
             goto err;
     }
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 02/18] vdpa: move iova tree to the shared struct
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 01/18] vdpa: add VhostVDPAShared Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-11-02  9:36   ` Si-Wei Liu
  2023-10-19 14:34 ` [RFC PATCH 03/18] vdpa: move iova_range to vhost_vdpa_shared Eugenio Pérez
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

Next patches will register the vhost_vdpa memory listener while the VM
is migrating at the destination, so we can map the memory to the device
before stopping the VM at the source.  The main goal is to reduce the
downtime.

However, the destination QEMU is unaware of which vhost_vdpa device will
register its memory_listener.  If the source guest has CVQ enabled, it
will be the CVQ device.  Otherwise, it  will be the first one.

Move the iova tree to VhostVDPAShared so all vhost_vdpa can use it,
rather than always in the first or last vhost_vdpa.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h |  4 +--
 hw/virtio/vhost-vdpa.c         | 19 ++++++------
 net/vhost-vdpa.c               | 54 +++++++++++++++-------------------
 3 files changed, 35 insertions(+), 42 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index eb1a56d75a..ac036055d3 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -32,6 +32,8 @@ typedef struct VhostVDPAHostNotifier {
 
 /* Info shared by all vhost_vdpa device models */
 typedef struct vhost_vdpa_shared {
+    /* IOVA mapping used by the Shadow Virtqueue */
+    VhostIOVATree *iova_tree;
 } VhostVDPAShared;
 
 typedef struct vhost_vdpa {
@@ -48,8 +50,6 @@ typedef struct vhost_vdpa {
     bool shadow_data;
     /* Device suspended successfully */
     bool suspended;
-    /* IOVA mapping used by the Shadow Virtqueue */
-    VhostIOVATree *iova_tree;
     VhostVDPAShared *shared;
     GPtrArray *shadow_vqs;
     const VhostShadowVirtqueueOps *shadow_vq_ops;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 819b2d811a..9cee38cb6d 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -358,7 +358,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
         mem_region.size = int128_get64(llsize) - 1,
         mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
 
-        r = vhost_iova_tree_map_alloc(v->iova_tree, &mem_region);
+        r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &mem_region);
         if (unlikely(r != IOVA_OK)) {
             error_report("Can't allocate a mapping (%d)", r);
             goto fail;
@@ -379,7 +379,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
 
 fail_map:
     if (v->shadow_data) {
-        vhost_iova_tree_remove(v->iova_tree, mem_region);
+        vhost_iova_tree_remove(v->shared->iova_tree, mem_region);
     }
 
 fail:
@@ -441,13 +441,13 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
             .size = int128_get64(llsize) - 1,
         };
 
-        result = vhost_iova_tree_find_iova(v->iova_tree, &mem_region);
+        result = vhost_iova_tree_find_iova(v->shared->iova_tree, &mem_region);
         if (!result) {
             /* The memory listener map wasn't mapped */
             return;
         }
         iova = result->iova;
-        vhost_iova_tree_remove(v->iova_tree, *result);
+        vhost_iova_tree_remove(v->shared->iova_tree, *result);
     }
     vhost_vdpa_iotlb_batch_begin_once(v);
     /*
@@ -1059,7 +1059,8 @@ static void vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr addr)
     const DMAMap needle = {
         .translated_addr = addr,
     };
-    const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, &needle);
+    const DMAMap *result = vhost_iova_tree_find_iova(v->shared->iova_tree,
+                                                     &needle);
     hwaddr size;
     int r;
 
@@ -1075,7 +1076,7 @@ static void vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr addr)
         return;
     }
 
-    vhost_iova_tree_remove(v->iova_tree, *result);
+    vhost_iova_tree_remove(v->shared->iova_tree, *result);
 }
 
 static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
@@ -1103,7 +1104,7 @@ static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
 {
     int r;
 
-    r = vhost_iova_tree_map_alloc(v->iova_tree, needle);
+    r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
     if (unlikely(r != IOVA_OK)) {
         error_setg(errp, "Cannot allocate iova (%d)", r);
         return false;
@@ -1115,7 +1116,7 @@ static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
                            needle->perm == IOMMU_RO);
     if (unlikely(r != 0)) {
         error_setg_errno(errp, -r, "Cannot map region to device");
-        vhost_iova_tree_remove(v->iova_tree, *needle);
+        vhost_iova_tree_remove(v->shared->iova_tree, *needle);
     }
 
     return r == 0;
@@ -1216,7 +1217,7 @@ static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
             goto err;
         }
 
-        vhost_svq_start(svq, dev->vdev, vq, v->iova_tree);
+        vhost_svq_start(svq, dev->vdev, vq, v->shared->iova_tree);
         ok = vhost_vdpa_svq_map_rings(dev, svq, &addr, &err);
         if (unlikely(!ok)) {
             goto err_map;
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index a2f9855288..15e7579b13 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -345,8 +345,8 @@ static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
 
     add_migration_state_change_notifier(&s->migration_state);
     if (v->shadow_vqs_enabled) {
-        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
-                                           v->iova_range.last);
+        v->shared->iova_tree = vhost_iova_tree_new(v->iova_range.first,
+                                                   v->iova_range.last);
     }
 }
 
@@ -371,11 +371,6 @@ static int vhost_vdpa_net_data_start(NetClientState *nc)
         return 0;
     }
 
-    if (v->shadow_vqs_enabled) {
-        VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s);
-        v->iova_tree = s0->vhost_vdpa.iova_tree;
-    }
-
     return 0;
 }
 
@@ -408,9 +403,8 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
 
     dev = s->vhost_vdpa.dev;
     if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
-        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
-    } else {
-        s->vhost_vdpa.iova_tree = NULL;
+        g_clear_pointer(&s->vhost_vdpa.shared->iova_tree,
+                        vhost_iova_tree_delete);
     }
 }
 
@@ -464,7 +458,7 @@ static int vhost_vdpa_set_address_space_id(struct vhost_vdpa *v,
 
 static void vhost_vdpa_cvq_unmap_buf(struct vhost_vdpa *v, void *addr)
 {
-    VhostIOVATree *tree = v->iova_tree;
+    VhostIOVATree *tree = v->shared->iova_tree;
     DMAMap needle = {
         /*
          * No need to specify size or to look for more translations since
@@ -498,7 +492,7 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
     map.translated_addr = (hwaddr)(uintptr_t)buf;
     map.size = size - 1;
     map.perm = write ? IOMMU_RW : IOMMU_RO,
-    r = vhost_iova_tree_map_alloc(v->iova_tree, &map);
+    r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &map);
     if (unlikely(r != IOVA_OK)) {
         error_report("Cannot map injected element");
         return r;
@@ -513,7 +507,7 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
     return 0;
 
 dma_map_err:
-    vhost_iova_tree_remove(v->iova_tree, map);
+    vhost_iova_tree_remove(v->shared->iova_tree, map);
     return r;
 }
 
@@ -573,24 +567,22 @@ out:
         return 0;
     }
 
-    if (s0->vhost_vdpa.iova_tree) {
-        /*
-         * SVQ is already configured for all virtqueues.  Reuse IOVA tree for
-         * simplicity, whether CVQ shares ASID with guest or not, because:
-         * - Memory listener need access to guest's memory addresses allocated
-         *   in the IOVA tree.
-         * - There should be plenty of IOVA address space for both ASID not to
-         *   worry about collisions between them.  Guest's translations are
-         *   still validated with virtio virtqueue_pop so there is no risk for
-         *   the guest to access memory that it shouldn't.
-         *
-         * To allocate a iova tree per ASID is doable but it complicates the
-         * code and it is not worth it for the moment.
-         */
-        v->iova_tree = s0->vhost_vdpa.iova_tree;
-    } else {
-        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
-                                           v->iova_range.last);
+    /*
+     * If other vhost_vdpa already have an iova_tree, reuse it for simplicity,
+     * whether CVQ shares ASID with guest or not, because:
+     * - Memory listener need access to guest's memory addresses allocated in
+     *   the IOVA tree.
+     * - There should be plenty of IOVA address space for both ASID not to
+     *   worry about collisions between them.  Guest's translations are still
+     *   validated with virtio virtqueue_pop so there is no risk for the guest
+     *   to access memory that it shouldn't.
+     *
+     * To allocate a iova tree per ASID is doable but it complicates the code
+     * and it is not worth it for the moment.
+     */
+    if (!v->shared->iova_tree) {
+        v->shared->iova_tree = vhost_iova_tree_new(v->iova_range.first,
+                                                   v->iova_range.last);
     }
 
     r = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer,
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 02/18] vdpa: move iova tree to the shared struct
  2023-10-19 14:34 ` [RFC PATCH 02/18] vdpa: move iova tree to the shared struct Eugenio Pérez
@ 2023-11-02  9:36   ` Si-Wei Liu
  2023-11-24 17:11     ` Eugenio Perez Martin
  0 siblings, 1 reply; 34+ messages in thread
From: Si-Wei Liu @ 2023-11-02  9:36 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, Gautam Dawar



On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> Next patches will register the vhost_vdpa memory listener while the VM
> is migrating at the destination, so we can map the memory to the device
> before stopping the VM at the source.  The main goal is to reduce the
> downtime.
>
> However, the destination QEMU is unaware of which vhost_vdpa device will
> register its memory_listener.  If the source guest has CVQ enabled, it
> will be the CVQ device.  Otherwise, it  will be the first one.
>
> Move the iova tree to VhostVDPAShared so all vhost_vdpa can use it,
> rather than always in the first or last vhost_vdpa.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   include/hw/virtio/vhost-vdpa.h |  4 +--
>   hw/virtio/vhost-vdpa.c         | 19 ++++++------
>   net/vhost-vdpa.c               | 54 +++++++++++++++-------------------
>   3 files changed, 35 insertions(+), 42 deletions(-)
>
> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> index eb1a56d75a..ac036055d3 100644
> --- a/include/hw/virtio/vhost-vdpa.h
> +++ b/include/hw/virtio/vhost-vdpa.h
> @@ -32,6 +32,8 @@ typedef struct VhostVDPAHostNotifier {
>   
>   /* Info shared by all vhost_vdpa device models */
>   typedef struct vhost_vdpa_shared {
> +    /* IOVA mapping used by the Shadow Virtqueue */
> +    VhostIOVATree *iova_tree;
>   } VhostVDPAShared;
>   
>   typedef struct vhost_vdpa {
> @@ -48,8 +50,6 @@ typedef struct vhost_vdpa {
>       bool shadow_data;
>       /* Device suspended successfully */
>       bool suspended;
> -    /* IOVA mapping used by the Shadow Virtqueue */
> -    VhostIOVATree *iova_tree;
>       VhostVDPAShared *shared;
>       GPtrArray *shadow_vqs;
>       const VhostShadowVirtqueueOps *shadow_vq_ops;
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 819b2d811a..9cee38cb6d 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -358,7 +358,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>           mem_region.size = int128_get64(llsize) - 1,
>           mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>   
> -        r = vhost_iova_tree_map_alloc(v->iova_tree, &mem_region);
> +        r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &mem_region);
>           if (unlikely(r != IOVA_OK)) {
>               error_report("Can't allocate a mapping (%d)", r);
>               goto fail;
> @@ -379,7 +379,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>   
>   fail_map:
>       if (v->shadow_data) {
> -        vhost_iova_tree_remove(v->iova_tree, mem_region);
> +        vhost_iova_tree_remove(v->shared->iova_tree, mem_region);
>       }
>   
>   fail:
> @@ -441,13 +441,13 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
>               .size = int128_get64(llsize) - 1,
>           };
>   
> -        result = vhost_iova_tree_find_iova(v->iova_tree, &mem_region);
> +        result = vhost_iova_tree_find_iova(v->shared->iova_tree, &mem_region);
>           if (!result) {
>               /* The memory listener map wasn't mapped */
>               return;
>           }
>           iova = result->iova;
> -        vhost_iova_tree_remove(v->iova_tree, *result);
> +        vhost_iova_tree_remove(v->shared->iova_tree, *result);
>       }
>       vhost_vdpa_iotlb_batch_begin_once(v);
>       /*
> @@ -1059,7 +1059,8 @@ static void vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr addr)
>       const DMAMap needle = {
>           .translated_addr = addr,
>       };
> -    const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, &needle);
> +    const DMAMap *result = vhost_iova_tree_find_iova(v->shared->iova_tree,
> +                                                     &needle);
>       hwaddr size;
>       int r;
>   
> @@ -1075,7 +1076,7 @@ static void vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr addr)
>           return;
>       }
>   
> -    vhost_iova_tree_remove(v->iova_tree, *result);
> +    vhost_iova_tree_remove(v->shared->iova_tree, *result);
>   }
>   
>   static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
> @@ -1103,7 +1104,7 @@ static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
>   {
>       int r;
>   
> -    r = vhost_iova_tree_map_alloc(v->iova_tree, needle);
> +    r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
>       if (unlikely(r != IOVA_OK)) {
>           error_setg(errp, "Cannot allocate iova (%d)", r);
>           return false;
> @@ -1115,7 +1116,7 @@ static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
>                              needle->perm == IOMMU_RO);
>       if (unlikely(r != 0)) {
>           error_setg_errno(errp, -r, "Cannot map region to device");
> -        vhost_iova_tree_remove(v->iova_tree, *needle);
> +        vhost_iova_tree_remove(v->shared->iova_tree, *needle);
>       }
>   
>       return r == 0;
> @@ -1216,7 +1217,7 @@ static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
>               goto err;
>           }
>   
> -        vhost_svq_start(svq, dev->vdev, vq, v->iova_tree);
> +        vhost_svq_start(svq, dev->vdev, vq, v->shared->iova_tree);
>           ok = vhost_vdpa_svq_map_rings(dev, svq, &addr, &err);
>           if (unlikely(!ok)) {
>               goto err_map;
> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> index a2f9855288..15e7579b13 100644
> --- a/net/vhost-vdpa.c
> +++ b/net/vhost-vdpa.c
> @@ -345,8 +345,8 @@ static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
>   
>       add_migration_state_change_notifier(&s->migration_state);
>       if (v->shadow_vqs_enabled) {
> -        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> -                                           v->iova_range.last);
> +        v->shared->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> +                                                   v->iova_range.last);
This code change is okay so far without .load_setup involved, but if 
interacting with .load_setup the iova_tree can be NULL when x-svq=on. 
Below is a stacktrace showing the problem.

#0  0x00005582bf00944c in vhost_iova_tree_map_alloc (tree=0x0, 
map=map@entry=0x7fb16bfffde0) at ../hw/virtio/vhost-iova-tree.c:89
#1  0x00005582bee8cdb3 in vhost_vdpa_listener_region_add 
(listener=0x5582c138ee58, section=0x7fb16bfffe50)
     at ../include/qemu/int128.h:33
#2  0x00005582bf029d4b in memory_listener_register (as=0x5582bfb53d20 
<address_space_memory>, listener=0x5582c138ee58)
     at ../system/memory.c:3026
#3  0x00005582bf029d4b in memory_listener_register 
(listener=0x5582c138ee58, as=0x5582bfb53d20 <address_space_memory>)
     at ../system/memory.c:3096
#4  0x00005582bee8e712 in vhost_vdpa_load_setup (shared=0x5582c138ee50, 
dma_as=0x5582bfb53d20 <address_space_memory>)
     at ../hw/virtio/vhost-vdpa.c:1550
#5  0x00005582bef0b7df in vhost_vdpa_net_load_setup (nc=0x7fb172a27010, 
nic=<optimized out>) at ../net/vhost-vdpa.c:415
#6  0x00005582beeeb4f5 in qemu_loadvm_state (f=0x5582c1c1a800) at 
../migration/savevm.c:2682
#7  0x00005582beeeb4f5 in qemu_loadvm_state (f=0x5582c1c1a800) at 
../migration/savevm.c:2866
#8  0x00005582beed5e17 in process_incoming_migration_co 
(opaque=<optimized out>) at ../migration/migration.c:548
#9  0x00005582bf21b29b in coroutine_trampoline (i0=<optimized out>, 
i1=<optimized out>) at ../util/coroutine-ucontext.c:177
#10 0x00007fb16e448190 in __start_context () at /lib64/libc.so.6

-Siwei
>       }
>   }
>   
> @@ -371,11 +371,6 @@ static int vhost_vdpa_net_data_start(NetClientState *nc)
>           return 0;
>       }
>   
> -    if (v->shadow_vqs_enabled) {
> -        VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s);
> -        v->iova_tree = s0->vhost_vdpa.iova_tree;
> -    }
> -
>       return 0;
>   }
>   
> @@ -408,9 +403,8 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
>   
>       dev = s->vhost_vdpa.dev;
>       if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> -        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> -    } else {
> -        s->vhost_vdpa.iova_tree = NULL;
> +        g_clear_pointer(&s->vhost_vdpa.shared->iova_tree,
> +                        vhost_iova_tree_delete);
>       }
>   }
>   
> @@ -464,7 +458,7 @@ static int vhost_vdpa_set_address_space_id(struct vhost_vdpa *v,
>   
>   static void vhost_vdpa_cvq_unmap_buf(struct vhost_vdpa *v, void *addr)
>   {
> -    VhostIOVATree *tree = v->iova_tree;
> +    VhostIOVATree *tree = v->shared->iova_tree;
>       DMAMap needle = {
>           /*
>            * No need to specify size or to look for more translations since
> @@ -498,7 +492,7 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
>       map.translated_addr = (hwaddr)(uintptr_t)buf;
>       map.size = size - 1;
>       map.perm = write ? IOMMU_RW : IOMMU_RO,
> -    r = vhost_iova_tree_map_alloc(v->iova_tree, &map);
> +    r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &map);
>       if (unlikely(r != IOVA_OK)) {
>           error_report("Cannot map injected element");
>           return r;
> @@ -513,7 +507,7 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
>       return 0;
>   
>   dma_map_err:
> -    vhost_iova_tree_remove(v->iova_tree, map);
> +    vhost_iova_tree_remove(v->shared->iova_tree, map);
>       return r;
>   }
>   
> @@ -573,24 +567,22 @@ out:
>           return 0;
>       }
>   
> -    if (s0->vhost_vdpa.iova_tree) {
> -        /*
> -         * SVQ is already configured for all virtqueues.  Reuse IOVA tree for
> -         * simplicity, whether CVQ shares ASID with guest or not, because:
> -         * - Memory listener need access to guest's memory addresses allocated
> -         *   in the IOVA tree.
> -         * - There should be plenty of IOVA address space for both ASID not to
> -         *   worry about collisions between them.  Guest's translations are
> -         *   still validated with virtio virtqueue_pop so there is no risk for
> -         *   the guest to access memory that it shouldn't.
> -         *
> -         * To allocate a iova tree per ASID is doable but it complicates the
> -         * code and it is not worth it for the moment.
> -         */
> -        v->iova_tree = s0->vhost_vdpa.iova_tree;
> -    } else {
> -        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> -                                           v->iova_range.last);
> +    /*
> +     * If other vhost_vdpa already have an iova_tree, reuse it for simplicity,
> +     * whether CVQ shares ASID with guest or not, because:
> +     * - Memory listener need access to guest's memory addresses allocated in
> +     *   the IOVA tree.
> +     * - There should be plenty of IOVA address space for both ASID not to
> +     *   worry about collisions between them.  Guest's translations are still
> +     *   validated with virtio virtqueue_pop so there is no risk for the guest
> +     *   to access memory that it shouldn't.
> +     *
> +     * To allocate a iova tree per ASID is doable but it complicates the code
> +     * and it is not worth it for the moment.
> +     */
> +    if (!v->shared->iova_tree) {
> +        v->shared->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> +                                                   v->iova_range.last);
>       }
>   
>       r = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer,



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 02/18] vdpa: move iova tree to the shared struct
  2023-11-02  9:36   ` Si-Wei Liu
@ 2023-11-24 17:11     ` Eugenio Perez Martin
  0 siblings, 0 replies; 34+ messages in thread
From: Eugenio Perez Martin @ 2023-11-24 17:11 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Shannon, Parav Pandit, Stefano Garzarella,
	Michael S. Tsirkin, yin31149, Jason Wang, Yajun Wu, Zhu Lingshan,
	Lei Yang, Dragos Tatulea, Juan Quintela, Laurent Vivier,
	Gautam Dawar

On Thu, Nov 2, 2023 at 10:37 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> > Next patches will register the vhost_vdpa memory listener while the VM
> > is migrating at the destination, so we can map the memory to the device
> > before stopping the VM at the source.  The main goal is to reduce the
> > downtime.
> >
> > However, the destination QEMU is unaware of which vhost_vdpa device will
> > register its memory_listener.  If the source guest has CVQ enabled, it
> > will be the CVQ device.  Otherwise, it  will be the first one.
> >
> > Move the iova tree to VhostVDPAShared so all vhost_vdpa can use it,
> > rather than always in the first or last vhost_vdpa.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   include/hw/virtio/vhost-vdpa.h |  4 +--
> >   hw/virtio/vhost-vdpa.c         | 19 ++++++------
> >   net/vhost-vdpa.c               | 54 +++++++++++++++-------------------
> >   3 files changed, 35 insertions(+), 42 deletions(-)
> >
> > diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> > index eb1a56d75a..ac036055d3 100644
> > --- a/include/hw/virtio/vhost-vdpa.h
> > +++ b/include/hw/virtio/vhost-vdpa.h
> > @@ -32,6 +32,8 @@ typedef struct VhostVDPAHostNotifier {
> >
> >   /* Info shared by all vhost_vdpa device models */
> >   typedef struct vhost_vdpa_shared {
> > +    /* IOVA mapping used by the Shadow Virtqueue */
> > +    VhostIOVATree *iova_tree;
> >   } VhostVDPAShared;
> >
> >   typedef struct vhost_vdpa {
> > @@ -48,8 +50,6 @@ typedef struct vhost_vdpa {
> >       bool shadow_data;
> >       /* Device suspended successfully */
> >       bool suspended;
> > -    /* IOVA mapping used by the Shadow Virtqueue */
> > -    VhostIOVATree *iova_tree;
> >       VhostVDPAShared *shared;
> >       GPtrArray *shadow_vqs;
> >       const VhostShadowVirtqueueOps *shadow_vq_ops;
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 819b2d811a..9cee38cb6d 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -358,7 +358,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >           mem_region.size = int128_get64(llsize) - 1,
> >           mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> >
> > -        r = vhost_iova_tree_map_alloc(v->iova_tree, &mem_region);
> > +        r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &mem_region);
> >           if (unlikely(r != IOVA_OK)) {
> >               error_report("Can't allocate a mapping (%d)", r);
> >               goto fail;
> > @@ -379,7 +379,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >
> >   fail_map:
> >       if (v->shadow_data) {
> > -        vhost_iova_tree_remove(v->iova_tree, mem_region);
> > +        vhost_iova_tree_remove(v->shared->iova_tree, mem_region);
> >       }
> >
> >   fail:
> > @@ -441,13 +441,13 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
> >               .size = int128_get64(llsize) - 1,
> >           };
> >
> > -        result = vhost_iova_tree_find_iova(v->iova_tree, &mem_region);
> > +        result = vhost_iova_tree_find_iova(v->shared->iova_tree, &mem_region);
> >           if (!result) {
> >               /* The memory listener map wasn't mapped */
> >               return;
> >           }
> >           iova = result->iova;
> > -        vhost_iova_tree_remove(v->iova_tree, *result);
> > +        vhost_iova_tree_remove(v->shared->iova_tree, *result);
> >       }
> >       vhost_vdpa_iotlb_batch_begin_once(v);
> >       /*
> > @@ -1059,7 +1059,8 @@ static void vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr addr)
> >       const DMAMap needle = {
> >           .translated_addr = addr,
> >       };
> > -    const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, &needle);
> > +    const DMAMap *result = vhost_iova_tree_find_iova(v->shared->iova_tree,
> > +                                                     &needle);
> >       hwaddr size;
> >       int r;
> >
> > @@ -1075,7 +1076,7 @@ static void vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr addr)
> >           return;
> >       }
> >
> > -    vhost_iova_tree_remove(v->iova_tree, *result);
> > +    vhost_iova_tree_remove(v->shared->iova_tree, *result);
> >   }
> >
> >   static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
> > @@ -1103,7 +1104,7 @@ static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
> >   {
> >       int r;
> >
> > -    r = vhost_iova_tree_map_alloc(v->iova_tree, needle);
> > +    r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
> >       if (unlikely(r != IOVA_OK)) {
> >           error_setg(errp, "Cannot allocate iova (%d)", r);
> >           return false;
> > @@ -1115,7 +1116,7 @@ static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
> >                              needle->perm == IOMMU_RO);
> >       if (unlikely(r != 0)) {
> >           error_setg_errno(errp, -r, "Cannot map region to device");
> > -        vhost_iova_tree_remove(v->iova_tree, *needle);
> > +        vhost_iova_tree_remove(v->shared->iova_tree, *needle);
> >       }
> >
> >       return r == 0;
> > @@ -1216,7 +1217,7 @@ static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
> >               goto err;
> >           }
> >
> > -        vhost_svq_start(svq, dev->vdev, vq, v->iova_tree);
> > +        vhost_svq_start(svq, dev->vdev, vq, v->shared->iova_tree);
> >           ok = vhost_vdpa_svq_map_rings(dev, svq, &addr, &err);
> >           if (unlikely(!ok)) {
> >               goto err_map;
> > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> > index a2f9855288..15e7579b13 100644
> > --- a/net/vhost-vdpa.c
> > +++ b/net/vhost-vdpa.c
> > @@ -345,8 +345,8 @@ static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
> >
> >       add_migration_state_change_notifier(&s->migration_state);
> >       if (v->shadow_vqs_enabled) {
> > -        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> > -                                           v->iova_range.last);
> > +        v->shared->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> > +                                                   v->iova_range.last);
> This code change is okay so far without .load_setup involved, but if
> interacting with .load_setup the iova_tree can be NULL when x-svq=on.
> Below is a stacktrace showing the problem.
>

Right, the next version of the series including .load_setup will take
this into account. Thanks!

> #0  0x00005582bf00944c in vhost_iova_tree_map_alloc (tree=0x0,
> map=map@entry=0x7fb16bfffde0) at ../hw/virtio/vhost-iova-tree.c:89
> #1  0x00005582bee8cdb3 in vhost_vdpa_listener_region_add
> (listener=0x5582c138ee58, section=0x7fb16bfffe50)
>      at ../include/qemu/int128.h:33
> #2  0x00005582bf029d4b in memory_listener_register (as=0x5582bfb53d20
> <address_space_memory>, listener=0x5582c138ee58)
>      at ../system/memory.c:3026
> #3  0x00005582bf029d4b in memory_listener_register
> (listener=0x5582c138ee58, as=0x5582bfb53d20 <address_space_memory>)
>      at ../system/memory.c:3096
> #4  0x00005582bee8e712 in vhost_vdpa_load_setup (shared=0x5582c138ee50,
> dma_as=0x5582bfb53d20 <address_space_memory>)
>      at ../hw/virtio/vhost-vdpa.c:1550
> #5  0x00005582bef0b7df in vhost_vdpa_net_load_setup (nc=0x7fb172a27010,
> nic=<optimized out>) at ../net/vhost-vdpa.c:415
> #6  0x00005582beeeb4f5 in qemu_loadvm_state (f=0x5582c1c1a800) at
> ../migration/savevm.c:2682
> #7  0x00005582beeeb4f5 in qemu_loadvm_state (f=0x5582c1c1a800) at
> ../migration/savevm.c:2866
> #8  0x00005582beed5e17 in process_incoming_migration_co
> (opaque=<optimized out>) at ../migration/migration.c:548
> #9  0x00005582bf21b29b in coroutine_trampoline (i0=<optimized out>,
> i1=<optimized out>) at ../util/coroutine-ucontext.c:177
> #10 0x00007fb16e448190 in __start_context () at /lib64/libc.so.6
>
> -Siwei
> >       }
> >   }
> >
> > @@ -371,11 +371,6 @@ static int vhost_vdpa_net_data_start(NetClientState *nc)
> >           return 0;
> >       }
> >
> > -    if (v->shadow_vqs_enabled) {
> > -        VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s);
> > -        v->iova_tree = s0->vhost_vdpa.iova_tree;
> > -    }
> > -
> >       return 0;
> >   }
> >
> > @@ -408,9 +403,8 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
> >
> >       dev = s->vhost_vdpa.dev;
> >       if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
> > -        g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
> > -    } else {
> > -        s->vhost_vdpa.iova_tree = NULL;
> > +        g_clear_pointer(&s->vhost_vdpa.shared->iova_tree,
> > +                        vhost_iova_tree_delete);
> >       }
> >   }
> >
> > @@ -464,7 +458,7 @@ static int vhost_vdpa_set_address_space_id(struct vhost_vdpa *v,
> >
> >   static void vhost_vdpa_cvq_unmap_buf(struct vhost_vdpa *v, void *addr)
> >   {
> > -    VhostIOVATree *tree = v->iova_tree;
> > +    VhostIOVATree *tree = v->shared->iova_tree;
> >       DMAMap needle = {
> >           /*
> >            * No need to specify size or to look for more translations since
> > @@ -498,7 +492,7 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
> >       map.translated_addr = (hwaddr)(uintptr_t)buf;
> >       map.size = size - 1;
> >       map.perm = write ? IOMMU_RW : IOMMU_RO,
> > -    r = vhost_iova_tree_map_alloc(v->iova_tree, &map);
> > +    r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &map);
> >       if (unlikely(r != IOVA_OK)) {
> >           error_report("Cannot map injected element");
> >           return r;
> > @@ -513,7 +507,7 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
> >       return 0;
> >
> >   dma_map_err:
> > -    vhost_iova_tree_remove(v->iova_tree, map);
> > +    vhost_iova_tree_remove(v->shared->iova_tree, map);
> >       return r;
> >   }
> >
> > @@ -573,24 +567,22 @@ out:
> >           return 0;
> >       }
> >
> > -    if (s0->vhost_vdpa.iova_tree) {
> > -        /*
> > -         * SVQ is already configured for all virtqueues.  Reuse IOVA tree for
> > -         * simplicity, whether CVQ shares ASID with guest or not, because:
> > -         * - Memory listener need access to guest's memory addresses allocated
> > -         *   in the IOVA tree.
> > -         * - There should be plenty of IOVA address space for both ASID not to
> > -         *   worry about collisions between them.  Guest's translations are
> > -         *   still validated with virtio virtqueue_pop so there is no risk for
> > -         *   the guest to access memory that it shouldn't.
> > -         *
> > -         * To allocate a iova tree per ASID is doable but it complicates the
> > -         * code and it is not worth it for the moment.
> > -         */
> > -        v->iova_tree = s0->vhost_vdpa.iova_tree;
> > -    } else {
> > -        v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> > -                                           v->iova_range.last);
> > +    /*
> > +     * If other vhost_vdpa already have an iova_tree, reuse it for simplicity,
> > +     * whether CVQ shares ASID with guest or not, because:
> > +     * - Memory listener need access to guest's memory addresses allocated in
> > +     *   the IOVA tree.
> > +     * - There should be plenty of IOVA address space for both ASID not to
> > +     *   worry about collisions between them.  Guest's translations are still
> > +     *   validated with virtio virtqueue_pop so there is no risk for the guest
> > +     *   to access memory that it shouldn't.
> > +     *
> > +     * To allocate a iova tree per ASID is doable but it complicates the code
> > +     * and it is not worth it for the moment.
> > +     */
> > +    if (!v->shared->iova_tree) {
> > +        v->shared->iova_tree = vhost_iova_tree_new(v->iova_range.first,
> > +                                                   v->iova_range.last);
> >       }
> >
> >       r = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer,
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 03/18] vdpa: move iova_range to vhost_vdpa_shared
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 01/18] vdpa: add VhostVDPAShared Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 02/18] vdpa: move iova tree to the shared struct Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 04/18] vdpa: move shadow_data " Eugenio Pérez
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

Next patches will register the vhost_vdpa memory listener while the VM
is migrating at the destination, so we can map the memory to the device
before stopping the VM at the source.  The main goal is to reduce the
downtime.

However, the destination QEMU is unaware of which vhost_vdpa device will
register its memory_listener.  If the source guest has CVQ enabled, it
will be the CVQ device.  Otherwise, it  will be the first one.

Move the iova range to VhostVDPAShared so all vhost_vdpa can use it,
rather than always in the first or last vhost_vdpa.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h |  3 ++-
 hw/virtio/vdpa-dev.c           |  5 ++++-
 hw/virtio/vhost-vdpa.c         | 16 ++++++++++------
 net/vhost-vdpa.c               | 10 +++++-----
 4 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index ac036055d3..8d52a7e498 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -32,6 +32,8 @@ typedef struct VhostVDPAHostNotifier {
 
 /* Info shared by all vhost_vdpa device models */
 typedef struct vhost_vdpa_shared {
+    struct vhost_vdpa_iova_range iova_range;
+
     /* IOVA mapping used by the Shadow Virtqueue */
     VhostIOVATree *iova_tree;
 } VhostVDPAShared;
@@ -43,7 +45,6 @@ typedef struct vhost_vdpa {
     bool iotlb_batch_begin_sent;
     uint32_t address_space_id;
     MemoryListener listener;
-    struct vhost_vdpa_iova_range iova_range;
     uint64_t acked_features;
     bool shadow_vqs_enabled;
     /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
diff --git a/hw/virtio/vdpa-dev.c b/hw/virtio/vdpa-dev.c
index f22d5d5bc0..457960d28a 100644
--- a/hw/virtio/vdpa-dev.c
+++ b/hw/virtio/vdpa-dev.c
@@ -114,7 +114,8 @@ static void vhost_vdpa_device_realize(DeviceState *dev, Error **errp)
                    strerror(-ret));
         goto free_vqs;
     }
-    v->vdpa.iova_range = iova_range;
+    v->vdpa.shared = g_new0(VhostVDPAShared, 1);
+    v->vdpa.shared->iova_range = iova_range;
 
     ret = vhost_dev_init(&v->dev, &v->vdpa, VHOST_BACKEND_TYPE_VDPA, 0, NULL);
     if (ret < 0) {
@@ -162,6 +163,7 @@ vhost_cleanup:
     vhost_dev_cleanup(&v->dev);
 free_vqs:
     g_free(vqs);
+    g_free(v->vdpa.shared);
 out:
     qemu_close(v->vhostfd);
     v->vhostfd = -1;
@@ -184,6 +186,7 @@ static void vhost_vdpa_device_unrealize(DeviceState *dev)
     g_free(s->config);
     g_free(s->dev.vqs);
     vhost_dev_cleanup(&s->dev);
+    g_free(s->vdpa.shared);
     qemu_close(s->vhostfd);
     s->vhostfd = -1;
 }
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 9cee38cb6d..2bceadd118 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -213,10 +213,10 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     RCU_READ_LOCK_GUARD();
     /* check if RAM section out of device range */
     llend = int128_add(int128_makes64(iotlb->addr_mask), int128_makes64(iova));
-    if (int128_gt(llend, int128_make64(v->iova_range.last))) {
+    if (int128_gt(llend, int128_make64(v->shared->iova_range.last))) {
         error_report("RAM section out of device range (max=0x%" PRIx64
                      ", end addr=0x%" PRIx64 ")",
-                     v->iova_range.last, int128_get64(llend));
+                     v->shared->iova_range.last, int128_get64(llend));
         return;
     }
 
@@ -316,8 +316,10 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
     int page_size = qemu_target_page_size();
     int page_mask = -page_size;
 
-    if (vhost_vdpa_listener_skipped_section(section, v->iova_range.first,
-                                            v->iova_range.last, page_mask)) {
+    if (vhost_vdpa_listener_skipped_section(section,
+                                            v->shared->iova_range.first,
+                                            v->shared->iova_range.last,
+                                            page_mask)) {
         return;
     }
     if (memory_region_is_iommu(section->mr)) {
@@ -403,8 +405,10 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
     int page_size = qemu_target_page_size();
     int page_mask = -page_size;
 
-    if (vhost_vdpa_listener_skipped_section(section, v->iova_range.first,
-                                            v->iova_range.last, page_mask)) {
+    if (vhost_vdpa_listener_skipped_section(section,
+                                            v->shared->iova_range.first,
+                                            v->shared->iova_range.last,
+                                            page_mask)) {
         return;
     }
     if (memory_region_is_iommu(section->mr)) {
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 15e7579b13..9648b0ef7e 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -345,8 +345,8 @@ static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
 
     add_migration_state_change_notifier(&s->migration_state);
     if (v->shadow_vqs_enabled) {
-        v->shared->iova_tree = vhost_iova_tree_new(v->iova_range.first,
-                                                   v->iova_range.last);
+        v->shared->iova_tree = vhost_iova_tree_new(v->shared->iova_range.first,
+                                                   v->shared->iova_range.last);
     }
 }
 
@@ -581,8 +581,8 @@ out:
      * and it is not worth it for the moment.
      */
     if (!v->shared->iova_tree) {
-        v->shared->iova_tree = vhost_iova_tree_new(v->iova_range.first,
-                                                   v->iova_range.last);
+        v->shared->iova_tree = vhost_iova_tree_new(v->shared->iova_range.first,
+                                                   v->shared->iova_range.last);
     }
 
     r = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer,
@@ -1455,12 +1455,12 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
     s->always_svq = svq;
     s->migration_state.notify = vdpa_net_migration_state_notifier;
     s->vhost_vdpa.shadow_vqs_enabled = svq;
-    s->vhost_vdpa.iova_range = iova_range;
     s->vhost_vdpa.shadow_data = svq;
     if (queue_pair_index == 0) {
         vhost_vdpa_net_valid_svq_features(features,
                                           &s->vhost_vdpa.migration_blocker);
         s->vhost_vdpa.shared = g_new0(VhostVDPAShared, 1);
+        s->vhost_vdpa.shared->iova_range = iova_range;
     } else if (!is_datapath) {
         s->cvq_cmd_out_buffer = mmap(NULL, vhost_vdpa_net_cvq_cmd_page_len(),
                                      PROT_READ | PROT_WRITE,
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 04/18] vdpa: move shadow_data to vhost_vdpa_shared
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (2 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 03/18] vdpa: move iova_range to vhost_vdpa_shared Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-11-02  8:47   ` Si-Wei Liu
  2023-10-19 14:34 ` [RFC PATCH 05/18] vdpa: use vdpa shared for tracing Eugenio Pérez
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

Next patches will register the vhost_vdpa memory listener while the VM
is migrating at the destination, so we can map the memory to the device
before stopping the VM at the source.  The main goal is to reduce the
downtime.

However, the destination QEMU is unaware of which vhost_vdpa device will
register its memory_listener.  If the source guest has CVQ enabled, it
will be the CVQ device.  Otherwise, it  will be the first one.

Move the shadow_data member to VhostVDPAShared so all vhost_vdpa can use
it, rather than always in the first or last vhost_vdpa.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h |  5 +++--
 hw/virtio/vhost-vdpa.c         |  6 +++---
 net/vhost-vdpa.c               | 23 ++++++-----------------
 3 files changed, 12 insertions(+), 22 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 8d52a7e498..01e0f25e27 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -36,6 +36,9 @@ typedef struct vhost_vdpa_shared {
 
     /* IOVA mapping used by the Shadow Virtqueue */
     VhostIOVATree *iova_tree;
+
+    /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
+    bool shadow_data;
 } VhostVDPAShared;
 
 typedef struct vhost_vdpa {
@@ -47,8 +50,6 @@ typedef struct vhost_vdpa {
     MemoryListener listener;
     uint64_t acked_features;
     bool shadow_vqs_enabled;
-    /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
-    bool shadow_data;
     /* Device suspended successfully */
     bool suspended;
     VhostVDPAShared *shared;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 2bceadd118..ec028e4c56 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -353,7 +353,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
                                          vaddr, section->readonly);
 
     llsize = int128_sub(llend, int128_make64(iova));
-    if (v->shadow_data) {
+    if (v->shared->shadow_data) {
         int r;
 
         mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
@@ -380,7 +380,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
     return;
 
 fail_map:
-    if (v->shadow_data) {
+    if (v->shared->shadow_data) {
         vhost_iova_tree_remove(v->shared->iova_tree, mem_region);
     }
 
@@ -435,7 +435,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
 
     llsize = int128_sub(llend, int128_make64(iova));
 
-    if (v->shadow_data) {
+    if (v->shared->shadow_data) {
         const DMAMap *result;
         const void *vaddr = memory_region_get_ram_ptr(section->mr) +
             section->offset_within_region +
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 9648b0ef7e..01202350ea 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -282,15 +282,6 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
     return size;
 }
 
-/** From any vdpa net client, get the netclient of the first queue pair */
-static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
-{
-    NICState *nic = qemu_get_nic(s->nc.peer);
-    NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
-
-    return DO_UPCAST(VhostVDPAState, nc, nc0);
-}
-
 static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
 {
     struct vhost_vdpa *v = &s->vhost_vdpa;
@@ -360,10 +351,10 @@ static int vhost_vdpa_net_data_start(NetClientState *nc)
     if (s->always_svq ||
         migration_is_setup_or_active(migrate_get_current()->state)) {
         v->shadow_vqs_enabled = true;
-        v->shadow_data = true;
+        v->shared->shadow_data = true;
     } else {
         v->shadow_vqs_enabled = false;
-        v->shadow_data = false;
+        v->shared->shadow_data = false;
     }
 
     if (v->index == 0) {
@@ -513,7 +504,7 @@ dma_map_err:
 
 static int vhost_vdpa_net_cvq_start(NetClientState *nc)
 {
-    VhostVDPAState *s, *s0;
+    VhostVDPAState *s;
     struct vhost_vdpa *v;
     int64_t cvq_group;
     int r;
@@ -524,12 +515,10 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
     s = DO_UPCAST(VhostVDPAState, nc, nc);
     v = &s->vhost_vdpa;
 
-    s0 = vhost_vdpa_net_first_nc_vdpa(s);
-    v->shadow_data = s0->vhost_vdpa.shadow_vqs_enabled;
-    v->shadow_vqs_enabled = s0->vhost_vdpa.shadow_vqs_enabled;
+    v->shadow_vqs_enabled = s->always_svq;
     s->vhost_vdpa.address_space_id = VHOST_VDPA_GUEST_PA_ASID;
 
-    if (s->vhost_vdpa.shadow_data) {
+    if (v->shared->shadow_data) {
         /* SVQ is already configured for all virtqueues */
         goto out;
     }
@@ -1455,12 +1444,12 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
     s->always_svq = svq;
     s->migration_state.notify = vdpa_net_migration_state_notifier;
     s->vhost_vdpa.shadow_vqs_enabled = svq;
-    s->vhost_vdpa.shadow_data = svq;
     if (queue_pair_index == 0) {
         vhost_vdpa_net_valid_svq_features(features,
                                           &s->vhost_vdpa.migration_blocker);
         s->vhost_vdpa.shared = g_new0(VhostVDPAShared, 1);
         s->vhost_vdpa.shared->iova_range = iova_range;
+        s->vhost_vdpa.shared->shadow_data = svq;
     } else if (!is_datapath) {
         s->cvq_cmd_out_buffer = mmap(NULL, vhost_vdpa_net_cvq_cmd_page_len(),
                                      PROT_READ | PROT_WRITE,
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 04/18] vdpa: move shadow_data to vhost_vdpa_shared
  2023-10-19 14:34 ` [RFC PATCH 04/18] vdpa: move shadow_data " Eugenio Pérez
@ 2023-11-02  8:47   ` Si-Wei Liu
  2023-11-02 15:45     ` Eugenio Perez Martin
  0 siblings, 1 reply; 34+ messages in thread
From: Si-Wei Liu @ 2023-11-02  8:47 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, Gautam Dawar



On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> Next patches will register the vhost_vdpa memory listener while the VM
> is migrating at the destination, so we can map the memory to the device
> before stopping the VM at the source.  The main goal is to reduce the
> downtime.
>
> However, the destination QEMU is unaware of which vhost_vdpa device will
> register its memory_listener.  If the source guest has CVQ enabled, it
> will be the CVQ device.  Otherwise, it  will be the first one.
>
> Move the shadow_data member to VhostVDPAShared so all vhost_vdpa can use
> it, rather than always in the first or last vhost_vdpa.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   include/hw/virtio/vhost-vdpa.h |  5 +++--
>   hw/virtio/vhost-vdpa.c         |  6 +++---
>   net/vhost-vdpa.c               | 23 ++++++-----------------
>   3 files changed, 12 insertions(+), 22 deletions(-)
>
> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> index 8d52a7e498..01e0f25e27 100644
> --- a/include/hw/virtio/vhost-vdpa.h
> +++ b/include/hw/virtio/vhost-vdpa.h
> @@ -36,6 +36,9 @@ typedef struct vhost_vdpa_shared {
>   
>       /* IOVA mapping used by the Shadow Virtqueue */
>       VhostIOVATree *iova_tree;
> +
> +    /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
> +    bool shadow_data;
>   } VhostVDPAShared;
>   
>   typedef struct vhost_vdpa {
> @@ -47,8 +50,6 @@ typedef struct vhost_vdpa {
>       MemoryListener listener;
>       uint64_t acked_features;
>       bool shadow_vqs_enabled;
> -    /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
> -    bool shadow_data;
>       /* Device suspended successfully */
>       bool suspended;
>       VhostVDPAShared *shared;
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 2bceadd118..ec028e4c56 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -353,7 +353,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>                                            vaddr, section->readonly);
>   
>       llsize = int128_sub(llend, int128_make64(iova));
> -    if (v->shadow_data) {
> +    if (v->shared->shadow_data) {
>           int r;
>   
>           mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
> @@ -380,7 +380,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>       return;
>   
>   fail_map:
> -    if (v->shadow_data) {
> +    if (v->shared->shadow_data) {
>           vhost_iova_tree_remove(v->shared->iova_tree, mem_region);
>       }
>   
> @@ -435,7 +435,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
>   
>       llsize = int128_sub(llend, int128_make64(iova));
>   
> -    if (v->shadow_data) {
> +    if (v->shared->shadow_data) {
>           const DMAMap *result;
>           const void *vaddr = memory_region_get_ram_ptr(section->mr) +
>               section->offset_within_region +
> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> index 9648b0ef7e..01202350ea 100644
> --- a/net/vhost-vdpa.c
> +++ b/net/vhost-vdpa.c
> @@ -282,15 +282,6 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
>       return size;
>   }
>   
> -/** From any vdpa net client, get the netclient of the first queue pair */
> -static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
> -{
> -    NICState *nic = qemu_get_nic(s->nc.peer);
> -    NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
> -
> -    return DO_UPCAST(VhostVDPAState, nc, nc0);
> -}
> -
>   static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
>   {
>       struct vhost_vdpa *v = &s->vhost_vdpa;
> @@ -360,10 +351,10 @@ static int vhost_vdpa_net_data_start(NetClientState *nc)
>       if (s->always_svq ||
>           migration_is_setup_or_active(migrate_get_current()->state)) {
>           v->shadow_vqs_enabled = true;
> -        v->shadow_data = true;
> +        v->shared->shadow_data = true;
>       } else {
>           v->shadow_vqs_enabled = false;
> -        v->shadow_data = false;
> +        v->shared->shadow_data = false;
>       }
>   
>       if (v->index == 0) {
> @@ -513,7 +504,7 @@ dma_map_err:
>   
>   static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>   {
> -    VhostVDPAState *s, *s0;
> +    VhostVDPAState *s;
>       struct vhost_vdpa *v;
>       int64_t cvq_group;
>       int r;
> @@ -524,12 +515,10 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
>       s = DO_UPCAST(VhostVDPAState, nc, nc);
>       v = &s->vhost_vdpa;
>   
> -    s0 = vhost_vdpa_net_first_nc_vdpa(s);
> -    v->shadow_data = s0->vhost_vdpa.shadow_vqs_enabled;
> -    v->shadow_vqs_enabled = s0->vhost_vdpa.shadow_vqs_enabled;
> +    v->shadow_vqs_enabled = s->always_svq;
This doesn't seem equivalent to the previous code. If always_svq is not 
set and migration is active, will it cause CVQ not shadowed at all? The 
"goto out;" line below would effectively return from this function, 
resulting in cvq's shadow_vqs_enabled left behind as false.


>       s->vhost_vdpa.address_space_id = VHOST_VDPA_GUEST_PA_ASID;
>   
> -    if (s->vhost_vdpa.shadow_data) {
> +    if (v->shared->shadow_data) {
>           /* SVQ is already configured for all virtqueues */
>           goto out;
>       }
> @@ -1455,12 +1444,12 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
>       s->always_svq = svq;
>       s->migration_state.notify = vdpa_net_migration_state_notifier;
>       s->vhost_vdpa.shadow_vqs_enabled = svq;
> -    s->vhost_vdpa.shadow_data = svq;
>       if (queue_pair_index == 0) {
>           vhost_vdpa_net_valid_svq_features(features,
>                                             &s->vhost_vdpa.migration_blocker);
>           s->vhost_vdpa.shared = g_new0(VhostVDPAShared, 1);
>           s->vhost_vdpa.shared->iova_range = iova_range;
> +        s->vhost_vdpa.shared->shadow_data = svq;
>       } else if (!is_datapath) {
>           s->cvq_cmd_out_buffer = mmap(NULL, vhost_vdpa_net_cvq_cmd_page_len(),
>                                        PROT_READ | PROT_WRITE,



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 04/18] vdpa: move shadow_data to vhost_vdpa_shared
  2023-11-02  8:47   ` Si-Wei Liu
@ 2023-11-02 15:45     ` Eugenio Perez Martin
  0 siblings, 0 replies; 34+ messages in thread
From: Eugenio Perez Martin @ 2023-11-02 15:45 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Shannon, Parav Pandit, Stefano Garzarella,
	Michael S. Tsirkin, yin31149, Jason Wang, Yajun Wu, Zhu Lingshan,
	Lei Yang, Dragos Tatulea, Juan Quintela, Laurent Vivier,
	Gautam Dawar

On Thu, Nov 2, 2023 at 9:48 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> > Next patches will register the vhost_vdpa memory listener while the VM
> > is migrating at the destination, so we can map the memory to the device
> > before stopping the VM at the source.  The main goal is to reduce the
> > downtime.
> >
> > However, the destination QEMU is unaware of which vhost_vdpa device will
> > register its memory_listener.  If the source guest has CVQ enabled, it
> > will be the CVQ device.  Otherwise, it  will be the first one.
> >
> > Move the shadow_data member to VhostVDPAShared so all vhost_vdpa can use
> > it, rather than always in the first or last vhost_vdpa.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   include/hw/virtio/vhost-vdpa.h |  5 +++--
> >   hw/virtio/vhost-vdpa.c         |  6 +++---
> >   net/vhost-vdpa.c               | 23 ++++++-----------------
> >   3 files changed, 12 insertions(+), 22 deletions(-)
> >
> > diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> > index 8d52a7e498..01e0f25e27 100644
> > --- a/include/hw/virtio/vhost-vdpa.h
> > +++ b/include/hw/virtio/vhost-vdpa.h
> > @@ -36,6 +36,9 @@ typedef struct vhost_vdpa_shared {
> >
> >       /* IOVA mapping used by the Shadow Virtqueue */
> >       VhostIOVATree *iova_tree;
> > +
> > +    /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
> > +    bool shadow_data;
> >   } VhostVDPAShared;
> >
> >   typedef struct vhost_vdpa {
> > @@ -47,8 +50,6 @@ typedef struct vhost_vdpa {
> >       MemoryListener listener;
> >       uint64_t acked_features;
> >       bool shadow_vqs_enabled;
> > -    /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
> > -    bool shadow_data;
> >       /* Device suspended successfully */
> >       bool suspended;
> >       VhostVDPAShared *shared;
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 2bceadd118..ec028e4c56 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -353,7 +353,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >                                            vaddr, section->readonly);
> >
> >       llsize = int128_sub(llend, int128_make64(iova));
> > -    if (v->shadow_data) {
> > +    if (v->shared->shadow_data) {
> >           int r;
> >
> >           mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
> > @@ -380,7 +380,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >       return;
> >
> >   fail_map:
> > -    if (v->shadow_data) {
> > +    if (v->shared->shadow_data) {
> >           vhost_iova_tree_remove(v->shared->iova_tree, mem_region);
> >       }
> >
> > @@ -435,7 +435,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
> >
> >       llsize = int128_sub(llend, int128_make64(iova));
> >
> > -    if (v->shadow_data) {
> > +    if (v->shared->shadow_data) {
> >           const DMAMap *result;
> >           const void *vaddr = memory_region_get_ram_ptr(section->mr) +
> >               section->offset_within_region +
> > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> > index 9648b0ef7e..01202350ea 100644
> > --- a/net/vhost-vdpa.c
> > +++ b/net/vhost-vdpa.c
> > @@ -282,15 +282,6 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
> >       return size;
> >   }
> >
> > -/** From any vdpa net client, get the netclient of the first queue pair */
> > -static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
> > -{
> > -    NICState *nic = qemu_get_nic(s->nc.peer);
> > -    NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
> > -
> > -    return DO_UPCAST(VhostVDPAState, nc, nc0);
> > -}
> > -
> >   static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
> >   {
> >       struct vhost_vdpa *v = &s->vhost_vdpa;
> > @@ -360,10 +351,10 @@ static int vhost_vdpa_net_data_start(NetClientState *nc)
> >       if (s->always_svq ||
> >           migration_is_setup_or_active(migrate_get_current()->state)) {
> >           v->shadow_vqs_enabled = true;
> > -        v->shadow_data = true;
> > +        v->shared->shadow_data = true;
> >       } else {
> >           v->shadow_vqs_enabled = false;
> > -        v->shadow_data = false;
> > +        v->shared->shadow_data = false;
> >       }
> >
> >       if (v->index == 0) {
> > @@ -513,7 +504,7 @@ dma_map_err:
> >
> >   static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> >   {
> > -    VhostVDPAState *s, *s0;
> > +    VhostVDPAState *s;
> >       struct vhost_vdpa *v;
> >       int64_t cvq_group;
> >       int r;
> > @@ -524,12 +515,10 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
> >       s = DO_UPCAST(VhostVDPAState, nc, nc);
> >       v = &s->vhost_vdpa;
> >
> > -    s0 = vhost_vdpa_net_first_nc_vdpa(s);
> > -    v->shadow_data = s0->vhost_vdpa.shadow_vqs_enabled;
> > -    v->shadow_vqs_enabled = s0->vhost_vdpa.shadow_vqs_enabled;
> > +    v->shadow_vqs_enabled = s->always_svq;
> This doesn't seem equivalent to the previous code. If always_svq is not
> set and migration is active, will it cause CVQ not shadowed at all? The
> "goto out;" line below would effectively return from this function,
> resulting in cvq's shadow_vqs_enabled left behind as false.
>

You're right, this is probably a bad rebase. Thanks for the catch!

>
> >       s->vhost_vdpa.address_space_id = VHOST_VDPA_GUEST_PA_ASID;
> >
> > -    if (s->vhost_vdpa.shadow_data) {
> > +    if (v->shared->shadow_data) {
> >           /* SVQ is already configured for all virtqueues */
> >           goto out;
> >       }
> > @@ -1455,12 +1444,12 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
> >       s->always_svq = svq;
> >       s->migration_state.notify = vdpa_net_migration_state_notifier;
> >       s->vhost_vdpa.shadow_vqs_enabled = svq;
> > -    s->vhost_vdpa.shadow_data = svq;
> >       if (queue_pair_index == 0) {
> >           vhost_vdpa_net_valid_svq_features(features,
> >                                             &s->vhost_vdpa.migration_blocker);
> >           s->vhost_vdpa.shared = g_new0(VhostVDPAShared, 1);
> >           s->vhost_vdpa.shared->iova_range = iova_range;
> > +        s->vhost_vdpa.shared->shadow_data = svq;
> >       } else if (!is_datapath) {
> >           s->cvq_cmd_out_buffer = mmap(NULL, vhost_vdpa_net_cvq_cmd_page_len(),
> >                                        PROT_READ | PROT_WRITE,
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 05/18] vdpa: use vdpa shared for tracing
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (3 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 04/18] vdpa: move shadow_data " Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 06/18] vdpa: move file descriptor to vhost_vdpa_shared Eugenio Pérez
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

By the end of this series dma_map and dma_unmap functions don't have the
vdpa device for tracing.  Movinge trace function to shared member one.
Print it also in the vdpa initialization so log reader can relate them.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 23 ++++++++++++-----------
 hw/virtio/trace-events | 14 +++++++-------
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index ec028e4c56..4f52bfa9ee 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -101,7 +101,7 @@ int vhost_vdpa_dma_map(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
     msg.iotlb.perm = readonly ? VHOST_ACCESS_RO : VHOST_ACCESS_RW;
     msg.iotlb.type = VHOST_IOTLB_UPDATE;
 
-    trace_vhost_vdpa_dma_map(v, fd, msg.type, msg.asid, msg.iotlb.iova,
+    trace_vhost_vdpa_dma_map(v->shared, fd, msg.type, msg.asid, msg.iotlb.iova,
                              msg.iotlb.size, msg.iotlb.uaddr, msg.iotlb.perm,
                              msg.iotlb.type);
 
@@ -131,7 +131,7 @@ int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
     msg.iotlb.size = size;
     msg.iotlb.type = VHOST_IOTLB_INVALIDATE;
 
-    trace_vhost_vdpa_dma_unmap(v, fd, msg.type, msg.asid, msg.iotlb.iova,
+    trace_vhost_vdpa_dma_unmap(v->shared, fd, msg.type, msg.asid, msg.iotlb.iova,
                                msg.iotlb.size, msg.iotlb.type);
 
     if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
@@ -151,7 +151,7 @@ static void vhost_vdpa_listener_begin_batch(struct vhost_vdpa *v)
         .iotlb.type = VHOST_IOTLB_BATCH_BEGIN,
     };
 
-    trace_vhost_vdpa_listener_begin_batch(v, fd, msg.type, msg.iotlb.type);
+    trace_vhost_vdpa_listener_begin_batch(v->shared, fd, msg.type, msg.iotlb.type);
     if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
         error_report("failed to write, fd=%d, errno=%d (%s)",
                      fd, errno, strerror(errno));
@@ -186,7 +186,7 @@ static void vhost_vdpa_listener_commit(MemoryListener *listener)
     msg.type = v->msg_type;
     msg.iotlb.type = VHOST_IOTLB_BATCH_END;
 
-    trace_vhost_vdpa_listener_commit(v, fd, msg.type, msg.iotlb.type);
+    trace_vhost_vdpa_listener_commit(v->shared, fd, msg.type, msg.iotlb.type);
     if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
         error_report("failed to write, fd=%d, errno=%d (%s)",
                      fd, errno, strerror(errno));
@@ -329,7 +329,8 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
 
     if (unlikely((section->offset_within_address_space & ~page_mask) !=
                  (section->offset_within_region & ~page_mask))) {
-        trace_vhost_vdpa_listener_region_add_unaligned(v, section->mr->name,
+        trace_vhost_vdpa_listener_region_add_unaligned(v->shared,
+                       section->mr->name,
                        section->offset_within_address_space & ~page_mask,
                        section->offset_within_region & ~page_mask);
         return;
@@ -349,7 +350,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
             section->offset_within_region +
             (iova - section->offset_within_address_space);
 
-    trace_vhost_vdpa_listener_region_add(v, iova, int128_get64(llend),
+    trace_vhost_vdpa_listener_region_add(v->shared, iova, int128_get64(llend),
                                          vaddr, section->readonly);
 
     llsize = int128_sub(llend, int128_make64(iova));
@@ -417,7 +418,8 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
 
     if (unlikely((section->offset_within_address_space & ~page_mask) !=
                  (section->offset_within_region & ~page_mask))) {
-        trace_vhost_vdpa_listener_region_del_unaligned(v, section->mr->name,
+        trace_vhost_vdpa_listener_region_del_unaligned(v->shared,
+                       section->mr->name,
                        section->offset_within_address_space & ~page_mask,
                        section->offset_within_region & ~page_mask);
         return;
@@ -426,7 +428,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
     iova = ROUND_UP(section->offset_within_address_space, page_size);
     llend = vhost_vdpa_section_end(section, page_mask);
 
-    trace_vhost_vdpa_listener_region_del(v, iova,
+    trace_vhost_vdpa_listener_region_del(v->shared, iova,
         int128_get64(int128_sub(llend, int128_one())));
 
     if (int128_ge(int128_make64(iova), llend)) {
@@ -583,12 +585,11 @@ static void vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v)
 
 static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
 {
-    struct vhost_vdpa *v;
+    struct vhost_vdpa *v = opaque;
     assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA);
-    trace_vhost_vdpa_init(dev, opaque);
+    trace_vhost_vdpa_init(dev, v->shared, opaque);
     int ret;
 
-    v = opaque;
     v->dev = dev;
     dev->opaque =  opaque ;
     v->listener = vhost_vdpa_memory_listener;
diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 1cb9027d1e..ad2196bc39 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -30,16 +30,16 @@ vhost_user_write(uint32_t req, uint32_t flags) "req:%d flags:0x%"PRIx32""
 vhost_user_create_notifier(int idx, void *n) "idx:%d n:%p"
 
 # vhost-vdpa.c
-vhost_vdpa_dma_map(void *vdpa, int fd, uint32_t msg_type, uint32_t asid, uint64_t iova, uint64_t size, uint64_t uaddr, uint8_t perm, uint8_t type) "vdpa:%p fd: %d msg_type: %"PRIu32" asid: %"PRIu32" iova: 0x%"PRIx64" size: 0x%"PRIx64" uaddr: 0x%"PRIx64" perm: 0x%"PRIx8" type: %"PRIu8
-vhost_vdpa_dma_unmap(void *vdpa, int fd, uint32_t msg_type, uint32_t asid, uint64_t iova, uint64_t size, uint8_t type) "vdpa:%p fd: %d msg_type: %"PRIu32" asid: %"PRIu32" iova: 0x%"PRIx64" size: 0x%"PRIx64" type: %"PRIu8
-vhost_vdpa_listener_begin_batch(void *v, int fd, uint32_t msg_type, uint8_t type)  "vdpa:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8
-vhost_vdpa_listener_commit(void *v, int fd, uint32_t msg_type, uint8_t type)  "vdpa:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8
-vhost_vdpa_listener_region_add_unaligned(void *v, const char *name, uint64_t offset_as, uint64_t offset_page) "vdpa: %p region %s offset_within_address_space %"PRIu64" offset_within_region %"PRIu64
+vhost_vdpa_dma_map(void *vdpa, int fd, uint32_t msg_type, uint32_t asid, uint64_t iova, uint64_t size, uint64_t uaddr, uint8_t perm, uint8_t type) "vdpa_shared:%p fd: %d msg_type: %"PRIu32" asid: %"PRIu32" iova: 0x%"PRIx64" size: 0x%"PRIx64" uaddr: 0x%"PRIx64" perm: 0x%"PRIx8" type: %"PRIu8
+vhost_vdpa_dma_unmap(void *vdpa, int fd, uint32_t msg_type, uint32_t asid, uint64_t iova, uint64_t size, uint8_t type) "vdpa_shared:%p fd: %d msg_type: %"PRIu32" asid: %"PRIu32" iova: 0x%"PRIx64" size: 0x%"PRIx64" type: %"PRIu8
+vhost_vdpa_listener_begin_batch(void *v, int fd, uint32_t msg_type, uint8_t type)  "vdpa_shared:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8
+vhost_vdpa_listener_commit(void *v, int fd, uint32_t msg_type, uint8_t type)  "vdpa_shared:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8
+vhost_vdpa_listener_region_add_unaligned(void *v, const char *name, uint64_t offset_as, uint64_t offset_page) "vdpa_shared: %p region %s offset_within_address_space %"PRIu64" offset_within_region %"PRIu64
 vhost_vdpa_listener_region_add(void *vdpa, uint64_t iova, uint64_t llend, void *vaddr, bool readonly) "vdpa: %p iova 0x%"PRIx64" llend 0x%"PRIx64" vaddr: %p read-only: %d"
-vhost_vdpa_listener_region_del_unaligned(void *v, const char *name, uint64_t offset_as, uint64_t offset_page) "vdpa: %p region %s offset_within_address_space %"PRIu64" offset_within_region %"PRIu64
+vhost_vdpa_listener_region_del_unaligned(void *v, const char *name, uint64_t offset_as, uint64_t offset_page) "vdpa_shared: %p region %s offset_within_address_space %"PRIu64" offset_within_region %"PRIu64
 vhost_vdpa_listener_region_del(void *vdpa, uint64_t iova, uint64_t llend) "vdpa: %p iova 0x%"PRIx64" llend 0x%"PRIx64
 vhost_vdpa_add_status(void *dev, uint8_t status) "dev: %p status: 0x%"PRIx8
-vhost_vdpa_init(void *dev, void *vdpa) "dev: %p vdpa: %p"
+vhost_vdpa_init(void *dev, void *s, void *vdpa) "dev: %p, common dev: %p vdpa: %p"
 vhost_vdpa_cleanup(void *dev, void *vdpa) "dev: %p vdpa: %p"
 vhost_vdpa_memslots_limit(void *dev, int ret) "dev: %p = 0x%x"
 vhost_vdpa_set_mem_table(void *dev, uint32_t nregions, uint32_t padding) "dev: %p nregions: %"PRIu32" padding: 0x%"PRIx32
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 06/18] vdpa: move file descriptor to vhost_vdpa_shared
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (4 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 05/18] vdpa: use vdpa shared for tracing Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 07/18] vdpa: move iotlb_batch_begin_sent " Eugenio Pérez
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

Next patches will register the vhost_vdpa memory listener while the VM
is migrating at the destination, so we can map the memory to the device
before stopping the VM at the source.  The main goal is to reduce the
downtime.

However, the destination QEMU is unaware of which vhost_vdpa device will
register its memory_listener.  If the source guest has CVQ enabled, it
will be the CVQ device.  Otherwise, it  will be the first one.

Move the file descriptor to VhostVDPAShared so all vhost_vdpa can use
it, rather than always in the first / last vhost_vdpa.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h |  2 +-
 hw/virtio/vdpa-dev.c           |  2 +-
 hw/virtio/vhost-vdpa.c         | 14 +++++++-------
 net/vhost-vdpa.c               | 11 ++++-------
 4 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 01e0f25e27..796a180afa 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -32,6 +32,7 @@ typedef struct VhostVDPAHostNotifier {
 
 /* Info shared by all vhost_vdpa device models */
 typedef struct vhost_vdpa_shared {
+    int device_fd;
     struct vhost_vdpa_iova_range iova_range;
 
     /* IOVA mapping used by the Shadow Virtqueue */
@@ -42,7 +43,6 @@ typedef struct vhost_vdpa_shared {
 } VhostVDPAShared;
 
 typedef struct vhost_vdpa {
-    int device_fd;
     int index;
     uint32_t msg_type;
     bool iotlb_batch_begin_sent;
diff --git a/hw/virtio/vdpa-dev.c b/hw/virtio/vdpa-dev.c
index 457960d28a..8774986571 100644
--- a/hw/virtio/vdpa-dev.c
+++ b/hw/virtio/vdpa-dev.c
@@ -66,7 +66,6 @@ static void vhost_vdpa_device_realize(DeviceState *dev, Error **errp)
     if (*errp) {
         return;
     }
-    v->vdpa.device_fd = v->vhostfd;
 
     v->vdev_id = vhost_vdpa_device_get_u32(v->vhostfd,
                                            VHOST_VDPA_GET_DEVICE_ID, errp);
@@ -115,6 +114,7 @@ static void vhost_vdpa_device_realize(DeviceState *dev, Error **errp)
         goto free_vqs;
     }
     v->vdpa.shared = g_new0(VhostVDPAShared, 1);
+    v->vdpa.shared->device_fd = v->vhostfd;
     v->vdpa.shared->iova_range = iova_range;
 
     ret = vhost_dev_init(&v->dev, &v->vdpa, VHOST_BACKEND_TYPE_VDPA, 0, NULL);
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 4f52bfa9ee..298aefd065 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -90,7 +90,7 @@ int vhost_vdpa_dma_map(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
                        hwaddr size, void *vaddr, bool readonly)
 {
     struct vhost_msg_v2 msg = {};
-    int fd = v->device_fd;
+    int fd = v->shared->device_fd;
     int ret = 0;
 
     msg.type = v->msg_type;
@@ -122,7 +122,7 @@ int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
                          hwaddr size)
 {
     struct vhost_msg_v2 msg = {};
-    int fd = v->device_fd;
+    int fd = v->shared->device_fd;
     int ret = 0;
 
     msg.type = v->msg_type;
@@ -145,7 +145,7 @@ int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
 
 static void vhost_vdpa_listener_begin_batch(struct vhost_vdpa *v)
 {
-    int fd = v->device_fd;
+    int fd = v->shared->device_fd;
     struct vhost_msg_v2 msg = {
         .type = v->msg_type,
         .iotlb.type = VHOST_IOTLB_BATCH_BEGIN,
@@ -173,7 +173,7 @@ static void vhost_vdpa_listener_commit(MemoryListener *listener)
     struct vhost_vdpa *v = container_of(listener, struct vhost_vdpa, listener);
     struct vhost_dev *dev = v->dev;
     struct vhost_msg_v2 msg = {};
-    int fd = v->device_fd;
+    int fd = v->shared->device_fd;
 
     if (!(dev->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH))) {
         return;
@@ -498,7 +498,7 @@ static int vhost_vdpa_call(struct vhost_dev *dev, unsigned long int request,
                              void *arg)
 {
     struct vhost_vdpa *v = dev->opaque;
-    int fd = v->device_fd;
+    int fd = v->shared->device_fd;
     int ret;
 
     assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA);
@@ -656,7 +656,7 @@ static int vhost_vdpa_host_notifier_init(struct vhost_dev *dev, int queue_index)
     struct vhost_vdpa *v = dev->opaque;
     VirtIODevice *vdev = dev->vdev;
     VhostVDPAHostNotifier *n;
-    int fd = v->device_fd;
+    int fd = v->shared->device_fd;
     void *addr;
     char *name;
 
@@ -1285,7 +1285,7 @@ static void vhost_vdpa_suspend(struct vhost_dev *dev)
 
     if (dev->backend_cap & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) {
         trace_vhost_vdpa_suspend(dev);
-        r = ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
+        r = ioctl(v->shared->device_fd, VHOST_VDPA_SUSPEND);
         if (unlikely(r)) {
             error_report("Cannot suspend: %s(%d)", g_strerror(errno), errno);
         } else {
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 01202350ea..4d42d7a742 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -233,14 +233,11 @@ static void vhost_vdpa_cleanup(NetClientState *nc)
         vhost_net_cleanup(s->vhost_net);
         g_free(s->vhost_net);
         s->vhost_net = NULL;
-    }
-     if (s->vhost_vdpa.device_fd >= 0) {
-        qemu_close(s->vhost_vdpa.device_fd);
-        s->vhost_vdpa.device_fd = -1;
     }
     if (s->vhost_vdpa.index != 0) {
         return;
     }
+    qemu_close(s->vhost_vdpa.shared->device_fd);
     g_free(s->vhost_vdpa.shared);
 }
 
@@ -439,7 +436,7 @@ static int vhost_vdpa_set_address_space_id(struct vhost_vdpa *v,
     };
     int r;
 
-    r = ioctl(v->device_fd, VHOST_VDPA_SET_GROUP_ASID, &asid);
+    r = ioctl(v->shared->device_fd, VHOST_VDPA_SET_GROUP_ASID, &asid);
     if (unlikely(r < 0)) {
         error_report("Can't set vq group %u asid %u, errno=%d (%s)",
                      asid.index, asid.num, errno, g_strerror(errno));
@@ -535,7 +532,7 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
         return 0;
     }
 
-    cvq_group = vhost_vdpa_get_vring_group(v->device_fd,
+    cvq_group = vhost_vdpa_get_vring_group(v->shared->device_fd,
                                            v->dev->vq_index_end - 1,
                                            &err);
     if (unlikely(cvq_group < 0)) {
@@ -1439,7 +1436,6 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
     qemu_set_info_str(nc, TYPE_VHOST_VDPA);
     s = DO_UPCAST(VhostVDPAState, nc, nc);
 
-    s->vhost_vdpa.device_fd = vdpa_device_fd;
     s->vhost_vdpa.index = queue_pair_index;
     s->always_svq = svq;
     s->migration_state.notify = vdpa_net_migration_state_notifier;
@@ -1448,6 +1444,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
         vhost_vdpa_net_valid_svq_features(features,
                                           &s->vhost_vdpa.migration_blocker);
         s->vhost_vdpa.shared = g_new0(VhostVDPAShared, 1);
+        s->vhost_vdpa.shared->device_fd = vdpa_device_fd;
         s->vhost_vdpa.shared->iova_range = iova_range;
         s->vhost_vdpa.shared->shadow_data = svq;
     } else if (!is_datapath) {
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 07/18] vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (5 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 06/18] vdpa: move file descriptor to vhost_vdpa_shared Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 08/18] vdpa: move backend_cap " Eugenio Pérez
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

Next patches will register the vhost_vdpa memory listener while the VM
is migrating at the destination, so we can map the memory to the device
before stopping the VM at the source.  The main goal is to reduce the
downtime.

However, the destination QEMU is unaware of which vhost_vdpa device will
register its memory_listener.  If the source guest has CVQ enabled, it
will be the CVQ device.  Otherwise, it  will be the first one.

Move the iotlb_batch_begin_sent member to VhostVDPAShared so all
vhost_vdpa can use it, rather than always in the first / last
vhost_vdpa.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h | 3 ++-
 hw/virtio/vhost-vdpa.c         | 8 ++++----
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 796a180afa..05219bbcf7 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -38,6 +38,8 @@ typedef struct vhost_vdpa_shared {
     /* IOVA mapping used by the Shadow Virtqueue */
     VhostIOVATree *iova_tree;
 
+    bool iotlb_batch_begin_sent;
+
     /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
     bool shadow_data;
 } VhostVDPAShared;
@@ -45,7 +47,6 @@ typedef struct vhost_vdpa_shared {
 typedef struct vhost_vdpa {
     int index;
     uint32_t msg_type;
-    bool iotlb_batch_begin_sent;
     uint32_t address_space_id;
     MemoryListener listener;
     uint64_t acked_features;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 298aefd065..3da5a19a9e 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -161,11 +161,11 @@ static void vhost_vdpa_listener_begin_batch(struct vhost_vdpa *v)
 static void vhost_vdpa_iotlb_batch_begin_once(struct vhost_vdpa *v)
 {
     if (v->dev->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH) &&
-        !v->iotlb_batch_begin_sent) {
+        !v->shared->iotlb_batch_begin_sent) {
         vhost_vdpa_listener_begin_batch(v);
     }
 
-    v->iotlb_batch_begin_sent = true;
+    v->shared->iotlb_batch_begin_sent = true;
 }
 
 static void vhost_vdpa_listener_commit(MemoryListener *listener)
@@ -179,7 +179,7 @@ static void vhost_vdpa_listener_commit(MemoryListener *listener)
         return;
     }
 
-    if (!v->iotlb_batch_begin_sent) {
+    if (!v->shared->iotlb_batch_begin_sent) {
         return;
     }
 
@@ -192,7 +192,7 @@ static void vhost_vdpa_listener_commit(MemoryListener *listener)
                      fd, errno, strerror(errno));
     }
 
-    v->iotlb_batch_begin_sent = false;
+    v->shared->iotlb_batch_begin_sent = false;
 }
 
 static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 08/18] vdpa: move backend_cap to vhost_vdpa_shared
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (6 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 07/18] vdpa: move iotlb_batch_begin_sent " Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 09/18] vdpa: remove msg type of vhost_vdpa Eugenio Pérez
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

Next patches will register the vhost_vdpa memory listener while the VM
is migrating at the destination, so we can map the memory to the device
before stopping the VM at the source.  The main goal is to reduce the
downtime.

However, the destination QEMU is unaware of which vhost_vdpa device will
register its memory_listener.  If the source guest has CVQ enabled, it
will be the CVQ device.  Otherwise, it  will be the first one.

Move the backend_cap member to VhostVDPAShared so all vhost_vdpa can use
it, rather than always in the first / last vhost_vdpa.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h | 3 +++
 hw/virtio/vhost-vdpa.c         | 8 +++++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 05219bbcf7..11ac14085a 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -38,6 +38,9 @@ typedef struct vhost_vdpa_shared {
     /* IOVA mapping used by the Shadow Virtqueue */
     VhostIOVATree *iova_tree;
 
+    /* Copy of backend features */
+    uint64_t backend_cap;
+
     bool iotlb_batch_begin_sent;
 
     /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 3da5a19a9e..124323de43 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -160,7 +160,7 @@ static void vhost_vdpa_listener_begin_batch(struct vhost_vdpa *v)
 
 static void vhost_vdpa_iotlb_batch_begin_once(struct vhost_vdpa *v)
 {
-    if (v->dev->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH) &&
+    if (v->shared->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH) &&
         !v->shared->iotlb_batch_begin_sent) {
         vhost_vdpa_listener_begin_batch(v);
     }
@@ -171,11 +171,10 @@ static void vhost_vdpa_iotlb_batch_begin_once(struct vhost_vdpa *v)
 static void vhost_vdpa_listener_commit(MemoryListener *listener)
 {
     struct vhost_vdpa *v = container_of(listener, struct vhost_vdpa, listener);
-    struct vhost_dev *dev = v->dev;
     struct vhost_msg_v2 msg = {};
     int fd = v->shared->device_fd;
 
-    if (!(dev->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH))) {
+    if (!(v->shared->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH))) {
         return;
     }
 
@@ -833,6 +832,8 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
 
 static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
 {
+    struct vhost_vdpa *v = dev->opaque;
+
     uint64_t features;
     uint64_t f = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
         0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
@@ -854,6 +855,7 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
     }
 
     dev->backend_cap = features;
+    v->shared->backend_cap = features;
 
     return 0;
 }
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 09/18] vdpa: remove msg type of vhost_vdpa
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (7 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 08/18] vdpa: move backend_cap " Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 10/18] vdpa: move iommu_list to vhost_vdpa_shared Eugenio Pérez
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

It is always VHOST_IOTLB_MSG_V2. We can always make it back per
vhost_dev if needed.

This change makes easier for vhost_vdpa_map and unmap not to depend on
vhost_vdpa but only in VhostVDPAShared.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h | 1 -
 hw/virtio/vhost-vdpa.c         | 9 ++++-----
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 11ac14085a..5bd964dac5 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -49,7 +49,6 @@ typedef struct vhost_vdpa_shared {
 
 typedef struct vhost_vdpa {
     int index;
-    uint32_t msg_type;
     uint32_t address_space_id;
     MemoryListener listener;
     uint64_t acked_features;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 124323de43..66ae8ff6f4 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -93,7 +93,7 @@ int vhost_vdpa_dma_map(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
     int fd = v->shared->device_fd;
     int ret = 0;
 
-    msg.type = v->msg_type;
+    msg.type = VHOST_IOTLB_MSG_V2;
     msg.asid = asid;
     msg.iotlb.iova = iova;
     msg.iotlb.size = size;
@@ -125,7 +125,7 @@ int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
     int fd = v->shared->device_fd;
     int ret = 0;
 
-    msg.type = v->msg_type;
+    msg.type = VHOST_IOTLB_MSG_V2;
     msg.asid = asid;
     msg.iotlb.iova = iova;
     msg.iotlb.size = size;
@@ -147,7 +147,7 @@ static void vhost_vdpa_listener_begin_batch(struct vhost_vdpa *v)
 {
     int fd = v->shared->device_fd;
     struct vhost_msg_v2 msg = {
-        .type = v->msg_type,
+        .type = VHOST_IOTLB_MSG_V2,
         .iotlb.type = VHOST_IOTLB_BATCH_BEGIN,
     };
 
@@ -182,7 +182,7 @@ static void vhost_vdpa_listener_commit(MemoryListener *listener)
         return;
     }
 
-    msg.type = v->msg_type;
+    msg.type = VHOST_IOTLB_MSG_V2;
     msg.iotlb.type = VHOST_IOTLB_BATCH_END;
 
     trace_vhost_vdpa_listener_commit(v->shared, fd, msg.type, msg.iotlb.type);
@@ -592,7 +592,6 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
     v->dev = dev;
     dev->opaque =  opaque ;
     v->listener = vhost_vdpa_memory_listener;
-    v->msg_type = VHOST_IOTLB_MSG_V2;
     vhost_vdpa_init_svq(dev, v);
 
     error_propagate(&dev->migration_blocker, v->migration_blocker);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 10/18] vdpa: move iommu_list to vhost_vdpa_shared
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (8 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 09/18] vdpa: remove msg type of vhost_vdpa Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 11/18] vdpa: use VhostVDPAShared in vdpa_dma_map and unmap Eugenio Pérez
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

Next patches will register the vhost_vdpa memory listener while the VM
is migrating at the destination, so we can map the memory to the device
before stopping the VM at the source.  The main goal is to reduce the
downtime.

However, the destination QEMU is unaware of which vhost_vdpa device will
register its memory_listener.  If the source guest has CVQ enabled, it
will be the CVQ device.  Otherwise, it  will be the first one.

Move the iommu_list member to VhostVDPAShared so all vhost_vdpa can use
it, rather than always in the first / last vhost_vdpa.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h | 2 +-
 hw/virtio/vhost-vdpa.c         | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 5bd964dac5..3880b9e7f2 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -34,6 +34,7 @@ typedef struct VhostVDPAHostNotifier {
 typedef struct vhost_vdpa_shared {
     int device_fd;
     struct vhost_vdpa_iova_range iova_range;
+    QLIST_HEAD(, vdpa_iommu) iommu_list;
 
     /* IOVA mapping used by the Shadow Virtqueue */
     VhostIOVATree *iova_tree;
@@ -62,7 +63,6 @@ typedef struct vhost_vdpa {
     struct vhost_dev *dev;
     Error *migration_blocker;
     VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
-    QLIST_HEAD(, vdpa_iommu) iommu_list;
     IOMMUNotifier n;
 } VhostVDPA;
 
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 66ae8ff6f4..2295daa7cc 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -278,7 +278,7 @@ static void vhost_vdpa_iommu_region_add(MemoryListener *listener,
         return;
     }
 
-    QLIST_INSERT_HEAD(&v->iommu_list, iommu, iommu_next);
+    QLIST_INSERT_HEAD(&v->shared->iommu_list, iommu, iommu_next);
     memory_region_iommu_replay(iommu->iommu_mr, &iommu->n);
 
     return;
@@ -291,7 +291,7 @@ static void vhost_vdpa_iommu_region_del(MemoryListener *listener,
 
     struct vdpa_iommu *iommu;
 
-    QLIST_FOREACH(iommu, &v->iommu_list, iommu_next)
+    QLIST_FOREACH(iommu, &v->shared->iommu_list, iommu_next)
     {
         if (MEMORY_REGION(iommu->iommu_mr) == section->mr &&
             iommu->n.start == section->offset_within_region) {
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 11/18] vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (9 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 10/18] vdpa: move iommu_list to vhost_vdpa_shared Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 12/18] vdpa: use dev_shared in vdpa_iommu Eugenio Pérez
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

The callers only have the shared information by the end of this series.
Start converting this functions.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h |  4 +--
 hw/virtio/vhost-vdpa.c         | 47 +++++++++++++++++-----------------
 net/vhost-vdpa.c               |  5 ++--
 3 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 3880b9e7f2..705c754776 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -69,9 +69,9 @@ typedef struct vhost_vdpa {
 int vhost_vdpa_get_iova_range(int fd, struct vhost_vdpa_iova_range *iova_range);
 int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, unsigned idx);
 
-int vhost_vdpa_dma_map(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
+int vhost_vdpa_dma_map(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
                        hwaddr size, void *vaddr, bool readonly);
-int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
+int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
                          hwaddr size);
 
 typedef struct vdpa_iommu {
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 2295daa7cc..0ed6550aad 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -86,11 +86,11 @@ static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section,
  * The caller must set asid = 0 if the device does not support asid.
  * This is not an ABI break since it is set to 0 by the initializer anyway.
  */
-int vhost_vdpa_dma_map(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
+int vhost_vdpa_dma_map(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
                        hwaddr size, void *vaddr, bool readonly)
 {
     struct vhost_msg_v2 msg = {};
-    int fd = v->shared->device_fd;
+    int fd = s->device_fd;
     int ret = 0;
 
     msg.type = VHOST_IOTLB_MSG_V2;
@@ -101,7 +101,7 @@ int vhost_vdpa_dma_map(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
     msg.iotlb.perm = readonly ? VHOST_ACCESS_RO : VHOST_ACCESS_RW;
     msg.iotlb.type = VHOST_IOTLB_UPDATE;
 
-    trace_vhost_vdpa_dma_map(v->shared, fd, msg.type, msg.asid, msg.iotlb.iova,
+    trace_vhost_vdpa_dma_map(s, fd, msg.type, msg.asid, msg.iotlb.iova,
                              msg.iotlb.size, msg.iotlb.uaddr, msg.iotlb.perm,
                              msg.iotlb.type);
 
@@ -118,11 +118,11 @@ int vhost_vdpa_dma_map(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
  * The caller must set asid = 0 if the device does not support asid.
  * This is not an ABI break since it is set to 0 by the initializer anyway.
  */
-int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
+int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
                          hwaddr size)
 {
     struct vhost_msg_v2 msg = {};
-    int fd = v->shared->device_fd;
+    int fd = s->device_fd;
     int ret = 0;
 
     msg.type = VHOST_IOTLB_MSG_V2;
@@ -131,7 +131,7 @@ int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
     msg.iotlb.size = size;
     msg.iotlb.type = VHOST_IOTLB_INVALIDATE;
 
-    trace_vhost_vdpa_dma_unmap(v->shared, fd, msg.type, msg.asid, msg.iotlb.iova,
+    trace_vhost_vdpa_dma_unmap(s, fd, msg.type, msg.asid, msg.iotlb.iova,
                                msg.iotlb.size, msg.iotlb.type);
 
     if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
@@ -143,29 +143,29 @@ int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
     return ret;
 }
 
-static void vhost_vdpa_listener_begin_batch(struct vhost_vdpa *v)
+static void vhost_vdpa_listener_begin_batch(VhostVDPAShared *s)
 {
-    int fd = v->shared->device_fd;
+    int fd = s->device_fd;
     struct vhost_msg_v2 msg = {
         .type = VHOST_IOTLB_MSG_V2,
         .iotlb.type = VHOST_IOTLB_BATCH_BEGIN,
     };
 
-    trace_vhost_vdpa_listener_begin_batch(v->shared, fd, msg.type, msg.iotlb.type);
+    trace_vhost_vdpa_listener_begin_batch(s, fd, msg.type, msg.iotlb.type);
     if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
         error_report("failed to write, fd=%d, errno=%d (%s)",
                      fd, errno, strerror(errno));
     }
 }
 
-static void vhost_vdpa_iotlb_batch_begin_once(struct vhost_vdpa *v)
+static void vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
 {
-    if (v->shared->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH) &&
-        !v->shared->iotlb_batch_begin_sent) {
-        vhost_vdpa_listener_begin_batch(v);
+    if (s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH) &&
+        !s->iotlb_batch_begin_sent) {
+        vhost_vdpa_listener_begin_batch(s);
     }
 
-    v->shared->iotlb_batch_begin_sent = true;
+    s->iotlb_batch_begin_sent = true;
 }
 
 static void vhost_vdpa_listener_commit(MemoryListener *listener)
@@ -225,7 +225,7 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
         if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL)) {
             return;
         }
-        ret = vhost_vdpa_dma_map(v, VHOST_VDPA_GUEST_PA_ASID, iova,
+        ret = vhost_vdpa_dma_map(v->shared, VHOST_VDPA_GUEST_PA_ASID, iova,
                                  iotlb->addr_mask + 1, vaddr, read_only);
         if (ret) {
             error_report("vhost_vdpa_dma_map(%p, 0x%" HWADDR_PRIx ", "
@@ -233,7 +233,7 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
                          v, iova, iotlb->addr_mask + 1, vaddr, ret);
         }
     } else {
-        ret = vhost_vdpa_dma_unmap(v, VHOST_VDPA_GUEST_PA_ASID, iova,
+        ret = vhost_vdpa_dma_unmap(v->shared, VHOST_VDPA_GUEST_PA_ASID, iova,
                                    iotlb->addr_mask + 1);
         if (ret) {
             error_report("vhost_vdpa_dma_unmap(%p, 0x%" HWADDR_PRIx ", "
@@ -369,8 +369,8 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
         iova = mem_region.iova;
     }
 
-    vhost_vdpa_iotlb_batch_begin_once(v);
-    ret = vhost_vdpa_dma_map(v, VHOST_VDPA_GUEST_PA_ASID, iova,
+    vhost_vdpa_iotlb_batch_begin_once(v->shared);
+    ret = vhost_vdpa_dma_map(v->shared, VHOST_VDPA_GUEST_PA_ASID, iova,
                              int128_get64(llsize), vaddr, section->readonly);
     if (ret) {
         error_report("vhost vdpa map fail!");
@@ -454,13 +454,13 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
         iova = result->iova;
         vhost_iova_tree_remove(v->shared->iova_tree, *result);
     }
-    vhost_vdpa_iotlb_batch_begin_once(v);
+    vhost_vdpa_iotlb_batch_begin_once(v->shared);
     /*
      * The unmap ioctl doesn't accept a full 64-bit. need to check it
      */
     if (int128_eq(llsize, int128_2_64())) {
         llsize = int128_rshift(llsize, 1);
-        ret = vhost_vdpa_dma_unmap(v, VHOST_VDPA_GUEST_PA_ASID, iova,
+        ret = vhost_vdpa_dma_unmap(v->shared, VHOST_VDPA_GUEST_PA_ASID, iova,
                                    int128_get64(llsize));
 
         if (ret) {
@@ -470,7 +470,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
         }
         iova += int128_get64(llsize);
     }
-    ret = vhost_vdpa_dma_unmap(v, VHOST_VDPA_GUEST_PA_ASID, iova,
+    ret = vhost_vdpa_dma_unmap(v->shared, VHOST_VDPA_GUEST_PA_ASID, iova,
                                int128_get64(llsize));
 
     if (ret) {
@@ -1076,7 +1076,8 @@ static void vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr addr)
     }
 
     size = ROUND_UP(result->size, qemu_real_host_page_size());
-    r = vhost_vdpa_dma_unmap(v, v->address_space_id, result->iova, size);
+    r = vhost_vdpa_dma_unmap(v->shared, v->address_space_id, result->iova,
+                             size);
     if (unlikely(r < 0)) {
         error_report("Unable to unmap SVQ vring: %s (%d)", g_strerror(-r), -r);
         return;
@@ -1116,7 +1117,7 @@ static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap *needle,
         return false;
     }
 
-    r = vhost_vdpa_dma_map(v, v->address_space_id, needle->iova,
+    r = vhost_vdpa_dma_map(v->shared, v->address_space_id, needle->iova,
                            needle->size + 1,
                            (void *)(uintptr_t)needle->translated_addr,
                            needle->perm == IOMMU_RO);
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 4d42d7a742..31df7e91a1 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -462,7 +462,8 @@ static void vhost_vdpa_cvq_unmap_buf(struct vhost_vdpa *v, void *addr)
         return;
     }
 
-    r = vhost_vdpa_dma_unmap(v, v->address_space_id, map->iova, map->size + 1);
+    r = vhost_vdpa_dma_unmap(v->shared, v->address_space_id, map->iova,
+                             map->size + 1);
     if (unlikely(r != 0)) {
         error_report("Device cannot unmap: %s(%d)", g_strerror(r), r);
     }
@@ -486,7 +487,7 @@ static int vhost_vdpa_cvq_map_buf(struct vhost_vdpa *v, void *buf, size_t size,
         return r;
     }
 
-    r = vhost_vdpa_dma_map(v, v->address_space_id, map.iova,
+    r = vhost_vdpa_dma_map(v->shared, v->address_space_id, map.iova,
                            vhost_vdpa_net_cvq_cmd_page_len(), buf, !write);
     if (unlikely(r < 0)) {
         goto dma_map_err;
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 12/18] vdpa: use dev_shared in vdpa_iommu
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (10 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 11/18] vdpa: use VhostVDPAShared in vdpa_dma_map and unmap Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 13/18] vdpa: move memory listener to vhost_vdpa_shared Eugenio Pérez
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

The memory listener functions can call these too.  Make vdpa_iommu work
with VhostVDPAShared.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h |  2 +-
 hw/virtio/vhost-vdpa.c         | 16 ++++++++--------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 705c754776..2abee2164a 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -75,7 +75,7 @@ int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
                          hwaddr size);
 
 typedef struct vdpa_iommu {
-    struct vhost_vdpa *dev;
+    VhostVDPAShared *dev_shared;
     IOMMUMemoryRegion *iommu_mr;
     hwaddr iommu_offset;
     IOMMUNotifier n;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 0ed6550aad..61553ad196 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -199,7 +199,7 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     struct vdpa_iommu *iommu = container_of(n, struct vdpa_iommu, n);
 
     hwaddr iova = iotlb->iova + iommu->iommu_offset;
-    struct vhost_vdpa *v = iommu->dev;
+    VhostVDPAShared *s = iommu->dev_shared;
     void *vaddr;
     int ret;
     Int128 llend;
@@ -212,10 +212,10 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     RCU_READ_LOCK_GUARD();
     /* check if RAM section out of device range */
     llend = int128_add(int128_makes64(iotlb->addr_mask), int128_makes64(iova));
-    if (int128_gt(llend, int128_make64(v->shared->iova_range.last))) {
+    if (int128_gt(llend, int128_make64(s->iova_range.last))) {
         error_report("RAM section out of device range (max=0x%" PRIx64
                      ", end addr=0x%" PRIx64 ")",
-                     v->shared->iova_range.last, int128_get64(llend));
+                     s->iova_range.last, int128_get64(llend));
         return;
     }
 
@@ -225,20 +225,20 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
         if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL)) {
             return;
         }
-        ret = vhost_vdpa_dma_map(v->shared, VHOST_VDPA_GUEST_PA_ASID, iova,
+        ret = vhost_vdpa_dma_map(s, VHOST_VDPA_GUEST_PA_ASID, iova,
                                  iotlb->addr_mask + 1, vaddr, read_only);
         if (ret) {
             error_report("vhost_vdpa_dma_map(%p, 0x%" HWADDR_PRIx ", "
                          "0x%" HWADDR_PRIx ", %p) = %d (%m)",
-                         v, iova, iotlb->addr_mask + 1, vaddr, ret);
+                         s, iova, iotlb->addr_mask + 1, vaddr, ret);
         }
     } else {
-        ret = vhost_vdpa_dma_unmap(v->shared, VHOST_VDPA_GUEST_PA_ASID, iova,
+        ret = vhost_vdpa_dma_unmap(s, VHOST_VDPA_GUEST_PA_ASID, iova,
                                    iotlb->addr_mask + 1);
         if (ret) {
             error_report("vhost_vdpa_dma_unmap(%p, 0x%" HWADDR_PRIx ", "
                          "0x%" HWADDR_PRIx ") = %d (%m)",
-                         v, iova, iotlb->addr_mask + 1, ret);
+                         s, iova, iotlb->addr_mask + 1, ret);
         }
     }
 }
@@ -270,7 +270,7 @@ static void vhost_vdpa_iommu_region_add(MemoryListener *listener,
                         iommu_idx);
     iommu->iommu_offset = section->offset_within_address_space -
                           section->offset_within_region;
-    iommu->dev = v;
+    iommu->dev_shared = v->shared;
 
     ret = memory_region_register_iommu_notifier(section->mr, &iommu->n, NULL);
     if (ret) {
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 13/18] vdpa: move memory listener to vhost_vdpa_shared
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (11 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 12/18] vdpa: use dev_shared in vdpa_iommu Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 14/18] vdpa: do not set virtio status bits if unneeded Eugenio Pérez
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

Next patches will register the vhost_vdpa memory listener while the VM
is migrating at the destination, so we can map the memory to the device
before stopping the VM at the source.  The main goal is to reduce the
downtime.

However, the destination QEMU is unaware of which vhost_vdpa device will
register its memory_listener.  If the source guest has CVQ enabled, it
will be the CVQ device.  Otherwise, it  will be the first one.

Move the memory listener to a common place rather than always in the
first / last vhost_vdpa.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h |  2 +-
 hw/virtio/vhost-vdpa.c         | 84 ++++++++++++++++------------------
 2 files changed, 40 insertions(+), 46 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 2abee2164a..8f54e5edd4 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -33,6 +33,7 @@ typedef struct VhostVDPAHostNotifier {
 /* Info shared by all vhost_vdpa device models */
 typedef struct vhost_vdpa_shared {
     int device_fd;
+    MemoryListener listener;
     struct vhost_vdpa_iova_range iova_range;
     QLIST_HEAD(, vdpa_iommu) iommu_list;
 
@@ -51,7 +52,6 @@ typedef struct vhost_vdpa_shared {
 typedef struct vhost_vdpa {
     int index;
     uint32_t address_space_id;
-    MemoryListener listener;
     uint64_t acked_features;
     bool shadow_vqs_enabled;
     /* Device suspended successfully */
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 61553ad196..7500c2fc82 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -170,28 +170,28 @@ static void vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
 
 static void vhost_vdpa_listener_commit(MemoryListener *listener)
 {
-    struct vhost_vdpa *v = container_of(listener, struct vhost_vdpa, listener);
+    VhostVDPAShared *s = container_of(listener, VhostVDPAShared, listener);
     struct vhost_msg_v2 msg = {};
-    int fd = v->shared->device_fd;
+    int fd = s->device_fd;
 
-    if (!(v->shared->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH))) {
+    if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH))) {
         return;
     }
 
-    if (!v->shared->iotlb_batch_begin_sent) {
+    if (!s->iotlb_batch_begin_sent) {
         return;
     }
 
     msg.type = VHOST_IOTLB_MSG_V2;
     msg.iotlb.type = VHOST_IOTLB_BATCH_END;
 
-    trace_vhost_vdpa_listener_commit(v->shared, fd, msg.type, msg.iotlb.type);
+    trace_vhost_vdpa_listener_commit(s, fd, msg.type, msg.iotlb.type);
     if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
         error_report("failed to write, fd=%d, errno=%d (%s)",
                      fd, errno, strerror(errno));
     }
 
-    v->shared->iotlb_batch_begin_sent = false;
+    s->iotlb_batch_begin_sent = false;
 }
 
 static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
@@ -246,7 +246,7 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 static void vhost_vdpa_iommu_region_add(MemoryListener *listener,
                                         MemoryRegionSection *section)
 {
-    struct vhost_vdpa *v = container_of(listener, struct vhost_vdpa, listener);
+    VhostVDPAShared *s = container_of(listener, VhostVDPAShared, listener);
 
     struct vdpa_iommu *iommu;
     Int128 end;
@@ -270,7 +270,7 @@ static void vhost_vdpa_iommu_region_add(MemoryListener *listener,
                         iommu_idx);
     iommu->iommu_offset = section->offset_within_address_space -
                           section->offset_within_region;
-    iommu->dev_shared = v->shared;
+    iommu->dev_shared = s;
 
     ret = memory_region_register_iommu_notifier(section->mr, &iommu->n, NULL);
     if (ret) {
@@ -278,7 +278,7 @@ static void vhost_vdpa_iommu_region_add(MemoryListener *listener,
         return;
     }
 
-    QLIST_INSERT_HEAD(&v->shared->iommu_list, iommu, iommu_next);
+    QLIST_INSERT_HEAD(&s->iommu_list, iommu, iommu_next);
     memory_region_iommu_replay(iommu->iommu_mr, &iommu->n);
 
     return;
@@ -287,11 +287,11 @@ static void vhost_vdpa_iommu_region_add(MemoryListener *listener,
 static void vhost_vdpa_iommu_region_del(MemoryListener *listener,
                                         MemoryRegionSection *section)
 {
-    struct vhost_vdpa *v = container_of(listener, struct vhost_vdpa, listener);
+    VhostVDPAShared *s = container_of(listener, VhostVDPAShared, listener);
 
     struct vdpa_iommu *iommu;
 
-    QLIST_FOREACH(iommu, &v->shared->iommu_list, iommu_next)
+    QLIST_FOREACH(iommu, &s->iommu_list, iommu_next)
     {
         if (MEMORY_REGION(iommu->iommu_mr) == section->mr &&
             iommu->n.start == section->offset_within_region) {
@@ -307,7 +307,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
                                            MemoryRegionSection *section)
 {
     DMAMap mem_region = {};
-    struct vhost_vdpa *v = container_of(listener, struct vhost_vdpa, listener);
+    VhostVDPAShared *s = container_of(listener, VhostVDPAShared, listener);
     hwaddr iova;
     Int128 llend, llsize;
     void *vaddr;
@@ -315,10 +315,8 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
     int page_size = qemu_target_page_size();
     int page_mask = -page_size;
 
-    if (vhost_vdpa_listener_skipped_section(section,
-                                            v->shared->iova_range.first,
-                                            v->shared->iova_range.last,
-                                            page_mask)) {
+    if (vhost_vdpa_listener_skipped_section(section, s->iova_range.first,
+                                            s->iova_range.last, page_mask)) {
         return;
     }
     if (memory_region_is_iommu(section->mr)) {
@@ -328,8 +326,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
 
     if (unlikely((section->offset_within_address_space & ~page_mask) !=
                  (section->offset_within_region & ~page_mask))) {
-        trace_vhost_vdpa_listener_region_add_unaligned(v->shared,
-                       section->mr->name,
+        trace_vhost_vdpa_listener_region_add_unaligned(s, section->mr->name,
                        section->offset_within_address_space & ~page_mask,
                        section->offset_within_region & ~page_mask);
         return;
@@ -349,18 +346,18 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
             section->offset_within_region +
             (iova - section->offset_within_address_space);
 
-    trace_vhost_vdpa_listener_region_add(v->shared, iova, int128_get64(llend),
+    trace_vhost_vdpa_listener_region_add(s, iova, int128_get64(llend),
                                          vaddr, section->readonly);
 
     llsize = int128_sub(llend, int128_make64(iova));
-    if (v->shared->shadow_data) {
+    if (s->shadow_data) {
         int r;
 
         mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
         mem_region.size = int128_get64(llsize) - 1,
         mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
 
-        r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &mem_region);
+        r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
         if (unlikely(r != IOVA_OK)) {
             error_report("Can't allocate a mapping (%d)", r);
             goto fail;
@@ -369,8 +366,8 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
         iova = mem_region.iova;
     }
 
-    vhost_vdpa_iotlb_batch_begin_once(v->shared);
-    ret = vhost_vdpa_dma_map(v->shared, VHOST_VDPA_GUEST_PA_ASID, iova,
+    vhost_vdpa_iotlb_batch_begin_once(s);
+    ret = vhost_vdpa_dma_map(s, VHOST_VDPA_GUEST_PA_ASID, iova,
                              int128_get64(llsize), vaddr, section->readonly);
     if (ret) {
         error_report("vhost vdpa map fail!");
@@ -380,8 +377,8 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
     return;
 
 fail_map:
-    if (v->shared->shadow_data) {
-        vhost_iova_tree_remove(v->shared->iova_tree, mem_region);
+    if (s->shadow_data) {
+        vhost_iova_tree_remove(s->iova_tree, mem_region);
     }
 
 fail:
@@ -398,17 +395,15 @@ fail:
 static void vhost_vdpa_listener_region_del(MemoryListener *listener,
                                            MemoryRegionSection *section)
 {
-    struct vhost_vdpa *v = container_of(listener, struct vhost_vdpa, listener);
+    VhostVDPAShared *s = container_of(listener, VhostVDPAShared, listener);
     hwaddr iova;
     Int128 llend, llsize;
     int ret;
     int page_size = qemu_target_page_size();
     int page_mask = -page_size;
 
-    if (vhost_vdpa_listener_skipped_section(section,
-                                            v->shared->iova_range.first,
-                                            v->shared->iova_range.last,
-                                            page_mask)) {
+    if (vhost_vdpa_listener_skipped_section(section, s->iova_range.first,
+                                            s->iova_range.last, page_mask)) {
         return;
     }
     if (memory_region_is_iommu(section->mr)) {
@@ -417,8 +412,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
 
     if (unlikely((section->offset_within_address_space & ~page_mask) !=
                  (section->offset_within_region & ~page_mask))) {
-        trace_vhost_vdpa_listener_region_del_unaligned(v->shared,
-                       section->mr->name,
+        trace_vhost_vdpa_listener_region_del_unaligned(s, section->mr->name,
                        section->offset_within_address_space & ~page_mask,
                        section->offset_within_region & ~page_mask);
         return;
@@ -427,7 +421,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
     iova = ROUND_UP(section->offset_within_address_space, page_size);
     llend = vhost_vdpa_section_end(section, page_mask);
 
-    trace_vhost_vdpa_listener_region_del(v->shared, iova,
+    trace_vhost_vdpa_listener_region_del(s, iova,
         int128_get64(int128_sub(llend, int128_one())));
 
     if (int128_ge(int128_make64(iova), llend)) {
@@ -436,7 +430,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
 
     llsize = int128_sub(llend, int128_make64(iova));
 
-    if (v->shared->shadow_data) {
+    if (s->shadow_data) {
         const DMAMap *result;
         const void *vaddr = memory_region_get_ram_ptr(section->mr) +
             section->offset_within_region +
@@ -446,37 +440,37 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
             .size = int128_get64(llsize) - 1,
         };
 
-        result = vhost_iova_tree_find_iova(v->shared->iova_tree, &mem_region);
+        result = vhost_iova_tree_find_iova(s->iova_tree, &mem_region);
         if (!result) {
             /* The memory listener map wasn't mapped */
             return;
         }
         iova = result->iova;
-        vhost_iova_tree_remove(v->shared->iova_tree, *result);
+        vhost_iova_tree_remove(s->iova_tree, *result);
     }
-    vhost_vdpa_iotlb_batch_begin_once(v->shared);
+    vhost_vdpa_iotlb_batch_begin_once(s);
     /*
      * The unmap ioctl doesn't accept a full 64-bit. need to check it
      */
     if (int128_eq(llsize, int128_2_64())) {
         llsize = int128_rshift(llsize, 1);
-        ret = vhost_vdpa_dma_unmap(v->shared, VHOST_VDPA_GUEST_PA_ASID, iova,
+        ret = vhost_vdpa_dma_unmap(s, VHOST_VDPA_GUEST_PA_ASID, iova,
                                    int128_get64(llsize));
 
         if (ret) {
             error_report("vhost_vdpa_dma_unmap(%p, 0x%" HWADDR_PRIx ", "
                          "0x%" HWADDR_PRIx ") = %d (%m)",
-                         v, iova, int128_get64(llsize), ret);
+                         s, iova, int128_get64(llsize), ret);
         }
         iova += int128_get64(llsize);
     }
-    ret = vhost_vdpa_dma_unmap(v->shared, VHOST_VDPA_GUEST_PA_ASID, iova,
+    ret = vhost_vdpa_dma_unmap(s, VHOST_VDPA_GUEST_PA_ASID, iova,
                                int128_get64(llsize));
 
     if (ret) {
         error_report("vhost_vdpa_dma_unmap(%p, 0x%" HWADDR_PRIx ", "
                      "0x%" HWADDR_PRIx ") = %d (%m)",
-                     v, iova, int128_get64(llsize), ret);
+                     s, iova, int128_get64(llsize), ret);
     }
 
     memory_region_unref(section->mr);
@@ -591,7 +585,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
 
     v->dev = dev;
     dev->opaque =  opaque ;
-    v->listener = vhost_vdpa_memory_listener;
+    v->shared->listener = vhost_vdpa_memory_listener;
     vhost_vdpa_init_svq(dev, v);
 
     error_propagate(&dev->migration_blocker, v->migration_blocker);
@@ -754,7 +748,7 @@ static int vhost_vdpa_cleanup(struct vhost_dev *dev)
     }
 
     vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
-    memory_listener_unregister(&v->listener);
+    memory_listener_unregister(&v->shared->listener);
     vhost_vdpa_svq_cleanup(dev);
 
     dev->opaque = NULL;
@@ -1327,7 +1321,7 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
                          "IOMMU and try again");
             return -1;
         }
-        memory_listener_register(&v->listener, dev->vdev->dma_as);
+        memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
 
         return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
     }
@@ -1346,7 +1340,7 @@ static void vhost_vdpa_reset_status(struct vhost_dev *dev)
     vhost_vdpa_reset_device(dev);
     vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
                                VIRTIO_CONFIG_S_DRIVER);
-    memory_listener_unregister(&v->listener);
+    memory_listener_unregister(&v->shared->listener);
 }
 
 static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 14/18] vdpa: do not set virtio status bits if unneeded
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (12 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 13/18] vdpa: move memory listener to vhost_vdpa_shared Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 15/18] vdpa: add vhost_vdpa_load_setup Eugenio Pérez
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

Next commits will set DRIVER and ACKNOWLEDGE flags repeatedly in the
case of a migration destination.  Let's save ioctls with this.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 7500c2fc82..cc252fc2d8 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -510,6 +510,10 @@ static int vhost_vdpa_add_status(struct vhost_dev *dev, uint8_t status)
     if (ret < 0) {
         return ret;
     }
+    if ((s & status) == status) {
+        /* Don't set bits already set */
+        return 0;
+    }
 
     s |= status;
 
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 15/18] vdpa: add vhost_vdpa_load_setup
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (13 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 14/18] vdpa: do not set virtio status bits if unneeded Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-11-02  8:48   ` Si-Wei Liu
  2023-10-19 14:34 ` [RFC PATCH 16/18] vdpa: add vhost_vdpa_net_load_setup NetClient callback Eugenio Pérez
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

Callers can use this function to setup the incoming migration.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h |  7 +++++++
 hw/virtio/vhost-vdpa.c         | 17 ++++++++++++++++-
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 8f54e5edd4..edc08b7a02 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -45,6 +45,12 @@ typedef struct vhost_vdpa_shared {
 
     bool iotlb_batch_begin_sent;
 
+    /*
+     * The memory listener has been registered, so DMA maps have been sent to
+     * the device.
+     */
+    bool listener_registered;
+
     /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
     bool shadow_data;
 } VhostVDPAShared;
@@ -73,6 +79,7 @@ int vhost_vdpa_dma_map(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
                        hwaddr size, void *vaddr, bool readonly);
 int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
                          hwaddr size);
+int vhost_vdpa_load_setup(VhostVDPAShared *s, AddressSpace *dma_as);
 
 typedef struct vdpa_iommu {
     VhostVDPAShared *dev_shared;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index cc252fc2d8..bfbe4673af 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1325,7 +1325,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
                          "IOMMU and try again");
             return -1;
         }
-        memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
+        if (!v->shared->listener_registered) {
+            memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
+        }
 
         return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
     }
@@ -1528,3 +1530,16 @@ const VhostOps vdpa_ops = {
         .vhost_set_config_call = vhost_vdpa_set_config_call,
         .vhost_reset_status = vhost_vdpa_reset_status,
 };
+
+int vhost_vdpa_load_setup(VhostVDPAShared *shared, AddressSpace *dma_as)
+{
+    uint8_t s = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
+    int r = ioctl(shared->device_fd, VHOST_VDPA_SET_STATUS, &s);
+    if (unlikely(r < 0)) {
+        return r;
+    }
+
+    memory_listener_register(&shared->listener, dma_as);
+    shared->listener_registered = true;
+    return 0;
+}
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 15/18] vdpa: add vhost_vdpa_load_setup
  2023-10-19 14:34 ` [RFC PATCH 15/18] vdpa: add vhost_vdpa_load_setup Eugenio Pérez
@ 2023-11-02  8:48   ` Si-Wei Liu
  2023-11-02 15:24     ` Eugenio Perez Martin
  0 siblings, 1 reply; 34+ messages in thread
From: Si-Wei Liu @ 2023-11-02  8:48 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, Gautam Dawar



On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> Callers can use this function to setup the incoming migration.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   include/hw/virtio/vhost-vdpa.h |  7 +++++++
>   hw/virtio/vhost-vdpa.c         | 17 ++++++++++++++++-
>   2 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> index 8f54e5edd4..edc08b7a02 100644
> --- a/include/hw/virtio/vhost-vdpa.h
> +++ b/include/hw/virtio/vhost-vdpa.h
> @@ -45,6 +45,12 @@ typedef struct vhost_vdpa_shared {
>   
>       bool iotlb_batch_begin_sent;
>   
> +    /*
> +     * The memory listener has been registered, so DMA maps have been sent to
> +     * the device.
> +     */
> +    bool listener_registered;
> +
>       /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
>       bool shadow_data;
>   } VhostVDPAShared;
> @@ -73,6 +79,7 @@ int vhost_vdpa_dma_map(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
>                          hwaddr size, void *vaddr, bool readonly);
>   int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
>                            hwaddr size);
> +int vhost_vdpa_load_setup(VhostVDPAShared *s, AddressSpace *dma_as);
>   
>   typedef struct vdpa_iommu {
>       VhostVDPAShared *dev_shared;
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index cc252fc2d8..bfbe4673af 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -1325,7 +1325,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>                            "IOMMU and try again");
>               return -1;
>           }
> -        memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
> +        if (!v->shared->listener_registered) {
> +            memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
> +        }
Set listener_registered to true after registration; in addition, it 
looks like the memory_listener_unregister in vhost_vdpa_reset_status 
doesn't clear the listener_registered flag after unregistration. This 
code path can be called during SVQ switching, if not doing so mapping 
can't be added back after a couple rounds of SVQ switching or live 
migration.

-Siwei

>   
>           return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
>       }
> @@ -1528,3 +1530,16 @@ const VhostOps vdpa_ops = {
>           .vhost_set_config_call = vhost_vdpa_set_config_call,
>           .vhost_reset_status = vhost_vdpa_reset_status,
>   };
> +
> +int vhost_vdpa_load_setup(VhostVDPAShared *shared, AddressSpace *dma_as)
> +{
> +    uint8_t s = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
> +    int r = ioctl(shared->device_fd, VHOST_VDPA_SET_STATUS, &s);
> +    if (unlikely(r < 0)) {
> +        return r;
> +    }
> +
> +    memory_listener_register(&shared->listener, dma_as);
> +    shared->listener_registered = true;
> +    return 0;
> +}



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 15/18] vdpa: add vhost_vdpa_load_setup
  2023-11-02  8:48   ` Si-Wei Liu
@ 2023-11-02 15:24     ` Eugenio Perez Martin
  0 siblings, 0 replies; 34+ messages in thread
From: Eugenio Perez Martin @ 2023-11-02 15:24 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Shannon, Parav Pandit, Stefano Garzarella,
	Michael S. Tsirkin, yin31149, Jason Wang, Yajun Wu, Zhu Lingshan,
	Lei Yang, Dragos Tatulea, Juan Quintela, Laurent Vivier,
	Gautam Dawar

On Thu, Nov 2, 2023 at 9:49 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> > Callers can use this function to setup the incoming migration.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   include/hw/virtio/vhost-vdpa.h |  7 +++++++
> >   hw/virtio/vhost-vdpa.c         | 17 ++++++++++++++++-
> >   2 files changed, 23 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> > index 8f54e5edd4..edc08b7a02 100644
> > --- a/include/hw/virtio/vhost-vdpa.h
> > +++ b/include/hw/virtio/vhost-vdpa.h
> > @@ -45,6 +45,12 @@ typedef struct vhost_vdpa_shared {
> >
> >       bool iotlb_batch_begin_sent;
> >
> > +    /*
> > +     * The memory listener has been registered, so DMA maps have been sent to
> > +     * the device.
> > +     */
> > +    bool listener_registered;
> > +
> >       /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
> >       bool shadow_data;
> >   } VhostVDPAShared;
> > @@ -73,6 +79,7 @@ int vhost_vdpa_dma_map(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
> >                          hwaddr size, void *vaddr, bool readonly);
> >   int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
> >                            hwaddr size);
> > +int vhost_vdpa_load_setup(VhostVDPAShared *s, AddressSpace *dma_as);
> >
> >   typedef struct vdpa_iommu {
> >       VhostVDPAShared *dev_shared;
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index cc252fc2d8..bfbe4673af 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -1325,7 +1325,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
> >                            "IOMMU and try again");
> >               return -1;
> >           }
> > -        memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
> > +        if (!v->shared->listener_registered) {
> > +            memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
> > +        }
> Set listener_registered to true after registration; in addition, it
> looks like the memory_listener_unregister in vhost_vdpa_reset_status
> doesn't clear the listener_registered flag after unregistration. This
> code path can be called during SVQ switching, if not doing so mapping
> can't be added back after a couple rounds of SVQ switching or live
> migration.
>

That's right, it should be also set to false in the unregister. Thanks
for the catch!

> -Siwei
>
> >
> >           return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
> >       }
> > @@ -1528,3 +1530,16 @@ const VhostOps vdpa_ops = {
> >           .vhost_set_config_call = vhost_vdpa_set_config_call,
> >           .vhost_reset_status = vhost_vdpa_reset_status,
> >   };
> > +
> > +int vhost_vdpa_load_setup(VhostVDPAShared *shared, AddressSpace *dma_as)
> > +{
> > +    uint8_t s = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
> > +    int r = ioctl(shared->device_fd, VHOST_VDPA_SET_STATUS, &s);
> > +    if (unlikely(r < 0)) {
> > +        return r;
> > +    }
> > +
> > +    memory_listener_register(&shared->listener, dma_as);
> > +    shared->listener_registered = true;
> > +    return 0;
> > +}
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 16/18] vdpa: add vhost_vdpa_net_load_setup NetClient callback
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (14 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 15/18] vdpa: add vhost_vdpa_load_setup Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 17/18] vdpa: use shadow_data instead of first device v->shadow_vqs_enabled Eugenio Pérez
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

So the vDPA backend knows when a migration incoming starts.  NicState
argument is needed so we can get the dma address space.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/net/net.h |  4 ++++
 net/vhost-vdpa.c  | 10 ++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/net/net.h b/include/net/net.h
index 2fb1c9181c..e0ecf85bf6 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -42,6 +42,7 @@ typedef struct NICConf {
 
 /* Net clients */
 
+struct NICState;
 typedef void (NetPoll)(NetClientState *, bool enable);
 typedef bool (NetCanReceive)(NetClientState *);
 typedef int (NetStart)(NetClientState *);
@@ -69,6 +70,8 @@ typedef void (SocketReadStateFinalize)(SocketReadState *rs);
 typedef void (NetAnnounce)(NetClientState *);
 typedef bool (SetSteeringEBPF)(NetClientState *, int);
 typedef bool (NetCheckPeerType)(NetClientState *, ObjectClass *, Error **);
+/* This can be called before start & pair, so get also the peer */
+typedef int (NetMigrationLoadSetup)(NetClientState *, struct NICState *);
 
 typedef struct NetClientInfo {
     NetClientDriver type;
@@ -98,6 +101,7 @@ typedef struct NetClientInfo {
     NetAnnounce *announce;
     SetSteeringEBPF *set_steering_ebpf;
     NetCheckPeerType *check_peer_type;
+    NetMigrationLoadSetup *load_setup;
 } NetClientInfo;
 
 struct NetClientState {
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 31df7e91a1..507cb946d2 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -396,6 +396,15 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
     }
 }
 
+static int vhost_vdpa_net_load_setup(NetClientState *nc, NICState *nic)
+{
+    VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+    VirtIONet *n = qemu_get_nic_opaque(&nic->ncs[0]);
+
+    vhost_vdpa_load_setup(s->vhost_vdpa.shared, n->parent_obj.dma_as);
+    return 0;
+}
+
 static NetClientInfo net_vhost_vdpa_info = {
         .type = NET_CLIENT_DRIVER_VHOST_VDPA,
         .size = sizeof(VhostVDPAState),
@@ -407,6 +416,7 @@ static NetClientInfo net_vhost_vdpa_info = {
         .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
         .has_ufo = vhost_vdpa_has_ufo,
         .check_peer_type = vhost_vdpa_check_peer_type,
+        .load_setup = vhost_vdpa_net_load_setup,
 };
 
 static int64_t vhost_vdpa_get_vring_group(int device_fd, unsigned vq_index,
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 17/18] vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (15 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 16/18] vdpa: add vhost_vdpa_net_load_setup NetClient callback Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-10-19 14:34 ` [RFC PATCH 18/18] virtio_net: register incremental migration handlers Eugenio Pérez
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

Previously we trusted vq_index == 0 will be started first. By the end of
this series this might not be true, so let's check proper shadow_data
for this.
---
 hw/virtio/vhost-vdpa.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index bfbe4673af..1f2ef953f6 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -802,7 +802,7 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
         return 0;
     }
 
-    if (v->shadow_vqs_enabled) {
+    if (v->shared->shadow_data) {
         if ((v->acked_features ^ features) == BIT_ULL(VHOST_F_LOG_ALL)) {
             /*
              * QEMU is just trying to enable or disable logging. SVQ handles
@@ -1353,7 +1353,7 @@ static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
                                      struct vhost_log *log)
 {
     struct vhost_vdpa *v = dev->opaque;
-    if (v->shadow_vqs_enabled || !vhost_vdpa_first_dev(dev)) {
+    if (v->shared->shadow_data || !vhost_vdpa_first_dev(dev)) {
         return 0;
     }
 
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 18/18] virtio_net: register incremental migration handlers
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (16 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 17/18] vdpa: use shadow_data instead of first device v->shadow_vqs_enabled Eugenio Pérez
@ 2023-10-19 14:34 ` Eugenio Pérez
  2023-11-02  4:36 ` [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Jason Wang
  2023-11-02 10:12 ` Si-Wei Liu
  19 siblings, 0 replies; 34+ messages in thread
From: Eugenio Pérez @ 2023-10-19 14:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

This way VirtIONet can detect when the incoming migration starts.

While registering in the backend (nc->peer) seems more logical, we need
nic dma address space, and we cannot get it from the backend.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
This could be done in vhost_vdpa or VirtIODevice struct, but future
series will add state restore through CVQ so it's easier to start in
VirtIONEt directly.  If we need to make this more generic, we can move
to VirtIODevice and expose callbacks from VirtIONet class.

Also, the pointer may not be the best id, but there are not a lot of
things initialized in n.
---
 hw/net/virtio-net.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 29e33ea5ed..470338fa7c 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -38,6 +38,7 @@
 #include "qapi/qapi-events-migration.h"
 #include "hw/virtio/virtio-access.h"
 #include "migration/misc.h"
+#include "migration/register.h"
 #include "standard-headers/linux/ethtool.h"
 #include "sysemu/sysemu.h"
 #include "trace.h"
@@ -3808,9 +3809,27 @@ static void virtio_net_device_unrealize(DeviceState *dev)
     virtio_cleanup(vdev);
 }
 
+static int virtio_net_load_setup(QEMUFile *f, void *opaque)
+{
+    VirtIONet *n = opaque;
+    NetClientState *nc = qemu_get_subqueue(n->nic, 0);
+
+    if (nc->peer->info->load_setup) {
+        return nc->peer->info->load_setup(nc->peer, n->nic);
+    }
+
+    return 0;
+}
+
+static const SaveVMHandlers savevm_virtio_net_handlers = {
+    .load_setup = virtio_net_load_setup,
+};
+
+
 static void virtio_net_instance_init(Object *obj)
 {
     VirtIONet *n = VIRTIO_NET(obj);
+    g_autoptr(GString) id = g_string_new(NULL);
 
     /*
      * The default config_size is sizeof(struct virtio_net_config).
@@ -3822,6 +3841,10 @@ static void virtio_net_instance_init(Object *obj)
                                   DEVICE(n));
 
     ebpf_rss_init(&n->ebpf_rss);
+
+    g_string_printf(id, "%p", n);
+    register_savevm_live(id->str, VMSTATE_INSTANCE_ID_ANY, 1,
+                         &savevm_virtio_net_handlers, n);
 }
 
 static int virtio_net_pre_save(void *opaque)
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (17 preceding siblings ...)
  2023-10-19 14:34 ` [RFC PATCH 18/18] virtio_net: register incremental migration handlers Eugenio Pérez
@ 2023-11-02  4:36 ` Jason Wang
  2023-11-02 10:12 ` Si-Wei Liu
  19 siblings, 0 replies; 34+ messages in thread
From: Jason Wang @ 2023-11-02  4:36 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, Shannon, Parav Pandit, Stefano Garzarella,
	Michael S. Tsirkin, yin31149, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, si-wei.liu,
	Gautam Dawar

On Thu, Oct 19, 2023 at 10:35 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> Current memory operations like pinning may take a lot of time at the
> destination.  Currently they are done after the source of the migration is
> stopped, and before the workload is resumed at the destination.  This is a
> period where neigher traffic can flow, nor the VM workload can continue
> (downtime).
>
> We can do better as we know the memory layout of the guest RAM at the
> destination from the moment the migration starts.  Moving that operation allows
> QEMU to communicate the kernel the maps while the workload is still running in
> the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
> saving all the pinning time.
>
> Note that further devices setup at the end of the migration may alter the guest
> memory layout. But same as the previous point, many operations are still done
> incrementally, like memory pinning, so we're saving time anyway.
>
> The first bunch of patches just reorganizes the code, so memory related
> operation parameters are shared between all vhost_vdpa devices.  This is
> because the destination does not know what vhost_vdpa struct will have the
> registered listener member, so it is easier to place them in a shared struct
> rather to keep them in vhost_vdpa struct.  Future version may squash or omit
> these patches.
>
> Only tested with vdpa_sim. I'm sending this before full benchmark, as some work
> like [1] can be based on it, and Si-Wei agreed on benchmark this series with
> his experience.

I'd expect we can see some improvement even without other
optimizations? For example, do we see improvement on mlx5?

(Or we can probably add some delay to the simulator to see)

Thanks

>
> Future directions on top of this series may include:
> * Iterative migration of virtio-net devices, as it may reduce downtime per [1].
>   vhost-vdpa net can apply the configuration through CVQ in the destination
>   while the source is still migrating.
> * Move more things ahead of migration time, like DRIVER_OK.
> * Check that the devices of the destination are valid, and cancel the migration
>   in case it is not.
>
> [1] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
>
> Eugenio Pérez (18):
>   vdpa: add VhostVDPAShared
>   vdpa: move iova tree to the shared struct
>   vdpa: move iova_range to vhost_vdpa_shared
>   vdpa: move shadow_data to vhost_vdpa_shared
>   vdpa: use vdpa shared for tracing
>   vdpa: move file descriptor to vhost_vdpa_shared
>   vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
>   vdpa: move backend_cap to vhost_vdpa_shared
>   vdpa: remove msg type of vhost_vdpa
>   vdpa: move iommu_list to vhost_vdpa_shared
>   vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
>   vdpa: use dev_shared in vdpa_iommu
>   vdpa: move memory listener to vhost_vdpa_shared
>   vdpa: do not set virtio status bits if unneeded
>   vdpa: add vhost_vdpa_load_setup
>   vdpa: add vhost_vdpa_net_load_setup NetClient callback
>   vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
>   virtio_net: register incremental migration handlers
>
>  include/hw/virtio/vhost-vdpa.h |  43 +++++---
>  include/net/net.h              |   4 +
>  hw/net/virtio-net.c            |  23 +++++
>  hw/virtio/vdpa-dev.c           |   7 +-
>  hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
>  net/vhost-vdpa.c               | 127 ++++++++++++-----------
>  hw/virtio/trace-events         |  14 +--
>  7 files changed, 239 insertions(+), 162 deletions(-)
>
> --
> 2.39.3
>
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration
  2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
                   ` (18 preceding siblings ...)
  2023-11-02  4:36 ` [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Jason Wang
@ 2023-11-02 10:12 ` Si-Wei Liu
  2023-11-02 12:37   ` Eugenio Perez Martin
  2023-11-03 20:40   ` Si-Wei Liu
  19 siblings, 2 replies; 34+ messages in thread
From: Si-Wei Liu @ 2023-11-02 10:12 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, Gautam Dawar



On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> Current memory operations like pinning may take a lot of time at the
>
> destination.  Currently they are done after the source of the migration is
>
> stopped, and before the workload is resumed at the destination.  This is a
>
> period where neigher traffic can flow, nor the VM workload can continue
>
> (downtime).
>
>
>
> We can do better as we know the memory layout of the guest RAM at the
>
> destination from the moment the migration starts.  Moving that operation allows
>
> QEMU to communicate the kernel the maps while the workload is still running in
>
> the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
>
> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
>
> saving all the pinning time.
I get what you want to say, though not sure how pinning is relevant to 
on-chip IOMMU and .set_map here, essentially pinning is required for all 
parent vdpa drivers that perform DMA hence don't want VM pages to move 
around.
>
>
>
> Note that further devices setup at the end of the migration may alter the guest
>
> memory layout. But same as the previous point, many operations are still done
>
> incrementally, like memory pinning, so we're saving time anyway.
>
>
>
> The first bunch of patches just reorganizes the code, so memory related
>
> operation parameters are shared between all vhost_vdpa devices.  This is
>
> because the destination does not know what vhost_vdpa struct will have the
>
> registered listener member, so it is easier to place them in a shared struct
>
> rather to keep them in vhost_vdpa struct.  Future version may squash or omit
>
> these patches.
It looks this VhostVDPAShared facility (patch 1-13) is also what I need 
in my SVQ descriptor group series [*], for which I've built similar 
construct there. If possible please try to merge this in ASAP. I'll 
rework my series on top of that.

[*] 
https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5

>
>
>
> Only tested with vdpa_sim. I'm sending this before full benchmark, as some work
>
> like [1] can be based on it, and Si-Wei agreed on benchmark this series with
>
> his experience.
Haven't done the full benchmark compared to pre-map at destination yet, 
though an observation is that the destination QEMU seems very easy to 
get stuck for very long time while in mid of pinning pages. During this 
period, any client doing read-only QMP query or executing HMP info 
command got frozen indefinitely (subject to how large size the memory is 
being pinned). Is it possible to unblock those QMP request or HMP 
command from being executed (at least the read-only ones) while in 
migration? Yield from the load_setup corourtine and spawn another thread?

Having said, not sure if .load_setup is a good fit for what we want to 
do. Searching all current users of .load_setup, either the job can be 
done instantly or the task is time bound without trapping into kernel 
for too long. Maybe pinning is too special use case here...

-Siwei
>
>
>
> Future directions on top of this series may include:
>
> * Iterative migration of virtio-net devices, as it may reduce downtime per [1].
>
>    vhost-vdpa net can apply the configuration through CVQ in the destination
>
>    while the source is still migrating.
>
> * Move more things ahead of migration time, like DRIVER_OK.
>
> * Check that the devices of the destination are valid, and cancel the migration
>
>    in case it is not.
>
>
>
> [1] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
>
>
>
> Eugenio Pérez (18):
>
>    vdpa: add VhostVDPAShared
>
>    vdpa: move iova tree to the shared struct
>
>    vdpa: move iova_range to vhost_vdpa_shared
>
>    vdpa: move shadow_data to vhost_vdpa_shared
>
>    vdpa: use vdpa shared for tracing
>
>    vdpa: move file descriptor to vhost_vdpa_shared
>
>    vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
>
>    vdpa: move backend_cap to vhost_vdpa_shared
>
>    vdpa: remove msg type of vhost_vdpa
>
>    vdpa: move iommu_list to vhost_vdpa_shared
>
>    vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
>
>    vdpa: use dev_shared in vdpa_iommu
>
>    vdpa: move memory listener to vhost_vdpa_shared
>
>    vdpa: do not set virtio status bits if unneeded
>
>    vdpa: add vhost_vdpa_load_setup
>
>    vdpa: add vhost_vdpa_net_load_setup NetClient callback
>
>    vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
>
>    virtio_net: register incremental migration handlers
>
>
>
>   include/hw/virtio/vhost-vdpa.h |  43 +++++---
>
>   include/net/net.h              |   4 +
>
>   hw/net/virtio-net.c            |  23 +++++
>
>   hw/virtio/vdpa-dev.c           |   7 +-
>
>   hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
>
>   net/vhost-vdpa.c               | 127 ++++++++++++-----------
>
>   hw/virtio/trace-events         |  14 +--
>
>   7 files changed, 239 insertions(+), 162 deletions(-)
>
>
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration
  2023-11-02 10:12 ` Si-Wei Liu
@ 2023-11-02 12:37   ` Eugenio Perez Martin
  2023-11-03 20:19     ` Si-Wei Liu
  2023-11-06  4:17     ` Jason Wang
  2023-11-03 20:40   ` Si-Wei Liu
  1 sibling, 2 replies; 34+ messages in thread
From: Eugenio Perez Martin @ 2023-11-02 12:37 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Shannon, Parav Pandit, Stefano Garzarella,
	Michael S. Tsirkin, yin31149, Jason Wang, Yajun Wu, Zhu Lingshan,
	Lei Yang, Dragos Tatulea, Juan Quintela, Laurent Vivier,
	Gautam Dawar

On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> > Current memory operations like pinning may take a lot of time at the
> >
> > destination.  Currently they are done after the source of the migration is
> >
> > stopped, and before the workload is resumed at the destination.  This is a
> >
> > period where neigher traffic can flow, nor the VM workload can continue
> >
> > (downtime).
> >
> >
> >
> > We can do better as we know the memory layout of the guest RAM at the
> >
> > destination from the moment the migration starts.  Moving that operation allows
> >
> > QEMU to communicate the kernel the maps while the workload is still running in
> >
> > the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
> >
> > but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
> >
> > saving all the pinning time.
> I get what you want to say, though not sure how pinning is relevant to
> on-chip IOMMU and .set_map here, essentially pinning is required for all
> parent vdpa drivers that perform DMA hence don't want VM pages to move
> around.

Basically highlighting that the work done under .set_map is not only
pinning, but it is a significant fraction on it. It can be reworded or
deleted for sure.

> >
> >
> >
> > Note that further devices setup at the end of the migration may alter the guest
> >
> > memory layout. But same as the previous point, many operations are still done
> >
> > incrementally, like memory pinning, so we're saving time anyway.
> >
> >
> >
> > The first bunch of patches just reorganizes the code, so memory related
> >
> > operation parameters are shared between all vhost_vdpa devices.  This is
> >
> > because the destination does not know what vhost_vdpa struct will have the
> >
> > registered listener member, so it is easier to place them in a shared struct
> >
> > rather to keep them in vhost_vdpa struct.  Future version may squash or omit
> >
> > these patches.
> It looks this VhostVDPAShared facility (patch 1-13) is also what I need
> in my SVQ descriptor group series [*], for which I've built similar
> construct there. If possible please try to merge this in ASAP. I'll
> rework my series on top of that.
>
> [*]
> https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
>

I can send it individually, for sure.

MST, Jason, can this first part be merged? It doesn't add a lot by
itself but it helps pave the way for future changes.

> >
> >
> >
> > Only tested with vdpa_sim. I'm sending this before full benchmark, as some work
> >
> > like [1] can be based on it, and Si-Wei agreed on benchmark this series with
> >
> > his experience.
> Haven't done the full benchmark compared to pre-map at destination yet,
> though an observation is that the destination QEMU seems very easy to
> get stuck for very long time while in mid of pinning pages. During this
> period, any client doing read-only QMP query or executing HMP info
> command got frozen indefinitely (subject to how large size the memory is
> being pinned). Is it possible to unblock those QMP request or HMP
> command from being executed (at least the read-only ones) while in
> migration? Yield from the load_setup corourtine and spawn another thread?
>

Ok, I wasn't aware of that.

I think we cannot yield in a coroutine and wait for an ioctl. One
option that came to my mind is to effectively use another thread, and
use a POSIX barrier (or equivalent on glib / QEMU) before finishing
the migration. I'm not sure if there are more points where we can
check the barrier and tell the migration to continue or stop though.

Another option is to effectively start doing these ioctls in an
asynchronous way, io_uring cmds like, but I'd like to achieve this
first.

> Having said, not sure if .load_setup is a good fit for what we want to
> do. Searching all current users of .load_setup, either the job can be
> done instantly or the task is time bound without trapping into kernel
> for too long. Maybe pinning is too special use case here...
>
> -Siwei
> >
> >
> >
> > Future directions on top of this series may include:
> >
> > * Iterative migration of virtio-net devices, as it may reduce downtime per [1].
> >
> >    vhost-vdpa net can apply the configuration through CVQ in the destination
> >
> >    while the source is still migrating.
> >
> > * Move more things ahead of migration time, like DRIVER_OK.
> >
> > * Check that the devices of the destination are valid, and cancel the migration
> >
> >    in case it is not.
> >
> >
> >
> > [1] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
> >
> >
> >
> > Eugenio Pérez (18):
> >
> >    vdpa: add VhostVDPAShared
> >
> >    vdpa: move iova tree to the shared struct
> >
> >    vdpa: move iova_range to vhost_vdpa_shared
> >
> >    vdpa: move shadow_data to vhost_vdpa_shared
> >
> >    vdpa: use vdpa shared for tracing
> >
> >    vdpa: move file descriptor to vhost_vdpa_shared
> >
> >    vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
> >
> >    vdpa: move backend_cap to vhost_vdpa_shared
> >
> >    vdpa: remove msg type of vhost_vdpa
> >
> >    vdpa: move iommu_list to vhost_vdpa_shared
> >
> >    vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
> >
> >    vdpa: use dev_shared in vdpa_iommu
> >
> >    vdpa: move memory listener to vhost_vdpa_shared
> >
> >    vdpa: do not set virtio status bits if unneeded
> >
> >    vdpa: add vhost_vdpa_load_setup
> >
> >    vdpa: add vhost_vdpa_net_load_setup NetClient callback
> >
> >    vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
> >
> >    virtio_net: register incremental migration handlers
> >
> >
> >
> >   include/hw/virtio/vhost-vdpa.h |  43 +++++---
> >
> >   include/net/net.h              |   4 +
> >
> >   hw/net/virtio-net.c            |  23 +++++
> >
> >   hw/virtio/vdpa-dev.c           |   7 +-
> >
> >   hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
> >
> >   net/vhost-vdpa.c               | 127 ++++++++++++-----------
> >
> >   hw/virtio/trace-events         |  14 +--
> >
> >   7 files changed, 239 insertions(+), 162 deletions(-)
> >
> >
> >
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration
  2023-11-02 12:37   ` Eugenio Perez Martin
@ 2023-11-03 20:19     ` Si-Wei Liu
  2023-12-05 14:23       ` Eugenio Perez Martin
  2023-11-06  4:17     ` Jason Wang
  1 sibling, 1 reply; 34+ messages in thread
From: Si-Wei Liu @ 2023-11-03 20:19 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Shannon, Parav Pandit, Stefano Garzarella,
	Michael S. Tsirkin, yin31149, Jason Wang, Yajun Wu, Zhu Lingshan,
	Lei Yang, Dragos Tatulea, Juan Quintela, Laurent Vivier,
	Gautam Dawar



On 11/2/2023 5:37 AM, Eugenio Perez Martin wrote:
> On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
>>> Current memory operations like pinning may take a lot of time at the
>>>
>>> destination.  Currently they are done after the source of the migration is
>>>
>>> stopped, and before the workload is resumed at the destination.  This is a
>>>
>>> period where neigher traffic can flow, nor the VM workload can continue
>>>
>>> (downtime).
>>>
>>>
>>>
>>> We can do better as we know the memory layout of the guest RAM at the
>>>
>>> destination from the moment the migration starts.  Moving that operation allows
>>>
>>> QEMU to communicate the kernel the maps while the workload is still running in
>>>
>>> the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
>>>
>>> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
>>>
>>> saving all the pinning time.
>> I get what you want to say, though not sure how pinning is relevant to
>> on-chip IOMMU and .set_map here, essentially pinning is required for all
>> parent vdpa drivers that perform DMA hence don't want VM pages to move
>> around.
> Basically highlighting that the work done under .set_map is not only
> pinning, but it is a significant fraction on it. It can be reworded or
> deleted for sure.
>
>>>
>>>
>>> Note that further devices setup at the end of the migration may alter the guest
>>>
>>> memory layout. But same as the previous point, many operations are still done
>>>
>>> incrementally, like memory pinning, so we're saving time anyway.
>>>
>>>
>>>
>>> The first bunch of patches just reorganizes the code, so memory related
>>>
>>> operation parameters are shared between all vhost_vdpa devices.  This is
>>>
>>> because the destination does not know what vhost_vdpa struct will have the
>>>
>>> registered listener member, so it is easier to place them in a shared struct
>>>
>>> rather to keep them in vhost_vdpa struct.  Future version may squash or omit
>>>
>>> these patches.
>> It looks this VhostVDPAShared facility (patch 1-13) is also what I need
>> in my SVQ descriptor group series [*], for which I've built similar
>> construct there. If possible please try to merge this in ASAP. I'll
>> rework my series on top of that.
>>
>> [*]
>> https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
>>
> I can send it individually, for sure.
>
> MST, Jason, can this first part be merged? It doesn't add a lot by
> itself but it helps pave the way for future changes.
If it cannot, it doesn't matter. I can pick it from here and get my 
series posted with your patches 1-13 applied upfront. This should work, 
I think?

>>>
>>>
>>> Only tested with vdpa_sim. I'm sending this before full benchmark, as some work
>>>
>>> like [1] can be based on it, and Si-Wei agreed on benchmark this series with
>>>
>>> his experience.
>> Haven't done the full benchmark compared to pre-map at destination yet,
>> though an observation is that the destination QEMU seems very easy to
>> get stuck for very long time while in mid of pinning pages. During this
>> period, any client doing read-only QMP query or executing HMP info
>> command got frozen indefinitely (subject to how large size the memory is
>> being pinned). Is it possible to unblock those QMP request or HMP
>> command from being executed (at least the read-only ones) while in
>> migration? Yield from the load_setup corourtine and spawn another thread?
>>
> Ok, I wasn't aware of that.
>
> I think we cannot yield in a coroutine and wait for an ioctl.
I was wondering if we need a separate coroutine out of the general 
migration path to support this special code without overloading 
load_setup or its callers. For instance, unblock the source from sending 
guest rams while allow destination pin pages in parallel should be 
possible.

Regardless, a separate thread is needed to carry out all the heavy 
lifting, i.e. ioctl(2) or write(2) syscalls to map&pin pages.


> One
> option that came to my mind is to effectively use another thread, and
> use a POSIX barrier (or equivalent on glib / QEMU) before finishing
> the migration.
Yes, a separate thread is needed anyway.

>   I'm not sure if there are more points where we can
> check the barrier and tell the migration to continue or stop though.
I think there is, for e.g. what if the dma_map fails. There must be a 
check point for that.

>
> Another option is to effectively start doing these ioctls in an
> asynchronous way, io_uring cmds like, but I'd like to achieve this
> first.
Yes, io_uring or any async API could be another option. Though this 
needs new uAPI through additional kernel facility to support. Anyway, 
it's up to you to decide. :)

Regards,
-Siwei
>> Having said, not sure if .load_setup is a good fit for what we want to
>> do. Searching all current users of .load_setup, either the job can be
>> done instantly or the task is time bound without trapping into kernel
>> for too long. Maybe pinning is too special use case here...
>>
>> -Siwei
>>>
>>>
>>> Future directions on top of this series may include:
>>>
>>> * Iterative migration of virtio-net devices, as it may reduce downtime per [1].
>>>
>>>     vhost-vdpa net can apply the configuration through CVQ in the destination
>>>
>>>     while the source is still migrating.
>>>
>>> * Move more things ahead of migration time, like DRIVER_OK.
>>>
>>> * Check that the devices of the destination are valid, and cancel the migration
>>>
>>>     in case it is not.
>>>
>>>
>>>
>>> [1] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
>>>
>>>
>>>
>>> Eugenio Pérez (18):
>>>
>>>     vdpa: add VhostVDPAShared
>>>
>>>     vdpa: move iova tree to the shared struct
>>>
>>>     vdpa: move iova_range to vhost_vdpa_shared
>>>
>>>     vdpa: move shadow_data to vhost_vdpa_shared
>>>
>>>     vdpa: use vdpa shared for tracing
>>>
>>>     vdpa: move file descriptor to vhost_vdpa_shared
>>>
>>>     vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
>>>
>>>     vdpa: move backend_cap to vhost_vdpa_shared
>>>
>>>     vdpa: remove msg type of vhost_vdpa
>>>
>>>     vdpa: move iommu_list to vhost_vdpa_shared
>>>
>>>     vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
>>>
>>>     vdpa: use dev_shared in vdpa_iommu
>>>
>>>     vdpa: move memory listener to vhost_vdpa_shared
>>>
>>>     vdpa: do not set virtio status bits if unneeded
>>>
>>>     vdpa: add vhost_vdpa_load_setup
>>>
>>>     vdpa: add vhost_vdpa_net_load_setup NetClient callback
>>>
>>>     vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
>>>
>>>     virtio_net: register incremental migration handlers
>>>
>>>
>>>
>>>    include/hw/virtio/vhost-vdpa.h |  43 +++++---
>>>
>>>    include/net/net.h              |   4 +
>>>
>>>    hw/net/virtio-net.c            |  23 +++++
>>>
>>>    hw/virtio/vdpa-dev.c           |   7 +-
>>>
>>>    hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
>>>
>>>    net/vhost-vdpa.c               | 127 ++++++++++++-----------
>>>
>>>    hw/virtio/trace-events         |  14 +--
>>>
>>>    7 files changed, 239 insertions(+), 162 deletions(-)
>>>
>>>
>>>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration
  2023-11-03 20:19     ` Si-Wei Liu
@ 2023-12-05 14:23       ` Eugenio Perez Martin
  2023-12-06  0:36         ` Si-Wei Liu
  0 siblings, 1 reply; 34+ messages in thread
From: Eugenio Perez Martin @ 2023-12-05 14:23 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Shannon, Parav Pandit, Stefano Garzarella,
	Michael S. Tsirkin, yin31149, Jason Wang, Yajun Wu, Zhu Lingshan,
	Lei Yang, Dragos Tatulea, Juan Quintela, Laurent Vivier,
	Gautam Dawar

On Fri, Nov 3, 2023 at 9:19 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 11/2/2023 5:37 AM, Eugenio Perez Martin wrote:
> > On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> >>> Current memory operations like pinning may take a lot of time at the
> >>>
> >>> destination.  Currently they are done after the source of the migration is
> >>>
> >>> stopped, and before the workload is resumed at the destination.  This is a
> >>>
> >>> period where neigher traffic can flow, nor the VM workload can continue
> >>>
> >>> (downtime).
> >>>
> >>>
> >>>
> >>> We can do better as we know the memory layout of the guest RAM at the
> >>>
> >>> destination from the moment the migration starts.  Moving that operation allows
> >>>
> >>> QEMU to communicate the kernel the maps while the workload is still running in
> >>>
> >>> the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
> >>>
> >>> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
> >>>
> >>> saving all the pinning time.
> >> I get what you want to say, though not sure how pinning is relevant to
> >> on-chip IOMMU and .set_map here, essentially pinning is required for all
> >> parent vdpa drivers that perform DMA hence don't want VM pages to move
> >> around.
> > Basically highlighting that the work done under .set_map is not only
> > pinning, but it is a significant fraction on it. It can be reworded or
> > deleted for sure.
> >
> >>>
> >>>
> >>> Note that further devices setup at the end of the migration may alter the guest
> >>>
> >>> memory layout. But same as the previous point, many operations are still done
> >>>
> >>> incrementally, like memory pinning, so we're saving time anyway.
> >>>
> >>>
> >>>
> >>> The first bunch of patches just reorganizes the code, so memory related
> >>>
> >>> operation parameters are shared between all vhost_vdpa devices.  This is
> >>>
> >>> because the destination does not know what vhost_vdpa struct will have the
> >>>
> >>> registered listener member, so it is easier to place them in a shared struct
> >>>
> >>> rather to keep them in vhost_vdpa struct.  Future version may squash or omit
> >>>
> >>> these patches.
> >> It looks this VhostVDPAShared facility (patch 1-13) is also what I need
> >> in my SVQ descriptor group series [*], for which I've built similar
> >> construct there. If possible please try to merge this in ASAP. I'll
> >> rework my series on top of that.
> >>
> >> [*]
> >> https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
> >>
> > I can send it individually, for sure.
> >
> > MST, Jason, can this first part be merged? It doesn't add a lot by
> > itself but it helps pave the way for future changes.
> If it cannot, it doesn't matter. I can pick it from here and get my
> series posted with your patches 1-13 applied upfront. This should work,
> I think?
>
> >>>
> >>>
> >>> Only tested with vdpa_sim. I'm sending this before full benchmark, as some work
> >>>
> >>> like [1] can be based on it, and Si-Wei agreed on benchmark this series with
> >>>
> >>> his experience.
> >> Haven't done the full benchmark compared to pre-map at destination yet,
> >> though an observation is that the destination QEMU seems very easy to
> >> get stuck for very long time while in mid of pinning pages. During this
> >> period, any client doing read-only QMP query or executing HMP info
> >> command got frozen indefinitely (subject to how large size the memory is
> >> being pinned). Is it possible to unblock those QMP request or HMP
> >> command from being executed (at least the read-only ones) while in
> >> migration? Yield from the load_setup corourtine and spawn another thread?
> >>
> > Ok, I wasn't aware of that.
> >
> > I think we cannot yield in a coroutine and wait for an ioctl.
> I was wondering if we need a separate coroutine out of the general
> migration path to support this special code without overloading
> load_setup or its callers. For instance, unblock the source from sending
> guest rams while allow destination pin pages in parallel should be
> possible.
>

Hi Si-Wei,

I'm working on this, I think I'll be able to send a new version soon.
Just a question, when the mapping is done in vhost_vdpa_dev_start as
the current upstream master does, are you able to interact with QMP?

Thanks!

> Regardless, a separate thread is needed to carry out all the heavy
> lifting, i.e. ioctl(2) or write(2) syscalls to map&pin pages.
>
>
> > One
> > option that came to my mind is to effectively use another thread, and
> > use a POSIX barrier (or equivalent on glib / QEMU) before finishing
> > the migration.
> Yes, a separate thread is needed anyway.
>
> >   I'm not sure if there are more points where we can
> > check the barrier and tell the migration to continue or stop though.
> I think there is, for e.g. what if the dma_map fails. There must be a
> check point for that.
>
> >
> > Another option is to effectively start doing these ioctls in an
> > asynchronous way, io_uring cmds like, but I'd like to achieve this
> > first.
> Yes, io_uring or any async API could be another option. Though this
> needs new uAPI through additional kernel facility to support. Anyway,
> it's up to you to decide. :)
>
> Regards,
> -Siwei
> >> Having said, not sure if .load_setup is a good fit for what we want to
> >> do. Searching all current users of .load_setup, either the job can be
> >> done instantly or the task is time bound without trapping into kernel
> >> for too long. Maybe pinning is too special use case here...
> >>
> >> -Siwei
> >>>
> >>>
> >>> Future directions on top of this series may include:
> >>>
> >>> * Iterative migration of virtio-net devices, as it may reduce downtime per [1].
> >>>
> >>>     vhost-vdpa net can apply the configuration through CVQ in the destination
> >>>
> >>>     while the source is still migrating.
> >>>
> >>> * Move more things ahead of migration time, like DRIVER_OK.
> >>>
> >>> * Check that the devices of the destination are valid, and cancel the migration
> >>>
> >>>     in case it is not.
> >>>
> >>>
> >>>
> >>> [1] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
> >>>
> >>>
> >>>
> >>> Eugenio Pérez (18):
> >>>
> >>>     vdpa: add VhostVDPAShared
> >>>
> >>>     vdpa: move iova tree to the shared struct
> >>>
> >>>     vdpa: move iova_range to vhost_vdpa_shared
> >>>
> >>>     vdpa: move shadow_data to vhost_vdpa_shared
> >>>
> >>>     vdpa: use vdpa shared for tracing
> >>>
> >>>     vdpa: move file descriptor to vhost_vdpa_shared
> >>>
> >>>     vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
> >>>
> >>>     vdpa: move backend_cap to vhost_vdpa_shared
> >>>
> >>>     vdpa: remove msg type of vhost_vdpa
> >>>
> >>>     vdpa: move iommu_list to vhost_vdpa_shared
> >>>
> >>>     vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
> >>>
> >>>     vdpa: use dev_shared in vdpa_iommu
> >>>
> >>>     vdpa: move memory listener to vhost_vdpa_shared
> >>>
> >>>     vdpa: do not set virtio status bits if unneeded
> >>>
> >>>     vdpa: add vhost_vdpa_load_setup
> >>>
> >>>     vdpa: add vhost_vdpa_net_load_setup NetClient callback
> >>>
> >>>     vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
> >>>
> >>>     virtio_net: register incremental migration handlers
> >>>
> >>>
> >>>
> >>>    include/hw/virtio/vhost-vdpa.h |  43 +++++---
> >>>
> >>>    include/net/net.h              |   4 +
> >>>
> >>>    hw/net/virtio-net.c            |  23 +++++
> >>>
> >>>    hw/virtio/vdpa-dev.c           |   7 +-
> >>>
> >>>    hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
> >>>
> >>>    net/vhost-vdpa.c               | 127 ++++++++++++-----------
> >>>
> >>>    hw/virtio/trace-events         |  14 +--
> >>>
> >>>    7 files changed, 239 insertions(+), 162 deletions(-)
> >>>
> >>>
> >>>
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration
  2023-12-05 14:23       ` Eugenio Perez Martin
@ 2023-12-06  0:36         ` Si-Wei Liu
  0 siblings, 0 replies; 34+ messages in thread
From: Si-Wei Liu @ 2023-12-06  0:36 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Shannon, Parav Pandit, Stefano Garzarella,
	Michael S. Tsirkin, yin31149, Jason Wang, Yajun Wu, Zhu Lingshan,
	Lei Yang, Dragos Tatulea, Juan Quintela, Laurent Vivier,
	Gautam Dawar



On 12/5/2023 6:23 AM, Eugenio Perez Martin wrote:
> On Fri, Nov 3, 2023 at 9:19 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 11/2/2023 5:37 AM, Eugenio Perez Martin wrote:
>>> On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
>>>>> Current memory operations like pinning may take a lot of time at the
>>>>>
>>>>> destination.  Currently they are done after the source of the migration is
>>>>>
>>>>> stopped, and before the workload is resumed at the destination.  This is a
>>>>>
>>>>> period where neigher traffic can flow, nor the VM workload can continue
>>>>>
>>>>> (downtime).
>>>>>
>>>>>
>>>>>
>>>>> We can do better as we know the memory layout of the guest RAM at the
>>>>>
>>>>> destination from the moment the migration starts.  Moving that operation allows
>>>>>
>>>>> QEMU to communicate the kernel the maps while the workload is still running in
>>>>>
>>>>> the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
>>>>>
>>>>> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
>>>>>
>>>>> saving all the pinning time.
>>>> I get what you want to say, though not sure how pinning is relevant to
>>>> on-chip IOMMU and .set_map here, essentially pinning is required for all
>>>> parent vdpa drivers that perform DMA hence don't want VM pages to move
>>>> around.
>>> Basically highlighting that the work done under .set_map is not only
>>> pinning, but it is a significant fraction on it. It can be reworded or
>>> deleted for sure.
>>>
>>>>>
>>>>> Note that further devices setup at the end of the migration may alter the guest
>>>>>
>>>>> memory layout. But same as the previous point, many operations are still done
>>>>>
>>>>> incrementally, like memory pinning, so we're saving time anyway.
>>>>>
>>>>>
>>>>>
>>>>> The first bunch of patches just reorganizes the code, so memory related
>>>>>
>>>>> operation parameters are shared between all vhost_vdpa devices.  This is
>>>>>
>>>>> because the destination does not know what vhost_vdpa struct will have the
>>>>>
>>>>> registered listener member, so it is easier to place them in a shared struct
>>>>>
>>>>> rather to keep them in vhost_vdpa struct.  Future version may squash or omit
>>>>>
>>>>> these patches.
>>>> It looks this VhostVDPAShared facility (patch 1-13) is also what I need
>>>> in my SVQ descriptor group series [*], for which I've built similar
>>>> construct there. If possible please try to merge this in ASAP. I'll
>>>> rework my series on top of that.
>>>>
>>>> [*]
>>>> https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
>>>>
>>> I can send it individually, for sure.
>>>
>>> MST, Jason, can this first part be merged? It doesn't add a lot by
>>> itself but it helps pave the way for future changes.
>> If it cannot, it doesn't matter. I can pick it from here and get my
>> series posted with your patches 1-13 applied upfront. This should work,
>> I think?
>>
>>>>>
>>>>> Only tested with vdpa_sim. I'm sending this before full benchmark, as some work
>>>>>
>>>>> like [1] can be based on it, and Si-Wei agreed on benchmark this series with
>>>>>
>>>>> his experience.
>>>> Haven't done the full benchmark compared to pre-map at destination yet,
>>>> though an observation is that the destination QEMU seems very easy to
>>>> get stuck for very long time while in mid of pinning pages. During this
>>>> period, any client doing read-only QMP query or executing HMP info
>>>> command got frozen indefinitely (subject to how large size the memory is
>>>> being pinned). Is it possible to unblock those QMP request or HMP
>>>> command from being executed (at least the read-only ones) while in
>>>> migration? Yield from the load_setup corourtine and spawn another thread?
>>>>
>>> Ok, I wasn't aware of that.
>>>
>>> I think we cannot yield in a coroutine and wait for an ioctl.
>> I was wondering if we need a separate coroutine out of the general
>> migration path to support this special code without overloading
>> load_setup or its callers. For instance, unblock the source from sending
>> guest rams while allow destination pin pages in parallel should be
>> possible.
>>
> Hi Si-Wei,
>
> I'm working on this, I think I'll be able to send a new version soon.
> Just a question, when the mapping is done in vhost_vdpa_dev_start as
> the current upstream master does, are you able to interact with QMP?
Hi Eugenio,

Yes, the latest version works pretty well! Did not get to all of the QMP 
commands, but at least I can do read-only QMP without a problem. That is 
able to address our typical usages. Thanks for the prompt fix!

I've rebased my series on top the .load_setup series instead of the top 
13 patches for 9.0, as there are some other dependent patches from this 
series to avoid duplicate work. Am debugging some problems I ran into 
after the code merge. Once they are sorted out I'll post my patch series 
soon!

Thanks,
-Siwei



>
> Thanks!
>
>> Regardless, a separate thread is needed to carry out all the heavy
>> lifting, i.e. ioctl(2) or write(2) syscalls to map&pin pages.
>>
>>
>>> One
>>> option that came to my mind is to effectively use another thread, and
>>> use a POSIX barrier (or equivalent on glib / QEMU) before finishing
>>> the migration.
>> Yes, a separate thread is needed anyway.
>>
>>>    I'm not sure if there are more points where we can
>>> check the barrier and tell the migration to continue or stop though.
>> I think there is, for e.g. what if the dma_map fails. There must be a
>> check point for that.
>>
>>> Another option is to effectively start doing these ioctls in an
>>> asynchronous way, io_uring cmds like, but I'd like to achieve this
>>> first.
>> Yes, io_uring or any async API could be another option. Though this
>> needs new uAPI through additional kernel facility to support. Anyway,
>> it's up to you to decide. :)
>>
>> Regards,
>> -Siwei
>>>> Having said, not sure if .load_setup is a good fit for what we want to
>>>> do. Searching all current users of .load_setup, either the job can be
>>>> done instantly or the task is time bound without trapping into kernel
>>>> for too long. Maybe pinning is too special use case here...
>>>>
>>>> -Siwei
>>>>>
>>>>> Future directions on top of this series may include:
>>>>>
>>>>> * Iterative migration of virtio-net devices, as it may reduce downtime per [1].
>>>>>
>>>>>      vhost-vdpa net can apply the configuration through CVQ in the destination
>>>>>
>>>>>      while the source is still migrating.
>>>>>
>>>>> * Move more things ahead of migration time, like DRIVER_OK.
>>>>>
>>>>> * Check that the devices of the destination are valid, and cancel the migration
>>>>>
>>>>>      in case it is not.
>>>>>
>>>>>
>>>>>
>>>>> [1] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
>>>>>
>>>>>
>>>>>
>>>>> Eugenio Pérez (18):
>>>>>
>>>>>      vdpa: add VhostVDPAShared
>>>>>
>>>>>      vdpa: move iova tree to the shared struct
>>>>>
>>>>>      vdpa: move iova_range to vhost_vdpa_shared
>>>>>
>>>>>      vdpa: move shadow_data to vhost_vdpa_shared
>>>>>
>>>>>      vdpa: use vdpa shared for tracing
>>>>>
>>>>>      vdpa: move file descriptor to vhost_vdpa_shared
>>>>>
>>>>>      vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
>>>>>
>>>>>      vdpa: move backend_cap to vhost_vdpa_shared
>>>>>
>>>>>      vdpa: remove msg type of vhost_vdpa
>>>>>
>>>>>      vdpa: move iommu_list to vhost_vdpa_shared
>>>>>
>>>>>      vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
>>>>>
>>>>>      vdpa: use dev_shared in vdpa_iommu
>>>>>
>>>>>      vdpa: move memory listener to vhost_vdpa_shared
>>>>>
>>>>>      vdpa: do not set virtio status bits if unneeded
>>>>>
>>>>>      vdpa: add vhost_vdpa_load_setup
>>>>>
>>>>>      vdpa: add vhost_vdpa_net_load_setup NetClient callback
>>>>>
>>>>>      vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
>>>>>
>>>>>      virtio_net: register incremental migration handlers
>>>>>
>>>>>
>>>>>
>>>>>     include/hw/virtio/vhost-vdpa.h |  43 +++++---
>>>>>
>>>>>     include/net/net.h              |   4 +
>>>>>
>>>>>     hw/net/virtio-net.c            |  23 +++++
>>>>>
>>>>>     hw/virtio/vdpa-dev.c           |   7 +-
>>>>>
>>>>>     hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
>>>>>
>>>>>     net/vhost-vdpa.c               | 127 ++++++++++++-----------
>>>>>
>>>>>     hw/virtio/trace-events         |  14 +--
>>>>>
>>>>>     7 files changed, 239 insertions(+), 162 deletions(-)
>>>>>
>>>>>
>>>>>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration
  2023-11-02 12:37   ` Eugenio Perez Martin
  2023-11-03 20:19     ` Si-Wei Liu
@ 2023-11-06  4:17     ` Jason Wang
  1 sibling, 0 replies; 34+ messages in thread
From: Jason Wang @ 2023-11-06  4:17 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Si-Wei Liu, qemu-devel, Shannon, Parav Pandit, Stefano Garzarella,
	Michael S. Tsirkin, yin31149, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, Gautam Dawar

On Thu, Nov 2, 2023 at 8:38 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >
> >
> >
> > On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> > > Current memory operations like pinning may take a lot of time at the
> > >
> > > destination.  Currently they are done after the source of the migration is
> > >
> > > stopped, and before the workload is resumed at the destination.  This is a
> > >
> > > period where neigher traffic can flow, nor the VM workload can continue
> > >
> > > (downtime).
> > >
> > >
> > >
> > > We can do better as we know the memory layout of the guest RAM at the
> > >
> > > destination from the moment the migration starts.  Moving that operation allows
> > >
> > > QEMU to communicate the kernel the maps while the workload is still running in
> > >
> > > the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
> > >
> > > but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
> > >
> > > saving all the pinning time.
> > I get what you want to say, though not sure how pinning is relevant to
> > on-chip IOMMU and .set_map here, essentially pinning is required for all
> > parent vdpa drivers that perform DMA hence don't want VM pages to move
> > around.
>
> Basically highlighting that the work done under .set_map is not only
> pinning, but it is a significant fraction on it. It can be reworded or
> deleted for sure.
>
> > >
> > >
> > >
> > > Note that further devices setup at the end of the migration may alter the guest
> > >
> > > memory layout. But same as the previous point, many operations are still done
> > >
> > > incrementally, like memory pinning, so we're saving time anyway.
> > >
> > >
> > >
> > > The first bunch of patches just reorganizes the code, so memory related
> > >
> > > operation parameters are shared between all vhost_vdpa devices.  This is
> > >
> > > because the destination does not know what vhost_vdpa struct will have the
> > >
> > > registered listener member, so it is easier to place them in a shared struct
> > >
> > > rather to keep them in vhost_vdpa struct.  Future version may squash or omit
> > >
> > > these patches.
> > It looks this VhostVDPAShared facility (patch 1-13) is also what I need
> > in my SVQ descriptor group series [*], for which I've built similar
> > construct there. If possible please try to merge this in ASAP. I'll
> > rework my series on top of that.
> >
> > [*]
> > https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
> >
>
> I can send it individually, for sure.
>
> MST, Jason, can this first part be merged? It doesn't add a lot by
> itself but it helps pave the way for future changes.
>

I'm fine with that, or it might be easier if you resend it as a
standalone series.

Thanks



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration
  2023-11-02 10:12 ` Si-Wei Liu
  2023-11-02 12:37   ` Eugenio Perez Martin
@ 2023-11-03 20:40   ` Si-Wei Liu
  2023-11-06  9:04     ` Eugenio Perez Martin
  1 sibling, 1 reply; 34+ messages in thread
From: Si-Wei Liu @ 2023-11-03 20:40 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Shannon, Parav Pandit, Stefano Garzarella, Michael S. Tsirkin,
	yin31149, Jason Wang, Yajun Wu, Zhu Lingshan, Lei Yang,
	Dragos Tatulea, Juan Quintela, Laurent Vivier, Gautam Dawar



On 11/2/2023 3:12 AM, Si-Wei Liu wrote:
>
>
> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
>> Current memory operations like pinning may take a lot of time at the
>>
>> destination.  Currently they are done after the source of the 
>> migration is
>>
>> stopped, and before the workload is resumed at the destination. This 
>> is a
>>
>> period where neigher traffic can flow, nor the VM workload can continue
>>
>> (downtime).
>>
>>
>>
>> We can do better as we know the memory layout of the guest RAM at the
>>
>> destination from the moment the migration starts.  Moving that 
>> operation allows
>>
>> QEMU to communicate the kernel the maps while the workload is still 
>> running in
>>
>> the source, so Linux can start mapping them.  Ideally, all IOMMU is 
>> configured,
>>
>> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're 
>> still
>>
>> saving all the pinning time.
> I get what you want to say, though not sure how pinning is relevant to 
> on-chip IOMMU and .set_map here, essentially pinning is required for 
> all parent vdpa drivers that perform DMA hence don't want VM pages to 
> move around.
>>
>>
>>
>> Note that further devices setup at the end of the migration may alter 
>> the guest
>>
>> memory layout. But same as the previous point, many operations are 
>> still done
>>
>> incrementally, like memory pinning, so we're saving time anyway.
>>
>>
>>
>> The first bunch of patches just reorganizes the code, so memory related
>>
>> operation parameters are shared between all vhost_vdpa devices. This is
>>
>> because the destination does not know what vhost_vdpa struct will 
>> have the
>>
>> registered listener member, so it is easier to place them in a shared 
>> struct
>>
>> rather to keep them in vhost_vdpa struct.  Future version may squash 
>> or omit
>>
>> these patches.
> It looks this VhostVDPAShared facility (patch 1-13) is also what I 
> need in my SVQ descriptor group series [*], for which I've built 
> similar construct there. If possible please try to merge this in ASAP. 
> I'll rework my series on top of that.
>
> [*] 
> https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
>
>>
>>
>>
>> Only tested with vdpa_sim. I'm sending this before full benchmark, as 
>> some work
>>
>> like [1] can be based on it, and Si-Wei agreed on benchmark this 
>> series with
>>
>> his experience.
> Haven't done the full benchmark compared to pre-map at destination yet,
Hi Eugenio,

I just notice one thing that affects the performance benchmark for this 
series in terms of migration total_time (to be fair, it's mlx5_vdpa 
specific). It looks like iotlb map batching is not acked (via 
vhost_vdpa_set_backend_cap) at the point of vhost-vdpa_load_setup, 
effectively causing quite extensive time spent on hundreds of dma_map 
calls from listener_register().  While the equivalent code had been 
implemented in my destination pre-map patch [1]. Although I can 
benchmark the current patchset by remove batching from my code, I guess 
that's not the goal of this benchmark, right?

If would be the best to have map batching in place, so benchmark for 
both options could match. What do you think?

Thanks,
-Siwei

[1]
https://github.com/siwliu-kernel/qemu/commit/0ce225b0c7e618163ea09da3846c93c4de2f85ed#diff-45489c6f25dc36fd84e1cd28cbf3b8ff03301e2d24dadb6d1c334c9e8f14c00cR639

> though an observation is that the destination QEMU seems very easy to 
> get stuck for very long time while in mid of pinning pages. During 
> this period, any client doing read-only QMP query or executing HMP 
> info command got frozen indefinitely (subject to how large size the 
> memory is being pinned). Is it possible to unblock those QMP request 
> or HMP command from being executed (at least the read-only ones) while 
> in migration? Yield from the load_setup corourtine and spawn another 
> thread?
>
> Having said, not sure if .load_setup is a good fit for what we want to 
> do. Searching all current users of .load_setup, either the job can be 
> done instantly or the task is time bound without trapping into kernel 
> for too long. Maybe pinning is too special use case here...
>
> -Siwei
>>
>>
>>
>> Future directions on top of this series may include:
>>
>> * Iterative migration of virtio-net devices, as it may reduce 
>> downtime per [1].
>>
>>    vhost-vdpa net can apply the configuration through CVQ in the 
>> destination
>>
>>    while the source is still migrating.
>>
>> * Move more things ahead of migration time, like DRIVER_OK.
>>
>> * Check that the devices of the destination are valid, and cancel the 
>> migration
>>
>>    in case it is not.
>>
>>
>>
>> [1] 
>> https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
>>
>>
>>
>> Eugenio Pérez (18):
>>
>>    vdpa: add VhostVDPAShared
>>
>>    vdpa: move iova tree to the shared struct
>>
>>    vdpa: move iova_range to vhost_vdpa_shared
>>
>>    vdpa: move shadow_data to vhost_vdpa_shared
>>
>>    vdpa: use vdpa shared for tracing
>>
>>    vdpa: move file descriptor to vhost_vdpa_shared
>>
>>    vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
>>
>>    vdpa: move backend_cap to vhost_vdpa_shared
>>
>>    vdpa: remove msg type of vhost_vdpa
>>
>>    vdpa: move iommu_list to vhost_vdpa_shared
>>
>>    vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
>>
>>    vdpa: use dev_shared in vdpa_iommu
>>
>>    vdpa: move memory listener to vhost_vdpa_shared
>>
>>    vdpa: do not set virtio status bits if unneeded
>>
>>    vdpa: add vhost_vdpa_load_setup
>>
>>    vdpa: add vhost_vdpa_net_load_setup NetClient callback
>>
>>    vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
>>
>>    virtio_net: register incremental migration handlers
>>
>>
>>
>>   include/hw/virtio/vhost-vdpa.h |  43 +++++---
>>
>>   include/net/net.h              |   4 +
>>
>>   hw/net/virtio-net.c            |  23 +++++
>>
>>   hw/virtio/vdpa-dev.c           |   7 +-
>>
>>   hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
>>
>>   net/vhost-vdpa.c               | 127 ++++++++++++-----------
>>
>>   hw/virtio/trace-events         |  14 +--
>>
>>   7 files changed, 239 insertions(+), 162 deletions(-)
>>
>>
>>
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration
  2023-11-03 20:40   ` Si-Wei Liu
@ 2023-11-06  9:04     ` Eugenio Perez Martin
  0 siblings, 0 replies; 34+ messages in thread
From: Eugenio Perez Martin @ 2023-11-06  9:04 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Shannon, Parav Pandit, Stefano Garzarella,
	Michael S. Tsirkin, yin31149, Jason Wang, Yajun Wu, Zhu Lingshan,
	Lei Yang, Dragos Tatulea, Juan Quintela, Laurent Vivier,
	Gautam Dawar

On Fri, Nov 3, 2023 at 9:41 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 11/2/2023 3:12 AM, Si-Wei Liu wrote:
> >
> >
> > On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> >> Current memory operations like pinning may take a lot of time at the
> >>
> >> destination.  Currently they are done after the source of the
> >> migration is
> >>
> >> stopped, and before the workload is resumed at the destination. This
> >> is a
> >>
> >> period where neigher traffic can flow, nor the VM workload can continue
> >>
> >> (downtime).
> >>
> >>
> >>
> >> We can do better as we know the memory layout of the guest RAM at the
> >>
> >> destination from the moment the migration starts.  Moving that
> >> operation allows
> >>
> >> QEMU to communicate the kernel the maps while the workload is still
> >> running in
> >>
> >> the source, so Linux can start mapping them.  Ideally, all IOMMU is
> >> configured,
> >>
> >> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're
> >> still
> >>
> >> saving all the pinning time.
> > I get what you want to say, though not sure how pinning is relevant to
> > on-chip IOMMU and .set_map here, essentially pinning is required for
> > all parent vdpa drivers that perform DMA hence don't want VM pages to
> > move around.
> >>
> >>
> >>
> >> Note that further devices setup at the end of the migration may alter
> >> the guest
> >>
> >> memory layout. But same as the previous point, many operations are
> >> still done
> >>
> >> incrementally, like memory pinning, so we're saving time anyway.
> >>
> >>
> >>
> >> The first bunch of patches just reorganizes the code, so memory related
> >>
> >> operation parameters are shared between all vhost_vdpa devices. This is
> >>
> >> because the destination does not know what vhost_vdpa struct will
> >> have the
> >>
> >> registered listener member, so it is easier to place them in a shared
> >> struct
> >>
> >> rather to keep them in vhost_vdpa struct.  Future version may squash
> >> or omit
> >>
> >> these patches.
> > It looks this VhostVDPAShared facility (patch 1-13) is also what I
> > need in my SVQ descriptor group series [*], for which I've built
> > similar construct there. If possible please try to merge this in ASAP.
> > I'll rework my series on top of that.
> >
> > [*]
> > https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
> >
> >>
> >>
> >>
> >> Only tested with vdpa_sim. I'm sending this before full benchmark, as
> >> some work
> >>
> >> like [1] can be based on it, and Si-Wei agreed on benchmark this
> >> series with
> >>
> >> his experience.
> > Haven't done the full benchmark compared to pre-map at destination yet,
> Hi Eugenio,
>
> I just notice one thing that affects the performance benchmark for this
> series in terms of migration total_time (to be fair, it's mlx5_vdpa
> specific). It looks like iotlb map batching is not acked (via
> vhost_vdpa_set_backend_cap) at the point of vhost-vdpa_load_setup,
> effectively causing quite extensive time spent on hundreds of dma_map
> calls from listener_register().  While the equivalent code had been
> implemented in my destination pre-map patch [1]. Although I can
> benchmark the current patchset by remove batching from my code, I guess
> that's not the goal of this benchmark, right?
>
> If would be the best to have map batching in place, so benchmark for
> both options could match. What do you think?
>

This is definitely a bug I need to fix too, thanks for catching it!
:). But I'm not sure how this happens, as net_init_vhost_vdpa should
be called before .load_setup. I'll take a deeper look, thanks!

> Thanks,
> -Siwei
>
> [1]
> https://github.com/siwliu-kernel/qemu/commit/0ce225b0c7e618163ea09da3846c93c4de2f85ed#diff-45489c6f25dc36fd84e1cd28cbf3b8ff03301e2d24dadb6d1c334c9e8f14c00cR639
>
> > though an observation is that the destination QEMU seems very easy to
> > get stuck for very long time while in mid of pinning pages. During
> > this period, any client doing read-only QMP query or executing HMP
> > info command got frozen indefinitely (subject to how large size the
> > memory is being pinned). Is it possible to unblock those QMP request
> > or HMP command from being executed (at least the read-only ones) while
> > in migration? Yield from the load_setup corourtine and spawn another
> > thread?
> >
> > Having said, not sure if .load_setup is a good fit for what we want to
> > do. Searching all current users of .load_setup, either the job can be
> > done instantly or the task is time bound without trapping into kernel
> > for too long. Maybe pinning is too special use case here...
> >
> > -Siwei
> >>
> >>
> >>
> >> Future directions on top of this series may include:
> >>
> >> * Iterative migration of virtio-net devices, as it may reduce
> >> downtime per [1].
> >>
> >>    vhost-vdpa net can apply the configuration through CVQ in the
> >> destination
> >>
> >>    while the source is still migrating.
> >>
> >> * Move more things ahead of migration time, like DRIVER_OK.
> >>
> >> * Check that the devices of the destination are valid, and cancel the
> >> migration
> >>
> >>    in case it is not.
> >>
> >>
> >>
> >> [1]
> >> https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
> >>
> >>
> >>
> >> Eugenio Pérez (18):
> >>
> >>    vdpa: add VhostVDPAShared
> >>
> >>    vdpa: move iova tree to the shared struct
> >>
> >>    vdpa: move iova_range to vhost_vdpa_shared
> >>
> >>    vdpa: move shadow_data to vhost_vdpa_shared
> >>
> >>    vdpa: use vdpa shared for tracing
> >>
> >>    vdpa: move file descriptor to vhost_vdpa_shared
> >>
> >>    vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
> >>
> >>    vdpa: move backend_cap to vhost_vdpa_shared
> >>
> >>    vdpa: remove msg type of vhost_vdpa
> >>
> >>    vdpa: move iommu_list to vhost_vdpa_shared
> >>
> >>    vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
> >>
> >>    vdpa: use dev_shared in vdpa_iommu
> >>
> >>    vdpa: move memory listener to vhost_vdpa_shared
> >>
> >>    vdpa: do not set virtio status bits if unneeded
> >>
> >>    vdpa: add vhost_vdpa_load_setup
> >>
> >>    vdpa: add vhost_vdpa_net_load_setup NetClient callback
> >>
> >>    vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
> >>
> >>    virtio_net: register incremental migration handlers
> >>
> >>
> >>
> >>   include/hw/virtio/vhost-vdpa.h |  43 +++++---
> >>
> >>   include/net/net.h              |   4 +
> >>
> >>   hw/net/virtio-net.c            |  23 +++++
> >>
> >>   hw/virtio/vdpa-dev.c           |   7 +-
> >>
> >>   hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
> >>
> >>   net/vhost-vdpa.c               | 127 ++++++++++++-----------
> >>
> >>   hw/virtio/trace-events         |  14 +--
> >>
> >>   7 files changed, 239 insertions(+), 162 deletions(-)
> >>
> >>
> >>
> >
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-12-06  0:37 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-19 14:34 [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 01/18] vdpa: add VhostVDPAShared Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 02/18] vdpa: move iova tree to the shared struct Eugenio Pérez
2023-11-02  9:36   ` Si-Wei Liu
2023-11-24 17:11     ` Eugenio Perez Martin
2023-10-19 14:34 ` [RFC PATCH 03/18] vdpa: move iova_range to vhost_vdpa_shared Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 04/18] vdpa: move shadow_data " Eugenio Pérez
2023-11-02  8:47   ` Si-Wei Liu
2023-11-02 15:45     ` Eugenio Perez Martin
2023-10-19 14:34 ` [RFC PATCH 05/18] vdpa: use vdpa shared for tracing Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 06/18] vdpa: move file descriptor to vhost_vdpa_shared Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 07/18] vdpa: move iotlb_batch_begin_sent " Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 08/18] vdpa: move backend_cap " Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 09/18] vdpa: remove msg type of vhost_vdpa Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 10/18] vdpa: move iommu_list to vhost_vdpa_shared Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 11/18] vdpa: use VhostVDPAShared in vdpa_dma_map and unmap Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 12/18] vdpa: use dev_shared in vdpa_iommu Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 13/18] vdpa: move memory listener to vhost_vdpa_shared Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 14/18] vdpa: do not set virtio status bits if unneeded Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 15/18] vdpa: add vhost_vdpa_load_setup Eugenio Pérez
2023-11-02  8:48   ` Si-Wei Liu
2023-11-02 15:24     ` Eugenio Perez Martin
2023-10-19 14:34 ` [RFC PATCH 16/18] vdpa: add vhost_vdpa_net_load_setup NetClient callback Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 17/18] vdpa: use shadow_data instead of first device v->shadow_vqs_enabled Eugenio Pérez
2023-10-19 14:34 ` [RFC PATCH 18/18] virtio_net: register incremental migration handlers Eugenio Pérez
2023-11-02  4:36 ` [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration Jason Wang
2023-11-02 10:12 ` Si-Wei Liu
2023-11-02 12:37   ` Eugenio Perez Martin
2023-11-03 20:19     ` Si-Wei Liu
2023-12-05 14:23       ` Eugenio Perez Martin
2023-12-06  0:36         ` Si-Wei Liu
2023-11-06  4:17     ` Jason Wang
2023-11-03 20:40   ` Si-Wei Liu
2023-11-06  9:04     ` Eugenio Perez Martin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).