[PATCH v4 0/7] Move memory listener register to vhost_vdpa

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
@ 2025-05-07 18:46 Jonah Palmer
  2025-05-07 18:46 ` [PATCH v4 1/7] vdpa: check for iova tree initialized at net_client_start Jonah Palmer
                   ` (11 more replies)
  0 siblings, 12 replies; 44+ messages in thread
From: Jonah Palmer @ 2025-05-07 18:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: jonah.palmer, eperezma, peterx, mst, jasowang, lvivier, dtatulea,
	leiyang, parav, sgarzare, si-wei.liu, lingshan.zhu,
	boris.ostrovsky

Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment that all devices are initializaed.  So
moving that operation allows QEMU to communicate the kernel the maps
while the workload is still running in the source, so Linux can start
mapping them.

As a small drawback, there is a time in the initialization where QEMU
cannot respond to QMP etc.  By some testing, this time is about
0.2seconds.  This may be further reduced (or increased) depending on the
vdpa driver and the platform hardware, and it is dominated by the cost
of memory pinning.

This matches the time that we move out of the called downtime window.
The downtime is measured as checking the trace timestamp from the moment
the source suspend the device to the moment the destination starts the
eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
secs to 2.0949.

Future directions on top of this series may include to move more things ahead
of the migration time, like set DRIVER_OK or perform actual iterative migration
of virtio-net devices.

Comments are welcome.

This series is a different approach of series [1]. As the title does not
reflect the changes anymore, please refer to the previous one to know the
series history.

This series is based on [2], it must be applied after it.

[Jonah Palmer]
This series was rebased after [3] was pulled in, as [3] was a prerequisite
fix for this series.

v4:
---
* Add memory listener unregistration to vhost_vdpa_reset_device.
* Remove memory listener unregistration from vhost_vdpa_reset_status.

v3:
---
* Rebase

v2:
---
* Move the memory listener registration to vhost_vdpa_set_owner function.
* Move the iova_tree allocation to net_vhost_vdpa_init.

v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.

[1] https://patchwork.kernel.org/project/qemu-devel/cover/20231215172830.2540987-1-eperezma@redhat.com/
[2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
[3] https://lore.kernel.org/qemu-devel/20250217144936.3589907-1-jonah.palmer@oracle.com/

Jonah - note: I'll be on vacation from May 10-19. Will respond to
              comments when I return.

Eugenio Pérez (7):
  vdpa: check for iova tree initialized at net_client_start
  vdpa: reorder vhost_vdpa_set_backend_cap
  vdpa: set backend capabilities at vhost_vdpa_init
  vdpa: add listener_registered
  vdpa: reorder listener assignment
  vdpa: move iova_tree allocation to net_vhost_vdpa_init
  vdpa: move memory listener register to vhost_vdpa_init

 hw/virtio/vhost-vdpa.c         | 107 +++++++++++++++++++++------------
 include/hw/virtio/vhost-vdpa.h |  22 ++++++-
 net/vhost-vdpa.c               |  34 +----------
 3 files changed, 93 insertions(+), 70 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v4 1/7] vdpa: check for iova tree initialized at net_client_start
  2025-05-07 18:46 [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init Jonah Palmer
@ 2025-05-07 18:46 ` Jonah Palmer
  2025-05-16  1:52   ` Jason Wang
  2025-05-07 18:46 ` [PATCH v4 2/7] vdpa: reorder vhost_vdpa_set_backend_cap Jonah Palmer
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Jonah Palmer @ 2025-05-07 18:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: jonah.palmer, eperezma, peterx, mst, jasowang, lvivier, dtatulea,
	leiyang, parav, sgarzare, si-wei.liu, lingshan.zhu,
	boris.ostrovsky

From: Eugenio Pérez <eperezma@redhat.com>

To map the guest memory while it is migrating we need to create the
iova_tree, as long as the destination uses x-svq=on. Checking to not
override it.

The function vhost_vdpa_net_client_stop clear it if the device is
stopped. If the guest starts the device again, the iova tree is
recreated by vhost_vdpa_net_data_start_first or vhost_vdpa_net_cvq_start
if needed, so old behavior is kept.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
---
 net/vhost-vdpa.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 7ca8b46eee..decb826868 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -366,7 +366,9 @@ static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
 
     migration_add_notifier(&s->migration_state,
                            vdpa_net_migration_state_notifier);
-    if (v->shadow_vqs_enabled) {
+
+    /* iova_tree may be initialized by vhost_vdpa_net_load_setup */
+    if (v->shadow_vqs_enabled && !v->shared->iova_tree) {
         v->shared->iova_tree = vhost_iova_tree_new(v->shared->iova_range.first,
                                                    v->shared->iova_range.last);
     }
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 2/7] vdpa: reorder vhost_vdpa_set_backend_cap
  2025-05-07 18:46 [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init Jonah Palmer
  2025-05-07 18:46 ` [PATCH v4 1/7] vdpa: check for iova tree initialized at net_client_start Jonah Palmer
@ 2025-05-07 18:46 ` Jonah Palmer
  2025-05-16  1:53   ` Jason Wang
  2025-05-16  1:56   ` Jason Wang
  2025-05-07 18:46 ` [PATCH v4 3/7] vdpa: set backend capabilities at vhost_vdpa_init Jonah Palmer
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 44+ messages in thread
From: Jonah Palmer @ 2025-05-07 18:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: jonah.palmer, eperezma, peterx, mst, jasowang, lvivier, dtatulea,
	leiyang, parav, sgarzare, si-wei.liu, lingshan.zhu,
	boris.ostrovsky

From: Eugenio Pérez <eperezma@redhat.com>

It will be used directly by vhost_vdpa_init.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
---
 hw/virtio/vhost-vdpa.c | 60 +++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 1ab2c11fa8..6b242ca56a 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -594,6 +594,36 @@ static void vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v)
     v->shadow_vqs = g_steal_pointer(&shadow_vqs);
 }
 
+static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
+{
+    struct vhost_vdpa *v = dev->opaque;
+
+    uint64_t features;
+    uint64_t f = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
+        0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
+        0x1ULL << VHOST_BACKEND_F_IOTLB_ASID |
+        0x1ULL << VHOST_BACKEND_F_SUSPEND;
+    int r;
+
+    if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, &features)) {
+        return -EFAULT;
+    }
+
+    features &= f;
+
+    if (vhost_vdpa_first_dev(dev)) {
+        r = vhost_vdpa_call(dev, VHOST_SET_BACKEND_FEATURES, &features);
+        if (r) {
+            return -EFAULT;
+        }
+    }
+
+    dev->backend_cap = features;
+    v->shared->backend_cap = features;
+
+    return 0;
+}
+
 static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
 {
     struct vhost_vdpa *v = opaque;
@@ -841,36 +871,6 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
     return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
 }
 
-static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
-{
-    struct vhost_vdpa *v = dev->opaque;
-
-    uint64_t features;
-    uint64_t f = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
-        0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
-        0x1ULL << VHOST_BACKEND_F_IOTLB_ASID |
-        0x1ULL << VHOST_BACKEND_F_SUSPEND;
-    int r;
-
-    if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, &features)) {
-        return -EFAULT;
-    }
-
-    features &= f;
-
-    if (vhost_vdpa_first_dev(dev)) {
-        r = vhost_vdpa_call(dev, VHOST_SET_BACKEND_FEATURES, &features);
-        if (r) {
-            return -EFAULT;
-        }
-    }
-
-    dev->backend_cap = features;
-    v->shared->backend_cap = features;
-
-    return 0;
-}
-
 static int vhost_vdpa_get_device_id(struct vhost_dev *dev,
                                     uint32_t *device_id)
 {
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 3/7] vdpa: set backend capabilities at vhost_vdpa_init
  2025-05-07 18:46 [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init Jonah Palmer
  2025-05-07 18:46 ` [PATCH v4 1/7] vdpa: check for iova tree initialized at net_client_start Jonah Palmer
  2025-05-07 18:46 ` [PATCH v4 2/7] vdpa: reorder vhost_vdpa_set_backend_cap Jonah Palmer
@ 2025-05-07 18:46 ` Jonah Palmer
  2025-05-16  1:57   ` Jason Wang
  2025-05-07 18:46 ` [PATCH v4 4/7] vdpa: add listener_registered Jonah Palmer
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Jonah Palmer @ 2025-05-07 18:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: jonah.palmer, eperezma, peterx, mst, jasowang, lvivier, dtatulea,
	leiyang, parav, sgarzare, si-wei.liu, lingshan.zhu,
	boris.ostrovsky

From: Eugenio Pérez <eperezma@redhat.com>

The backend does not reset them until the vdpa file descriptor is closed
so there is no harm in doing it only once.

This allows the destination of a live migration to premap memory in
batches, using VHOST_BACKEND_F_IOTLB_BATCH.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
---
 hw/virtio/vhost-vdpa.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 6b242ca56a..e9826ede2c 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -634,6 +634,12 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
     v->dev = dev;
     dev->opaque =  opaque ;
     v->shared->listener = vhost_vdpa_memory_listener;
+
+    ret = vhost_vdpa_set_backend_cap(dev);
+    if (unlikely(ret != 0)) {
+        return ret;
+    }
+
     vhost_vdpa_init_svq(dev, v);
 
     error_propagate(&dev->migration_blocker, v->migration_blocker);
@@ -1563,7 +1569,6 @@ const VhostOps vdpa_ops = {
         .vhost_set_vring_kick = vhost_vdpa_set_vring_kick,
         .vhost_set_vring_call = vhost_vdpa_set_vring_call,
         .vhost_get_features = vhost_vdpa_get_features,
-        .vhost_set_backend_cap = vhost_vdpa_set_backend_cap,
         .vhost_set_owner = vhost_vdpa_set_owner,
         .vhost_set_vring_endian = NULL,
         .vhost_backend_memslots_limit = vhost_vdpa_memslots_limit,
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 4/7] vdpa: add listener_registered
  2025-05-07 18:46 [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init Jonah Palmer
                   ` (2 preceding siblings ...)
  2025-05-07 18:46 ` [PATCH v4 3/7] vdpa: set backend capabilities at vhost_vdpa_init Jonah Palmer
@ 2025-05-07 18:46 ` Jonah Palmer
  2025-05-16  2:00   ` Jason Wang
  2025-05-07 18:46 ` [PATCH v4 5/7] vdpa: reorder listener assignment Jonah Palmer
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Jonah Palmer @ 2025-05-07 18:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: jonah.palmer, eperezma, peterx, mst, jasowang, lvivier, dtatulea,
	leiyang, parav, sgarzare, si-wei.liu, lingshan.zhu,
	boris.ostrovsky

From: Eugenio Pérez <eperezma@redhat.com>

Check if the listener has been registered or not, so it needs to be
registered again at start.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
---
 hw/virtio/vhost-vdpa.c         | 7 ++++++-
 include/hw/virtio/vhost-vdpa.h | 6 ++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index e9826ede2c..450f68f117 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1379,7 +1379,10 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
                          "IOMMU and try again");
             return -1;
         }
-        memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
+        if (!v->shared->listener_registered) {
+            memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
+            v->shared->listener_registered = true;
+        }
 
         return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
     }
@@ -1399,6 +1402,8 @@ static void vhost_vdpa_reset_status(struct vhost_dev *dev)
     vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
                                VIRTIO_CONFIG_S_DRIVER);
     memory_listener_unregister(&v->shared->listener);
+    v->shared->listener_registered = false;
+
 }
 
 static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 0a9575b469..221840987e 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -51,6 +51,12 @@ typedef struct vhost_vdpa_shared {
 
     bool iotlb_batch_begin_sent;
 
+    /*
+     * The memory listener has been registered, so DMA maps have been sent to
+     * the device.
+     */
+    bool listener_registered;
+
     /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
     bool shadow_data;
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 5/7] vdpa: reorder listener assignment
  2025-05-07 18:46 [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init Jonah Palmer
                   ` (3 preceding siblings ...)
  2025-05-07 18:46 ` [PATCH v4 4/7] vdpa: add listener_registered Jonah Palmer
@ 2025-05-07 18:46 ` Jonah Palmer
  2025-05-16  2:01   ` Jason Wang
  2025-05-07 18:46 ` [PATCH v4 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init Jonah Palmer
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Jonah Palmer @ 2025-05-07 18:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: jonah.palmer, eperezma, peterx, mst, jasowang, lvivier, dtatulea,
	leiyang, parav, sgarzare, si-wei.liu, lingshan.zhu,
	boris.ostrovsky

From: Eugenio Pérez <eperezma@redhat.com>

Since commit f6fe3e333f ("vdpa: move memory listener to
vhost_vdpa_shared") this piece of code repeatedly assign
shared->listener members.  This was not a problem as it was not used
until device start.

However next patches move the listener registration to this
vhost_vdpa_init function.  When the listener is registered it is added
to an embedded linked list, so setting its members again will cause
memory corruption to the linked list node.

Do the right thing and only set it in the first vdpa device.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
---
 hw/virtio/vhost-vdpa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 450f68f117..de834f2ebd 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -633,7 +633,6 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
 
     v->dev = dev;
     dev->opaque =  opaque ;
-    v->shared->listener = vhost_vdpa_memory_listener;
 
     ret = vhost_vdpa_set_backend_cap(dev);
     if (unlikely(ret != 0)) {
@@ -675,6 +674,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
     vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
                                VIRTIO_CONFIG_S_DRIVER);
 
+    v->shared->listener = vhost_vdpa_memory_listener;
     return 0;
 }
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init
  2025-05-07 18:46 [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init Jonah Palmer
                   ` (4 preceding siblings ...)
  2025-05-07 18:46 ` [PATCH v4 5/7] vdpa: reorder listener assignment Jonah Palmer
@ 2025-05-07 18:46 ` Jonah Palmer
  2025-05-16  2:07   ` Jason Wang
  2025-05-07 18:46 ` [PATCH v4 7/7] vdpa: move memory listener register to vhost_vdpa_init Jonah Palmer
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Jonah Palmer @ 2025-05-07 18:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: jonah.palmer, eperezma, peterx, mst, jasowang, lvivier, dtatulea,
	leiyang, parav, sgarzare, si-wei.liu, lingshan.zhu,
	boris.ostrovsky

From: Eugenio Pérez <eperezma@redhat.com>

As we are moving to keep the mapping through all the vdpa device life
instead of resetting it at VirtIO reset, we need to move all its
dependencies to the initialization too.  In particular devices with
x-svq=on need a valid iova_tree from the beginning.

Simplify the code also consolidating the two creation points: the first
data vq in case of SVQ active and CVQ start in case only CVQ uses it.

Suggested-by: Si-Wei Liu <si-wei.liu@oracle.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
---
 include/hw/virtio/vhost-vdpa.h | 16 ++++++++++++++-
 net/vhost-vdpa.c               | 36 +++-------------------------------
 2 files changed, 18 insertions(+), 34 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 221840987e..449bf5c840 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -43,7 +43,21 @@ typedef struct vhost_vdpa_shared {
     struct vhost_vdpa_iova_range iova_range;
     QLIST_HEAD(, vdpa_iommu) iommu_list;
 
-    /* IOVA mapping used by the Shadow Virtqueue */
+    /*
+     * IOVA mapping used by the Shadow Virtqueue
+     *
+     * It is shared among all ASID for simplicity, whether CVQ shares ASID with
+     * guest or not:
+     * - Memory listener need access to guest's memory addresses allocated in
+     *   the IOVA tree.
+     * - There should be plenty of IOVA address space for both ASID not to
+     *   worry about collisions between them.  Guest's translations are still
+     *   validated with virtio virtqueue_pop so there is no risk for the guest
+     *   to access memory that it shouldn't.
+     *
+     * To allocate a iova tree per ASID is doable but it complicates the code
+     * and it is not worth it for the moment.
+     */
     VhostIOVATree *iova_tree;
 
     /* Copy of backend features */
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index decb826868..58d738945d 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -235,6 +235,7 @@ static void vhost_vdpa_cleanup(NetClientState *nc)
         return;
     }
     qemu_close(s->vhost_vdpa.shared->device_fd);
+    g_clear_pointer(&s->vhost_vdpa.shared->iova_tree, vhost_iova_tree_delete);
     g_free(s->vhost_vdpa.shared);
 }
 
@@ -362,16 +363,8 @@ static int vdpa_net_migration_state_notifier(NotifierWithReturn *notifier,
 
 static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
 {
-    struct vhost_vdpa *v = &s->vhost_vdpa;
-
     migration_add_notifier(&s->migration_state,
                            vdpa_net_migration_state_notifier);
-
-    /* iova_tree may be initialized by vhost_vdpa_net_load_setup */
-    if (v->shadow_vqs_enabled && !v->shared->iova_tree) {
-        v->shared->iova_tree = vhost_iova_tree_new(v->shared->iova_range.first,
-                                                   v->shared->iova_range.last);
-    }
 }
 
 static int vhost_vdpa_net_data_start(NetClientState *nc)
@@ -418,19 +411,12 @@ static int vhost_vdpa_net_data_load(NetClientState *nc)
 static void vhost_vdpa_net_client_stop(NetClientState *nc)
 {
     VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
-    struct vhost_dev *dev;
 
     assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
 
     if (s->vhost_vdpa.index == 0) {
         migration_remove_notifier(&s->migration_state);
     }
-
-    dev = s->vhost_vdpa.dev;
-    if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
-        g_clear_pointer(&s->vhost_vdpa.shared->iova_tree,
-                        vhost_iova_tree_delete);
-    }
 }
 
 static NetClientInfo net_vhost_vdpa_info = {
@@ -602,24 +588,6 @@ out:
         return 0;
     }
 
-    /*
-     * If other vhost_vdpa already have an iova_tree, reuse it for simplicity,
-     * whether CVQ shares ASID with guest or not, because:
-     * - Memory listener need access to guest's memory addresses allocated in
-     *   the IOVA tree.
-     * - There should be plenty of IOVA address space for both ASID not to
-     *   worry about collisions between them.  Guest's translations are still
-     *   validated with virtio virtqueue_pop so there is no risk for the guest
-     *   to access memory that it shouldn't.
-     *
-     * To allocate a iova tree per ASID is doable but it complicates the code
-     * and it is not worth it for the moment.
-     */
-    if (!v->shared->iova_tree) {
-        v->shared->iova_tree = vhost_iova_tree_new(v->shared->iova_range.first,
-                                                   v->shared->iova_range.last);
-    }
-
     r = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer,
                                vhost_vdpa_net_cvq_cmd_page_len(), false);
     if (unlikely(r < 0)) {
@@ -1728,6 +1696,8 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
         s->vhost_vdpa.shared->device_fd = vdpa_device_fd;
         s->vhost_vdpa.shared->iova_range = iova_range;
         s->vhost_vdpa.shared->shadow_data = svq;
+        s->vhost_vdpa.shared->iova_tree = vhost_iova_tree_new(iova_range.first,
+                                                              iova_range.last);
     } else if (!is_datapath) {
         s->cvq_cmd_out_buffer = mmap(NULL, vhost_vdpa_net_cvq_cmd_page_len(),
                                      PROT_READ | PROT_WRITE,
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 7/7] vdpa: move memory listener register to vhost_vdpa_init
  2025-05-07 18:46 [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init Jonah Palmer
                   ` (5 preceding siblings ...)
  2025-05-07 18:46 ` [PATCH v4 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init Jonah Palmer
@ 2025-05-07 18:46 ` Jonah Palmer
  2025-05-15  5:42   ` Michael S. Tsirkin
  2025-05-14  1:42 ` [PATCH v4 0/7] Move " Lei Yang
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Jonah Palmer @ 2025-05-07 18:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: jonah.palmer, eperezma, peterx, mst, jasowang, lvivier, dtatulea,
	leiyang, parav, sgarzare, si-wei.liu, lingshan.zhu,
	boris.ostrovsky

From: Eugenio Pérez <eperezma@redhat.com>

Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment that all devices are initializaed.  So
moving that operation allows QEMU to communicate the kernel the maps
while the workload is still running in the source, so Linux can start
mapping them.

As a small drawback, there is a time in the initialization where QEMU
cannot respond to QMP etc.  By some testing, this time is about
0.2seconds.  This may be further reduced (or increased) depending on the
vdpa driver and the platform hardware, and it is dominated by the cost
of memory pinning.

This matches the time that we move out of the called downtime window.
The downtime is measured as checking the trace timestamp from the moment
the source suspend the device to the moment the destination starts the
eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
secs to 2.0949.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>

v3:
---
Move memory listener unregistration from vhost_vdpa_reset_status to
vhost_vdpa_reset_device. By unregistering the listener here, we can
guarantee that every reset leaves the device in an expected state.
Also remove the duplicate call in vhost_vdpa_reset_status.

Reported-by: Lei Yang <leiyang@redhat.com>
Suggested-by: Si-Wei Liu <si-wei.liu@oracle.com>

--
v2:
Move the memory listener registration to vhost_vdpa_set_owner function.
In case of hotplug the vdpa device, the memory is already set up, and
leaving memory listener register call in the init function made maps
occur before set owner call.

To be 100% safe, let's put it right after set_owner call.

Reported-by: Lei Yang <leiyang@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 35 ++++++++++++++++++++++++++++-------
 1 file changed, 28 insertions(+), 7 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index de834f2ebd..e20da95f30 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -894,8 +894,14 @@ static int vhost_vdpa_reset_device(struct vhost_dev *dev)
 
     ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
     trace_vhost_vdpa_reset_device(dev);
+    if (ret) {
+        return ret;
+    }
+
+    memory_listener_unregister(&v->shared->listener);
+    v->shared->listener_registered = false;
     v->suspended = false;
-    return ret;
+    return 0;
 }
 
 static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, int idx)
@@ -1379,6 +1385,11 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
                          "IOMMU and try again");
             return -1;
         }
+        if (v->shared->listener_registered &&
+            dev->vdev->dma_as != v->shared->listener.address_space) {
+            memory_listener_unregister(&v->shared->listener);
+            v->shared->listener_registered = false;
+        }
         if (!v->shared->listener_registered) {
             memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
             v->shared->listener_registered = true;
@@ -1392,8 +1403,6 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
 
 static void vhost_vdpa_reset_status(struct vhost_dev *dev)
 {
-    struct vhost_vdpa *v = dev->opaque;
-
     if (!vhost_vdpa_last_dev(dev)) {
         return;
     }
@@ -1401,9 +1410,6 @@ static void vhost_vdpa_reset_status(struct vhost_dev *dev)
     vhost_vdpa_reset_device(dev);
     vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
                                VIRTIO_CONFIG_S_DRIVER);
-    memory_listener_unregister(&v->shared->listener);
-    v->shared->listener_registered = false;
-
 }
 
 static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
@@ -1537,12 +1543,27 @@ static int vhost_vdpa_get_features(struct vhost_dev *dev,
 
 static int vhost_vdpa_set_owner(struct vhost_dev *dev)
 {
+    int r;
+    struct vhost_vdpa *v;
+
     if (!vhost_vdpa_first_dev(dev)) {
         return 0;
     }
 
     trace_vhost_vdpa_set_owner(dev);
-    return vhost_vdpa_call(dev, VHOST_SET_OWNER, NULL);
+    r = vhost_vdpa_call(dev, VHOST_SET_OWNER, NULL);
+    if (unlikely(r < 0)) {
+        return r;
+    }
+
+    /*
+     * Being optimistic and listening address space memory. If the device
+     * uses vIOMMU, it is changed at vhost_vdpa_dev_start.
+     */
+    v = dev->opaque;
+    memory_listener_register(&v->shared->listener, &address_space_memory);
+    v->shared->listener_registered = true;
+    return 0;
 }
 
 static int vhost_vdpa_vq_get_addr(struct vhost_dev *dev,
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-07 18:46 [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init Jonah Palmer
                   ` (6 preceding siblings ...)
  2025-05-07 18:46 ` [PATCH v4 7/7] vdpa: move memory listener register to vhost_vdpa_init Jonah Palmer
@ 2025-05-14  1:42 ` Lei Yang
  2025-05-14 15:49 ` Eugenio Perez Martin
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Lei Yang @ 2025-05-14  1:42 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: qemu-devel, eperezma, peterx, mst, jasowang, lvivier, dtatulea,
	parav, sgarzare, si-wei.liu, lingshan.zhu, boris.ostrovsky

Tested pass with vhost_vdpa device's regression tests.

Tested-by: Lei Yang <leiyang@redhat.com>

On Thu, May 8, 2025 at 2:47 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> Current memory operations like pinning may take a lot of time at the
> destination.  Currently they are done after the source of the migration is
> stopped, and before the workload is resumed at the destination.  This is a
> period where neigher traffic can flow, nor the VM workload can continue
> (downtime).
>
> We can do better as we know the memory layout of the guest RAM at the
> destination from the moment that all devices are initializaed.  So
> moving that operation allows QEMU to communicate the kernel the maps
> while the workload is still running in the source, so Linux can start
> mapping them.
>
> As a small drawback, there is a time in the initialization where QEMU
> cannot respond to QMP etc.  By some testing, this time is about
> 0.2seconds.  This may be further reduced (or increased) depending on the
> vdpa driver and the platform hardware, and it is dominated by the cost
> of memory pinning.
>
> This matches the time that we move out of the called downtime window.
> The downtime is measured as checking the trace timestamp from the moment
> the source suspend the device to the moment the destination starts the
> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> secs to 2.0949.
>
> Future directions on top of this series may include to move more things ahead
> of the migration time, like set DRIVER_OK or perform actual iterative migration
> of virtio-net devices.
>
> Comments are welcome.
>
> This series is a different approach of series [1]. As the title does not
> reflect the changes anymore, please refer to the previous one to know the
> series history.
>
> This series is based on [2], it must be applied after it.
>
> [Jonah Palmer]
> This series was rebased after [3] was pulled in, as [3] was a prerequisite
> fix for this series.
>
> v4:
> ---
> * Add memory listener unregistration to vhost_vdpa_reset_device.
> * Remove memory listener unregistration from vhost_vdpa_reset_status.
>
> v3:
> ---
> * Rebase
>
> v2:
> ---
> * Move the memory listener registration to vhost_vdpa_set_owner function.
> * Move the iova_tree allocation to net_vhost_vdpa_init.
>
> v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
>
> [1] https://patchwork.kernel.org/project/qemu-devel/cover/20231215172830.2540987-1-eperezma@redhat.com/
> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> [3] https://lore.kernel.org/qemu-devel/20250217144936.3589907-1-jonah.palmer@oracle.com/
>
> Jonah - note: I'll be on vacation from May 10-19. Will respond to
>               comments when I return.
>
> Eugenio Pérez (7):
>   vdpa: check for iova tree initialized at net_client_start
>   vdpa: reorder vhost_vdpa_set_backend_cap
>   vdpa: set backend capabilities at vhost_vdpa_init
>   vdpa: add listener_registered
>   vdpa: reorder listener assignment
>   vdpa: move iova_tree allocation to net_vhost_vdpa_init
>   vdpa: move memory listener register to vhost_vdpa_init
>
>  hw/virtio/vhost-vdpa.c         | 107 +++++++++++++++++++++------------
>  include/hw/virtio/vhost-vdpa.h |  22 ++++++-
>  net/vhost-vdpa.c               |  34 +----------
>  3 files changed, 93 insertions(+), 70 deletions(-)
>
> --
> 2.43.5
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-07 18:46 [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init Jonah Palmer
                   ` (7 preceding siblings ...)
  2025-05-14  1:42 ` [PATCH v4 0/7] Move " Lei Yang
@ 2025-05-14 15:49 ` Eugenio Perez Martin
  2025-05-15  0:17   ` Si-Wei Liu
  2025-05-20 13:27   ` Jonah Palmer
  2025-05-14 23:00 ` Si-Wei Liu
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-05-14 15:49 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: qemu-devel, peterx, mst, jasowang, lvivier, dtatulea, leiyang,
	parav, sgarzare, si-wei.liu, lingshan.zhu, boris.ostrovsky

On Wed, May 7, 2025 at 8:47 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> Current memory operations like pinning may take a lot of time at the
> destination.  Currently they are done after the source of the migration is
> stopped, and before the workload is resumed at the destination.  This is a
> period where neigher traffic can flow, nor the VM workload can continue
> (downtime).
>
> We can do better as we know the memory layout of the guest RAM at the
> destination from the moment that all devices are initializaed.  So
> moving that operation allows QEMU to communicate the kernel the maps
> while the workload is still running in the source, so Linux can start
> mapping them.
>
> As a small drawback, there is a time in the initialization where QEMU
> cannot respond to QMP etc.  By some testing, this time is about
> 0.2seconds.  This may be further reduced (or increased) depending on the
> vdpa driver and the platform hardware, and it is dominated by the cost
> of memory pinning.
>
> This matches the time that we move out of the called downtime window.
> The downtime is measured as checking the trace timestamp from the moment
> the source suspend the device to the moment the destination starts the
> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> secs to 2.0949.
>

Hi Jonah,

Could you update this benchmark? I don't think it changed a lot but
just to be as updated as possible.

I think I cannot ack the series as I sent the first revision. Jason or
Si-Wei, could you ack it?

Thanks!

> Future directions on top of this series may include to move more things ahead
> of the migration time, like set DRIVER_OK or perform actual iterative migration
> of virtio-net devices.
>
> Comments are welcome.
>
> This series is a different approach of series [1]. As the title does not
> reflect the changes anymore, please refer to the previous one to know the
> series history.
>
> This series is based on [2], it must be applied after it.
>
> [Jonah Palmer]
> This series was rebased after [3] was pulled in, as [3] was a prerequisite
> fix for this series.
>
> v4:
> ---
> * Add memory listener unregistration to vhost_vdpa_reset_device.
> * Remove memory listener unregistration from vhost_vdpa_reset_status.
>
> v3:
> ---
> * Rebase
>
> v2:
> ---
> * Move the memory listener registration to vhost_vdpa_set_owner function.
> * Move the iova_tree allocation to net_vhost_vdpa_init.
>
> v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
>
> [1] https://patchwork.kernel.org/project/qemu-devel/cover/20231215172830.2540987-1-eperezma@redhat.com/
> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> [3] https://lore.kernel.org/qemu-devel/20250217144936.3589907-1-jonah.palmer@oracle.com/
>
> Jonah - note: I'll be on vacation from May 10-19. Will respond to
>               comments when I return.
>
> Eugenio Pérez (7):
>   vdpa: check for iova tree initialized at net_client_start
>   vdpa: reorder vhost_vdpa_set_backend_cap
>   vdpa: set backend capabilities at vhost_vdpa_init
>   vdpa: add listener_registered
>   vdpa: reorder listener assignment
>   vdpa: move iova_tree allocation to net_vhost_vdpa_init
>   vdpa: move memory listener register to vhost_vdpa_init
>
>  hw/virtio/vhost-vdpa.c         | 107 +++++++++++++++++++++------------
>  include/hw/virtio/vhost-vdpa.h |  22 ++++++-
>  net/vhost-vdpa.c               |  34 +----------
>  3 files changed, 93 insertions(+), 70 deletions(-)
>
> --
> 2.43.5
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-07 18:46 [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init Jonah Palmer
                   ` (8 preceding siblings ...)
  2025-05-14 15:49 ` Eugenio Perez Martin
@ 2025-05-14 23:00 ` Si-Wei Liu
  2025-05-16  1:47 ` Jason Wang
  2025-05-16  1:51 ` Jason Wang
  11 siblings, 0 replies; 44+ messages in thread
From: Si-Wei Liu @ 2025-05-14 23:00 UTC (permalink / raw)
  To: Jonah Palmer, qemu-devel
  Cc: eperezma, peterx, mst, jasowang, lvivier, dtatulea, leiyang,
	parav, sgarzare, lingshan.zhu, boris.ostrovsky

For the series:

Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>

On 5/7/2025 11:46 AM, Jonah Palmer wrote:
> Current memory operations like pinning may take a lot of time at the
> destination.  Currently they are done after the source of the migration is
> stopped, and before the workload is resumed at the destination.  This is a
> period where neigher traffic can flow, nor the VM workload can continue
> (downtime).
>
> We can do better as we know the memory layout of the guest RAM at the
> destination from the moment that all devices are initializaed.  So
> moving that operation allows QEMU to communicate the kernel the maps
> while the workload is still running in the source, so Linux can start
> mapping them.
>
> As a small drawback, there is a time in the initialization where QEMU
> cannot respond to QMP etc.  By some testing, this time is about
> 0.2seconds.  This may be further reduced (or increased) depending on the
> vdpa driver and the platform hardware, and it is dominated by the cost
> of memory pinning.
>
> This matches the time that we move out of the called downtime window.
> The downtime is measured as checking the trace timestamp from the moment
> the source suspend the device to the moment the destination starts the
> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> secs to 2.0949.
>
> Future directions on top of this series may include to move more things ahead
> of the migration time, like set DRIVER_OK or perform actual iterative migration
> of virtio-net devices.
>
> Comments are welcome.
>
> This series is a different approach of series [1]. As the title does not
> reflect the changes anymore, please refer to the previous one to know the
> series history.
>
> This series is based on [2], it must be applied after it.
>
> [Jonah Palmer]
> This series was rebased after [3] was pulled in, as [3] was a prerequisite
> fix for this series.
>
> v4:
> ---
> * Add memory listener unregistration to vhost_vdpa_reset_device.
> * Remove memory listener unregistration from vhost_vdpa_reset_status.
>
> v3:
> ---
> * Rebase
>
> v2:
> ---
> * Move the memory listener registration to vhost_vdpa_set_owner function.
> * Move the iova_tree allocation to net_vhost_vdpa_init.
>
> v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
>
> [1] https://patchwork.kernel.org/project/qemu-devel/cover/20231215172830.2540987-1-eperezma@redhat.com/
> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> [3] https://lore.kernel.org/qemu-devel/20250217144936.3589907-1-jonah.palmer@oracle.com/
>
> Jonah - note: I'll be on vacation from May 10-19. Will respond to
>                comments when I return.
>
> Eugenio Pérez (7):
>    vdpa: check for iova tree initialized at net_client_start
>    vdpa: reorder vhost_vdpa_set_backend_cap
>    vdpa: set backend capabilities at vhost_vdpa_init
>    vdpa: add listener_registered
>    vdpa: reorder listener assignment
>    vdpa: move iova_tree allocation to net_vhost_vdpa_init
>    vdpa: move memory listener register to vhost_vdpa_init
>
>   hw/virtio/vhost-vdpa.c         | 107 +++++++++++++++++++++------------
>   include/hw/virtio/vhost-vdpa.h |  22 ++++++-
>   net/vhost-vdpa.c               |  34 +----------
>   3 files changed, 93 insertions(+), 70 deletions(-)
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-14 15:49 ` Eugenio Perez Martin
@ 2025-05-15  0:17   ` Si-Wei Liu
  2025-05-15  5:43     ` Michael S. Tsirkin
                       ` (2 more replies)
  2025-05-20 13:27   ` Jonah Palmer
  1 sibling, 3 replies; 44+ messages in thread
From: Si-Wei Liu @ 2025-05-15  0:17 UTC (permalink / raw)
  To: Eugenio Perez Martin, Jonah Palmer
  Cc: qemu-devel, peterx, mst, jasowang, lvivier, dtatulea, leiyang,
	parav, sgarzare, lingshan.zhu, boris.ostrovsky

Hi Eugenio,

On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote:
> On Wed, May 7, 2025 at 8:47 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>> Current memory operations like pinning may take a lot of time at the
>> destination.  Currently they are done after the source of the migration is
>> stopped, and before the workload is resumed at the destination.  This is a
>> period where neigher traffic can flow, nor the VM workload can continue
>> (downtime).
>>
>> We can do better as we know the memory layout of the guest RAM at the
>> destination from the moment that all devices are initializaed.  So
>> moving that operation allows QEMU to communicate the kernel the maps
>> while the workload is still running in the source, so Linux can start
>> mapping them.
>>
>> As a small drawback, there is a time in the initialization where QEMU
>> cannot respond to QMP etc.  By some testing, this time is about
>> 0.2seconds.  This may be further reduced (or increased) depending on the
>> vdpa driver and the platform hardware, and it is dominated by the cost
>> of memory pinning.
>>
>> This matches the time that we move out of the called downtime window.
>> The downtime is measured as checking the trace timestamp from the moment
>> the source suspend the device to the moment the destination starts the
>> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
>> secs to 2.0949.
>>
> Hi Jonah,
>
> Could you update this benchmark? I don't think it changed a lot but
> just to be as updated as possible.
Jonah is off this week and will be back until next Tuesday, but I recall 
he indeed did some downtime test with VM with 128GB memory before taking 
off, which shows obvious improvement from around 10 seconds to 5.8 
seconds after applying this series. Since this is related to update on 
the cover letter, would it be okay for you and Jason to ack now and then 
proceed to Michael for upcoming merge?

>
> I think I cannot ack the series as I sent the first revision. Jason or
> Si-Wei, could you ack it?
Sure, I just give my R-b, this series look good to me. Hopefully Jason 
can ack on his own.

Thanks!
-Siwei

>
> Thanks!
>
>> Future directions on top of this series may include to move more things ahead
>> of the migration time, like set DRIVER_OK or perform actual iterative migration
>> of virtio-net devices.
>>
>> Comments are welcome.
>>
>> This series is a different approach of series [1]. As the title does not
>> reflect the changes anymore, please refer to the previous one to know the
>> series history.
>>
>> This series is based on [2], it must be applied after it.
>>
>> [Jonah Palmer]
>> This series was rebased after [3] was pulled in, as [3] was a prerequisite
>> fix for this series.
>>
>> v4:
>> ---
>> * Add memory listener unregistration to vhost_vdpa_reset_device.
>> * Remove memory listener unregistration from vhost_vdpa_reset_status.
>>
>> v3:
>> ---
>> * Rebase
>>
>> v2:
>> ---
>> * Move the memory listener registration to vhost_vdpa_set_owner function.
>> * Move the iova_tree allocation to net_vhost_vdpa_init.
>>
>> v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
>>
>> [1] https://patchwork.kernel.org/project/qemu-devel/cover/20231215172830.2540987-1-eperezma@redhat.com/
>> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
>> [3] https://lore.kernel.org/qemu-devel/20250217144936.3589907-1-jonah.palmer@oracle.com/
>>
>> Jonah - note: I'll be on vacation from May 10-19. Will respond to
>>                comments when I return.
>>
>> Eugenio Pérez (7):
>>    vdpa: check for iova tree initialized at net_client_start
>>    vdpa: reorder vhost_vdpa_set_backend_cap
>>    vdpa: set backend capabilities at vhost_vdpa_init
>>    vdpa: add listener_registered
>>    vdpa: reorder listener assignment
>>    vdpa: move iova_tree allocation to net_vhost_vdpa_init
>>    vdpa: move memory listener register to vhost_vdpa_init
>>
>>   hw/virtio/vhost-vdpa.c         | 107 +++++++++++++++++++++------------
>>   include/hw/virtio/vhost-vdpa.h |  22 ++++++-
>>   net/vhost-vdpa.c               |  34 +----------
>>   3 files changed, 93 insertions(+), 70 deletions(-)
>>
>> --
>> 2.43.5
>>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 7/7] vdpa: move memory listener register to vhost_vdpa_init
  2025-05-07 18:46 ` [PATCH v4 7/7] vdpa: move memory listener register to vhost_vdpa_init Jonah Palmer
@ 2025-05-15  5:42   ` Michael S. Tsirkin
  2025-05-15 17:36     ` Si-Wei Liu
  0 siblings, 1 reply; 44+ messages in thread
From: Michael S. Tsirkin @ 2025-05-15  5:42 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: qemu-devel, eperezma, peterx, jasowang, lvivier, dtatulea,
	leiyang, parav, sgarzare, si-wei.liu, lingshan.zhu,
	boris.ostrovsky

On Wed, May 07, 2025 at 02:46:47PM -0400, Jonah Palmer wrote:
> From: Eugenio Pérez <eperezma@redhat.com>
> 
> Current memory operations like pinning may take a lot of time at the
> destination.  Currently they are done after the source of the migration is
> stopped, and before the workload is resumed at the destination.  This is a
> period where neigher traffic can flow, nor the VM workload can continue
> (downtime).
> 
> We can do better as we know the memory layout of the guest RAM at the
> destination from the moment that all devices are initializaed.  So
> moving that operation allows QEMU to communicate the kernel the maps
> while the workload is still running in the source, so Linux can start
> mapping them.
> 
> As a small drawback, there is a time in the initialization where QEMU
> cannot respond to QMP etc.  By some testing, this time is about
> 0.2seconds.  This may be further reduced (or increased) depending on the
> vdpa driver and the platform hardware, and it is dominated by the cost
> of memory pinning.
> 
> This matches the time that we move out of the called downtime window.
> The downtime is measured as checking the trace timestamp from the moment
> the source suspend the device to the moment the destination starts the
> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> secs to 2.0949.
> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> 
> v3:
> ---

just know that everything beyond this line is not going into
git commit log.


> Move memory listener unregistration from vhost_vdpa_reset_status to
> vhost_vdpa_reset_device. By unregistering the listener here, we can
> guarantee that every reset leaves the device in an expected state.
> Also remove the duplicate call in vhost_vdpa_reset_status.
> 
> Reported-by: Lei Yang <leiyang@redhat.com>
> Suggested-by: Si-Wei Liu <si-wei.liu@oracle.com>
> 
> --
> v2:
> Move the memory listener registration to vhost_vdpa_set_owner function.
> In case of hotplug the vdpa device, the memory is already set up, and
> leaving memory listener register call in the init function made maps
> occur before set owner call.
> 
> To be 100% safe, let's put it right after set_owner call.
> 
> Reported-by: Lei Yang <leiyang@redhat.com>
> ---
>  hw/virtio/vhost-vdpa.c | 35 ++++++++++++++++++++++++++++-------
>  1 file changed, 28 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index de834f2ebd..e20da95f30 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -894,8 +894,14 @@ static int vhost_vdpa_reset_device(struct vhost_dev *dev)
>  
>      ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
>      trace_vhost_vdpa_reset_device(dev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    memory_listener_unregister(&v->shared->listener);
> +    v->shared->listener_registered = false;
>      v->suspended = false;
> -    return ret;
> +    return 0;
>  }
>  
>  static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, int idx)
> @@ -1379,6 +1385,11 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>                           "IOMMU and try again");
>              return -1;
>          }
> +        if (v->shared->listener_registered &&
> +            dev->vdev->dma_as != v->shared->listener.address_space) {
> +            memory_listener_unregister(&v->shared->listener);
> +            v->shared->listener_registered = false;
> +        }
>          if (!v->shared->listener_registered) {
>              memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
>              v->shared->listener_registered = true;
> @@ -1392,8 +1403,6 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>  
>  static void vhost_vdpa_reset_status(struct vhost_dev *dev)
>  {
> -    struct vhost_vdpa *v = dev->opaque;
> -
>      if (!vhost_vdpa_last_dev(dev)) {
>          return;
>      }
> @@ -1401,9 +1410,6 @@ static void vhost_vdpa_reset_status(struct vhost_dev *dev)
>      vhost_vdpa_reset_device(dev);
>      vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>                                 VIRTIO_CONFIG_S_DRIVER);
> -    memory_listener_unregister(&v->shared->listener);
> -    v->shared->listener_registered = false;
> -
>  }
>  
>  static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
> @@ -1537,12 +1543,27 @@ static int vhost_vdpa_get_features(struct vhost_dev *dev,
>  
>  static int vhost_vdpa_set_owner(struct vhost_dev *dev)
>  {
> +    int r;
> +    struct vhost_vdpa *v;
> +
>      if (!vhost_vdpa_first_dev(dev)) {
>          return 0;
>      }
>  
>      trace_vhost_vdpa_set_owner(dev);
> -    return vhost_vdpa_call(dev, VHOST_SET_OWNER, NULL);
> +    r = vhost_vdpa_call(dev, VHOST_SET_OWNER, NULL);
> +    if (unlikely(r < 0)) {
> +        return r;
> +    }
> +
> +    /*
> +     * Being optimistic and listening address space memory. If the device
> +     * uses vIOMMU, it is changed at vhost_vdpa_dev_start.
> +     */
> +    v = dev->opaque;
> +    memory_listener_register(&v->shared->listener, &address_space_memory);
> +    v->shared->listener_registered = true;
> +    return 0;
>  }
>  
>  static int vhost_vdpa_vq_get_addr(struct vhost_dev *dev,
> -- 
> 2.43.5



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-15  0:17   ` Si-Wei Liu
@ 2025-05-15  5:43     ` Michael S. Tsirkin
  2025-05-15 17:41       ` Si-Wei Liu
  2025-05-15  8:30     ` Eugenio Perez Martin
  2025-05-16  1:49     ` Jason Wang
  2 siblings, 1 reply; 44+ messages in thread
From: Michael S. Tsirkin @ 2025-05-15  5:43 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Eugenio Perez Martin, Jonah Palmer, qemu-devel, peterx, jasowang,
	lvivier, dtatulea, leiyang, parav, sgarzare, lingshan.zhu,
	boris.ostrovsky

On Wed, May 14, 2025 at 05:17:15PM -0700, Si-Wei Liu wrote:
> Hi Eugenio,
> 
> On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote:
> > On Wed, May 7, 2025 at 8:47 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> > > Current memory operations like pinning may take a lot of time at the
> > > destination.  Currently they are done after the source of the migration is
> > > stopped, and before the workload is resumed at the destination.  This is a
> > > period where neigher traffic can flow, nor the VM workload can continue
> > > (downtime).
> > > 
> > > We can do better as we know the memory layout of the guest RAM at the
> > > destination from the moment that all devices are initializaed.  So
> > > moving that operation allows QEMU to communicate the kernel the maps
> > > while the workload is still running in the source, so Linux can start
> > > mapping them.
> > > 
> > > As a small drawback, there is a time in the initialization where QEMU
> > > cannot respond to QMP etc.  By some testing, this time is about
> > > 0.2seconds.  This may be further reduced (or increased) depending on the
> > > vdpa driver and the platform hardware, and it is dominated by the cost
> > > of memory pinning.
> > > 
> > > This matches the time that we move out of the called downtime window.
> > > The downtime is measured as checking the trace timestamp from the moment
> > > the source suspend the device to the moment the destination starts the
> > > eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> > > secs to 2.0949.
> > > 
> > Hi Jonah,
> > 
> > Could you update this benchmark? I don't think it changed a lot but
> > just to be as updated as possible.
> Jonah is off this week and will be back until next Tuesday, but I recall he
> indeed did some downtime test with VM with 128GB memory before taking off,
> which shows obvious improvement from around 10 seconds to 5.8 seconds after
> applying this series. Since this is related to update on the cover letter,
> would it be okay for you and Jason to ack now and then proceed to Michael
> for upcoming merge?
> 
> > 
> > I think I cannot ack the series as I sent the first revision. Jason or
> > Si-Wei, could you ack it?
> Sure, I just give my R-b, this series look good to me. Hopefully Jason can
> ack on his own.
> 
> Thanks!
> -Siwei

I just sent a pull, next one in a week or two, so - no rush.


> > 
> > Thanks!
> > 
> > > Future directions on top of this series may include to move more things ahead
> > > of the migration time, like set DRIVER_OK or perform actual iterative migration
> > > of virtio-net devices.
> > > 
> > > Comments are welcome.
> > > 
> > > This series is a different approach of series [1]. As the title does not
> > > reflect the changes anymore, please refer to the previous one to know the
> > > series history.
> > > 
> > > This series is based on [2], it must be applied after it.
> > > 
> > > [Jonah Palmer]
> > > This series was rebased after [3] was pulled in, as [3] was a prerequisite
> > > fix for this series.
> > > 
> > > v4:
> > > ---
> > > * Add memory listener unregistration to vhost_vdpa_reset_device.
> > > * Remove memory listener unregistration from vhost_vdpa_reset_status.
> > > 
> > > v3:
> > > ---
> > > * Rebase
> > > 
> > > v2:
> > > ---
> > > * Move the memory listener registration to vhost_vdpa_set_owner function.
> > > * Move the iova_tree allocation to net_vhost_vdpa_init.
> > > 
> > > v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
> > > 
> > > [1] https://patchwork.kernel.org/project/qemu-devel/cover/20231215172830.2540987-1-eperezma@redhat.com/
> > > [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> > > [3] https://lore.kernel.org/qemu-devel/20250217144936.3589907-1-jonah.palmer@oracle.com/
> > > 
> > > Jonah - note: I'll be on vacation from May 10-19. Will respond to
> > >                comments when I return.
> > > 
> > > Eugenio Pérez (7):
> > >    vdpa: check for iova tree initialized at net_client_start
> > >    vdpa: reorder vhost_vdpa_set_backend_cap
> > >    vdpa: set backend capabilities at vhost_vdpa_init
> > >    vdpa: add listener_registered
> > >    vdpa: reorder listener assignment
> > >    vdpa: move iova_tree allocation to net_vhost_vdpa_init
> > >    vdpa: move memory listener register to vhost_vdpa_init
> > > 
> > >   hw/virtio/vhost-vdpa.c         | 107 +++++++++++++++++++++------------
> > >   include/hw/virtio/vhost-vdpa.h |  22 ++++++-
> > >   net/vhost-vdpa.c               |  34 +----------
> > >   3 files changed, 93 insertions(+), 70 deletions(-)
> > > 
> > > --
> > > 2.43.5
> > > 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-15  0:17   ` Si-Wei Liu
  2025-05-15  5:43     ` Michael S. Tsirkin
@ 2025-05-15  8:30     ` Eugenio Perez Martin
  2025-05-16  1:49     ` Jason Wang
  2 siblings, 0 replies; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-05-15  8:30 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Jonah Palmer, qemu-devel, peterx, mst, jasowang, lvivier,
	dtatulea, leiyang, parav, sgarzare, lingshan.zhu, boris.ostrovsky

On Thu, May 15, 2025 at 2:17 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
> Hi Eugenio,
>
> On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote:
> > On Wed, May 7, 2025 at 8:47 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >> Current memory operations like pinning may take a lot of time at the
> >> destination.  Currently they are done after the source of the migration is
> >> stopped, and before the workload is resumed at the destination.  This is a
> >> period where neigher traffic can flow, nor the VM workload can continue
> >> (downtime).
> >>
> >> We can do better as we know the memory layout of the guest RAM at the
> >> destination from the moment that all devices are initializaed.  So
> >> moving that operation allows QEMU to communicate the kernel the maps
> >> while the workload is still running in the source, so Linux can start
> >> mapping them.
> >>
> >> As a small drawback, there is a time in the initialization where QEMU
> >> cannot respond to QMP etc.  By some testing, this time is about
> >> 0.2seconds.  This may be further reduced (or increased) depending on the
> >> vdpa driver and the platform hardware, and it is dominated by the cost
> >> of memory pinning.
> >>
> >> This matches the time that we move out of the called downtime window.
> >> The downtime is measured as checking the trace timestamp from the moment
> >> the source suspend the device to the moment the destination starts the
> >> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> >> secs to 2.0949.
> >>
> > Hi Jonah,
> >
> > Could you update this benchmark? I don't think it changed a lot but
> > just to be as updated as possible.
> Jonah is off this week and will be back until next Tuesday, but I recall
> he indeed did some downtime test with VM with 128GB memory before taking
> off, which shows obvious improvement from around 10 seconds to 5.8
> seconds after applying this series. Since this is related to update on
> the cover letter, would it be okay for you and Jason to ack now and then
> proceed to Michael for upcoming merge?
>

Oh yes that's what I meant, I should have been more explicit about that :).


> >
> > I think I cannot ack the series as I sent the first revision. Jason or
> > Si-Wei, could you ack it?
> Sure, I just give my R-b, this series look good to me. Hopefully Jason
> can ack on his own.
>
> Thanks!
> -Siwei
>
> >
> > Thanks!
> >
> >> Future directions on top of this series may include to move more things ahead
> >> of the migration time, like set DRIVER_OK or perform actual iterative migration
> >> of virtio-net devices.
> >>
> >> Comments are welcome.
> >>
> >> This series is a different approach of series [1]. As the title does not
> >> reflect the changes anymore, please refer to the previous one to know the
> >> series history.
> >>
> >> This series is based on [2], it must be applied after it.
> >>
> >> [Jonah Palmer]
> >> This series was rebased after [3] was pulled in, as [3] was a prerequisite
> >> fix for this series.
> >>
> >> v4:
> >> ---
> >> * Add memory listener unregistration to vhost_vdpa_reset_device.
> >> * Remove memory listener unregistration from vhost_vdpa_reset_status.
> >>
> >> v3:
> >> ---
> >> * Rebase
> >>
> >> v2:
> >> ---
> >> * Move the memory listener registration to vhost_vdpa_set_owner function.
> >> * Move the iova_tree allocation to net_vhost_vdpa_init.
> >>
> >> v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
> >>
> >> [1] https://patchwork.kernel.org/project/qemu-devel/cover/20231215172830.2540987-1-eperezma@redhat.com/
> >> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> >> [3] https://lore.kernel.org/qemu-devel/20250217144936.3589907-1-jonah.palmer@oracle.com/
> >>
> >> Jonah - note: I'll be on vacation from May 10-19. Will respond to
> >>                comments when I return.
> >>
> >> Eugenio Pérez (7):
> >>    vdpa: check for iova tree initialized at net_client_start
> >>    vdpa: reorder vhost_vdpa_set_backend_cap
> >>    vdpa: set backend capabilities at vhost_vdpa_init
> >>    vdpa: add listener_registered
> >>    vdpa: reorder listener assignment
> >>    vdpa: move iova_tree allocation to net_vhost_vdpa_init
> >>    vdpa: move memory listener register to vhost_vdpa_init
> >>
> >>   hw/virtio/vhost-vdpa.c         | 107 +++++++++++++++++++++------------
> >>   include/hw/virtio/vhost-vdpa.h |  22 ++++++-
> >>   net/vhost-vdpa.c               |  34 +----------
> >>   3 files changed, 93 insertions(+), 70 deletions(-)
> >>
> >> --
> >> 2.43.5
> >>
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 7/7] vdpa: move memory listener register to vhost_vdpa_init
  2025-05-15  5:42   ` Michael S. Tsirkin
@ 2025-05-15 17:36     ` Si-Wei Liu
  2025-05-20 13:23       ` Jonah Palmer
  0 siblings, 1 reply; 44+ messages in thread
From: Si-Wei Liu @ 2025-05-15 17:36 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jonah Palmer
  Cc: qemu-devel, eperezma, peterx, jasowang, lvivier, dtatulea,
	leiyang, parav, sgarzare, lingshan.zhu, boris.ostrovsky



On 5/14/2025 10:42 PM, Michael S. Tsirkin wrote:
> On Wed, May 07, 2025 at 02:46:47PM -0400, Jonah Palmer wrote:
>> From: Eugenio Pérez <eperezma@redhat.com>
>>
>> Current memory operations like pinning may take a lot of time at the
>> destination.  Currently they are done after the source of the migration is
>> stopped, and before the workload is resumed at the destination.  This is a
>> period where neigher traffic can flow, nor the VM workload can continue
>> (downtime).
>>
>> We can do better as we know the memory layout of the guest RAM at the
>> destination from the moment that all devices are initializaed.  So
>> moving that operation allows QEMU to communicate the kernel the maps
>> while the workload is still running in the source, so Linux can start
>> mapping them.
>>
>> As a small drawback, there is a time in the initialization where QEMU
>> cannot respond to QMP etc.  By some testing, this time is about
>> 0.2seconds.  This may be further reduced (or increased) depending on the
>> vdpa driver and the platform hardware, and it is dominated by the cost
>> of memory pinning.
>>
>> This matches the time that we move out of the called downtime window.
>> The downtime is measured as checking the trace timestamp from the moment
>> the source suspend the device to the moment the destination starts the
>> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
>> secs to 2.0949.
>>
>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>>
>> v3:
>> ---
> just know that everything beyond this line is not going into
> git commit log.
I guess that was the intent? Should be fine without them in the commit 
log I think. These are interim to capture what change was made to fix 
specific bug *in previous posted versions*.

(having said, please help edit the log and remove the "v3:" line which 
should be after the --- separator line, thx!)

-Siwei

>
>
>> Move memory listener unregistration from vhost_vdpa_reset_status to
>> vhost_vdpa_reset_device. By unregistering the listener here, we can
>> guarantee that every reset leaves the device in an expected state.
>> Also remove the duplicate call in vhost_vdpa_reset_status.
>>
>> Reported-by: Lei Yang <leiyang@redhat.com>
>> Suggested-by: Si-Wei Liu <si-wei.liu@oracle.com>
>>
>> --
>> v2:
>> Move the memory listener registration to vhost_vdpa_set_owner function.
>> In case of hotplug the vdpa device, the memory is already set up, and
>> leaving memory listener register call in the init function made maps
>> occur before set owner call.
>>
>> To be 100% safe, let's put it right after set_owner call.
>>
>> Reported-by: Lei Yang <leiyang@redhat.com>
>> ---
>>   hw/virtio/vhost-vdpa.c | 35 ++++++++++++++++++++++++++++-------
>>   1 file changed, 28 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>> index de834f2ebd..e20da95f30 100644
>> --- a/hw/virtio/vhost-vdpa.c
>> +++ b/hw/virtio/vhost-vdpa.c
>> @@ -894,8 +894,14 @@ static int vhost_vdpa_reset_device(struct vhost_dev *dev)
>>   
>>       ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
>>       trace_vhost_vdpa_reset_device(dev);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    memory_listener_unregister(&v->shared->listener);
>> +    v->shared->listener_registered = false;
>>       v->suspended = false;
>> -    return ret;
>> +    return 0;
>>   }
>>   
>>   static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, int idx)
>> @@ -1379,6 +1385,11 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>>                            "IOMMU and try again");
>>               return -1;
>>           }
>> +        if (v->shared->listener_registered &&
>> +            dev->vdev->dma_as != v->shared->listener.address_space) {
>> +            memory_listener_unregister(&v->shared->listener);
>> +            v->shared->listener_registered = false;
>> +        }
>>           if (!v->shared->listener_registered) {
>>               memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
>>               v->shared->listener_registered = true;
>> @@ -1392,8 +1403,6 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>>   
>>   static void vhost_vdpa_reset_status(struct vhost_dev *dev)
>>   {
>> -    struct vhost_vdpa *v = dev->opaque;
>> -
>>       if (!vhost_vdpa_last_dev(dev)) {
>>           return;
>>       }
>> @@ -1401,9 +1410,6 @@ static void vhost_vdpa_reset_status(struct vhost_dev *dev)
>>       vhost_vdpa_reset_device(dev);
>>       vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>                                  VIRTIO_CONFIG_S_DRIVER);
>> -    memory_listener_unregister(&v->shared->listener);
>> -    v->shared->listener_registered = false;
>> -
>>   }
>>   
>>   static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
>> @@ -1537,12 +1543,27 @@ static int vhost_vdpa_get_features(struct vhost_dev *dev,
>>   
>>   static int vhost_vdpa_set_owner(struct vhost_dev *dev)
>>   {
>> +    int r;
>> +    struct vhost_vdpa *v;
>> +
>>       if (!vhost_vdpa_first_dev(dev)) {
>>           return 0;
>>       }
>>   
>>       trace_vhost_vdpa_set_owner(dev);
>> -    return vhost_vdpa_call(dev, VHOST_SET_OWNER, NULL);
>> +    r = vhost_vdpa_call(dev, VHOST_SET_OWNER, NULL);
>> +    if (unlikely(r < 0)) {
>> +        return r;
>> +    }
>> +
>> +    /*
>> +     * Being optimistic and listening address space memory. If the device
>> +     * uses vIOMMU, it is changed at vhost_vdpa_dev_start.
>> +     */
>> +    v = dev->opaque;
>> +    memory_listener_register(&v->shared->listener, &address_space_memory);
>> +    v->shared->listener_registered = true;
>> +    return 0;
>>   }
>>   
>>   static int vhost_vdpa_vq_get_addr(struct vhost_dev *dev,
>> -- 
>> 2.43.5



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-15  5:43     ` Michael S. Tsirkin
@ 2025-05-15 17:41       ` Si-Wei Liu
  2025-05-16 10:45         ` Michael S. Tsirkin
  0 siblings, 1 reply; 44+ messages in thread
From: Si-Wei Liu @ 2025-05-15 17:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eugenio Perez Martin, Jonah Palmer, qemu-devel, peterx, jasowang,
	lvivier, dtatulea, leiyang, parav, sgarzare, lingshan.zhu,
	boris.ostrovsky



On 5/14/2025 10:43 PM, Michael S. Tsirkin wrote:
> On Wed, May 14, 2025 at 05:17:15PM -0700, Si-Wei Liu wrote:
>> Hi Eugenio,
>>
>> On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote:
>>> On Wed, May 7, 2025 at 8:47 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>> Current memory operations like pinning may take a lot of time at the
>>>> destination.  Currently they are done after the source of the migration is
>>>> stopped, and before the workload is resumed at the destination.  This is a
>>>> period where neigher traffic can flow, nor the VM workload can continue
>>>> (downtime).
>>>>
>>>> We can do better as we know the memory layout of the guest RAM at the
>>>> destination from the moment that all devices are initializaed.  So
>>>> moving that operation allows QEMU to communicate the kernel the maps
>>>> while the workload is still running in the source, so Linux can start
>>>> mapping them.
>>>>
>>>> As a small drawback, there is a time in the initialization where QEMU
>>>> cannot respond to QMP etc.  By some testing, this time is about
>>>> 0.2seconds.  This may be further reduced (or increased) depending on the
>>>> vdpa driver and the platform hardware, and it is dominated by the cost
>>>> of memory pinning.
>>>>
>>>> This matches the time that we move out of the called downtime window.
>>>> The downtime is measured as checking the trace timestamp from the moment
>>>> the source suspend the device to the moment the destination starts the
>>>> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
>>>> secs to 2.0949.
>>>>
>>> Hi Jonah,
>>>
>>> Could you update this benchmark? I don't think it changed a lot but
>>> just to be as updated as possible.
>> Jonah is off this week and will be back until next Tuesday, but I recall he
>> indeed did some downtime test with VM with 128GB memory before taking off,
>> which shows obvious improvement from around 10 seconds to 5.8 seconds after
>> applying this series. Since this is related to update on the cover letter,
>> would it be okay for you and Jason to ack now and then proceed to Michael
>> for upcoming merge?
>>
>>> I think I cannot ack the series as I sent the first revision. Jason or
>>> Si-Wei, could you ack it?
>> Sure, I just give my R-b, this series look good to me. Hopefully Jason can
>> ack on his own.
>>
>> Thanks!
>> -Siwei
> I just sent a pull, next one in a week or two, so - no rush.
All right, should be good to wait. In any case you have to repost a v2 
PULL, hope this series can be piggy-back'ed as we did extensive tests 
about it. ;-)

-Siwei

>
>
>>> Thanks!
>>>
>>>> Future directions on top of this series may include to move more things ahead
>>>> of the migration time, like set DRIVER_OK or perform actual iterative migration
>>>> of virtio-net devices.
>>>>
>>>> Comments are welcome.
>>>>
>>>> This series is a different approach of series [1]. As the title does not
>>>> reflect the changes anymore, please refer to the previous one to know the
>>>> series history.
>>>>
>>>> This series is based on [2], it must be applied after it.
>>>>
>>>> [Jonah Palmer]
>>>> This series was rebased after [3] was pulled in, as [3] was a prerequisite
>>>> fix for this series.
>>>>
>>>> v4:
>>>> ---
>>>> * Add memory listener unregistration to vhost_vdpa_reset_device.
>>>> * Remove memory listener unregistration from vhost_vdpa_reset_status.
>>>>
>>>> v3:
>>>> ---
>>>> * Rebase
>>>>
>>>> v2:
>>>> ---
>>>> * Move the memory listener registration to vhost_vdpa_set_owner function.
>>>> * Move the iova_tree allocation to net_vhost_vdpa_init.
>>>>
>>>> v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
>>>>
>>>> [1] https://patchwork.kernel.org/project/qemu-devel/cover/20231215172830.2540987-1-eperezma@redhat.com/
>>>> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
>>>> [3] https://lore.kernel.org/qemu-devel/20250217144936.3589907-1-jonah.palmer@oracle.com/
>>>>
>>>> Jonah - note: I'll be on vacation from May 10-19. Will respond to
>>>>                 comments when I return.
>>>>
>>>> Eugenio Pérez (7):
>>>>     vdpa: check for iova tree initialized at net_client_start
>>>>     vdpa: reorder vhost_vdpa_set_backend_cap
>>>>     vdpa: set backend capabilities at vhost_vdpa_init
>>>>     vdpa: add listener_registered
>>>>     vdpa: reorder listener assignment
>>>>     vdpa: move iova_tree allocation to net_vhost_vdpa_init
>>>>     vdpa: move memory listener register to vhost_vdpa_init
>>>>
>>>>    hw/virtio/vhost-vdpa.c         | 107 +++++++++++++++++++++------------
>>>>    include/hw/virtio/vhost-vdpa.h |  22 ++++++-
>>>>    net/vhost-vdpa.c               |  34 +----------
>>>>    3 files changed, 93 insertions(+), 70 deletions(-)
>>>>
>>>> --
>>>> 2.43.5
>>>>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-07 18:46 [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init Jonah Palmer
                   ` (9 preceding siblings ...)
  2025-05-14 23:00 ` Si-Wei Liu
@ 2025-05-16  1:47 ` Jason Wang
  2025-05-16  1:51 ` Jason Wang
  11 siblings, 0 replies; 44+ messages in thread
From: Jason Wang @ 2025-05-16  1:47 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: qemu-devel, eperezma, peterx, mst, lvivier, dtatulea, leiyang,
	parav, sgarzare, si-wei.liu, lingshan.zhu, boris.ostrovsky

On Thu, May 8, 2025 at 2:47 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> Current memory operations like pinning may take a lot of time at the
> destination.  Currently they are done after the source of the migration is
> stopped, and before the workload is resumed at the destination.  This is a
> period where neigher traffic can flow, nor the VM workload can continue
> (downtime).
>
> We can do better as we know the memory layout of the guest RAM at the
> destination from the moment that all devices are initializaed.  So
> moving that operation allows QEMU to communicate the kernel the maps
> while the workload is still running in the source, so Linux can start
> mapping them.
>
> As a small drawback, there is a time in the initialization where QEMU
> cannot respond to QMP etc.  By some testing, this time is about
> 0.2seconds.  This may be further reduced (or increased) depending on the
> vdpa driver and the platform hardware, and it is dominated by the cost
> of memory pinning.
>
> This matches the time that we move out of the called downtime window.
> The downtime is measured as checking the trace timestamp from the moment
> the source suspend the device to the moment the destination starts the
> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> secs to 2.0949.
>
> Future directions on top of this series may include to move more things ahead
> of the migration time, like set DRIVER_OK or perform actual iterative migration
> of virtio-net devices.
>
> Comments are welcome.
>
> This series is a different approach of series [1]. As the title does not
> reflect the changes anymore, please refer to the previous one to know the
> series history.
>
> This series is based on [2], it must be applied after it.

Not that this has been merged.

Thanks

>
> [Jonah Palmer]
> This series was rebased after [3] was pulled in, as [3] was a prerequisite
> fix for this series.
>
> v4:
> ---
> * Add memory listener unregistration to vhost_vdpa_reset_device.
> * Remove memory listener unregistration from vhost_vdpa_reset_status.
>
> v3:
> ---
> * Rebase
>
> v2:
> ---
> * Move the memory listener registration to vhost_vdpa_set_owner function.
> * Move the iova_tree allocation to net_vhost_vdpa_init.
>
> v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
>
> [1] https://patchwork.kernel.org/project/qemu-devel/cover/20231215172830.2540987-1-eperezma@redhat.com/
> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> [3] https://lore.kernel.org/qemu-devel/20250217144936.3589907-1-jonah.palmer@oracle.com/
>
> Jonah - note: I'll be on vacation from May 10-19. Will respond to
>               comments when I return.
>
> Eugenio Pérez (7):
>   vdpa: check for iova tree initialized at net_client_start
>   vdpa: reorder vhost_vdpa_set_backend_cap
>   vdpa: set backend capabilities at vhost_vdpa_init
>   vdpa: add listener_registered
>   vdpa: reorder listener assignment
>   vdpa: move iova_tree allocation to net_vhost_vdpa_init
>   vdpa: move memory listener register to vhost_vdpa_init
>
>  hw/virtio/vhost-vdpa.c         | 107 +++++++++++++++++++++------------
>  include/hw/virtio/vhost-vdpa.h |  22 ++++++-
>  net/vhost-vdpa.c               |  34 +----------
>  3 files changed, 93 insertions(+), 70 deletions(-)
>
> --
> 2.43.5
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-15  0:17   ` Si-Wei Liu
  2025-05-15  5:43     ` Michael S. Tsirkin
  2025-05-15  8:30     ` Eugenio Perez Martin
@ 2025-05-16  1:49     ` Jason Wang
  2 siblings, 0 replies; 44+ messages in thread
From: Jason Wang @ 2025-05-16  1:49 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Eugenio Perez Martin, Jonah Palmer, qemu-devel, peterx, mst,
	lvivier, dtatulea, leiyang, parav, sgarzare, lingshan.zhu,
	boris.ostrovsky

On Thu, May 15, 2025 at 8:17 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
> Hi Eugenio,
>
> On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote:
> > On Wed, May 7, 2025 at 8:47 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >> Current memory operations like pinning may take a lot of time at the
> >> destination.  Currently they are done after the source of the migration is
> >> stopped, and before the workload is resumed at the destination.  This is a
> >> period where neigher traffic can flow, nor the VM workload can continue
> >> (downtime).
> >>
> >> We can do better as we know the memory layout of the guest RAM at the
> >> destination from the moment that all devices are initializaed.  So
> >> moving that operation allows QEMU to communicate the kernel the maps
> >> while the workload is still running in the source, so Linux can start
> >> mapping them.
> >>
> >> As a small drawback, there is a time in the initialization where QEMU
> >> cannot respond to QMP etc.  By some testing, this time is about
> >> 0.2seconds.  This may be further reduced (or increased) depending on the
> >> vdpa driver and the platform hardware, and it is dominated by the cost
> >> of memory pinning.
> >>
> >> This matches the time that we move out of the called downtime window.
> >> The downtime is measured as checking the trace timestamp from the moment
> >> the source suspend the device to the moment the destination starts the
> >> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> >> secs to 2.0949.
> >>
> > Hi Jonah,
> >
> > Could you update this benchmark? I don't think it changed a lot but
> > just to be as updated as possible.
> Jonah is off this week and will be back until next Tuesday, but I recall
> he indeed did some downtime test with VM with 128GB memory before taking
> off, which shows obvious improvement from around 10 seconds to 5.8
> seconds after applying this series. Since this is related to update on
> the cover letter, would it be okay for you and Jason to ack now and then
> proceed to Michael for upcoming merge?

I will go through the series.

Thanks



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-07 18:46 [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init Jonah Palmer
                   ` (10 preceding siblings ...)
  2025-05-16  1:47 ` Jason Wang
@ 2025-05-16  1:51 ` Jason Wang
  2025-05-16  6:40   ` Markus Armbruster
  11 siblings, 1 reply; 44+ messages in thread
From: Jason Wang @ 2025-05-16  1:51 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: qemu-devel, eperezma, peterx, mst, lvivier, dtatulea, leiyang,
	parav, sgarzare, si-wei.liu, lingshan.zhu, boris.ostrovsky,
	Armbruster, Markus

On Thu, May 8, 2025 at 2:47 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> Current memory operations like pinning may take a lot of time at the
> destination.  Currently they are done after the source of the migration is
> stopped, and before the workload is resumed at the destination.  This is a
> period where neigher traffic can flow, nor the VM workload can continue
> (downtime).
>
> We can do better as we know the memory layout of the guest RAM at the
> destination from the moment that all devices are initializaed.  So
> moving that operation allows QEMU to communicate the kernel the maps
> while the workload is still running in the source, so Linux can start
> mapping them.
>
> As a small drawback, there is a time in the initialization where QEMU
> cannot respond to QMP etc.  By some testing, this time is about
> 0.2seconds.

Adding Markus to see if this is a real problem or not.

(I remember VFIO has some optimization in the speed of the pinning,
could vDPA do the same?)

Thanks

> This may be further reduced (or increased) depending on the
> vdpa driver and the platform hardware, and it is dominated by the cost
> of memory pinning.
>
> This matches the time that we move out of the called downtime window.
> The downtime is measured as checking the trace timestamp from the moment
> the source suspend the device to the moment the destination starts the
> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> secs to 2.0949.
>
> Future directions on top of this series may include to move more things ahead
> of the migration time, like set DRIVER_OK or perform actual iterative migration
> of virtio-net devices.
>
> Comments are welcome.
>
> This series is a different approach of series [1]. As the title does not
> reflect the changes anymore, please refer to the previous one to know the
> series history.
>
> This series is based on [2], it must be applied after it.
>
> [Jonah Palmer]
> This series was rebased after [3] was pulled in, as [3] was a prerequisite
> fix for this series.
>
> v4:
> ---
> * Add memory listener unregistration to vhost_vdpa_reset_device.
> * Remove memory listener unregistration from vhost_vdpa_reset_status.
>
> v3:
> ---
> * Rebase
>
> v2:
> ---
> * Move the memory listener registration to vhost_vdpa_set_owner function.
> * Move the iova_tree allocation to net_vhost_vdpa_init.
>
> v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
>
> [1] https://patchwork.kernel.org/project/qemu-devel/cover/20231215172830.2540987-1-eperezma@redhat.com/
> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> [3] https://lore.kernel.org/qemu-devel/20250217144936.3589907-1-jonah.palmer@oracle.com/
>
> Jonah - note: I'll be on vacation from May 10-19. Will respond to
>               comments when I return.
>
> Eugenio Pérez (7):
>   vdpa: check for iova tree initialized at net_client_start
>   vdpa: reorder vhost_vdpa_set_backend_cap
>   vdpa: set backend capabilities at vhost_vdpa_init
>   vdpa: add listener_registered
>   vdpa: reorder listener assignment
>   vdpa: move iova_tree allocation to net_vhost_vdpa_init
>   vdpa: move memory listener register to vhost_vdpa_init
>
>  hw/virtio/vhost-vdpa.c         | 107 +++++++++++++++++++++------------
>  include/hw/virtio/vhost-vdpa.h |  22 ++++++-
>  net/vhost-vdpa.c               |  34 +----------
>  3 files changed, 93 insertions(+), 70 deletions(-)
>
> --
> 2.43.5
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 1/7] vdpa: check for iova tree initialized at net_client_start
  2025-05-07 18:46 ` [PATCH v4 1/7] vdpa: check for iova tree initialized at net_client_start Jonah Palmer
@ 2025-05-16  1:52   ` Jason Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Wang @ 2025-05-16  1:52 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: qemu-devel, eperezma, peterx, mst, lvivier, dtatulea, leiyang,
	parav, sgarzare, si-wei.liu, lingshan.zhu, boris.ostrovsky

On Thu, May 8, 2025 at 2:47 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> From: Eugenio Pérez <eperezma@redhat.com>
>
> To map the guest memory while it is migrating we need to create the
> iova_tree, as long as the destination uses x-svq=on. Checking to not
> override it.
>
> The function vhost_vdpa_net_client_stop clear it if the device is
> stopped. If the guest starts the device again, the iova tree is
> recreated by vhost_vdpa_net_data_start_first or vhost_vdpa_net_cvq_start
> if needed, so old behavior is kept.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks

> ---
>  net/vhost-vdpa.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> index 7ca8b46eee..decb826868 100644
> --- a/net/vhost-vdpa.c
> +++ b/net/vhost-vdpa.c
> @@ -366,7 +366,9 @@ static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
>
>      migration_add_notifier(&s->migration_state,
>                             vdpa_net_migration_state_notifier);
> -    if (v->shadow_vqs_enabled) {
> +
> +    /* iova_tree may be initialized by vhost_vdpa_net_load_setup */
> +    if (v->shadow_vqs_enabled && !v->shared->iova_tree) {
>          v->shared->iova_tree = vhost_iova_tree_new(v->shared->iova_range.first,
>                                                     v->shared->iova_range.last);
>      }
> --
> 2.43.5
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 2/7] vdpa: reorder vhost_vdpa_set_backend_cap
  2025-05-07 18:46 ` [PATCH v4 2/7] vdpa: reorder vhost_vdpa_set_backend_cap Jonah Palmer
@ 2025-05-16  1:53   ` Jason Wang
  2025-05-16  1:56   ` Jason Wang
  1 sibling, 0 replies; 44+ messages in thread
From: Jason Wang @ 2025-05-16  1:53 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: qemu-devel, eperezma, peterx, mst, lvivier, dtatulea, leiyang,
	parav, sgarzare, si-wei.liu, lingshan.zhu, boris.ostrovsky

On Thu, May 8, 2025 at 2:48 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> From: Eugenio Pérez <eperezma@redhat.com>
>
> It will be used directly by vhost_vdpa_init.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> ---

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 2/7] vdpa: reorder vhost_vdpa_set_backend_cap
  2025-05-07 18:46 ` [PATCH v4 2/7] vdpa: reorder vhost_vdpa_set_backend_cap Jonah Palmer
  2025-05-16  1:53   ` Jason Wang
@ 2025-05-16  1:56   ` Jason Wang
  1 sibling, 0 replies; 44+ messages in thread
From: Jason Wang @ 2025-05-16  1:56 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: qemu-devel, eperezma, peterx, mst, lvivier, dtatulea, leiyang,
	parav, sgarzare, si-wei.liu, lingshan.zhu, boris.ostrovsky

On Thu, May 8, 2025 at 2:48 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> From: Eugenio Pérez <eperezma@redhat.com>
>
> It will be used directly by vhost_vdpa_init.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> ---

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 3/7] vdpa: set backend capabilities at vhost_vdpa_init
  2025-05-07 18:46 ` [PATCH v4 3/7] vdpa: set backend capabilities at vhost_vdpa_init Jonah Palmer
@ 2025-05-16  1:57   ` Jason Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Wang @ 2025-05-16  1:57 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: qemu-devel, eperezma, peterx, mst, lvivier, dtatulea, leiyang,
	parav, sgarzare, si-wei.liu, lingshan.zhu, boris.ostrovsky

On Thu, May 8, 2025 at 2:47 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> From: Eugenio Pérez <eperezma@redhat.com>
>
> The backend does not reset them until the vdpa file descriptor is closed
> so there is no harm in doing it only once.
>
> This allows the destination of a live migration to premap memory in
> batches, using VHOST_BACKEND_F_IOTLB_BATCH.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> ---

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 4/7] vdpa: add listener_registered
  2025-05-07 18:46 ` [PATCH v4 4/7] vdpa: add listener_registered Jonah Palmer
@ 2025-05-16  2:00   ` Jason Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Wang @ 2025-05-16  2:00 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: qemu-devel, eperezma, peterx, mst, lvivier, dtatulea, leiyang,
	parav, sgarzare, si-wei.liu, lingshan.zhu, boris.ostrovsky

On Thu, May 8, 2025 at 2:47 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> From: Eugenio Pérez <eperezma@redhat.com>
>
> Check if the listener has been registered or not, so it needs to be
> registered again at start.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> ---

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/7] vdpa: reorder listener assignment
  2025-05-07 18:46 ` [PATCH v4 5/7] vdpa: reorder listener assignment Jonah Palmer
@ 2025-05-16  2:01   ` Jason Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Wang @ 2025-05-16  2:01 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: qemu-devel, eperezma, peterx, mst, lvivier, dtatulea, leiyang,
	parav, sgarzare, si-wei.liu, lingshan.zhu, boris.ostrovsky

On Thu, May 8, 2025 at 2:47 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> From: Eugenio Pérez <eperezma@redhat.com>
>
> Since commit f6fe3e333f ("vdpa: move memory listener to
> vhost_vdpa_shared") this piece of code repeatedly assign
> shared->listener members.  This was not a problem as it was not used
> until device start.
>
> However next patches move the listener registration to this
> vhost_vdpa_init function.  When the listener is registered it is added
> to an embedded linked list, so setting its members again will cause
> memory corruption to the linked list node.
>
> Do the right thing and only set it in the first vdpa device.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init
  2025-05-07 18:46 ` [PATCH v4 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init Jonah Palmer
@ 2025-05-16  2:07   ` Jason Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Wang @ 2025-05-16  2:07 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: qemu-devel, eperezma, peterx, mst, lvivier, dtatulea, leiyang,
	parav, sgarzare, si-wei.liu, lingshan.zhu, boris.ostrovsky

On Thu, May 8, 2025 at 2:47 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> From: Eugenio Pérez <eperezma@redhat.com>
>
> As we are moving to keep the mapping through all the vdpa device life
> instead of resetting it at VirtIO reset, we need to move all its
> dependencies to the initialization too.  In particular devices with
> x-svq=on need a valid iova_tree from the beginning.
>
> Simplify the code also consolidating the two creation points: the first
> data vq in case of SVQ active and CVQ start in case only CVQ uses it.
>
> Suggested-by: Si-Wei Liu <si-wei.liu@oracle.com>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> ---

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-16  1:51 ` Jason Wang
@ 2025-05-16  6:40   ` Markus Armbruster
  2025-05-16 19:09     ` Si-Wei Liu
  0 siblings, 1 reply; 44+ messages in thread
From: Markus Armbruster @ 2025-05-16  6:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jonah Palmer, qemu-devel, eperezma, peterx, mst, lvivier,
	dtatulea, leiyang, parav, sgarzare, si-wei.liu, lingshan.zhu,
	boris.ostrovsky

Jason Wang <jasowang@redhat.com> writes:

> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>
>> Current memory operations like pinning may take a lot of time at the
>> destination.  Currently they are done after the source of the migration is
>> stopped, and before the workload is resumed at the destination.  This is a
>> period where neigher traffic can flow, nor the VM workload can continue
>> (downtime).
>>
>> We can do better as we know the memory layout of the guest RAM at the
>> destination from the moment that all devices are initializaed.  So
>> moving that operation allows QEMU to communicate the kernel the maps
>> while the workload is still running in the source, so Linux can start
>> mapping them.
>>
>> As a small drawback, there is a time in the initialization where QEMU
>> cannot respond to QMP etc.  By some testing, this time is about
>> 0.2seconds.
>
> Adding Markus to see if this is a real problem or not.

I guess the answer is "depends", and to get a more useful one, we need
more information.

When all you care is time from executing qemu-system-FOO to guest
finish booting, and the guest takes 10s to boot, then an extra 0.2s
won't matter much.

When a management application runs qemu-system-FOO several times to
probe its capabilities via QMP, then even milliseconds can hurt.

In what scenarios exactly is QMP delayed?

You told us an absolute delay you observed.  What's the relative delay,
i.e. what's the delay with and without these patches?

We need QMP to become available earlier in the startup sequence for
other reasons.  Could we bypass the delay that way?  Please understand
that this would likely be quite difficult: we know from experience that
messing with the startup sequence is prone to introduce subtle
compatility breaks and even bugs.

> (I remember VFIO has some optimization in the speed of the pinning,
> could vDPA do the same?)

That's well outside my bailiwick :)

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-15 17:41       ` Si-Wei Liu
@ 2025-05-16 10:45         ` Michael S. Tsirkin
  0 siblings, 0 replies; 44+ messages in thread
From: Michael S. Tsirkin @ 2025-05-16 10:45 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Eugenio Perez Martin, Jonah Palmer, qemu-devel, peterx, jasowang,
	lvivier, dtatulea, leiyang, parav, sgarzare, lingshan.zhu,
	boris.ostrovsky

On Thu, May 15, 2025 at 10:41:45AM -0700, Si-Wei Liu wrote:
> 
> 
> On 5/14/2025 10:43 PM, Michael S. Tsirkin wrote:
> > On Wed, May 14, 2025 at 05:17:15PM -0700, Si-Wei Liu wrote:
> > > Hi Eugenio,
> > > 
> > > On 5/14/2025 8:49 AM, Eugenio Perez Martin wrote:
> > > > On Wed, May 7, 2025 at 8:47 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> > > > > Current memory operations like pinning may take a lot of time at the
> > > > > destination.  Currently they are done after the source of the migration is
> > > > > stopped, and before the workload is resumed at the destination.  This is a
> > > > > period where neigher traffic can flow, nor the VM workload can continue
> > > > > (downtime).
> > > > > 
> > > > > We can do better as we know the memory layout of the guest RAM at the
> > > > > destination from the moment that all devices are initializaed.  So
> > > > > moving that operation allows QEMU to communicate the kernel the maps
> > > > > while the workload is still running in the source, so Linux can start
> > > > > mapping them.
> > > > > 
> > > > > As a small drawback, there is a time in the initialization where QEMU
> > > > > cannot respond to QMP etc.  By some testing, this time is about
> > > > > 0.2seconds.  This may be further reduced (or increased) depending on the
> > > > > vdpa driver and the platform hardware, and it is dominated by the cost
> > > > > of memory pinning.
> > > > > 
> > > > > This matches the time that we move out of the called downtime window.
> > > > > The downtime is measured as checking the trace timestamp from the moment
> > > > > the source suspend the device to the moment the destination starts the
> > > > > eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
> > > > > secs to 2.0949.
> > > > > 
> > > > Hi Jonah,
> > > > 
> > > > Could you update this benchmark? I don't think it changed a lot but
> > > > just to be as updated as possible.
> > > Jonah is off this week and will be back until next Tuesday, but I recall he
> > > indeed did some downtime test with VM with 128GB memory before taking off,
> > > which shows obvious improvement from around 10 seconds to 5.8 seconds after
> > > applying this series. Since this is related to update on the cover letter,
> > > would it be okay for you and Jason to ack now and then proceed to Michael
> > > for upcoming merge?
> > > 
> > > > I think I cannot ack the series as I sent the first revision. Jason or
> > > > Si-Wei, could you ack it?
> > > Sure, I just give my R-b, this series look good to me. Hopefully Jason can
> > > ack on his own.
> > > 
> > > Thanks!
> > > -Siwei
> > I just sent a pull, next one in a week or two, so - no rush.
> All right, should be good to wait. In any case you have to repost a v2 PULL,
> hope this series can be piggy-back'ed as we did extensive tests about it.
> ;-)
> 
> -Siwei

You mean "in case"?

> > 
> > 
> > > > Thanks!
> > > > 
> > > > > Future directions on top of this series may include to move more things ahead
> > > > > of the migration time, like set DRIVER_OK or perform actual iterative migration
> > > > > of virtio-net devices.
> > > > > 
> > > > > Comments are welcome.
> > > > > 
> > > > > This series is a different approach of series [1]. As the title does not
> > > > > reflect the changes anymore, please refer to the previous one to know the
> > > > > series history.
> > > > > 
> > > > > This series is based on [2], it must be applied after it.
> > > > > 
> > > > > [Jonah Palmer]
> > > > > This series was rebased after [3] was pulled in, as [3] was a prerequisite
> > > > > fix for this series.
> > > > > 
> > > > > v4:
> > > > > ---
> > > > > * Add memory listener unregistration to vhost_vdpa_reset_device.
> > > > > * Remove memory listener unregistration from vhost_vdpa_reset_status.
> > > > > 
> > > > > v3:
> > > > > ---
> > > > > * Rebase
> > > > > 
> > > > > v2:
> > > > > ---
> > > > > * Move the memory listener registration to vhost_vdpa_set_owner function.
> > > > > * Move the iova_tree allocation to net_vhost_vdpa_init.
> > > > > 
> > > > > v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
> > > > > 
> > > > > [1] https://patchwork.kernel.org/project/qemu-devel/cover/20231215172830.2540987-1-eperezma@redhat.com/
> > > > > [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> > > > > [3] https://lore.kernel.org/qemu-devel/20250217144936.3589907-1-jonah.palmer@oracle.com/
> > > > > 
> > > > > Jonah - note: I'll be on vacation from May 10-19. Will respond to
> > > > >                 comments when I return.
> > > > > 
> > > > > Eugenio Pérez (7):
> > > > >     vdpa: check for iova tree initialized at net_client_start
> > > > >     vdpa: reorder vhost_vdpa_set_backend_cap
> > > > >     vdpa: set backend capabilities at vhost_vdpa_init
> > > > >     vdpa: add listener_registered
> > > > >     vdpa: reorder listener assignment
> > > > >     vdpa: move iova_tree allocation to net_vhost_vdpa_init
> > > > >     vdpa: move memory listener register to vhost_vdpa_init
> > > > > 
> > > > >    hw/virtio/vhost-vdpa.c         | 107 +++++++++++++++++++++------------
> > > > >    include/hw/virtio/vhost-vdpa.h |  22 ++++++-
> > > > >    net/vhost-vdpa.c               |  34 +----------
> > > > >    3 files changed, 93 insertions(+), 70 deletions(-)
> > > > > 
> > > > > --
> > > > > 2.43.5
> > > > > 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-16  6:40   ` Markus Armbruster
@ 2025-05-16 19:09     ` Si-Wei Liu
  2025-05-26  9:16       ` Markus Armbruster
  0 siblings, 1 reply; 44+ messages in thread
From: Si-Wei Liu @ 2025-05-16 19:09 UTC (permalink / raw)
  To: Markus Armbruster, Jason Wang
  Cc: Jonah Palmer, qemu-devel, eperezma, peterx, mst, lvivier,
	dtatulea, leiyang, parav, sgarzare, lingshan.zhu, boris.ostrovsky



On 5/15/2025 11:40 PM, Markus Armbruster wrote:
> Jason Wang <jasowang@redhat.com> writes:
>
>> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>> Current memory operations like pinning may take a lot of time at the
>>> destination.  Currently they are done after the source of the migration is
>>> stopped, and before the workload is resumed at the destination.  This is a
>>> period where neigher traffic can flow, nor the VM workload can continue
>>> (downtime).
>>>
>>> We can do better as we know the memory layout of the guest RAM at the
>>> destination from the moment that all devices are initializaed.  So
>>> moving that operation allows QEMU to communicate the kernel the maps
>>> while the workload is still running in the source, so Linux can start
>>> mapping them.
>>>
>>> As a small drawback, there is a time in the initialization where QEMU
>>> cannot respond to QMP etc.  By some testing, this time is about
>>> 0.2seconds.
>> Adding Markus to see if this is a real problem or not.
> I guess the answer is "depends", and to get a more useful one, we need
> more information.
>
> When all you care is time from executing qemu-system-FOO to guest
> finish booting, and the guest takes 10s to boot, then an extra 0.2s
> won't matter much.
There's no such delay of an extra 0.2s or higher per se, it's just 
shifting around the page pinning hiccup, no matter it is 0.2s or 
something else, from the time of guest booting up to before guest is 
booted. This saves back guest boot time or start up delay, but in turn 
the same delay effectively will be charged to VM launch time. We follow 
the same model with VFIO, which would see the same hiccup during launch 
(at an early stage where no real mgmt software would care about).

>
> When a management application runs qemu-system-FOO several times to
> probe its capabilities via QMP, then even milliseconds can hurt.
Not something like that, this page pinning hiccup is one time only that 
occurs in the very early stage when launching QEMU, i.e. there's no 
consistent delay every time when QMP is called. The delay in QMP 
response at that very point depends on how much memory the VM has, but 
this is just specif to VM with VFIO or vDPA devices that have to pin 
memory for DMA. Having said, there's no extra delay at all if QEMU args 
has no vDPA device assignment, on the other hand, there's same delay or 
QMP hiccup when VFIO is around in QEMU args.
> In what scenarios exactly is QMP delayed?
Having said, this is not a new problem to QEMU in particular, this QMP 
delay is not peculiar, it's existent on VFIO as well.

Thanks,
-Siwei

>
> You told us an absolute delay you observed.  What's the relative delay,
> i.e. what's the delay with and without these patches?
>
> We need QMP to become available earlier in the startup sequence for
> other reasons.  Could we bypass the delay that way?  Please understand
> that this would likely be quite difficult: we know from experience that
> messing with the startup sequence is prone to introduce subtle
> compatility breaks and even bugs.
>
>> (I remember VFIO has some optimization in the speed of the pinning,
>> could vDPA do the same?)
> That's well outside my bailiwick :)
>
> [...]
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 7/7] vdpa: move memory listener register to vhost_vdpa_init
  2025-05-15 17:36     ` Si-Wei Liu
@ 2025-05-20 13:23       ` Jonah Palmer
  0 siblings, 0 replies; 44+ messages in thread
From: Jonah Palmer @ 2025-05-20 13:23 UTC (permalink / raw)
  To: Si-Wei Liu, Michael S. Tsirkin
  Cc: qemu-devel, eperezma, peterx, jasowang, lvivier, dtatulea,
	leiyang, parav, sgarzare, lingshan.zhu, boris.ostrovsky


On 5/15/25 1:36 PM, Si-Wei Liu wrote:
>
>
> On 5/14/2025 10:42 PM, Michael S. Tsirkin wrote:
>> On Wed, May 07, 2025 at 02:46:47PM -0400, Jonah Palmer wrote:
>>> From: Eugenio Pérez <eperezma@redhat.com>
>>>
>>> Current memory operations like pinning may take a lot of time at the
>>> destination.  Currently they are done after the source of the 
>>> migration is
>>> stopped, and before the workload is resumed at the destination.  
>>> This is a
>>> period where neigher traffic can flow, nor the VM workload can continue
>>> (downtime).
>>>
>>> We can do better as we know the memory layout of the guest RAM at the
>>> destination from the moment that all devices are initializaed.  So
>>> moving that operation allows QEMU to communicate the kernel the maps
>>> while the workload is still running in the source, so Linux can start
>>> mapping them.
>>>
>>> As a small drawback, there is a time in the initialization where QEMU
>>> cannot respond to QMP etc.  By some testing, this time is about
>>> 0.2seconds.  This may be further reduced (or increased) depending on 
>>> the
>>> vdpa driver and the platform hardware, and it is dominated by the cost
>>> of memory pinning.
>>>
>>> This matches the time that we move out of the called downtime window.
>>> The downtime is measured as checking the trace timestamp from the 
>>> moment
>>> the source suspend the device to the moment the destination starts the
>>> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
>>> secs to 2.0949.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>>>
>>> v3:
>>> ---
>> just know that everything beyond this line is not going into
>> git commit log.
> I guess that was the intent? Should be fine without them in the commit 
> log I think. These are interim to capture what change was made to fix 
> specific bug *in previous posted versions*.
>
> (having said, please help edit the log and remove the "v3:" line which 
> should be after the --- separator line, thx!)
>
> -Siwei
Woops, will fix this. Sorry about that.
>
>>
>>
>>> Move memory listener unregistration from vhost_vdpa_reset_status to
>>> vhost_vdpa_reset_device. By unregistering the listener here, we can
>>> guarantee that every reset leaves the device in an expected state.
>>> Also remove the duplicate call in vhost_vdpa_reset_status.
>>>
>>> Reported-by: Lei Yang <leiyang@redhat.com>
>>> Suggested-by: Si-Wei Liu <si-wei.liu@oracle.com>
>>>
>>> -- 
>>> v2:
>>> Move the memory listener registration to vhost_vdpa_set_owner function.
>>> In case of hotplug the vdpa device, the memory is already set up, and
>>> leaving memory listener register call in the init function made maps
>>> occur before set owner call.
>>>
>>> To be 100% safe, let's put it right after set_owner call.
>>>
>>> Reported-by: Lei Yang <leiyang@redhat.com>
>>> ---
>>>   hw/virtio/vhost-vdpa.c | 35 ++++++++++++++++++++++++++++-------
>>>   1 file changed, 28 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>> index de834f2ebd..e20da95f30 100644
>>> --- a/hw/virtio/vhost-vdpa.c
>>> +++ b/hw/virtio/vhost-vdpa.c
>>> @@ -894,8 +894,14 @@ static int vhost_vdpa_reset_device(struct 
>>> vhost_dev *dev)
>>>         ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);
>>>       trace_vhost_vdpa_reset_device(dev);
>>> +    if (ret) {
>>> +        return ret;
>>> +    }
>>> +
>>> + memory_listener_unregister(&v->shared->listener);
>>> +    v->shared->listener_registered = false;
>>>       v->suspended = false;
>>> -    return ret;
>>> +    return 0;
>>>   }
>>>     static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, int idx)
>>> @@ -1379,6 +1385,11 @@ static int vhost_vdpa_dev_start(struct 
>>> vhost_dev *dev, bool started)
>>>                            "IOMMU and try again");
>>>               return -1;
>>>           }
>>> +        if (v->shared->listener_registered &&
>>> +            dev->vdev->dma_as != v->shared->listener.address_space) {
>>> + memory_listener_unregister(&v->shared->listener);
>>> +            v->shared->listener_registered = false;
>>> +        }
>>>           if (!v->shared->listener_registered) {
>>> memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
>>>               v->shared->listener_registered = true;
>>> @@ -1392,8 +1403,6 @@ static int vhost_vdpa_dev_start(struct 
>>> vhost_dev *dev, bool started)
>>>     static void vhost_vdpa_reset_status(struct vhost_dev *dev)
>>>   {
>>> -    struct vhost_vdpa *v = dev->opaque;
>>> -
>>>       if (!vhost_vdpa_last_dev(dev)) {
>>>           return;
>>>       }
>>> @@ -1401,9 +1410,6 @@ static void vhost_vdpa_reset_status(struct 
>>> vhost_dev *dev)
>>>       vhost_vdpa_reset_device(dev);
>>>       vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>>                                  VIRTIO_CONFIG_S_DRIVER);
>>> - memory_listener_unregister(&v->shared->listener);
>>> -    v->shared->listener_registered = false;
>>> -
>>>   }
>>>     static int vhost_vdpa_set_log_base(struct vhost_dev *dev, 
>>> uint64_t base,
>>> @@ -1537,12 +1543,27 @@ static int vhost_vdpa_get_features(struct 
>>> vhost_dev *dev,
>>>     static int vhost_vdpa_set_owner(struct vhost_dev *dev)
>>>   {
>>> +    int r;
>>> +    struct vhost_vdpa *v;
>>> +
>>>       if (!vhost_vdpa_first_dev(dev)) {
>>>           return 0;
>>>       }
>>>         trace_vhost_vdpa_set_owner(dev);
>>> -    return vhost_vdpa_call(dev, VHOST_SET_OWNER, NULL);
>>> +    r = vhost_vdpa_call(dev, VHOST_SET_OWNER, NULL);
>>> +    if (unlikely(r < 0)) {
>>> +        return r;
>>> +    }
>>> +
>>> +    /*
>>> +     * Being optimistic and listening address space memory. If the 
>>> device
>>> +     * uses vIOMMU, it is changed at vhost_vdpa_dev_start.
>>> +     */
>>> +    v = dev->opaque;
>>> +    memory_listener_register(&v->shared->listener, 
>>> &address_space_memory);
>>> +    v->shared->listener_registered = true;
>>> +    return 0;
>>>   }
>>>     static int vhost_vdpa_vq_get_addr(struct vhost_dev *dev,
>>> -- 
>>> 2.43.5
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-14 15:49 ` Eugenio Perez Martin
  2025-05-15  0:17   ` Si-Wei Liu
@ 2025-05-20 13:27   ` Jonah Palmer
  1 sibling, 0 replies; 44+ messages in thread
From: Jonah Palmer @ 2025-05-20 13:27 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, peterx, mst, jasowang, lvivier, dtatulea, leiyang,
	parav, sgarzare, si-wei.liu, lingshan.zhu, boris.ostrovsky

[-- Attachment #1: Type: text/plain, Size: 4266 bytes --]


On 5/14/25 11:49 AM, Eugenio Perez Martin wrote:
> On Wed, May 7, 2025 at 8:47 PM Jonah Palmer<jonah.palmer@oracle.com> wrote:
>> Current memory operations like pinning may take a lot of time at the
>> destination.  Currently they are done after the source of the migration is
>> stopped, and before the workload is resumed at the destination.  This is a
>> period where neigher traffic can flow, nor the VM workload can continue
>> (downtime).
>>
>> We can do better as we know the memory layout of the guest RAM at the
>> destination from the moment that all devices are initializaed.  So
>> moving that operation allows QEMU to communicate the kernel the maps
>> while the workload is still running in the source, so Linux can start
>> mapping them.
>>
>> As a small drawback, there is a time in the initialization where QEMU
>> cannot respond to QMP etc.  By some testing, this time is about
>> 0.2seconds.  This may be further reduced (or increased) depending on the
>> vdpa driver and the platform hardware, and it is dominated by the cost
>> of memory pinning.
>>
>> This matches the time that we move out of the called downtime window.
>> The downtime is measured as checking the trace timestamp from the moment
>> the source suspend the device to the moment the destination starts the
>> eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
>> secs to 2.0949.
>>
> Hi Jonah,
>
> Could you update this benchmark? I don't think it changed a lot but
> just to be as updated as possible.

Yes, will update this for 39G guest and for 128G guests :)

>
> I think I cannot ack the series as I sent the first revision. Jason or
> Si-Wei, could you ack it?
>
> Thanks!
>
>> Future directions on top of this series may include to move more things ahead
>> of the migration time, like set DRIVER_OK or perform actual iterative migration
>> of virtio-net devices.
>>
>> Comments are welcome.
>>
>> This series is a different approach of series [1]. As the title does not
>> reflect the changes anymore, please refer to the previous one to know the
>> series history.
>>
>> This series is based on [2], it must be applied after it.
>>
>> [Jonah Palmer]
>> This series was rebased after [3] was pulled in, as [3] was a prerequisite
>> fix for this series.
>>
>> v4:
>> ---
>> * Add memory listener unregistration to vhost_vdpa_reset_device.
>> * Remove memory listener unregistration from vhost_vdpa_reset_status.
>>
>> v3:
>> ---
>> * Rebase
>>
>> v2:
>> ---
>> * Move the memory listener registration to vhost_vdpa_set_owner function.
>> * Move the iova_tree allocation to net_vhost_vdpa_init.
>>
>> v1 athttps://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html__;!!ACWV5N9M2RV99hQ!IEW1otcaS4OGOE7TX094yfNmZ7WbibjJQv_DaSJxTjMB4HYFNEjgaFdHMUKQMiGgWKeRhMBCS86V7C4DccE$ .
>>
>> [1]https://urldefense.com/v3/__https://patchwork.kernel.org/project/qemu-devel/cover/20231215172830.2540987-1-eperezma@redhat.com/__;!!ACWV5N9M2RV99hQ!IEW1otcaS4OGOE7TX094yfNmZ7WbibjJQv_DaSJxTjMB4HYFNEjgaFdHMUKQMiGgWKeRhMBCS86VTze8nNQ$ 
>> [2]https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html__;!!ACWV5N9M2RV99hQ!IEW1otcaS4OGOE7TX094yfNmZ7WbibjJQv_DaSJxTjMB4HYFNEjgaFdHMUKQMiGgWKeRhMBCS86VNYsAGaI$ 
>> [3]https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/20250217144936.3589907-1-jonah.palmer@oracle.com/__;!!ACWV5N9M2RV99hQ!IEW1otcaS4OGOE7TX094yfNmZ7WbibjJQv_DaSJxTjMB4HYFNEjgaFdHMUKQMiGgWKeRhMBCS86VXyDekTU$ 
>>
>> Jonah - note: I'll be on vacation from May 10-19. Will respond to
>>                comments when I return.
>>
>> Eugenio Pérez (7):
>>    vdpa: check for iova tree initialized at net_client_start
>>    vdpa: reorder vhost_vdpa_set_backend_cap
>>    vdpa: set backend capabilities at vhost_vdpa_init
>>    vdpa: add listener_registered
>>    vdpa: reorder listener assignment
>>    vdpa: move iova_tree allocation to net_vhost_vdpa_init
>>    vdpa: move memory listener register to vhost_vdpa_init
>>
>>   hw/virtio/vhost-vdpa.c         | 107 +++++++++++++++++++++------------
>>   include/hw/virtio/vhost-vdpa.h |  22 ++++++-
>>   net/vhost-vdpa.c               |  34 +----------
>>   3 files changed, 93 insertions(+), 70 deletions(-)
>>
>> --
>> 2.43.5
>>

[-- Attachment #2: Type: text/html, Size: 5998 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-16 19:09     ` Si-Wei Liu
@ 2025-05-26  9:16       ` Markus Armbruster
  2025-05-29  7:57         ` Si-Wei Liu
  0 siblings, 1 reply; 44+ messages in thread
From: Markus Armbruster @ 2025-05-26  9:16 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Jason Wang, Jonah Palmer, qemu-devel, eperezma, peterx, mst,
	lvivier, dtatulea, leiyang, parav, sgarzare, lingshan.zhu,
	boris.ostrovsky

Si-Wei Liu <si-wei.liu@oracle.com> writes:

> On 5/15/2025 11:40 PM, Markus Armbruster wrote:
>> Jason Wang <jasowang@redhat.com> writes:
>>
>>> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>> Current memory operations like pinning may take a lot of time at the
>>>> destination.  Currently they are done after the source of the migration is
>>>> stopped, and before the workload is resumed at the destination.  This is a
>>>> period where neigher traffic can flow, nor the VM workload can continue
>>>> (downtime).
>>>>
>>>> We can do better as we know the memory layout of the guest RAM at the
>>>> destination from the moment that all devices are initializaed.  So
>>>> moving that operation allows QEMU to communicate the kernel the maps
>>>> while the workload is still running in the source, so Linux can start
>>>> mapping them.
>>>>
>>>> As a small drawback, there is a time in the initialization where QEMU
>>>> cannot respond to QMP etc.  By some testing, this time is about
>>>> 0.2seconds.
>>>
>>> Adding Markus to see if this is a real problem or not.
>>
>> I guess the answer is "depends", and to get a more useful one, we need
>> more information.
>>
>> When all you care is time from executing qemu-system-FOO to guest
>> finish booting, and the guest takes 10s to boot, then an extra 0.2s
>> won't matter much.
>
> There's no such delay of an extra 0.2s or higher per se, it's just shifting around the page pinning hiccup, no matter it is 0.2s or something else, from the time of guest booting up to before guest is booted. This saves back guest boot time or start up delay, but in turn the same delay effectively will be charged to VM launch time. We follow the same model with VFIO, which would see the same hiccup during launch (at an early stage where no real mgmt software would care about).
>
>> When a management application runs qemu-system-FOO several times to
>> probe its capabilities via QMP, then even milliseconds can hurt.
>>
> Not something like that, this page pinning hiccup is one time only that occurs in the very early stage when launching QEMU, i.e. there's no consistent delay every time when QMP is called. The delay in QMP response at that very point depends on how much memory the VM has, but this is just specif to VM with VFIO or vDPA devices that have to pin memory for DMA. Having said, there's no extra delay at all if QEMU args has no vDPA device assignment, on the other hand, there's same delay or QMP hiccup when VFIO is around in QEMU args.
>
>> In what scenarios exactly is QMP delayed?
>
> Having said, this is not a new problem to QEMU in particular, this QMP delay is not peculiar, it's existent on VFIO as well.

In what scenarios exactly is QMP delayed compared to before the patch?

> Thanks,
> -Siwei
>
>>
>> You told us an absolute delay you observed.  What's the relative delay,
>> i.e. what's the delay with and without these patches?

Can you answer this question?

>> We need QMP to become available earlier in the startup sequence for
>> other reasons.  Could we bypass the delay that way?  Please understand
>> that this would likely be quite difficult: we know from experience that
>> messing with the startup sequence is prone to introduce subtle
>> compatility breaks and even bugs.
>>
>>> (I remember VFIO has some optimization in the speed of the pinning,
>>> could vDPA do the same?)
>>
>> That's well outside my bailiwick :)
>>
>> [...]
>>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-26  9:16       ` Markus Armbruster
@ 2025-05-29  7:57         ` Si-Wei Liu
  2025-06-02  8:08           ` Markus Armbruster
  0 siblings, 1 reply; 44+ messages in thread
From: Si-Wei Liu @ 2025-05-29  7:57 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Jason Wang, Jonah Palmer, qemu-devel, eperezma, peterx, mst,
	lvivier, dtatulea, leiyang, parav, sgarzare, lingshan.zhu,
	boris.ostrovsky



On 5/26/2025 2:16 AM, Markus Armbruster wrote:
> Si-Wei Liu <si-wei.liu@oracle.com> writes:
>
>> On 5/15/2025 11:40 PM, Markus Armbruster wrote:
>>> Jason Wang <jasowang@redhat.com> writes:
>>>
>>>> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>> Current memory operations like pinning may take a lot of time at the
>>>>> destination.  Currently they are done after the source of the migration is
>>>>> stopped, and before the workload is resumed at the destination.  This is a
>>>>> period where neigher traffic can flow, nor the VM workload can continue
>>>>> (downtime).
>>>>>
>>>>> We can do better as we know the memory layout of the guest RAM at the
>>>>> destination from the moment that all devices are initializaed.  So
>>>>> moving that operation allows QEMU to communicate the kernel the maps
>>>>> while the workload is still running in the source, so Linux can start
>>>>> mapping them.
>>>>>
>>>>> As a small drawback, there is a time in the initialization where QEMU
>>>>> cannot respond to QMP etc.  By some testing, this time is about
>>>>> 0.2seconds.
>>>> Adding Markus to see if this is a real problem or not.
>>> I guess the answer is "depends", and to get a more useful one, we need
>>> more information.
>>>
>>> When all you care is time from executing qemu-system-FOO to guest
>>> finish booting, and the guest takes 10s to boot, then an extra 0.2s
>>> won't matter much.
>> There's no such delay of an extra 0.2s or higher per se, it's just shifting around the page pinning hiccup, no matter it is 0.2s or something else, from the time of guest booting up to before guest is booted. This saves back guest boot time or start up delay, but in turn the same delay effectively will be charged to VM launch time. We follow the same model with VFIO, which would see the same hiccup during launch (at an early stage where no real mgmt software would care about).
>>
>>> When a management application runs qemu-system-FOO several times to
>>> probe its capabilities via QMP, then even milliseconds can hurt.
>>>
>> Not something like that, this page pinning hiccup is one time only that occurs in the very early stage when launching QEMU, i.e. there's no consistent delay every time when QMP is called. The delay in QMP response at that very point depends on how much memory the VM has, but this is just specif to VM with VFIO or vDPA devices that have to pin memory for DMA. Having said, there's no extra delay at all if QEMU args has no vDPA device assignment, on the other hand, there's same delay or QMP hiccup when VFIO is around in QEMU args.
>>
>>> In what scenarios exactly is QMP delayed?
>> Having said, this is not a new problem to QEMU in particular, this QMP delay is not peculiar, it's existent on VFIO as well.
> In what scenarios exactly is QMP delayed compared to before the patch?
The page pinning process now runs in a pretty early phase at qemu_init() 
e.g. machine_run_board_init(), before any QMP command can be serviced, 
the latter of which typically would be able to get run from 
qemu_main_loop() until the AIO gets chance to be started to get polled 
and dispatched to bh. Technically it's not a real delay for specific QMP 
command, but rather an extended span of initialization process may take 
place before the very first QMP request, usually qmp_capabilities, will 
be serviced. It's natural for mgmt software to expect initialization 
delay for the first qmp_capabilities response if it has to immediately 
issue one after launching qemu, especially when you have a large guest 
with hundred GBs of memory and with passthrough device that has to pin 
memory for DMA e.g. VFIO, the delayed effect from the QEMU 
initialization process is very visible too. On the other hand, before 
the patch, if memory happens to be in the middle of being pinned, any 
ongoing QMP can't be serviced by the QEMU main loop, either.

I'd also like to highlight that without this patch, the pretty high 
delay due to page pinning is even visible to the guest in addition to 
just QMP delay, which largely affected guest boot time with vDPA device 
already. It is long standing, and every VM user with vDPA device would 
like to avoid such high delay for the first boot, which is not seen with 
similar device e.g. VFIO passthrough.

>
>> Thanks,
>> -Siwei
>>
>>> You told us an absolute delay you observed.  What's the relative delay,
>>> i.e. what's the delay with and without these patches?
> Can you answer this question?
I thought I already got that answered in earlier reply. The relative 
delay is subject to the size of memory. Usually mgmt software won't be 
able to notice, unless the guest has more than 100GB of THP memory to 
pin, for DMA or whatever reason.


>
>>> We need QMP to become available earlier in the startup sequence for
>>> other reasons.  Could we bypass the delay that way?  Please understand
>>> that this would likely be quite difficult: we know from experience that
>>> messing with the startup sequence is prone to introduce subtle
>>> compatility breaks and even bugs.
>>>
>>>> (I remember VFIO has some optimization in the speed of the pinning,
>>>> could vDPA do the same?)
>>> That's well outside my bailiwick :)

Please be understood that any possible optimization is out of scope of 
this patch series, while there's certainly way around that already and 
to be carry out in the future, as Peter alluded to in earlier discussion 
thread:

https://lore.kernel.org/qemu-devel/ZZT7wuq-_IhfN_wR@x1n/
https://lore.kernel.org/qemu-devel/ZZZUNsOVxxqr-H5S@x1n/

Thanks,
-Siwei

>>>
>>> [...]
>>>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-05-29  7:57         ` Si-Wei Liu
@ 2025-06-02  8:08           ` Markus Armbruster
  2025-06-02  8:29             ` Markus Armbruster
  0 siblings, 1 reply; 44+ messages in thread
From: Markus Armbruster @ 2025-06-02  8:08 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Jason Wang, Jonah Palmer, qemu-devel, eperezma, peterx, mst,
	lvivier, dtatulea, leiyang, parav, sgarzare, lingshan.zhu,
	boris.ostrovsky

Si-Wei Liu <si-wei.liu@oracle.com> writes:

> On 5/26/2025 2:16 AM, Markus Armbruster wrote:
>> Si-Wei Liu <si-wei.liu@oracle.com> writes:
>>
>>> On 5/15/2025 11:40 PM, Markus Armbruster wrote:
>>>> Jason Wang <jasowang@redhat.com> writes:
>>>>
>>>>> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>>> Current memory operations like pinning may take a lot of time at the
>>>>>> destination.  Currently they are done after the source of the migration is
>>>>>> stopped, and before the workload is resumed at the destination.  This is a
>>>>>> period where neigher traffic can flow, nor the VM workload can continue
>>>>>> (downtime).
>>>>>>
>>>>>> We can do better as we know the memory layout of the guest RAM at the
>>>>>> destination from the moment that all devices are initializaed.  So
>>>>>> moving that operation allows QEMU to communicate the kernel the maps
>>>>>> while the workload is still running in the source, so Linux can start
>>>>>> mapping them.
>>>>>>
>>>>>> As a small drawback, there is a time in the initialization where QEMU
>>>>>> cannot respond to QMP etc.  By some testing, this time is about
>>>>>> 0.2seconds.
>>>>> Adding Markus to see if this is a real problem or not.
>>>> I guess the answer is "depends", and to get a more useful one, we need
>>>> more information.
>>>>
>>>> When all you care is time from executing qemu-system-FOO to guest
>>>> finish booting, and the guest takes 10s to boot, then an extra 0.2s
>>>> won't matter much.
>>>
>>> There's no such delay of an extra 0.2s or higher per se, it's just shifting around the page pinning hiccup, no matter it is 0.2s or something else, from the time of guest booting up to before guest is booted. This saves back guest boot time or start up delay, but in turn the same delay effectively will be charged to VM launch time. We follow the same model with VFIO, which would see the same hiccup during launch (at an early stage where no real mgmt software would care about).
>>>
>>>> When a management application runs qemu-system-FOO several times to
>>>> probe its capabilities via QMP, then even milliseconds can hurt.
>>>>
>>> Not something like that, this page pinning hiccup is one time only that occurs in the very early stage when launching QEMU, i.e. there's no consistent delay every time when QMP is called. The delay in QMP response at that very point depends on how much memory the VM has, but this is just specif to VM with VFIO or vDPA devices that have to pin memory for DMA. Having said, there's no extra delay at all if QEMU args has no vDPA device assignment, on the other hand, there's same delay or QMP hiccup when VFIO is around in QEMU args.
>>>
>>>> In what scenarios exactly is QMP delayed?
>>>
>>> Having said, this is not a new problem to QEMU in particular, this QMP delay is not peculiar, it's existent on VFIO as well.
>>
>> In what scenarios exactly is QMP delayed compared to before the patch?
>
> The page pinning process now runs in a pretty early phase at
> qemu_init() e.g. machine_run_board_init(),

It runs within

    qemu_init()
        qmp_x_exit_preconfig()
            qemu_init_board()
                machine_run_board_init()

Except when --preconfig is given, it instead runs within QMP command
x-exit-preconfig.

Correct?

> before any QMP command can be serviced, the latter of which typically
> would be able to get run from qemu_main_loop() until the AIO gets
> chance to be started to get polled and dispatched to bh.

We create the QMP monitor within qemu_create_late_backends(), which runs
before qmp_x_exit_preconfig(), but commands get processed only in the
main loop, which we enter later.

Correct?

> Technically it's not a real delay for specific QMP command, but rather
> an extended span of initialization process may take place before the
> very first QMP request, usually qmp_capabilities, will be
> serviced. It's natural for mgmt software to expect initialization
> delay for the first qmp_capabilities response if it has to immediately
> issue one after launching qemu, especially when you have a large guest
> with hundred GBs of memory and with passthrough device that has to pin
> memory for DMA e.g. VFIO, the delayed effect from the QEMU
> initialization process is very visible too.



>                                             On the other hand, before
> the patch, if memory happens to be in the middle of being pinned, any
> ongoing QMP can't be serviced by the QEMU main loop, either.
>
> I'd also like to highlight that without this patch, the pretty high
> delay due to page pinning is even visible to the guest in addition to
> just QMP delay, which largely affected guest boot time with vDPA
> device already. It is long standing, and every VM user with vDPA
> device would like to avoid such high delay for the first boot, which
> is not seen with similar device e.g. VFIO passthrough.
>
>>
>>> Thanks,
>>> -Siwei
>>>
>>>> You told us an absolute delay you observed.  What's the relative delay,
>>>> i.e. what's the delay with and without these patches?
>>
>> Can you answer this question?
>
> I thought I already got that answered in earlier reply. The relative
> delay is subject to the size of memory. Usually mgmt software won't be
> able to notice, unless the guest has more than 100GB of THP memory to
> pin, for DMA or whatever reason.
>
>
>>
>>>> We need QMP to become available earlier in the startup sequence for
>>>> other reasons.  Could we bypass the delay that way?  Please understand
>>>> that this would likely be quite difficult: we know from experience that
>>>> messing with the startup sequence is prone to introduce subtle
>>>> compatility breaks and even bugs.
>>>>
>>>>> (I remember VFIO has some optimization in the speed of the pinning,
>>>>> could vDPA do the same?)
>>>>
>>>> That's well outside my bailiwick :)
>
> Please be understood that any possible optimization is out of scope of
> this patch series, while there's certainly way around that already and
> to be carry out in the future, as Peter alluded to in earlier
> discussion thread:
>
> https://lore.kernel.org/qemu-devel/ZZT7wuq-_IhfN_wR@x1n/
> https://lore.kernel.org/qemu-devel/ZZZUNsOVxxqr-H5S@x1n/
>
> Thanks,
> -Siwei
>
>>>>
>>>> [...]
>>>>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-06-02  8:08           ` Markus Armbruster
@ 2025-06-02  8:29             ` Markus Armbruster
  2025-06-06 16:21               ` Jonah Palmer
  0 siblings, 1 reply; 44+ messages in thread
From: Markus Armbruster @ 2025-06-02  8:29 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Jason Wang, Jonah Palmer, qemu-devel, eperezma, peterx, mst,
	lvivier, dtatulea, leiyang, parav, sgarzare, lingshan.zhu,
	boris.ostrovsky

Butterfingers...  let's try this again.

Markus Armbruster <armbru@redhat.com> writes:

> Si-Wei Liu <si-wei.liu@oracle.com> writes:
>
>> On 5/26/2025 2:16 AM, Markus Armbruster wrote:
>>> Si-Wei Liu <si-wei.liu@oracle.com> writes:
>>>
>>>> On 5/15/2025 11:40 PM, Markus Armbruster wrote:
>>>>> Jason Wang <jasowang@redhat.com> writes:
>>>>>
>>>>>> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>>>> Current memory operations like pinning may take a lot of time at the
>>>>>>> destination.  Currently they are done after the source of the migration is
>>>>>>> stopped, and before the workload is resumed at the destination.  This is a
>>>>>>> period where neigher traffic can flow, nor the VM workload can continue
>>>>>>> (downtime).
>>>>>>>
>>>>>>> We can do better as we know the memory layout of the guest RAM at the
>>>>>>> destination from the moment that all devices are initializaed.  So
>>>>>>> moving that operation allows QEMU to communicate the kernel the maps
>>>>>>> while the workload is still running in the source, so Linux can start
>>>>>>> mapping them.
>>>>>>>
>>>>>>> As a small drawback, there is a time in the initialization where QEMU
>>>>>>> cannot respond to QMP etc.  By some testing, this time is about
>>>>>>> 0.2seconds.
>>>>>> Adding Markus to see if this is a real problem or not.
>>>>> I guess the answer is "depends", and to get a more useful one, we need
>>>>> more information.
>>>>>
>>>>> When all you care is time from executing qemu-system-FOO to guest
>>>>> finish booting, and the guest takes 10s to boot, then an extra 0.2s
>>>>> won't matter much.
>>>>
>>>> There's no such delay of an extra 0.2s or higher per se, it's just shifting around the page pinning hiccup, no matter it is 0.2s or something else, from the time of guest booting up to before guest is booted. This saves back guest boot time or start up delay, but in turn the same delay effectively will be charged to VM launch time. We follow the same model with VFIO, which would see the same hiccup during launch (at an early stage where no real mgmt software would care about).
>>>>
>>>>> When a management application runs qemu-system-FOO several times to
>>>>> probe its capabilities via QMP, then even milliseconds can hurt.
>>>>>
>>>> Not something like that, this page pinning hiccup is one time only that occurs in the very early stage when launching QEMU, i.e. there's no consistent delay every time when QMP is called. The delay in QMP response at that very point depends on how much memory the VM has, but this is just specif to VM with VFIO or vDPA devices that have to pin memory for DMA. Having said, there's no extra delay at all if QEMU args has no vDPA device assignment, on the other hand, there's same delay or QMP hiccup when VFIO is around in QEMU args.
>>>>
>>>>> In what scenarios exactly is QMP delayed?
>>>>
>>>> Having said, this is not a new problem to QEMU in particular, this QMP delay is not peculiar, it's existent on VFIO as well.
>>>
>>> In what scenarios exactly is QMP delayed compared to before the patch?
>>
>> The page pinning process now runs in a pretty early phase at
>> qemu_init() e.g. machine_run_board_init(),
>
> It runs within
>
>     qemu_init()
>         qmp_x_exit_preconfig()
>             qemu_init_board()
>                 machine_run_board_init()
>
> Except when --preconfig is given, it instead runs within QMP command
> x-exit-preconfig.
>
> Correct?
>
>> before any QMP command can be serviced, the latter of which typically
>> would be able to get run from qemu_main_loop() until the AIO gets
>> chance to be started to get polled and dispatched to bh.
>
> We create the QMP monitor within qemu_create_late_backends(), which runs
> before qmp_x_exit_preconfig(), but commands get processed only in the
> main loop, which we enter later.
>
> Correct?
>
>> Technically it's not a real delay for specific QMP command, but rather
>> an extended span of initialization process may take place before the
>> very first QMP request, usually qmp_capabilities, will be
>> serviced. It's natural for mgmt software to expect initialization
>> delay for the first qmp_capabilities response if it has to immediately
>> issue one after launching qemu, especially when you have a large guest
>> with hundred GBs of memory and with passthrough device that has to pin
>> memory for DMA e.g. VFIO, the delayed effect from the QEMU
>> initialization process is very visible too.

The work clearly needs to be done.  Whether it needs to be blocking
other things is less clear.

Even if it doesn't need to be blocking, we may choose not to avoid
blocking for now.  That should be an informed decision, though.

All I'm trying to do here is understand the tradeoffs, so I can give
useful advice.

>>                                             On the other hand, before
>> the patch, if memory happens to be in the middle of being pinned, any
>> ongoing QMP can't be serviced by the QEMU main loop, either.

When exactly does this pinning happen before the patch?  In which
function?

>> I'd also like to highlight that without this patch, the pretty high
>> delay due to page pinning is even visible to the guest in addition to
>> just QMP delay, which largely affected guest boot time with vDPA
>> device already. It is long standing, and every VM user with vDPA
>> device would like to avoid such high delay for the first boot, which
>> is not seen with similar device e.g. VFIO passthrough.

I understand that hiding the delay from the guest could be useful.

>>>> Thanks,
>>>> -Siwei
>>>>
>>>>> You told us an absolute delay you observed.  What's the relative delay,
>>>>> i.e. what's the delay with and without these patches?
>>>
>>> Can you answer this question?
>>
>> I thought I already got that answered in earlier reply. The relative
>> delay is subject to the size of memory. Usually mgmt software won't be
>> able to notice, unless the guest has more than 100GB of THP memory to
>> pin, for DMA or whatever reason.

Alright, what are the delays you observe with and without these patches
for three test cases that pin 50 / 100 / 200 GiB of THP memory
respectively?

>>>>> We need QMP to become available earlier in the startup sequence for
>>>>> other reasons.  Could we bypass the delay that way?  Please understand
>>>>> that this would likely be quite difficult: we know from experience that
>>>>> messing with the startup sequence is prone to introduce subtle
>>>>> compatility breaks and even bugs.
>>>>>
>>>>>> (I remember VFIO has some optimization in the speed of the pinning,
>>>>>> could vDPA do the same?)
>>>>>
>>>>> That's well outside my bailiwick :)
>>
>> Please be understood that any possible optimization is out of scope of
>> this patch series, while there's certainly way around that already and
>> to be carry out in the future, as Peter alluded to in earlier
>> discussion thread:
>>
>> https://lore.kernel.org/qemu-devel/ZZT7wuq-_IhfN_wR@x1n/
>> https://lore.kernel.org/qemu-devel/ZZZUNsOVxxqr-H5S@x1n/

Got it.

>> Thanks,
>> -Siwei
>>
>>>>>
>>>>> [...]
>>>>>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-06-02  8:29             ` Markus Armbruster
@ 2025-06-06 16:21               ` Jonah Palmer
  2025-06-26 12:08                 ` Markus Armbruster
  0 siblings, 1 reply; 44+ messages in thread
From: Jonah Palmer @ 2025-06-06 16:21 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Jason Wang, qemu-devel, eperezma, peterx, mst, lvivier, dtatulea,
	leiyang, parav, sgarzare, lingshan.zhu, boris.ostrovsky,
	Si-Wei Liu

[-- Attachment #1: Type: text/plain, Size: 10032 bytes --]


On 6/2/25 4:29 AM, Markus Armbruster wrote:
> Butterfingers...  let's try this again.
>
> Markus Armbruster<armbru@redhat.com> writes:
>
>> Si-Wei Liu<si-wei.liu@oracle.com> writes:
>>
>>> On 5/26/2025 2:16 AM, Markus Armbruster wrote:
>>>> Si-Wei Liu<si-wei.liu@oracle.com> writes:
>>>>
>>>>> On 5/15/2025 11:40 PM, Markus Armbruster wrote:
>>>>>> Jason Wang<jasowang@redhat.com> writes:
>>>>>>
>>>>>>> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer<jonah.palmer@oracle.com> wrote:
>>>>>>>> Current memory operations like pinning may take a lot of time at the
>>>>>>>> destination.  Currently they are done after the source of the migration is
>>>>>>>> stopped, and before the workload is resumed at the destination.  This is a
>>>>>>>> period where neigher traffic can flow, nor the VM workload can continue
>>>>>>>> (downtime).
>>>>>>>>
>>>>>>>> We can do better as we know the memory layout of the guest RAM at the
>>>>>>>> destination from the moment that all devices are initializaed.  So
>>>>>>>> moving that operation allows QEMU to communicate the kernel the maps
>>>>>>>> while the workload is still running in the source, so Linux can start
>>>>>>>> mapping them.
>>>>>>>>
>>>>>>>> As a small drawback, there is a time in the initialization where QEMU
>>>>>>>> cannot respond to QMP etc.  By some testing, this time is about
>>>>>>>> 0.2seconds.
>>>>>>> Adding Markus to see if this is a real problem or not.
>>>>>> I guess the answer is "depends", and to get a more useful one, we need
>>>>>> more information.
>>>>>>
>>>>>> When all you care is time from executing qemu-system-FOO to guest
>>>>>> finish booting, and the guest takes 10s to boot, then an extra 0.2s
>>>>>> won't matter much.
>>>>> There's no such delay of an extra 0.2s or higher per se, it's just shifting around the page pinning hiccup, no matter it is 0.2s or something else, from the time of guest booting up to before guest is booted. This saves back guest boot time or start up delay, but in turn the same delay effectively will be charged to VM launch time. We follow the same model with VFIO, which would see the same hiccup during launch (at an early stage where no real mgmt software would care about).
>>>>>
>>>>>> When a management application runs qemu-system-FOO several times to
>>>>>> probe its capabilities via QMP, then even milliseconds can hurt.
>>>>>>
>>>>> Not something like that, this page pinning hiccup is one time only that occurs in the very early stage when launching QEMU, i.e. there's no consistent delay every time when QMP is called. The delay in QMP response at that very point depends on how much memory the VM has, but this is just specif to VM with VFIO or vDPA devices that have to pin memory for DMA. Having said, there's no extra delay at all if QEMU args has no vDPA device assignment, on the other hand, there's same delay or QMP hiccup when VFIO is around in QEMU args.
>>>>>
>>>>>> In what scenarios exactly is QMP delayed?
>>>>> Having said, this is not a new problem to QEMU in particular, this QMP delay is not peculiar, it's existent on VFIO as well.
>>>> In what scenarios exactly is QMP delayed compared to before the patch?
>>> The page pinning process now runs in a pretty early phase at
>>> qemu_init() e.g. machine_run_board_init(),
>> It runs within
>>
>>      qemu_init()
>>          qmp_x_exit_preconfig()
>>              qemu_init_board()
>>                  machine_run_board_init()
>>
>> Except when --preconfig is given, it instead runs within QMP command
>> x-exit-preconfig.
>>
>> Correct?
>>
>>> before any QMP command can be serviced, the latter of which typically
>>> would be able to get run from qemu_main_loop() until the AIO gets
>>> chance to be started to get polled and dispatched to bh.
>> We create the QMP monitor within qemu_create_late_backends(), which runs
>> before qmp_x_exit_preconfig(), but commands get processed only in the
>> main loop, which we enter later.
>>
>> Correct?
>>
>>> Technically it's not a real delay for specific QMP command, but rather
>>> an extended span of initialization process may take place before the
>>> very first QMP request, usually qmp_capabilities, will be
>>> serviced. It's natural for mgmt software to expect initialization
>>> delay for the first qmp_capabilities response if it has to immediately
>>> issue one after launching qemu, especially when you have a large guest
>>> with hundred GBs of memory and with passthrough device that has to pin
>>> memory for DMA e.g. VFIO, the delayed effect from the QEMU
>>> initialization process is very visible too.
> The work clearly needs to be done.  Whether it needs to be blocking
> other things is less clear.
>
> Even if it doesn't need to be blocking, we may choose not to avoid
> blocking for now.  That should be an informed decision, though.
>
> All I'm trying to do here is understand the tradeoffs, so I can give
> useful advice.
>
>>>                                              On the other hand, before
>>> the patch, if memory happens to be in the middle of being pinned, any
>>> ongoing QMP can't be serviced by the QEMU main loop, either.
> When exactly does this pinning happen before the patch?  In which
> function?

Before the patches, the memory listener was registered in
vhost_vdpa_dev_start(), well after device initialization.

And by device initialization here I mean the
qemu_create_late_backends() function.

With these patches, the memory listener is now being
registered in vhost_vdpa_set_owner(), called from
vhost_dev_init(), which is part of the device
initialization phase.

However, even though the memory_listener_register() is
called during the device initialization phase, the actual
pinning happens (very shortly) after
qemu_create_late_backends() returns (due to RAM being
initialized later).

---

So, without these patches, and based on my measurements,
memory pinning starts ~2.9s after qemu_create_late_backends()
returns.

With these patches, memory pinning starts ~0.003s after
qemu_create_late_backends() returns.

>>> I'd also like to highlight that without this patch, the pretty high
>>> delay due to page pinning is even visible to the guest in addition to
>>> just QMP delay, which largely affected guest boot time with vDPA
>>> device already. It is long standing, and every VM user with vDPA
>>> device would like to avoid such high delay for the first boot, which
>>> is not seen with similar device e.g. VFIO passthrough.
> I understand that hiding the delay from the guest could be useful.
>
>>>>> Thanks,
>>>>> -Siwei
>>>>>
>>>>>> You told us an absolute delay you observed.  What's the relative delay,
>>>>>> i.e. what's the delay with and without these patches?
>>>> Can you answer this question?
>>> I thought I already got that answered in earlier reply. The relative
>>> delay is subject to the size of memory. Usually mgmt software won't be
>>> able to notice, unless the guest has more than 100GB of THP memory to
>>> pin, for DMA or whatever reason.
> Alright, what are the delays you observe with and without these patches
> for three test cases that pin 50 / 100 / 200 GiB of THP memory
> respectively?

So with THP memory specifically, these are my measurements below.
For these measurements, I simply started up a guest, traced the
vhost_vdpa_listener_region_add() calls, and found the difference
in time between the first and last calls. In other words, this is
roughly the time it took to pin all of guest memory. I did 5 runs
for each memory size:

Before patches:
===============
50G:   7.652s,  7.992s,  7.981s,  7.631s,  7.953s (Avg.  7.841s)
100G:  8.990s,  8.656s,  9.003s,  8.683s,  8.669s (Avg.  8.800s)
200G: 10.705s, 10.841s, 10.816s, 10.772s, 10.818s (Avg. 10.790s)

After patches:
==============
50G:  12.091s, 11.685s, 11.626s, 11.952s, 11.656s (Avg. 11.802s)
100G: 14.121s, 14.079s, 13.700s, 14.023s, 14.130s (Avg. 14.010s)
200G: 18.134s, 18.350s, 18.387s, 17.800s, 18.401s (Avg. 18.214s)

The reason we're seeing a jump here may be that with the memory
pinning happening earlier, the pinning happens before Qemu has
fully faulted in the guest's RAM.

As far as I understand, before these patches, by the time we
reached vhost_vdpa_dev_start(), all pages were already resident
(and THP splits already happened with the prealloc=on step), so
get_user_pages() pinned "warm" pages much faster.

With these patches, the memory listener is running on cold memory.
Every get_user_pages() call would fault in its 4KiB subpage (and
if THP was folded, split a 2MiB hugepage) before handing in a
'struct page'.

I believe this to be the case since in my measurements I noticed
some larger time gaps (fault + split overhead) in between some of
the vhost_vdpa_listener_region_add() calls.

However I'm still learning some of these memory pinning details,
so please let me know if I'm misunderstanding anything here.

Jonah

>>>>>> We need QMP to become available earlier in the startup sequence for
>>>>>> other reasons.  Could we bypass the delay that way?  Please understand
>>>>>> that this would likely be quite difficult: we know from experience that
>>>>>> messing with the startup sequence is prone to introduce subtle
>>>>>> compatility breaks and even bugs.
>>>>>>
>>>>>>> (I remember VFIO has some optimization in the speed of the pinning,
>>>>>>> could vDPA do the same?)
>>>>>> That's well outside my bailiwick :)
>>> Please be understood that any possible optimization is out of scope of
>>> this patch series, while there's certainly way around that already and
>>> to be carry out in the future, as Peter alluded to in earlier
>>> discussion thread:
>>>
>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/ZZT7wuq-_IhfN_wR@x1n/__;!!ACWV5N9M2RV99hQ!Kf05aVONsyUTIqQv0RRYw5IaU2V4k0KB6Fur5X_ocHbuT0SQV-xMr4tjggz6NJ4qgkUfONJVoswPoECB$ 
>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/ZZZUNsOVxxqr-H5S@x1n/__;!!ACWV5N9M2RV99hQ!Kf05aVONsyUTIqQv0RRYw5IaU2V4k0KB6Fur5X_ocHbuT0SQV-xMr4tjggz6NJ4qgkUfONJVovdnMan-$ 
> Got it.
>
>>> Thanks,
>>> -Siwei
>>>
>>>>>> [...]
>>>>>>

[-- Attachment #2: Type: text/html, Size: 14838 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-06-06 16:21               ` Jonah Palmer
@ 2025-06-26 12:08                 ` Markus Armbruster
  2025-07-02 19:31                   ` Jonah Palmer
  0 siblings, 1 reply; 44+ messages in thread
From: Markus Armbruster @ 2025-06-26 12:08 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: Jason Wang, qemu-devel, eperezma, peterx, mst, lvivier, dtatulea,
	leiyang, parav, sgarzare, lingshan.zhu, boris.ostrovsky,
	Si-Wei Liu

Jonah Palmer <jonah.palmer@oracle.com> writes:

> On 6/2/25 4:29 AM, Markus Armbruster wrote:
>> Butterfingers...  let's try this again.
>>
>> Markus Armbruster<armbru@redhat.com> writes:
>>
>>> Si-Wei Liu<si-wei.liu@oracle.com> writes:
>>>
>>>> On 5/26/2025 2:16 AM, Markus Armbruster wrote:
>>>>> Si-Wei Liu<si-wei.liu@oracle.com> writes:
>>>>>
>>>>>> On 5/15/2025 11:40 PM, Markus Armbruster wrote:
>>>>>>> Jason Wang<jasowang@redhat.com> writes:
>>>>>>>
>>>>>>>> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer<jonah.palmer@oracle.com> wrote:
>>>>>>>>> Current memory operations like pinning may take a lot of time at the
>>>>>>>>> destination.  Currently they are done after the source of the migration is
>>>>>>>>> stopped, and before the workload is resumed at the destination.  This is a
>>>>>>>>> period where neigher traffic can flow, nor the VM workload can continue
>>>>>>>>> (downtime).
>>>>>>>>>
>>>>>>>>> We can do better as we know the memory layout of the guest RAM at the
>>>>>>>>> destination from the moment that all devices are initializaed.  So
>>>>>>>>> moving that operation allows QEMU to communicate the kernel the maps
>>>>>>>>> while the workload is still running in the source, so Linux can start
>>>>>>>>> mapping them.
>>>>>>>>>
>>>>>>>>> As a small drawback, there is a time in the initialization where QEMU
>>>>>>>>> cannot respond to QMP etc.  By some testing, this time is about
>>>>>>>>> 0.2seconds.
>>>>>>>> Adding Markus to see if this is a real problem or not.
>>>>>>> I guess the answer is "depends", and to get a more useful one, we need
>>>>>>> more information.
>>>>>>>
>>>>>>> When all you care is time from executing qemu-system-FOO to guest
>>>>>>> finish booting, and the guest takes 10s to boot, then an extra 0.2s
>>>>>>> won't matter much.
>>>>>> There's no such delay of an extra 0.2s or higher per se, it's just shifting around the page pinning hiccup, no matter it is 0.2s or something else, from the time of guest booting up to before guest is booted. This saves back guest boot time or start up delay, but in turn the same delay effectively will be charged to VM launch time. We follow the same model with VFIO, which would see the same hiccup during launch (at an early stage where no real mgmt software would care about).
>>>>>>
>>>>>>> When a management application runs qemu-system-FOO several times to
>>>>>>> probe its capabilities via QMP, then even milliseconds can hurt.
>>>>>>>
>>>>>> Not something like that, this page pinning hiccup is one time only that occurs in the very early stage when launching QEMU, i.e. there's no consistent delay every time when QMP is called. The delay in QMP response at that very point depends on how much memory the VM has, but this is just specif to VM with VFIO or vDPA devices that have to pin memory for DMA. Having said, there's no extra delay at all if QEMU args has no vDPA device assignment, on the other hand, there's same delay or QMP hiccup when VFIO is around in QEMU args.
>>>>>>
>>>>>>> In what scenarios exactly is QMP delayed?
>>>>>> Having said, this is not a new problem to QEMU in particular, this QMP delay is not peculiar, it's existent on VFIO as well.
>>>>>
>>>>> In what scenarios exactly is QMP delayed compared to before the patch?
>>>>
>>>> The page pinning process now runs in a pretty early phase at
>>>> qemu_init() e.g. machine_run_board_init(),
>>>
>>> It runs within
>>>
>>>      qemu_init()
>>>          qmp_x_exit_preconfig()
>>>              qemu_init_board()
>>>                  machine_run_board_init()
>>>
>>> Except when --preconfig is given, it instead runs within QMP command
>>> x-exit-preconfig.
>>>
>>> Correct?
>>>
>>>> before any QMP command can be serviced, the latter of which typically
>>>> would be able to get run from qemu_main_loop() until the AIO gets
>>>> chance to be started to get polled and dispatched to bh.
>>>
>>> We create the QMP monitor within qemu_create_late_backends(), which runs
>>> before qmp_x_exit_preconfig(), but commands get processed only in the
>>> main loop, which we enter later.
>>>
>>> Correct?
>>>
>>>> Technically it's not a real delay for specific QMP command, but rather
>>>> an extended span of initialization process may take place before the
>>>> very first QMP request, usually qmp_capabilities, will be
>>>> serviced. It's natural for mgmt software to expect initialization
>>>> delay for the first qmp_capabilities response if it has to immediately
>>>> issue one after launching qemu, especially when you have a large guest
>>>> with hundred GBs of memory and with passthrough device that has to pin
>>>> memory for DMA e.g. VFIO, the delayed effect from the QEMU
>>>> initialization process is very visible too.
>>
>> The work clearly needs to be done.  Whether it needs to be blocking
>> other things is less clear.
>>
>> Even if it doesn't need to be blocking, we may choose not to avoid
>> blocking for now.  That should be an informed decision, though.
>>
>> All I'm trying to do here is understand the tradeoffs, so I can give
>> useful advice.
>>
>>>>                                              On the other hand, before
>>>> the patch, if memory happens to be in the middle of being pinned, any
>>>> ongoing QMP can't be serviced by the QEMU main loop, either.
>>
>> When exactly does this pinning happen before the patch?  In which
>> function?
>
> Before the patches, the memory listener was registered in
> vhost_vdpa_dev_start(), well after device initialization.
>
> And by device initialization here I mean the
> qemu_create_late_backends() function.
>
> With these patches, the memory listener is now being
> registered in vhost_vdpa_set_owner(), called from
> vhost_dev_init(), which is part of the device
> initialization phase.
>
> However, even though the memory_listener_register() is
> called during the device initialization phase, the actual
> pinning happens (very shortly) after
> qemu_create_late_backends() returns (due to RAM being
> initialized later).
>
> ---
>
> So, without these patches, and based on my measurements,
> memory pinning starts ~2.9s after qemu_create_late_backends()
> returns.
>
> With these patches, memory pinning starts ~0.003s after
> qemu_create_late_backends() returns.

So, we're registering the memory listener earlier, which makes it do its
expensive work (pinning) earlier ("very shortly after
qemu_create_late_backends()).  I still don't understand where exactly
the pinning happens (where at runtime and where in the code).  Not sure
I have to.

>>>> I'd also like to highlight that without this patch, the pretty high
>>>> delay due to page pinning is even visible to the guest in addition to
>>>> just QMP delay, which largely affected guest boot time with vDPA
>>>> device already. It is long standing, and every VM user with vDPA
>>>> device would like to avoid such high delay for the first boot, which
>>>> is not seen with similar device e.g. VFIO passthrough.
>>
>> I understand that hiding the delay from the guest could be useful.
>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>
>>>>>>> You told us an absolute delay you observed.  What's the relative delay,
>>>>>>> i.e. what's the delay with and without these patches?
>>>>>
>>>>> Can you answer this question?
>>>>
>>>> I thought I already got that answered in earlier reply. The relative
>>>> delay is subject to the size of memory. Usually mgmt software won't be
>>>> able to notice, unless the guest has more than 100GB of THP memory to
>>>> pin, for DMA or whatever reason.
>>
>> Alright, what are the delays you observe with and without these patches
>> for three test cases that pin 50 / 100 / 200 GiB of THP memory
>> respectively?
>
> So with THP memory specifically, these are my measurements below.
> For these measurements, I simply started up a guest, traced the
> vhost_vdpa_listener_region_add() calls, and found the difference
> in time between the first and last calls. In other words, this is
> roughly the time it took to pin all of guest memory. I did 5 runs
> for each memory size:
>
> Before patches:
> ===============
> 50G:   7.652s,  7.992s,  7.981s,  7.631s,  7.953s (Avg.  7.841s)
> 100G:  8.990s,  8.656s,  9.003s,  8.683s,  8.669s (Avg.  8.800s)
> 200G: 10.705s, 10.841s, 10.816s, 10.772s, 10.818s (Avg. 10.790s)
>
> After patches:
> ==============
> 50G:  12.091s, 11.685s, 11.626s, 11.952s, 11.656s (Avg. 11.802s)
> 100G: 14.121s, 14.079s, 13.700s, 14.023s, 14.130s (Avg. 14.010s)
> 200G: 18.134s, 18.350s, 18.387s, 17.800s, 18.401s (Avg. 18.214s)
>
> The reason we're seeing a jump here may be that with the memory
> pinning happening earlier, the pinning happens before Qemu has
> fully faulted in the guest's RAM.
>
> As far as I understand, before these patches, by the time we
> reached vhost_vdpa_dev_start(), all pages were already resident
> (and THP splits already happened with the prealloc=on step), so
> get_user_pages() pinned "warm" pages much faster.
>
> With these patches, the memory listener is running on cold memory.
> Every get_user_pages() call would fault in its 4KiB subpage (and
> if THP was folded, split a 2MiB hugepage) before handing in a
> 'struct page'.

Let's see whether I understand...  Please correct my mistakes.

Memory pinning takes several seconds for large guests.

Your patch makes pinning much slower.  You're theorizing this is because
pinning cold memory is slower than pinning warm memory.

I suppose the extra time is saved elsewhere, i.e. the entire startup
time remains roughly the same.  Have you verified this experimentally?

Your stated reason for moving the pinning is moving it from within
migration downtime to before migration downtime.  I understand why
that's useful.

You mentioned "a small drawback [...] a time in the initialization where
QEMU cannot respond to QMP".  Here's what I've been trying to figure out
about this drawback since the beginning:

* Under what circumstances is QMP responsiveness affected?  I *guess*
  it's only when we start a guest with plenty of memory and a certain
  vhost-vdpa configuration.  What configuration exactly?

* How is QMP responsiveness affected?  Delay until the QMP greeting is
  sent?  Delay until the first command gets processed?  Delay at some
  later time?

* What's the absolute and the relative time of QMP non-responsiveness?
  0.2s were mentioned.  I'm looking for something like "when we're not
  pinning, it takes 0.8s until the first QMP command is processed, and
  when we are, it takes 1.0s".

> I believe this to be the case since in my measurements I noticed
> some larger time gaps (fault + split overhead) in between some of
> the vhost_vdpa_listener_region_add() calls.
>
> However I'm still learning some of these memory pinning details,
> so please let me know if I'm misunderstanding anything here.

Thank you!

[...]



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-06-26 12:08                 ` Markus Armbruster
@ 2025-07-02 19:31                   ` Jonah Palmer
  2025-07-04 15:00                     ` Markus Armbruster
  0 siblings, 1 reply; 44+ messages in thread
From: Jonah Palmer @ 2025-07-02 19:31 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Jason Wang, qemu-devel, eperezma, peterx, mst, lvivier, dtatulea,
	leiyang, parav, sgarzare, lingshan.zhu, boris.ostrovsky,
	Si-Wei Liu



On 6/26/25 8:08 AM, Markus Armbruster wrote:
> Jonah Palmer <jonah.palmer@oracle.com> writes:
> 
>> On 6/2/25 4:29 AM, Markus Armbruster wrote:
>>> Butterfingers...  let's try this again.
>>>
>>> Markus Armbruster<armbru@redhat.com> writes:
>>>
>>>> Si-Wei Liu<si-wei.liu@oracle.com> writes:
>>>>
>>>>> On 5/26/2025 2:16 AM, Markus Armbruster wrote:
>>>>>> Si-Wei Liu<si-wei.liu@oracle.com> writes:
>>>>>>
>>>>>>> On 5/15/2025 11:40 PM, Markus Armbruster wrote:
>>>>>>>> Jason Wang<jasowang@redhat.com> writes:
>>>>>>>>
>>>>>>>>> On Thu, May 8, 2025 at 2:47 AM Jonah Palmer<jonah.palmer@oracle.com> wrote:
>>>>>>>>>> Current memory operations like pinning may take a lot of time at the
>>>>>>>>>> destination.  Currently they are done after the source of the migration is
>>>>>>>>>> stopped, and before the workload is resumed at the destination.  This is a
>>>>>>>>>> period where neigher traffic can flow, nor the VM workload can continue
>>>>>>>>>> (downtime).
>>>>>>>>>>
>>>>>>>>>> We can do better as we know the memory layout of the guest RAM at the
>>>>>>>>>> destination from the moment that all devices are initializaed.  So
>>>>>>>>>> moving that operation allows QEMU to communicate the kernel the maps
>>>>>>>>>> while the workload is still running in the source, so Linux can start
>>>>>>>>>> mapping them.
>>>>>>>>>>
>>>>>>>>>> As a small drawback, there is a time in the initialization where QEMU
>>>>>>>>>> cannot respond to QMP etc.  By some testing, this time is about
>>>>>>>>>> 0.2seconds.
>>>>>>>>> Adding Markus to see if this is a real problem or not.
>>>>>>>> I guess the answer is "depends", and to get a more useful one, we need
>>>>>>>> more information.
>>>>>>>>
>>>>>>>> When all you care is time from executing qemu-system-FOO to guest
>>>>>>>> finish booting, and the guest takes 10s to boot, then an extra 0.2s
>>>>>>>> won't matter much.
>>>>>>> There's no such delay of an extra 0.2s or higher per se, it's just shifting around the page pinning hiccup, no matter it is 0.2s or something else, from the time of guest booting up to before guest is booted. This saves back guest boot time or start up delay, but in turn the same delay effectively will be charged to VM launch time. We follow the same model with VFIO, which would see the same hiccup during launch (at an early stage where no real mgmt software would care about).
>>>>>>>
>>>>>>>> When a management application runs qemu-system-FOO several times to
>>>>>>>> probe its capabilities via QMP, then even milliseconds can hurt.
>>>>>>>>
>>>>>>> Not something like that, this page pinning hiccup is one time only that occurs in the very early stage when launching QEMU, i.e. there's no consistent delay every time when QMP is called. The delay in QMP response at that very point depends on how much memory the VM has, but this is just specif to VM with VFIO or vDPA devices that have to pin memory for DMA. Having said, there's no extra delay at all if QEMU args has no vDPA device assignment, on the other hand, there's same delay or QMP hiccup when VFIO is around in QEMU args.
>>>>>>>
>>>>>>>> In what scenarios exactly is QMP delayed?
>>>>>>> Having said, this is not a new problem to QEMU in particular, this QMP delay is not peculiar, it's existent on VFIO as well.
>>>>>>
>>>>>> In what scenarios exactly is QMP delayed compared to before the patch?
>>>>>
>>>>> The page pinning process now runs in a pretty early phase at
>>>>> qemu_init() e.g. machine_run_board_init(),
>>>>
>>>> It runs within
>>>>
>>>>       qemu_init()
>>>>           qmp_x_exit_preconfig()
>>>>               qemu_init_board()
>>>>                   machine_run_board_init()
>>>>
>>>> Except when --preconfig is given, it instead runs within QMP command
>>>> x-exit-preconfig.
>>>>
>>>> Correct?
>>>>
>>>>> before any QMP command can be serviced, the latter of which typically
>>>>> would be able to get run from qemu_main_loop() until the AIO gets
>>>>> chance to be started to get polled and dispatched to bh.
>>>>
>>>> We create the QMP monitor within qemu_create_late_backends(), which runs
>>>> before qmp_x_exit_preconfig(), but commands get processed only in the
>>>> main loop, which we enter later.
>>>>
>>>> Correct?
>>>>
>>>>> Technically it's not a real delay for specific QMP command, but rather
>>>>> an extended span of initialization process may take place before the
>>>>> very first QMP request, usually qmp_capabilities, will be
>>>>> serviced. It's natural for mgmt software to expect initialization
>>>>> delay for the first qmp_capabilities response if it has to immediately
>>>>> issue one after launching qemu, especially when you have a large guest
>>>>> with hundred GBs of memory and with passthrough device that has to pin
>>>>> memory for DMA e.g. VFIO, the delayed effect from the QEMU
>>>>> initialization process is very visible too.
>>>
>>> The work clearly needs to be done.  Whether it needs to be blocking
>>> other things is less clear.
>>>
>>> Even if it doesn't need to be blocking, we may choose not to avoid
>>> blocking for now.  That should be an informed decision, though.
>>>
>>> All I'm trying to do here is understand the tradeoffs, so I can give
>>> useful advice.
>>>
>>>>>                                               On the other hand, before
>>>>> the patch, if memory happens to be in the middle of being pinned, any
>>>>> ongoing QMP can't be serviced by the QEMU main loop, either.
>>>
>>> When exactly does this pinning happen before the patch?  In which
>>> function?
>>
>> Before the patches, the memory listener was registered in
>> vhost_vdpa_dev_start(), well after device initialization.
>>
>> And by device initialization here I mean the
>> qemu_create_late_backends() function.
>>
>> With these patches, the memory listener is now being
>> registered in vhost_vdpa_set_owner(), called from
>> vhost_dev_init(), which is part of the device
>> initialization phase.
>>
>> However, even though the memory_listener_register() is
>> called during the device initialization phase, the actual
>> pinning happens (very shortly) after
>> qemu_create_late_backends() returns (due to RAM being
>> initialized later).
>>
>> ---
>>
>> So, without these patches, and based on my measurements,
>> memory pinning starts ~2.9s after qemu_create_late_backends()
>> returns.
>>
>> With these patches, memory pinning starts ~0.003s after
>> qemu_create_late_backends() returns.
> 
> So, we're registering the memory listener earlier, which makes it do its
> expensive work (pinning) earlier ("very shortly after
> qemu_create_late_backends()).  I still don't understand where exactly
> the pinning happens (where at runtime and where in the code).  Not sure
> I have to.
> 

Apologies for the delay in getting back to you. I just wanted to be 
thorough and answer everything as accurately and clearly as possible.

----

Before these patches, pinning started in vhost_vdpa_dev_start(), where 
the memory listener was registered, and began calling 
vhost_vdpa_listener_region_add() to invoke the actual memory pinning. 
This happens after entering qemu_main_loop().

After these patches, pinning started in vhost_dev_init() (specifically 
vhost_vdpa_set_owner()), where the memory listener registration was 
moved to. This happens *before* entering qemu_main_loop().

However, the entirety of pinning doesn't all happen pre 
qemu_main_loop(). The pinning that happens before we enter 
qemu_main_loop() is the full guest RAM pinning, which is the main, heavy 
lifting work when it comes to pinning memory.

The rest of the pinning work happens after entering qemu_main_loop() 
(approximately around the same timing as when pinning started before 
these patches). But, since we already did the heavy lifting of the 
pinning work pre qemu_main_loop() (e.g. all pages were already allocated 
and pinned), we're just re-pinning here (i.e. kernel just updates its 
IOTLB tables for pages that're already mapped and locked in RAM).

This makes the pinning work we do after entering qemu_main_loop() much 
faster than compared to the same pinning we had to do before these patches.

However, we have to pay a cost for this. Because we do the heavy lifting 
work earlier pre qemu_main_loop(), we're pinning with cold memory. That 
is, the guest hasn't yet touched its memory yet, all host pages are 
still anonymous and unallocated. This essentially means that doing the 
pinning earlier is more expensive time-wise given that we need to also 
allocate physical pages for each chunk of memory.

To (hopefully) show this more clearly, I ran some tests before and after 
these patches and averaged the results. I used a 50G guest with real 
vDPA hardware (Mellanox CX-6Dx):

0.) How many vhost_vdpa_listener_region_add() (pins) calls?

                | Total | Before qemu_main_loop | After qemu_main_loop
_____________________________________________________________________
Before patches |   6   |         0             |         6
---------------|-----------------------------------------------------
After patches  |   11  |         5	       |         6

- After the patches, this looks like we doubled the work we're doing 
(given the extra 5 calls), however, the 6 calls that happen after 
entering qemu_main_loop() are essentially replays of the first 5 we did.

  * In other words, after the patches, the 6 calls made after entering 
qemu_main_loop() are performed much faster than the same 6 calls before 
the patches.

  * From my measurements, these are the timings it took to perform those 
6 calls after entering qemu_main_loop():
    > Before patches: 0.0770s
    > After patches:  0.0065s

---

1.) Time from starting the guest to entering qemu_main_loop():
  * Before patches: 0.112s
  * After patches:  3.900s

- This is due to the 5 early pins we're doing now with these patches, 
whereas before we never did any pinning work at all.

- From measuring the time between the first and last 
vhost_vdpa_listener_region_add() calls during this period, this comes 
out to ~3s for the early pinning.

>>>>> I'd also like to highlight that without this patch, the pretty high
>>>>> delay due to page pinning is even visible to the guest in addition to
>>>>> just QMP delay, which largely affected guest boot time with vDPA
>>>>> device already. It is long standing, and every VM user with vDPA
>>>>> device would like to avoid such high delay for the first boot, which
>>>>> is not seen with similar device e.g. VFIO passthrough.
>>>
>>> I understand that hiding the delay from the guest could be useful.
>>>
>>>>>>> Thanks,
>>>>>>> -Siwei
>>>>>>>
>>>>>>>> You told us an absolute delay you observed.  What's the relative delay,
>>>>>>>> i.e. what's the delay with and without these patches?
>>>>>>
>>>>>> Can you answer this question?
>>>>>
>>>>> I thought I already got that answered in earlier reply. The relative
>>>>> delay is subject to the size of memory. Usually mgmt software won't be
>>>>> able to notice, unless the guest has more than 100GB of THP memory to
>>>>> pin, for DMA or whatever reason.
>>>
>>> Alright, what are the delays you observe with and without these patches
>>> for three test cases that pin 50 / 100 / 200 GiB of THP memory
>>> respectively?
>>
>> So with THP memory specifically, these are my measurements below.
>> For these measurements, I simply started up a guest, traced the
>> vhost_vdpa_listener_region_add() calls, and found the difference
>> in time between the first and last calls. In other words, this is
>> roughly the time it took to pin all of guest memory. I did 5 runs
>> for each memory size:
>>
>> Before patches:
>> ===============
>> 50G:   7.652s,  7.992s,  7.981s,  7.631s,  7.953s (Avg.  7.841s)
>> 100G:  8.990s,  8.656s,  9.003s,  8.683s,  8.669s (Avg.  8.800s)
>> 200G: 10.705s, 10.841s, 10.816s, 10.772s, 10.818s (Avg. 10.790s)
>>
>> After patches:
>> ==============
>> 50G:  12.091s, 11.685s, 11.626s, 11.952s, 11.656s (Avg. 11.802s)
>> 100G: 14.121s, 14.079s, 13.700s, 14.023s, 14.130s (Avg. 14.010s)
>> 200G: 18.134s, 18.350s, 18.387s, 17.800s, 18.401s (Avg. 18.214s)
>>
>> The reason we're seeing a jump here may be that with the memory
>> pinning happening earlier, the pinning happens before Qemu has
>> fully faulted in the guest's RAM.
>>
>> As far as I understand, before these patches, by the time we
>> reached vhost_vdpa_dev_start(), all pages were already resident
>> (and THP splits already happened with the prealloc=on step), so
>> get_user_pages() pinned "warm" pages much faster.
>>
>> With these patches, the memory listener is running on cold memory.
>> Every get_user_pages() call would fault in its 4KiB subpage (and
>> if THP was folded, split a 2MiB hugepage) before handing in a
>> 'struct page'.
> 
> Let's see whether I understand...  Please correct my mistakes.
> 
> Memory pinning takes several seconds for large guests.
> 
> Your patch makes pinning much slower.  You're theorizing this is because
> pinning cold memory is slower than pinning warm memory.
> 
> I suppose the extra time is saved elsewhere, i.e. the entire startup
> time remains roughly the same.  Have you verified this experimentally?
> 

Based on my measurements that I did, we pay a ~3s increase in 
initialization time (pre qemu_main_loop()) to handle the heavy lifting 
of the memory pinning earlier for a vhost-vDPA device. This resulted in:

* Faster memory pinning during qemu_main_loop() (0.0770s vs 0.0065s).

* Shorter downtime phase during live migration (see below).

* Slight increase in time for the device to be operational (e.g. guest 
sets DRIVER_OK).
   > This measured the start time of the guest to guest setting 
DRIVER_OK for the device:

     Before patches: 22.46s
     After patches:  23.40s

The real timesaver here is the guest-visisble downtime during live 
migration (when using a vhost-vDPA device). Since the heavy lifting of 
the memory pinning is done during the initialization phase, it's no 
longer included as part of the stop-and-copy phase, which results in a 
much shorter guest-visible downtime.

 From v5's CV:

Using ConnectX-6 Dx (MLX5) NICs in vhost-vDPA mode with 8 queue-pairs,
the series reduces guest-visible downtime during back-to-back live
migrations by more than half:
- 39G VM:   4.72s -> 2.09s (-2.63s, ~56% improvement)
- 128G VM:  14.72s -> 5.83s (-8.89s, ~60% improvement)

Essentially, we pay a slight increased startup time tax to buy ourselves 
a much shorter downtime window when we want to perform a live migration 
with a vhost-vDPA networking device.

> Your stated reason for moving the pinning is moving it from within
> migration downtime to before migration downtime.  I understand why
> that's useful.
> 
> You mentioned "a small drawback [...] a time in the initialization where
> QEMU cannot respond to QMP".  Here's what I've been trying to figure out
> about this drawback since the beginning:
> 
> * Under what circumstances is QMP responsiveness affected?  I *guess*
>    it's only when we start a guest with plenty of memory and a certain
>    vhost-vdpa configuration.  What configuration exactly?
> 

Regardless of these patches, as I understand it, QMP cannot actually run 
any command that requires the BQL while we're pinning memory (memory 
pinning needs to use the lock).

However, the BQL is not held during the entirety of the pinning process. 
That is, it's periodically released throughout the entire pinning 
process. But those windows are *very* short and are only caught if 
you're really hammering QMP with commands very rapidly.

 From a realistic point of view, it's more practical to think of QMP 
being fully ready once all pinning has finished, e.g. 
time_spent_memory_pinning ≈ time_QMP_is_blocked.

---

As I understand it, QMP is not fully ready and cannot service requests 
until early on in qemu_main_loop().

Given that these patches increase the time it takes to reach 
qemu_main_loop() (due to the early pinning work), this means that QMP 
will also be delayed for this time.

I created a test that hammers QMP with commands until it's able to 
properly service the request and recorded how long it took from guest 
start to when it was able to fulfill the request:
  * Before patches: 0.167s
  * After patches:  4.080s

This aligns with time measured to reach qemu_main_loop() and the time 
we're spending doing the early memory pinning.

All in all, the larger the amount of memory we need to pin, the longer 
it will take for us to reach qemu_main_loop(), the larger 
time_spent_memory_pinning will be, and thus the longer it will take for 
QMP to be ready and fully functional.

----

I don't believe this related to any specific vhost-vDPA configuration. I 
think bottom line is that if we're using a vhost-vDPA device, we'll be 
spending more time to reach qemu_main_loop(), so QMP has to wait until 
we get there.

> * How is QMP responsiveness affected?  Delay until the QMP greeting is
>    sent?  Delay until the first command gets processed?  Delay at some
>    later time?
> 

Responsiveness: Longer initial delay due to early pinning work we need 
to do before we can bring QMP up.

Greeting delay: No greeting delay. Greeting is flushed earlier, even 
before we start the early pinning work.

* For both before and after the patches, this was ~0.052s for me.

Delay until first command processed: Longer initial delay at startup.

Delay at later time: None.

> * What's the absolute and the relative time of QMP non-responsiveness?
>    0.2s were mentioned.  I'm looking for something like "when we're not
>    pinning, it takes 0.8s until the first QMP command is processed, and
>    when we are, it takes 1.0s".
> 

The numbers below are based on my recent testing and measurements. This 
was with a 50G guest with real vDPA hardware.

Before patches:
---------------
* From the start time of the guest to the earliest time QMP is able to 
process a request (e.g. query-status): 0.167s.
   > This timing is pretty much the same regardless of whether or not 
we're pinning memory.

* Time spent pinning memory (QMP cannot handle requests during this 
window): 0.077s.

After patches:
--------------
* From the start time of the guest to the earliest time QMP is able to 
process a request (e.g. query-status): 4.08s
   > If we're not early pinning memory, it's ~0.167s.

* Time spent pinning memory *after entering qemu_main_loop()* (QMP 
cannot handle requests during this window): 0.0065s.

>> I believe this to be the case since in my measurements I noticed
>> some larger time gaps (fault + split overhead) in between some of
>> the vhost_vdpa_listener_region_add() calls.
>>
>> However I'm still learning some of these memory pinning details,
>> so please let me know if I'm misunderstanding anything here.
> 
> Thank you!
> 
> [...]
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-07-02 19:31                   ` Jonah Palmer
@ 2025-07-04 15:00                     ` Markus Armbruster
  2025-07-07 13:21                       ` Jonah Palmer
  0 siblings, 1 reply; 44+ messages in thread
From: Markus Armbruster @ 2025-07-04 15:00 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: Jason Wang, qemu-devel, eperezma, peterx, mst, lvivier, dtatulea,
	leiyang, parav, sgarzare, lingshan.zhu, boris.ostrovsky,
	Si-Wei Liu

Jonah Palmer <jonah.palmer@oracle.com> writes:

> On 6/26/25 8:08 AM, Markus Armbruster wrote:

[...]

> Apologies for the delay in getting back to you. I just wanted to be thorough and answer everything as accurately and clearly as possible.
>
> ----
>
> Before these patches, pinning started in vhost_vdpa_dev_start(), where the memory listener was registered, and began calling vhost_vdpa_listener_region_add() to invoke the actual memory pinning. This happens after entering qemu_main_loop().
>
> After these patches, pinning started in vhost_dev_init() (specifically vhost_vdpa_set_owner()), where the memory listener registration was moved to. This happens *before* entering qemu_main_loop().
>
> However, the entirety of pinning doesn't all happen pre qemu_main_loop(). The pinning that happens before we enter qemu_main_loop() is the full guest RAM pinning, which is the main, heavy lifting work when it comes to pinning memory.
>
> The rest of the pinning work happens after entering qemu_main_loop() (approximately around the same timing as when pinning started before these patches). But, since we already did the heavy lifting of the pinning work pre qemu_main_loop() (e.g. all pages were already allocated and pinned), we're just re-pinning here (i.e. kernel just updates its IOTLB tables for pages that're already mapped and locked in RAM).
>
> This makes the pinning work we do after entering qemu_main_loop() much faster than compared to the same pinning we had to do before these patches.
>
> However, we have to pay a cost for this. Because we do the heavy lifting work earlier pre qemu_main_loop(), we're pinning with cold memory. That is, the guest hasn't yet touched its memory yet, all host pages are still anonymous and unallocated. This essentially means that doing the pinning earlier is more expensive time-wise given that we need to also allocate physical pages for each chunk of memory.
>
> To (hopefully) show this more clearly, I ran some tests before and after these patches and averaged the results. I used a 50G guest with real vDPA hardware (Mellanox CX-6Dx):
>
> 0.) How many vhost_vdpa_listener_region_add() (pins) calls?
>
>                | Total | Before qemu_main_loop | After qemu_main_loop
> _____________________________________________________________________
> Before patches |   6   |         0             |         6
> ---------------|-----------------------------------------------------
> After patches  |   11  |         5	       |         6
>
> - After the patches, this looks like we doubled the work we're doing (given the extra 5 calls), however, the 6 calls that happen after entering qemu_main_loop() are essentially replays of the first 5 we did.
>
>  * In other words, after the patches, the 6 calls made after entering qemu_main_loop() are performed much faster than the same 6 calls before the patches.
>
>  * From my measurements, these are the timings it took to perform those 6 calls after entering qemu_main_loop():
>    > Before patches: 0.0770s
>    > After patches:  0.0065s
>
> ---
>
> 1.) Time from starting the guest to entering qemu_main_loop():
>  * Before patches: 0.112s
>  * After patches:  3.900s
>
> - This is due to the 5 early pins we're doing now with these patches, whereas before we never did any pinning work at all.
>
> - From measuring the time between the first and last vhost_vdpa_listener_region_add() calls during this period, this comes out to ~3s for the early pinning.

So, total time increases: early pinning (before main loop) takes more
time than we save pinning (in the main loop).  Correct?

We want this trade, because the time spent in the main loop is a
problem: guest-visible downtime.  Correct?

[...]

>> Let's see whether I understand...  Please correct my mistakes.
>> 
>> Memory pinning takes several seconds for large guests.
>> 
>> Your patch makes pinning much slower.  You're theorizing this is because
>> pinning cold memory is slower than pinning warm memory.
>> 
>> I suppose the extra time is saved elsewhere, i.e. the entire startup
>> time remains roughly the same.  Have you verified this experimentally?
>
> Based on my measurements that I did, we pay a ~3s increase in initialization time (pre qemu_main_loop()) to handle the heavy lifting of the memory pinning earlier for a vhost-vDPA device. This resulted in:
>
> * Faster memory pinning during qemu_main_loop() (0.0770s vs 0.0065s).
>
> * Shorter downtime phase during live migration (see below).
>
> * Slight increase in time for the device to be operational (e.g. guest sets DRIVER_OK).
>   > This measured the start time of the guest to guest setting DRIVER_OK for the device:
>
>     Before patches: 22.46s
>     After patches:  23.40s
>
> The real timesaver here is the guest-visisble downtime during live migration (when using a vhost-vDPA device). Since the heavy lifting of the memory pinning is done during the initialization phase, it's no longer included as part of the stop-and-copy phase, which results in a much shorter guest-visible downtime.
>
> From v5's CV:
>
> Using ConnectX-6 Dx (MLX5) NICs in vhost-vDPA mode with 8 queue-pairs,
> the series reduces guest-visible downtime during back-to-back live
> migrations by more than half:
> - 39G VM:   4.72s -> 2.09s (-2.63s, ~56% improvement)
> - 128G VM:  14.72s -> 5.83s (-8.89s, ~60% improvement)
>
> Essentially, we pay a slight increased startup time tax to buy ourselves a much shorter downtime window when we want to perform a live migration with a vhost-vDPA networking device.
>
>> Your stated reason for moving the pinning is moving it from within
>> migration downtime to before migration downtime.  I understand why
>> that's useful.
>> 
>> You mentioned "a small drawback [...] a time in the initialization where
>> QEMU cannot respond to QMP".  Here's what I've been trying to figure out
>> about this drawback since the beginning:
>> 
>> * Under what circumstances is QMP responsiveness affected?  I *guess*
>>   it's only when we start a guest with plenty of memory and a certain
>>   vhost-vdpa configuration.  What configuration exactly?
>> 
>
> Regardless of these patches, as I understand it, QMP cannot actually run any command that requires the BQL while we're pinning memory (memory pinning needs to use the lock).
>
> However, the BQL is not held during the entirety of the pinning process. That is, it's periodically released throughout the entire pinning process. But those windows are *very* short and are only caught if you're really hammering QMP with commands very rapidly.
>
> From a realistic point of view, it's more practical to think of QMP being fully ready once all pinning has finished, e.g. time_spent_memory_pinning ≈ time_QMP_is_blocked.
>
> ---
>
> As I understand it, QMP is not fully ready and cannot service requests until early on in qemu_main_loop().

It's a fair bit more complicated than that, but it'll do here.

> Given that these patches increase the time it takes to reach qemu_main_loop() (due to the early pinning work), this means that QMP will also be delayed for this time.
>
> I created a test that hammers QMP with commands until it's able to properly service the request and recorded how long it took from guest start to when it was able to fulfill the request:
>  * Before patches: 0.167s
>  * After patches:  4.080s
>
> This aligns with time measured to reach qemu_main_loop() and the time we're spending doing the early memory pinning.
>
> All in all, the larger the amount of memory we need to pin, the longer it will take for us to reach qemu_main_loop(), the larger time_spent_memory_pinning will be, and thus the longer it will take for QMP to be ready and fully functional.
>
> ----
>
> I don't believe this related to any specific vhost-vDPA configuration. I think bottom line is that if we're using a vhost-vDPA device, we'll be spending more time to reach qemu_main_loop(), so QMP has to wait until we get there.

Let me circle back to my question: Under what circumstances is QMP
responsiveness affected?

The answer seems to be "only when we're using a vhost-vDPA device".
Correct?

We're using one exactly when QEMU is running with one of its
vhost-vdpa-device-pci* device models.  Correct?

>> * How is QMP responsiveness affected?  Delay until the QMP greeting is
>>   sent?  Delay until the first command gets processed?  Delay at some
>>   later time?
>> 
>
> Responsiveness: Longer initial delay due to early pinning work we need to do before we can bring QMP up.
>
> Greeting delay: No greeting delay. Greeting is flushed earlier, even before we start the early pinning work.
>
> * For both before and after the patches, this was ~0.052s for me.
>
> Delay until first command processed: Longer initial delay at startup.
>
> Delay at later time: None.

Got it.

>> * What's the absolute and the relative time of QMP non-responsiveness?
>>   0.2s were mentioned.  I'm looking for something like "when we're not
>>   pinning, it takes 0.8s until the first QMP command is processed, and
>>   when we are, it takes 1.0s".
>> 
>
> The numbers below are based on my recent testing and measurements. This was with a 50G guest with real vDPA hardware.
>
> Before patches:
> ---------------
> * From the start time of the guest to the earliest time QMP is able to process a request (e.g. query-status): 0.167s.
>   > This timing is pretty much the same regardless of whether or not we're pinning memory.
>
> * Time spent pinning memory (QMP cannot handle requests during this window): 0.077s.
>
> After patches:
> --------------
> * From the start time of the guest to the earliest time QMP is able to process a request (e.g. query-status): 4.08s
>   > If we're not early pinning memory, it's ~0.167s.
>
> * Time spent pinning memory *after entering qemu_main_loop()* (QMP cannot handle requests during this window): 0.0065s.

Let me recap:

* No change at all unless we're pinning memory early, and we're doing
  that only when we're using a vhost-vDPA device.  Correct?

* If we are using a vhost-vDPA device:

  - Total startup time (until we're done pinning) increases.

  - QMP becomes available later.

  - Main loop behavior improves: less guest-visible downtime, QMP more
    responsive (once it's available)

  This is a tradeoff we want always.  There is no need to let users pick
  "faster startup, worse main loop behavior."

Correct?

[...]



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-07-04 15:00                     ` Markus Armbruster
@ 2025-07-07 13:21                       ` Jonah Palmer
  2025-07-08  8:17                         ` Markus Armbruster
  0 siblings, 1 reply; 44+ messages in thread
From: Jonah Palmer @ 2025-07-07 13:21 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Jason Wang, qemu-devel, eperezma, peterx, mst, lvivier, dtatulea,
	leiyang, parav, sgarzare, lingshan.zhu, boris.ostrovsky,
	Si-Wei Liu



On 7/4/25 11:00 AM, Markus Armbruster wrote:
> Jonah Palmer <jonah.palmer@oracle.com> writes:
> 
>> On 6/26/25 8:08 AM, Markus Armbruster wrote:
> 
> [...]
> 
>> Apologies for the delay in getting back to you. I just wanted to be thorough and answer everything as accurately and clearly as possible.
>>
>> ----
>>
>> Before these patches, pinning started in vhost_vdpa_dev_start(), where the memory listener was registered, and began calling vhost_vdpa_listener_region_add() to invoke the actual memory pinning. This happens after entering qemu_main_loop().
>>
>> After these patches, pinning started in vhost_dev_init() (specifically vhost_vdpa_set_owner()), where the memory listener registration was moved to. This happens *before* entering qemu_main_loop().
>>
>> However, the entirety of pinning doesn't all happen pre qemu_main_loop(). The pinning that happens before we enter qemu_main_loop() is the full guest RAM pinning, which is the main, heavy lifting work when it comes to pinning memory.
>>
>> The rest of the pinning work happens after entering qemu_main_loop() (approximately around the same timing as when pinning started before these patches). But, since we already did the heavy lifting of the pinning work pre qemu_main_loop() (e.g. all pages were already allocated and pinned), we're just re-pinning here (i.e. kernel just updates its IOTLB tables for pages that're already mapped and locked in RAM).
>>
>> This makes the pinning work we do after entering qemu_main_loop() much faster than compared to the same pinning we had to do before these patches.
>>
>> However, we have to pay a cost for this. Because we do the heavy lifting work earlier pre qemu_main_loop(), we're pinning with cold memory. That is, the guest hasn't yet touched its memory yet, all host pages are still anonymous and unallocated. This essentially means that doing the pinning earlier is more expensive time-wise given that we need to also allocate physical pages for each chunk of memory.
>>
>> To (hopefully) show this more clearly, I ran some tests before and after these patches and averaged the results. I used a 50G guest with real vDPA hardware (Mellanox CX-6Dx):
>>
>> 0.) How many vhost_vdpa_listener_region_add() (pins) calls?
>>
>>                 | Total | Before qemu_main_loop | After qemu_main_loop
>> _____________________________________________________________________
>> Before patches |   6   |         0             |         6
>> ---------------|-----------------------------------------------------
>> After patches  |   11  |         5	       |         6
>>
>> - After the patches, this looks like we doubled the work we're doing (given the extra 5 calls), however, the 6 calls that happen after entering qemu_main_loop() are essentially replays of the first 5 we did.
>>
>>   * In other words, after the patches, the 6 calls made after entering qemu_main_loop() are performed much faster than the same 6 calls before the patches.
>>
>>   * From my measurements, these are the timings it took to perform those 6 calls after entering qemu_main_loop():
>>     > Before patches: 0.0770s
>>     > After patches:  0.0065s
>>
>> ---
>>
>> 1.) Time from starting the guest to entering qemu_main_loop():
>>   * Before patches: 0.112s
>>   * After patches:  3.900s
>>
>> - This is due to the 5 early pins we're doing now with these patches, whereas before we never did any pinning work at all.
>>
>> - From measuring the time between the first and last vhost_vdpa_listener_region_add() calls during this period, this comes out to ~3s for the early pinning.
> 
> So, total time increases: early pinning (before main loop) takes more
> time than we save pinning (in the main loop).  Correct?
> 

Correct. We only save ~0.07s from the pinning that happens in the main 
loop. But the extra 3s we now need to spend pinning before 
qemu_main_loop() overshadows it.

> We want this trade, because the time spent in the main loop is a
> problem: guest-visible downtime.  Correct?
> 
> [...]
> 

Correct. Though whether or not we want this trade I suppose is 
subjective. But the 50-60% reduction in guest-visible downtime is pretty 
nice if we can stomach the initial startup costs.

>>> Let's see whether I understand...  Please correct my mistakes.
>>>
>>> Memory pinning takes several seconds for large guests.
>>>
>>> Your patch makes pinning much slower.  You're theorizing this is because
>>> pinning cold memory is slower than pinning warm memory.
>>>
>>> I suppose the extra time is saved elsewhere, i.e. the entire startup
>>> time remains roughly the same.  Have you verified this experimentally?
>>
>> Based on my measurements that I did, we pay a ~3s increase in initialization time (pre qemu_main_loop()) to handle the heavy lifting of the memory pinning earlier for a vhost-vDPA device. This resulted in:
>>
>> * Faster memory pinning during qemu_main_loop() (0.0770s vs 0.0065s).
>>
>> * Shorter downtime phase during live migration (see below).
>>
>> * Slight increase in time for the device to be operational (e.g. guest sets DRIVER_OK).
>>    > This measured the start time of the guest to guest setting DRIVER_OK for the device:
>>
>>      Before patches: 22.46s
>>      After patches:  23.40s
>>
>> The real timesaver here is the guest-visisble downtime during live migration (when using a vhost-vDPA device). Since the heavy lifting of the memory pinning is done during the initialization phase, it's no longer included as part of the stop-and-copy phase, which results in a much shorter guest-visible downtime.
>>
>>  From v5's CV:
>>
>> Using ConnectX-6 Dx (MLX5) NICs in vhost-vDPA mode with 8 queue-pairs,
>> the series reduces guest-visible downtime during back-to-back live
>> migrations by more than half:
>> - 39G VM:   4.72s -> 2.09s (-2.63s, ~56% improvement)
>> - 128G VM:  14.72s -> 5.83s (-8.89s, ~60% improvement)
>>
>> Essentially, we pay a slight increased startup time tax to buy ourselves a much shorter downtime window when we want to perform a live migration with a vhost-vDPA networking device.
>>
>>> Your stated reason for moving the pinning is moving it from within
>>> migration downtime to before migration downtime.  I understand why
>>> that's useful.
>>>
>>> You mentioned "a small drawback [...] a time in the initialization where
>>> QEMU cannot respond to QMP".  Here's what I've been trying to figure out
>>> about this drawback since the beginning:
>>>
>>> * Under what circumstances is QMP responsiveness affected?  I *guess*
>>>    it's only when we start a guest with plenty of memory and a certain
>>>    vhost-vdpa configuration.  What configuration exactly?
>>>
>>
>> Regardless of these patches, as I understand it, QMP cannot actually run any command that requires the BQL while we're pinning memory (memory pinning needs to use the lock).
>>
>> However, the BQL is not held during the entirety of the pinning process. That is, it's periodically released throughout the entire pinning process. But those windows are *very* short and are only caught if you're really hammering QMP with commands very rapidly.
>>
>>  From a realistic point of view, it's more practical to think of QMP being fully ready once all pinning has finished, e.g. time_spent_memory_pinning ≈ time_QMP_is_blocked.
>>
>> ---
>>
>> As I understand it, QMP is not fully ready and cannot service requests until early on in qemu_main_loop().
> 
> It's a fair bit more complicated than that, but it'll do here.
> 
>> Given that these patches increase the time it takes to reach qemu_main_loop() (due to the early pinning work), this means that QMP will also be delayed for this time.
>>
>> I created a test that hammers QMP with commands until it's able to properly service the request and recorded how long it took from guest start to when it was able to fulfill the request:
>>   * Before patches: 0.167s
>>   * After patches:  4.080s
>>
>> This aligns with time measured to reach qemu_main_loop() and the time we're spending doing the early memory pinning.
>>
>> All in all, the larger the amount of memory we need to pin, the longer it will take for us to reach qemu_main_loop(), the larger time_spent_memory_pinning will be, and thus the longer it will take for QMP to be ready and fully functional.
>>
>> ----
>>
>> I don't believe this related to any specific vhost-vDPA configuration. I think bottom line is that if we're using a vhost-vDPA device, we'll be spending more time to reach qemu_main_loop(), so QMP has to wait until we get there.
> 
> Let me circle back to my question: Under what circumstances is QMP
> responsiveness affected?
> 
> The answer seems to be "only when we're using a vhost-vDPA device".
> Correct?
> 

Correct, since using one of these guys causes us to do this memory 
pinning. If we're not using one, it's business as usual for Qemu.

> We're using one exactly when QEMU is running with one of its
> vhost-vdpa-device-pci* device models.  Correct?
> 

Yea, or something like:

-netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,... \
-device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,... \


>>> * How is QMP responsiveness affected?  Delay until the QMP greeting is
>>>    sent?  Delay until the first command gets processed?  Delay at some
>>>    later time?
>>>
>>
>> Responsiveness: Longer initial delay due to early pinning work we need to do before we can bring QMP up.
>>
>> Greeting delay: No greeting delay. Greeting is flushed earlier, even before we start the early pinning work.
>>
>> * For both before and after the patches, this was ~0.052s for me.
>>
>> Delay until first command processed: Longer initial delay at startup.
>>
>> Delay at later time: None.
> 
> Got it.
> 
>>> * What's the absolute and the relative time of QMP non-responsiveness?
>>>    0.2s were mentioned.  I'm looking for something like "when we're not
>>>    pinning, it takes 0.8s until the first QMP command is processed, and
>>>    when we are, it takes 1.0s".
>>>
>>
>> The numbers below are based on my recent testing and measurements. This was with a 50G guest with real vDPA hardware.
>>
>> Before patches:
>> ---------------
>> * From the start time of the guest to the earliest time QMP is able to process a request (e.g. query-status): 0.167s.
>>    > This timing is pretty much the same regardless of whether or not we're pinning memory.
>>
>> * Time spent pinning memory (QMP cannot handle requests during this window): 0.077s.
>>
>> After patches:
>> --------------
>> * From the start time of the guest to the earliest time QMP is able to process a request (e.g. query-status): 4.08s
>>    > If we're not early pinning memory, it's ~0.167s.
>>
>> * Time spent pinning memory *after entering qemu_main_loop()* (QMP cannot handle requests during this window): 0.0065s.
> 
> Let me recap:
> 
> * No change at all unless we're pinning memory early, and we're doing
>    that only when we're using a vhost-vDPA device.  Correct?
> 
> * If we are using a vhost-vDPA device:
> 
>    - Total startup time (until we're done pinning) increases.
> 

Correct.

>    - QMP becomes available later.
> 

Correct.

>    - Main loop behavior improves: less guest-visible downtime, QMP more
>      responsive (once it's available)
> 

Correct. Though the improvement is modest at best if we put aside the 
guest-visible downtime improvement.

>    This is a tradeoff we want always.  There is no need to let users pick
>    "faster startup, worse main loop behavior."
> 

"Always" might be subjective here. For example, if there's no desire to 
perform live migration, then the user kinda just gets stuck with the cons.

Whether or not we want to make this configurable though is another 
discussion.

> Correct?
> 
> [...]
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-07-07 13:21                       ` Jonah Palmer
@ 2025-07-08  8:17                         ` Markus Armbruster
  2025-07-09 19:57                           ` Jonah Palmer
  0 siblings, 1 reply; 44+ messages in thread
From: Markus Armbruster @ 2025-07-08  8:17 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: Jason Wang, qemu-devel, eperezma, peterx, mst, lvivier, dtatulea,
	leiyang, parav, sgarzare, lingshan.zhu, boris.ostrovsky,
	Si-Wei Liu

Jonah Palmer <jonah.palmer@oracle.com> writes:

> On 7/4/25 11:00 AM, Markus Armbruster wrote:
>> Jonah Palmer <jonah.palmer@oracle.com> writes:

[...]

>> So, total time increases: early pinning (before main loop) takes more
>> time than we save pinning (in the main loop).  Correct?
>
> Correct. We only save ~0.07s from the pinning that happens in the main loop. But the extra 3s we now need to spend pinning before qemu_main_loop() overshadows it.

Got it.

>> We want this trade, because the time spent in the main loop is a
>> problem: guest-visible downtime.  Correct?
>> [...]
>
> Correct. Though whether or not we want this trade I suppose is subjective. But the 50-60% reduction in guest-visible downtime is pretty nice if we can stomach the initial startup costs.

I'll get back to this at the end.

[...]

>> Let me circle back to my question: Under what circumstances is QMP
>> responsiveness affected?
>> 
>> The answer seems to be "only when we're using a vhost-vDPA device".
>> Correct?
>
> Correct, since using one of these guys causes us to do this memory pinning. If we're not using one, it's business as usual for Qemu.

Got it.

>> We're using one exactly when QEMU is running with one of its
>> vhost-vdpa-device-pci* device models.  Correct?
>
> Yea, or something like:
>
> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,... \
> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,... \

I'll get back to this at the end.

[...]

>> Let me recap:
>> 
>> * No change at all unless we're pinning memory early, and we're doing
>>    that only when we're using a vhost-vDPA device.  Correct?
>> 
>> * If we are using a vhost-vDPA device:
>>    - Total startup time (until we're done pinning) increases.
>
> Correct.
>
>>    - QMP becomes available later.
>
> Correct.
>
>>    - Main loop behavior improves: less guest-visible downtime, QMP more
>>      responsive (once it's available)
>
> Correct. Though the improvement is modest at best if we put aside the guest-visible downtime improvement.
>
>>    This is a tradeoff we want always.  There is no need to let users pick
>>    "faster startup, worse main loop behavior."
>> 
>
> "Always" might be subjective here. For example, if there's no desire to perform live migration, then the user kinda just gets stuck with the cons.
>
> Whether or not we want to make this configurable though is another discussion.
>
>> Correct?
>> 
>> [...]

I think I finally know enough to give you constructive feedback.

Your commit messages should answer the questions I had.  Specifically:

* Why are we doing this?  To shorten guest-visible downtime.

* How are we doing this?  We additionally pin memory before entering the
  main loop.  This speeds up the pinning we still do in the main loop.

* Drawback: slower startup.  In particular, QMP becomes
  available later.

* Secondary benefit: main loop responsiveness improves, in particular
  QMP.

* What uses of QEMU are affected?  Only with vhost-vDPA.  Spell out all
  the ways to get vhost-vDPA, please.

* There's a tradeoff.  Show your numbers.  Discuss whether this needs to
  be configurable.

If you can make a case for pinning memory this way always, do so.  If
you believe making it configurable would be a good idea, do so.  If
you're not sure, say so in the cover letter, and add a suitable TODO
comment.

Questions?



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-07-08  8:17                         ` Markus Armbruster
@ 2025-07-09 19:57                           ` Jonah Palmer
  2025-07-10  5:31                             ` Markus Armbruster
  0 siblings, 1 reply; 44+ messages in thread
From: Jonah Palmer @ 2025-07-09 19:57 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Jason Wang, qemu-devel, eperezma, peterx, mst, lvivier, dtatulea,
	leiyang, parav, sgarzare, lingshan.zhu, boris.ostrovsky,
	Si-Wei Liu



On 7/8/25 4:17 AM, Markus Armbruster wrote:
> Jonah Palmer <jonah.palmer@oracle.com> writes:
> 
>> On 7/4/25 11:00 AM, Markus Armbruster wrote:
>>> Jonah Palmer <jonah.palmer@oracle.com> writes:
> 
> [...]
> 
>>> So, total time increases: early pinning (before main loop) takes more
>>> time than we save pinning (in the main loop).  Correct?
>>
>> Correct. We only save ~0.07s from the pinning that happens in the main loop. But the extra 3s we now need to spend pinning before qemu_main_loop() overshadows it.
> 
> Got it.
> 
>>> We want this trade, because the time spent in the main loop is a
>>> problem: guest-visible downtime.  Correct?
>>> [...]
>>
>> Correct. Though whether or not we want this trade I suppose is subjective. But the 50-60% reduction in guest-visible downtime is pretty nice if we can stomach the initial startup costs.
> 
> I'll get back to this at the end.
> 
> [...]
> 
>>> Let me circle back to my question: Under what circumstances is QMP
>>> responsiveness affected?
>>>
>>> The answer seems to be "only when we're using a vhost-vDPA device".
>>> Correct?
>>
>> Correct, since using one of these guys causes us to do this memory pinning. If we're not using one, it's business as usual for Qemu.
> 
> Got it.
> 
>>> We're using one exactly when QEMU is running with one of its
>>> vhost-vdpa-device-pci* device models.  Correct?
>>
>> Yea, or something like:
>>
>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,... \
>> -device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,... \
> 
> I'll get back to this at the end.
> 
> [...]
> 
>>> Let me recap:
>>>
>>> * No change at all unless we're pinning memory early, and we're doing
>>>     that only when we're using a vhost-vDPA device.  Correct?
>>>
>>> * If we are using a vhost-vDPA device:
>>>     - Total startup time (until we're done pinning) increases.
>>
>> Correct.
>>
>>>     - QMP becomes available later.
>>
>> Correct.
>>
>>>     - Main loop behavior improves: less guest-visible downtime, QMP more
>>>       responsive (once it's available)
>>
>> Correct. Though the improvement is modest at best if we put aside the guest-visible downtime improvement.
>>
>>>     This is a tradeoff we want always.  There is no need to let users pick
>>>     "faster startup, worse main loop behavior."
>>>
>>
>> "Always" might be subjective here. For example, if there's no desire to perform live migration, then the user kinda just gets stuck with the cons.
>>
>> Whether or not we want to make this configurable though is another discussion.
>>
>>> Correct?
>>>
>>> [...]
> 
> I think I finally know enough to give you constructive feedback.
> 
> Your commit messages should answer the questions I had.  Specifically:
> 
> * Why are we doing this?  To shorten guest-visible downtime.
> 
> * How are we doing this?  We additionally pin memory before entering the
>    main loop.  This speeds up the pinning we still do in the main loop.
> 
> * Drawback: slower startup.  In particular, QMP becomes
>    available later.
> 
> * Secondary benefit: main loop responsiveness improves, in particular
>    QMP.
> 
> * What uses of QEMU are affected?  Only with vhost-vDPA.  Spell out all
>    the ways to get vhost-vDPA, please.
> 
> * There's a tradeoff.  Show your numbers.  Discuss whether this needs to
>    be configurable.
> 
> If you can make a case for pinning memory this way always, do so.  If
> you believe making it configurable would be a good idea, do so.  If
> you're not sure, say so in the cover letter, and add a suitable TODO
> comment.
> 
> Questions?
> 

No questions, understood.

As I was writing the responses to your questions I was thinking to 
myself that this stuff should've been in the cover letter / commit 
messages in the first place.

Definitely a learning moment for me. Thanks for your time on this Markus!

Jonah



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init
  2025-07-09 19:57                           ` Jonah Palmer
@ 2025-07-10  5:31                             ` Markus Armbruster
  0 siblings, 0 replies; 44+ messages in thread
From: Markus Armbruster @ 2025-07-10  5:31 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: Jason Wang, qemu-devel, eperezma, peterx, mst, lvivier, dtatulea,
	leiyang, parav, sgarzare, lingshan.zhu, boris.ostrovsky,
	Si-Wei Liu

Jonah Palmer <jonah.palmer@oracle.com> writes:

[...]

>> I think I finally know enough to give you constructive feedback.
>> 
>> Your commit messages should answer the questions I had.  Specifically:
>> 
>> * Why are we doing this?  To shorten guest-visible downtime.
>> 
>> * How are we doing this?  We additionally pin memory before entering the
>>   main loop.  This speeds up the pinning we still do in the main loop.
>> 
>> * Drawback: slower startup.  In particular, QMP becomes
>>   available later.
>> 
>> * Secondary benefit: main loop responsiveness improves, in particular
>>   QMP.
>> 
>> * What uses of QEMU are affected?  Only with vhost-vDPA.  Spell out all
>>    the ways to get vhost-vDPA, please.
>> 
>> * There's a tradeoff.  Show your numbers.  Discuss whether this needs to
>>   be configurable.
>> 
>> If you can make a case for pinning memory this way always, do so.  If
>> you believe making it configurable would be a good idea, do so.  If
>> you're not sure, say so in the cover letter, and add a suitable TODO
>> comment.
>> 
>> Questions?
>
> No questions, understood.
>
> As I was writing the responses to your questions I was thinking to 
> myself that this stuff should've been in the cover letter / commit 
> messages in the first place.
>
> Definitely a learning moment for me. Thanks for your time on this Markus!

You're welcome!



^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2025-07-10  5:32 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-07 18:46 [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init Jonah Palmer
2025-05-07 18:46 ` [PATCH v4 1/7] vdpa: check for iova tree initialized at net_client_start Jonah Palmer
2025-05-16  1:52   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 2/7] vdpa: reorder vhost_vdpa_set_backend_cap Jonah Palmer
2025-05-16  1:53   ` Jason Wang
2025-05-16  1:56   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 3/7] vdpa: set backend capabilities at vhost_vdpa_init Jonah Palmer
2025-05-16  1:57   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 4/7] vdpa: add listener_registered Jonah Palmer
2025-05-16  2:00   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 5/7] vdpa: reorder listener assignment Jonah Palmer
2025-05-16  2:01   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init Jonah Palmer
2025-05-16  2:07   ` Jason Wang
2025-05-07 18:46 ` [PATCH v4 7/7] vdpa: move memory listener register to vhost_vdpa_init Jonah Palmer
2025-05-15  5:42   ` Michael S. Tsirkin
2025-05-15 17:36     ` Si-Wei Liu
2025-05-20 13:23       ` Jonah Palmer
2025-05-14  1:42 ` [PATCH v4 0/7] Move " Lei Yang
2025-05-14 15:49 ` Eugenio Perez Martin
2025-05-15  0:17   ` Si-Wei Liu
2025-05-15  5:43     ` Michael S. Tsirkin
2025-05-15 17:41       ` Si-Wei Liu
2025-05-16 10:45         ` Michael S. Tsirkin
2025-05-15  8:30     ` Eugenio Perez Martin
2025-05-16  1:49     ` Jason Wang
2025-05-20 13:27   ` Jonah Palmer
2025-05-14 23:00 ` Si-Wei Liu
2025-05-16  1:47 ` Jason Wang
2025-05-16  1:51 ` Jason Wang
2025-05-16  6:40   ` Markus Armbruster
2025-05-16 19:09     ` Si-Wei Liu
2025-05-26  9:16       ` Markus Armbruster
2025-05-29  7:57         ` Si-Wei Liu
2025-06-02  8:08           ` Markus Armbruster
2025-06-02  8:29             ` Markus Armbruster
2025-06-06 16:21               ` Jonah Palmer
2025-06-26 12:08                 ` Markus Armbruster
2025-07-02 19:31                   ` Jonah Palmer
2025-07-04 15:00                     ` Markus Armbruster
2025-07-07 13:21                       ` Jonah Palmer
2025-07-08  8:17                         ` Markus Armbruster
2025-07-09 19:57                           ` Jonah Palmer
2025-07-10  5:31                             ` Markus Armbruster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).