From: "Eugenio Pérez" <eperezma@redhat.com>
To: qemu-devel@nongnu.org
Cc: Gautam Dawar <gdawar@xilinx.com>,
Jason Wang <jasowang@redhat.com>,
Zhu Lingshan <lingshan.zhu@intel.com>,
yin31149@gmail.com, Shannon Nelson <shannon.nelson@amd.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
Dragos Tatulea <dtatulea@nvidia.com>,
Yajun Wu <yajunw@nvidia.com>, Juan Quintela <quintela@redhat.com>,
Laurent Vivier <lvivier@redhat.com>,
Stefano Garzarella <sgarzare@redhat.com>,
Parav Pandit <parav@mellanox.com>, Lei Yang <leiyang@redhat.com>,
si-wei.liu@oracle.com
Subject: [RFC PATCH v2 08/10] vdpa: add vhost_vdpa_load_setup
Date: Tue, 28 Nov 2023 11:43:01 +0100 [thread overview]
Message-ID: <20231128104303.3314000-9-eperezma@redhat.com> (raw)
In-Reply-To: <20231128104303.3314000-1-eperezma@redhat.com>
Callers can use this function to setup the incoming migration thread.
This thread is able to map the guest memory while the migration is
ongoing, without blocking QMP or other important tasks. While this
allows the destination QEMU not to block, it expands the mapping time
during migration instead of making it pre-migration.
This thread joins at vdpa backend device start, so it could happen that
the guest memory is so large that we still have guest memory to map
before this time. This can be improved in later iterations, when the
destination device can inform QEMU that it is not ready to complete the
migration.
If the device is not started, the clean of the mapped memory is done at
.load_cleanup. This is far from ideal, as the destination machine has
mapped all the guest ram for nothing, and now it needs to unmap it.
However, we don't have information about the state of the device so its
the best we can do. Once iterative migration is supported, this will be
improved as we know the virtio state of the device.
TODO RFC: if the VM migrates before finishing all the maps, the source
will stop but the destination is still not ready to continue, and it
will wait until all guest RAM is mapped. It is still an improvement
over doing all the map when the migration finish, but it should be easy
to forbid the guest to stop until a condition is met.
TODO RFC: The memory unmapping if the device is not started is weird
too, as ideally nothing would be mapped. This can be fixed when we
migrate the device state iteratively, and we know for sure if the device
is started or not. At this moment we don't have such information so
there is no better alternative.
Other options considered:
* Coroutines: Overkill? What thread can I assign them, as vdpa does not
have any dedicated iothread for the moment?
* QemuEvent or Mutex + Cond: Need to synchronize list access
then, complicating the synchronization. As maps ops are heavier enough,
it is not worth.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
RFC v2:
* Use a dedicated thread for map instead of doing all in .load_setup,
blocking QMP etc.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
include/hw/virtio/vhost-vdpa.h | 25 +++++
hw/virtio/vhost-vdpa.c | 167 ++++++++++++++++++++++++++++++++-
2 files changed, 191 insertions(+), 1 deletion(-)
diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 8f54e5edd4..6533ad211c 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -17,6 +17,7 @@
#include "hw/virtio/vhost-iova-tree.h"
#include "hw/virtio/vhost-shadow-virtqueue.h"
#include "hw/virtio/virtio.h"
+#include "qemu/thread.h"
#include "standard-headers/linux/vhost_types.h"
/*
@@ -43,8 +44,30 @@ typedef struct vhost_vdpa_shared {
/* Copy of backend features */
uint64_t backend_cap;
+ /*
+ * Thread to map memory in QEMU incoming migration.
+ *
+ * Incoming migration calls devices ->load_setup in the main thread, but
+ * map operations can take a long time. This forbids the main thread to
+ * serve other requests like QMP.
+ *
+ * It works by fetching jobs from map_queue until it receives
+ * VhostVDPAShared, signalling the end of thread job. From that point,
+ * thread is joined and maps requests are synchronous again. These new
+ * maps are not served from main thread, so there is no danger there.
+ */
+ QemuThread map_thread;
+ GAsyncQueue *map_queue;
+ bool map_thread_enabled;
+
bool iotlb_batch_begin_sent;
+ /*
+ * The memory listener has been registered, so DMA maps have been sent to
+ * the device.
+ */
+ bool listener_registered;
+
/* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
bool shadow_data;
} VhostVDPAShared;
@@ -73,6 +96,8 @@ int vhost_vdpa_dma_map(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
hwaddr size, void *vaddr, bool readonly);
int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
hwaddr size);
+int vhost_vdpa_load_setup(VhostVDPAShared *s, AddressSpace *dma_as);
+int vhost_vdpa_load_cleanup(VhostVDPAShared *s, bool vhost_will_start);
typedef struct vdpa_iommu {
VhostVDPAShared *dev_shared;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 43f7c382b1..24844b5dfa 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -101,6 +101,15 @@ int vhost_vdpa_dma_map(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
msg.iotlb.perm = readonly ? VHOST_ACCESS_RO : VHOST_ACCESS_RW;
msg.iotlb.type = VHOST_IOTLB_UPDATE;
+ if (s->map_thread_enabled && !qemu_thread_is_self(&s->map_thread)) {
+ struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1);
+
+ *new_msg = msg;
+ g_async_queue_push(s->map_queue, new_msg);
+
+ return 0;
+ }
+
trace_vhost_vdpa_dma_map(s, fd, msg.type, msg.asid, msg.iotlb.iova,
msg.iotlb.size, msg.iotlb.uaddr, msg.iotlb.perm,
msg.iotlb.type);
@@ -131,6 +140,15 @@ int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
msg.iotlb.size = size;
msg.iotlb.type = VHOST_IOTLB_INVALIDATE;
+ if (s->map_thread_enabled && !qemu_thread_is_self(&s->map_thread)) {
+ struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1);
+
+ *new_msg = msg;
+ g_async_queue_push(s->map_queue, new_msg);
+
+ return 0;
+ }
+
trace_vhost_vdpa_dma_unmap(s, fd, msg.type, msg.asid, msg.iotlb.iova,
msg.iotlb.size, msg.iotlb.type);
@@ -156,6 +174,15 @@ static void vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
return;
}
+ if (s->map_thread_enabled && !qemu_thread_is_self(&s->map_thread)) {
+ struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1);
+
+ *new_msg = msg;
+ g_async_queue_push(s->map_queue, new_msg);
+
+ return;
+ }
+
trace_vhost_vdpa_listener_begin_batch(s, fd, msg.type, msg.iotlb.type);
if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
error_report("failed to write, fd=%d, errno=%d (%s)",
@@ -180,6 +207,15 @@ static void vhost_vdpa_dma_end_batch(VhostVDPAShared *s)
msg.type = VHOST_IOTLB_MSG_V2;
msg.iotlb.type = VHOST_IOTLB_BATCH_END;
+ if (s->map_thread_enabled && !qemu_thread_is_self(&s->map_thread)) {
+ struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1);
+
+ *new_msg = msg;
+ g_async_queue_push(s->map_queue, new_msg);
+
+ return;
+ }
+
trace_vhost_vdpa_listener_commit(s, fd, msg.type, msg.iotlb.type);
if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
error_report("failed to write, fd=%d, errno=%d (%s)",
@@ -1288,6 +1324,94 @@ static void vhost_vdpa_suspend(struct vhost_dev *dev)
vhost_vdpa_reset_device(dev);
}
+static void *vhost_vdpa_load_map(void *opaque)
+{
+ VhostVDPAShared *shared = opaque;
+ GPtrArray *ret = NULL;
+
+ while (true) {
+ g_autofree struct vhost_msg_v2 *msg = NULL;
+ void *job = g_async_queue_pop(shared->map_queue);
+ int r = 0;
+
+ if (job == shared) {
+ /* exit signal */
+ break;
+ }
+
+ msg = job;
+ switch (msg->iotlb.type) {
+ case VHOST_IOTLB_UPDATE:
+ r = vhost_vdpa_dma_map(shared, msg->asid, msg->iotlb.iova,
+ msg->iotlb.size,
+ (void *)(uintptr_t)msg->iotlb.uaddr,
+ msg->iotlb.perm == VHOST_ACCESS_RO);
+ break;
+ case VHOST_IOTLB_INVALIDATE:
+ r = vhost_vdpa_dma_unmap(shared, msg->asid, msg->iotlb.iova,
+ msg->iotlb.size);
+ break;
+ case VHOST_IOTLB_BATCH_BEGIN:
+ vhost_vdpa_iotlb_batch_begin_once(shared);
+ break;
+ case VHOST_IOTLB_BATCH_END:
+ vhost_vdpa_dma_end_batch(shared);
+ break;
+ default:
+ error_report("Invalid IOTLB msg type %d", msg->iotlb.type);
+ break;
+ };
+
+ if (unlikely(r != 0)) {
+ /* Add to return value so we can remove it from iova_tree */
+ if (ret == NULL) {
+ ret = g_ptr_array_new_full(0, g_free);
+ }
+
+ g_ptr_array_add(ret, g_steal_pointer(&msg));
+ }
+ }
+
+ return ret;
+}
+
+static void vhost_vdpa_spawn_maps_thread(VhostVDPAShared *shared)
+{
+ shared->map_queue = g_async_queue_new();
+ qemu_thread_create(&shared->map_thread, "vdpa map thread",
+ vhost_vdpa_load_map, shared, QEMU_THREAD_JOINABLE);
+ shared->map_thread_enabled = true;
+}
+
+static bool vhost_vdpa_join_maps_thread(VhostVDPAShared *shared)
+{
+ g_autoptr(GPtrArray) failed_iova = NULL;
+
+ /* Signal end of offloading maps */
+ g_async_queue_push(shared->map_queue, shared);
+ failed_iova = qemu_thread_join(&shared->map_thread);
+ g_async_queue_unref(shared->map_queue);
+
+ shared->map_thread_enabled = false;
+
+ if (likely(!failed_iova)) {
+ return true;
+ }
+
+ /* If it is a failed IOVA, abort starting */
+ for (size_t i = 0; failed_iova->len; ++i) {
+ struct vhost_msg_v2 *msg = g_ptr_array_index(failed_iova, i);
+ DMAMap mem_region = {
+ .iova = msg->iotlb.iova,
+ .size = msg->iotlb.size - 1, /* Inclusive */
+ };
+
+ vhost_iova_tree_remove(shared->iova_tree, mem_region);
+ }
+
+ return false;
+}
+
static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
{
struct vhost_vdpa *v = dev->opaque;
@@ -1315,7 +1439,15 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
"IOMMU and try again");
return -1;
}
- memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
+ if (!v->shared->listener_registered) {
+ memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
+ v->shared->listener_registered = true;
+ } else {
+ ok = vhost_vdpa_join_maps_thread(v->shared);
+ if (unlikely(!ok)) {
+ goto out_stop;
+ }
+ }
return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
}
@@ -1340,6 +1472,8 @@ static void vhost_vdpa_reset_status(struct vhost_dev *dev)
vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
VIRTIO_CONFIG_S_DRIVER);
memory_listener_unregister(&v->shared->listener);
+ v->shared->listener_registered = false;
+
}
static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
@@ -1522,3 +1656,34 @@ const VhostOps vdpa_ops = {
.vhost_set_config_call = vhost_vdpa_set_config_call,
.vhost_reset_status = vhost_vdpa_reset_status,
};
+
+int vhost_vdpa_load_setup(VhostVDPAShared *shared, AddressSpace *dma_as)
+{
+ uint8_t s = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
+ int r = ioctl(shared->device_fd, VHOST_VDPA_SET_STATUS, &s);
+ if (unlikely(r < 0)) {
+ return r;
+ }
+
+ vhost_vdpa_spawn_maps_thread(shared);
+ memory_listener_register(&shared->listener, dma_as);
+ shared->listener_registered = true;
+ return 0;
+}
+
+int vhost_vdpa_load_cleanup(VhostVDPAShared *shared, bool vhost_will_start)
+{
+ if (!shared->map_thread_enabled) {
+ return 0;
+ }
+
+ if (vhost_will_start) {
+ /*
+ * Delegate the join of map thread to vhost_vdpa_dev_start, as it runs
+ * out of main qemu lock.
+ */
+ return 0;
+ }
+
+ return vhost_vdpa_join_maps_thread(shared) ? 0 : -1;
+}
--
2.39.3
next prev parent reply other threads:[~2023-11-28 10:44 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-28 10:42 [RFC PATCH v2 00/10] Map memory at destination .load_setup in vDPA-net migration Eugenio Pérez
2023-11-28 10:42 ` [RFC PATCH v2 01/10] vdpa: do not set virtio status bits if unneeded Eugenio Pérez
2023-11-28 10:42 ` [RFC PATCH v2 02/10] vdpa: make batch_begin_once early return Eugenio Pérez
2023-11-28 10:42 ` [RFC PATCH v2 03/10] vdpa: merge _begin_batch into _batch_begin_once Eugenio Pérez
2023-11-28 10:42 ` [RFC PATCH v2 04/10] vdpa: extract out _dma_end_batch from _listener_commit Eugenio Pérez
2023-11-28 10:42 ` [RFC PATCH v2 05/10] vdpa: factor out stop path of vhost_vdpa_dev_start Eugenio Pérez
2023-11-28 10:42 ` [RFC PATCH v2 06/10] vdpa: check for iova tree initialized at net_client_start Eugenio Pérez
2023-11-28 10:43 ` [RFC PATCH v2 07/10] vdpa: set backend capabilities at vhost_vdpa_init Eugenio Pérez
2023-11-28 10:43 ` Eugenio Pérez [this message]
2023-11-28 10:43 ` [RFC PATCH v2 09/10] vdpa: add vhost_vdpa_net_load_setup NetClient callback Eugenio Pérez
2023-11-28 10:43 ` [RFC PATCH v2 10/10] virtio_net: register incremental migration handlers Eugenio Pérez
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231128104303.3314000-9-eperezma@redhat.com \
--to=eperezma@redhat.com \
--cc=dtatulea@nvidia.com \
--cc=gdawar@xilinx.com \
--cc=jasowang@redhat.com \
--cc=leiyang@redhat.com \
--cc=lingshan.zhu@intel.com \
--cc=lvivier@redhat.com \
--cc=mst@redhat.com \
--cc=parav@mellanox.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=sgarzare@redhat.com \
--cc=shannon.nelson@amd.com \
--cc=si-wei.liu@oracle.com \
--cc=yajunw@nvidia.com \
--cc=yin31149@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).