* [PATCH RFC 0/9] migration: Threadify loadvm process
@ 2025-08-27 20:59 Peter Xu
2025-08-27 20:59 ` [PATCH RFC 1/9] migration/vfio: Remove BQL implication in vfio_multifd_switchover_start() Peter Xu
` (11 more replies)
0 siblings, 12 replies; 45+ messages in thread
From: Peter Xu @ 2025-08-27 20:59 UTC (permalink / raw)
To: qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Li Zhijian, Juraj Marcin
[this is an early RFC, not for merge, but to collect initial feedbacks]
Background
==========
Nowadays, live migration heavily depends on threads. For example, most of
the major features that will be used nowadays in live migration (multifd,
postcopy, mapped-ram, vfio, etc.) all work with threads internally.
But still, from time to time, we'll see some coroutines floating around the
migration context. The major one is precopy's loadvm, which is internally
a coroutine. It is still a critical path that any live migration depends on.
A mixture of using both coroutines and threads is prone to issues. Some
examples can refer to commit e65cec5e5d ("migration/ram: Yield periodically
to the main loop") or commit 7afbdada7e ("migration/postcopy: ensure
preempt channel is ready before loading states").
Overview
========
This series tries to move migration further into the thread-based model, by
allowing the loadvm process to happen in a thread rather than in the main
thread with a coroutine.
Luckily, since the qio channel code is always ready for both cases, IO
paths should all be fine.
Note that loadvm for postcopy already happens in a ram load thread which is
separate. However, RAM is just the simple case here, even it has its own
challenges (on atomically update of the pgtables), its complexity lies in
the kernel.
For precopy, loadvm has quite a few operations that will need BQL. The
question is we can't take BQL for the whole process of loadvm, because
that'll block the main thread from executions (e.g. QMP hangs). Here, the
finer granule we can push BQL the better. This series so far chose
somewhere in the middle, by taking BQL on majorly these two places:
- CPU synchronizations
- Device START/FULL sections
After this series applied, most of the rest loadvm path will run without
BQL anymore. There is a more detailed discussion / todo in the commit
message of patch "migration: Thread-ify precopy vmstate load process"
explaning how to further split the BQL critical sections.
I was trying to split the patches into smaller ones if possible, but it's
still quite challenging so there's one major patch that does the work.
After the series applied, the only leftover pieces in migration/ that would
use a coroutine is snapshot save/load/delete jobs.
Tests
=====
Default CI passes.
RDMA unit tests pass as usual. I also tried out cancellation / failure
tests over RDMA channels, making sure nothing is stuck.
I also roughly measured how long it takes to run the whole 80+ migration
qtest suite, and see no measurable difference before / after this series.
Risks
=====
This series has the risk of breaking things. I would be surprised if it
didn't..
I confess I didn't test anything on COLO but only from code observations
and analysis. COLO maintainers: could you add some unit tests to QEMU's
qtests?
The current way of taking BQL during FULL section load may cause issues, it
means when the IOs are unstable we could be waiting for IO (in the new
migration incoming thread) with BQL held. This is low possibility, though,
only happens when the network halts during flushing the device states.
However still possible. One solution is to further breakdown the BQL
critical sections to smaller sections, as mentioned in TODO.
Anything more than welcomed: suggestions, questions, objections, tests..
Todo
====
- Test COLO?
- Finer grained BQL breakdown
- More..
Thanks,
Peter Xu (9):
migration/vfio: Remove BQL implication in
vfio_multifd_switchover_start()
migration/rdma: Fix wrong context in qio_channel_rdma_shutdown()
migration/rdma: Allow qemu_rdma_wait_comp_channel work with thread
migration/rdma: Change io_create_watch() to return immediately
migration: Thread-ify precopy vmstate load process
migration/rdma: Remove coroutine path in qemu_rdma_wait_comp_channel
migration/postcopy: Remove workaround on wait preempt channel
migration/ram: Remove workaround on ram yield during load
migration/rdma: Remove rdma_cm_poll_handler
include/migration/colo.h | 6 +-
migration/migration.h | 52 +++++++--
migration/savevm.h | 5 +-
hw/vfio/migration-multifd.c | 9 +-
migration/channel.c | 7 +-
migration/colo-stubs.c | 2 +-
migration/colo.c | 23 +---
migration/migration.c | 62 ++++++++---
migration/ram.c | 13 +--
migration/rdma.c | 206 ++++++++----------------------------
migration/savevm.c | 85 +++++++--------
migration/trace-events | 4 +-
12 files changed, 196 insertions(+), 278 deletions(-)
--
2.50.1
^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH RFC 1/9] migration/vfio: Remove BQL implication in vfio_multifd_switchover_start()
2025-08-27 20:59 [PATCH RFC 0/9] migration: Threadify loadvm process Peter Xu
@ 2025-08-27 20:59 ` Peter Xu
2025-08-28 18:05 ` Maciej S. Szmigiero
2025-09-16 21:34 ` Fabiano Rosas
2025-08-27 20:59 ` [PATCH RFC 2/9] migration/rdma: Fix wrong context in qio_channel_rdma_shutdown() Peter Xu
` (10 subsequent siblings)
11 siblings, 2 replies; 45+ messages in thread
From: Peter Xu @ 2025-08-27 20:59 UTC (permalink / raw)
To: qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Li Zhijian, Juraj Marcin, Cédric Le Goater,
Maciej S. Szmigiero
We may switch to a BQL-free loadvm model. Be prepared with it.
Cc: Cédric Le Goater <clg@redhat.com>
Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
hw/vfio/migration-multifd.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
index e4785031a7..8dc8444f0d 100644
--- a/hw/vfio/migration-multifd.c
+++ b/hw/vfio/migration-multifd.c
@@ -763,16 +763,21 @@ int vfio_multifd_switchover_start(VFIODevice *vbasedev)
{
VFIOMigration *migration = vbasedev->migration;
VFIOMultifd *multifd = migration->multifd;
+ bool bql_is_locked = bql_locked();
assert(multifd);
/* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
- bql_unlock();
+ if (bql_is_locked) {
+ bql_unlock();
+ }
WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
assert(!multifd->load_bufs_thread_running);
multifd->load_bufs_thread_running = true;
}
- bql_lock();
+ if (bql_is_locked) {
+ bql_lock();
+ }
qemu_loadvm_start_load_thread(vfio_load_bufs_thread, vbasedev);
--
2.50.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH RFC 2/9] migration/rdma: Fix wrong context in qio_channel_rdma_shutdown()
2025-08-27 20:59 [PATCH RFC 0/9] migration: Threadify loadvm process Peter Xu
2025-08-27 20:59 ` [PATCH RFC 1/9] migration/vfio: Remove BQL implication in vfio_multifd_switchover_start() Peter Xu
@ 2025-08-27 20:59 ` Peter Xu
2025-09-16 21:41 ` Fabiano Rosas
2025-09-26 1:01 ` Zhijian Li (Fujitsu)
2025-08-27 20:59 ` [PATCH RFC 3/9] migration/rdma: Allow qemu_rdma_wait_comp_channel work with thread Peter Xu
` (9 subsequent siblings)
11 siblings, 2 replies; 45+ messages in thread
From: Peter Xu @ 2025-08-27 20:59 UTC (permalink / raw)
To: qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Li Zhijian, Juraj Marcin, Lidong Chen
The rdmaout should be a cache of rioc->rdmaout, not rioc->rdmain.
Cc: Zhijian Li (Fujitsu) <lizhijian@fujitsu.com>
Cc: Lidong Chen <jemmy858585@gmail.com>
Fixes: 54db882f07 ("migration: implement the shutdown for RDMA QIOChannel")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/rdma.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/migration/rdma.c b/migration/rdma.c
index 2d839fce6c..e6837184c8 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -2986,7 +2986,7 @@ qio_channel_rdma_shutdown(QIOChannel *ioc,
RCU_READ_LOCK_GUARD();
rdmain = qatomic_rcu_read(&rioc->rdmain);
- rdmaout = qatomic_rcu_read(&rioc->rdmain);
+ rdmaout = qatomic_rcu_read(&rioc->rdmaout);
switch (how) {
case QIO_CHANNEL_SHUTDOWN_READ:
--
2.50.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH RFC 3/9] migration/rdma: Allow qemu_rdma_wait_comp_channel work with thread
2025-08-27 20:59 [PATCH RFC 0/9] migration: Threadify loadvm process Peter Xu
2025-08-27 20:59 ` [PATCH RFC 1/9] migration/vfio: Remove BQL implication in vfio_multifd_switchover_start() Peter Xu
2025-08-27 20:59 ` [PATCH RFC 2/9] migration/rdma: Fix wrong context in qio_channel_rdma_shutdown() Peter Xu
@ 2025-08-27 20:59 ` Peter Xu
2025-09-16 21:50 ` Fabiano Rosas
2025-09-26 1:02 ` Zhijian Li (Fujitsu)
2025-08-27 20:59 ` [PATCH RFC 4/9] migration/rdma: Change io_create_watch() to return immediately Peter Xu
` (8 subsequent siblings)
11 siblings, 2 replies; 45+ messages in thread
From: Peter Xu @ 2025-08-27 20:59 UTC (permalink / raw)
To: qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Li Zhijian, Juraj Marcin
It's almost there, except that currently it relies on a global flag showing
that it's in incoming migration.
Change it to detect coroutine instead.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/rdma.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/migration/rdma.c b/migration/rdma.c
index e6837184c8..ed4e20b988 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -1357,7 +1357,8 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma,
* so don't yield unless we know we're running inside of a coroutine.
*/
if (rdma->migration_started_on_destination &&
- migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE) {
+ migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE &&
+ qemu_in_coroutine()) {
yield_until_fd_readable(comp_channel->fd);
} else {
/* This is the source side, we're in a separate thread
--
2.50.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH RFC 4/9] migration/rdma: Change io_create_watch() to return immediately
2025-08-27 20:59 [PATCH RFC 0/9] migration: Threadify loadvm process Peter Xu
` (2 preceding siblings ...)
2025-08-27 20:59 ` [PATCH RFC 3/9] migration/rdma: Allow qemu_rdma_wait_comp_channel work with thread Peter Xu
@ 2025-08-27 20:59 ` Peter Xu
2025-09-16 22:35 ` Fabiano Rosas
2025-09-26 2:39 ` Zhijian Li (Fujitsu)
2025-08-27 20:59 ` [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process Peter Xu
` (7 subsequent siblings)
11 siblings, 2 replies; 45+ messages in thread
From: Peter Xu @ 2025-08-27 20:59 UTC (permalink / raw)
To: qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Li Zhijian, Juraj Marcin
The old RDMA's io_create_watch() isn't really doing much work anyway. For
G_IO_OUT, it already does return immediately. For G_IO_IN, it will try to
detect some RDMA context length however normally nobody will be able to set
it at all.
Simplify the code so that RDMA iochannels simply always rely on synchronous
reads and writes. It is highly likely what 6ddd2d76ca6f86f was talking
about, that the async model isn't really working well.
This helps because this is almost the only dependency that the migration
core would need a coroutine for rdma channels.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/rdma.c | 69 +++---------------------------------------------
1 file changed, 3 insertions(+), 66 deletions(-)
diff --git a/migration/rdma.c b/migration/rdma.c
index ed4e20b988..bcd7aae2f2 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -2789,56 +2789,14 @@ static gboolean
qio_channel_rdma_source_prepare(GSource *source,
gint *timeout)
{
- QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
- RDMAContext *rdma;
- GIOCondition cond = 0;
*timeout = -1;
-
- RCU_READ_LOCK_GUARD();
- if (rsource->condition == G_IO_IN) {
- rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
- } else {
- rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
- }
-
- if (!rdma) {
- error_report("RDMAContext is NULL when prepare Gsource");
- return FALSE;
- }
-
- if (rdma->wr_data[0].control_len) {
- cond |= G_IO_IN;
- }
- cond |= G_IO_OUT;
-
- return cond & rsource->condition;
+ return TRUE;
}
static gboolean
qio_channel_rdma_source_check(GSource *source)
{
- QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
- RDMAContext *rdma;
- GIOCondition cond = 0;
-
- RCU_READ_LOCK_GUARD();
- if (rsource->condition == G_IO_IN) {
- rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
- } else {
- rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
- }
-
- if (!rdma) {
- error_report("RDMAContext is NULL when check Gsource");
- return FALSE;
- }
-
- if (rdma->wr_data[0].control_len) {
- cond |= G_IO_IN;
- }
- cond |= G_IO_OUT;
-
- return cond & rsource->condition;
+ return TRUE;
}
static gboolean
@@ -2848,29 +2806,8 @@ qio_channel_rdma_source_dispatch(GSource *source,
{
QIOChannelFunc func = (QIOChannelFunc)callback;
QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
- RDMAContext *rdma;
- GIOCondition cond = 0;
-
- RCU_READ_LOCK_GUARD();
- if (rsource->condition == G_IO_IN) {
- rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
- } else {
- rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
- }
-
- if (!rdma) {
- error_report("RDMAContext is NULL when dispatch Gsource");
- return FALSE;
- }
-
- if (rdma->wr_data[0].control_len) {
- cond |= G_IO_IN;
- }
- cond |= G_IO_OUT;
- return (*func)(QIO_CHANNEL(rsource->rioc),
- (cond & rsource->condition),
- user_data);
+ return (*func)(QIO_CHANNEL(rsource->rioc), rsource->condition, user_data);
}
static void
--
2.50.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process
2025-08-27 20:59 [PATCH RFC 0/9] migration: Threadify loadvm process Peter Xu
` (3 preceding siblings ...)
2025-08-27 20:59 ` [PATCH RFC 4/9] migration/rdma: Change io_create_watch() to return immediately Peter Xu
@ 2025-08-27 20:59 ` Peter Xu
2025-08-27 23:51 ` Dr. David Alan Gilbert
` (3 more replies)
2025-08-27 20:59 ` [PATCH RFC 6/9] migration/rdma: Remove coroutine path in qemu_rdma_wait_comp_channel Peter Xu
` (6 subsequent siblings)
11 siblings, 4 replies; 45+ messages in thread
From: Peter Xu @ 2025-08-27 20:59 UTC (permalink / raw)
To: qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Li Zhijian, Juraj Marcin
Migration module was there for 10+ years. Initially, it was in most cases
based on coroutines. As more features were added into the framework, like
postcopy, multifd, etc.. it became a mixture of threads and coroutines.
I'm guessing coroutines just can't fix all issues that migration want to
resolve.
After all these years, migration is now heavily based on a threaded model.
Now there's still a major part of migration framework that is still not
thread-based, which is precopy load. We do load in a separate thread in
postcopy since the 1st day postcopy was introduced, however that requires a
separate state transition from precopy loading all devices first, which
still happens in the main thread of a coroutine.
This patch tries to move the migration incoming side to be run inside a
separate thread (mig/dst/main) just like the src (mig/src/main). The
entrance to be migration_incoming_thread().
Quite a few things are needed to make it fly..
BQL Analysis
============
Firstly, when moving it over to the thread, it means the thread cannot take
BQL during the whole process of loading anymore, because otherwise it can
block main thread from using the BQL for all kinds of other concurrent
tasks (for example, processing QMP / HMP commands).
Here the first question to ask is: what needs BQL during precopy load, and
what doesn't?
Most of the load process shouldn't need BQL, especially when it's about
RAM. After all, RAM is still the major chunk of data to move for a live
migration process. VFIO started to change that, though, but still, VFIO is
per-device so that shouldn't need BQL either in most cases.
Generic device loads will need BQL, likely not when receiving VMSDs, but
when applying them. One example is any post_load() could potentially
inject memory regions causing memory transactions to happen. That'll need
to update the global address spaces, hence requires BQL. The other one is
CPU sync operations, even if the sync alone may not need BQL (which is
still to be further justified), run_on_cpu() will need it.
For that, qemu_loadvm_state() and qemu_loadvm_state_main() functions need
to now take a "bql_held" parameter saying whether bql is held. We could
use things like BQL_LOCK_GUARD(), but this patch goes with explicit
lockings rather than relying on bql_locked TLS variable. In case of
migration, we always know whether BQL is held in different context as long
as we can still pass that information downwards.
COLO
====
COLO assumed the dest VM load happens in a coroutine. After this patch,
it's not anymore. Change that by invoking colo_incoming_co() directly from
the migration_incoming_thread().
The name (colo_incoming_co()) isn't proper anymore. Change it to
colo_incoming_wait(), removing the coroutine annotation alongside.
Remove all the bql_lock() implications in COLO, e.g., colo_incoming_co()
used to release the lock for a short period while join(). Now it's not
needed.
At the meantime, there's colo_incoming_co variable that used to store the
COLO incoming coroutine, only to be kicked off when a secondary failover
happens.
To recap, what should happen for such failover should be (taking example of
a QMP command x-colo-lost-heartbeat triggering on dest QEMU):
- The QMP command will kick off both the coroutine and the COLO
thread (colo_process_incoming_thread()), with something like:
/* Notify COLO incoming thread that failover work is finished */
qemu_event_set(&mis->colo_incoming_event);
qemu_coroutine_enter(mis->colo_incoming_co);
- The coroutine, which yielded itself before, now resumes after enter(),
then it'll wait for the join():
mis->colo_incoming_co = qemu_coroutine_self();
qemu_coroutine_yield();
mis->colo_incoming_co = NULL;
/* Wait checkpoint incoming thread exit before free resource */
qemu_thread_join(&th);
Here, when switching to a thread model, it should be fine removing
colo_incoming_co variable completely, because if so, the incoming thread
will (instead of yielding the coroutine) wait at qemu_thread_join() until
the colo thread completes execution (after receiving colo_incoming_event).
RDMA
====
With the prior patch making sure io_watch won't block for RDMA iochannels,
RDMA threads should only block at its io_readv/io_writev functions. When a
disconnection is detected (as in rdma_cm_poll_handler()), the update to
"errored" field will be immediately reflected in the migration incoming
thread. Hence the coroutine for RDMA is not needed anymore to kick the
thread out.
TODO
====
Currently the BQL is taken during loading of a START|FULL section. When
the IO hangs (e.g. network issue) during this process, it could potentially
block others like the monitor servers. One solution is breaking BQL to
smaller granule and leave IOs to be always BQL-free. That'll need more
justifications.
For example, there are at least four things that need some closer
attention:
- SaveVMHandlers's load_state(): this likely DO NOT need BQL, but we need
to justify all of them (not to mention, some of them look like prone to
be rewritten as VMSDs..)
- VMSD's pre_load(): in most cases, this DO NOT really need BQL, but
sometimes maybe it will! Double checking on this will be needed.
- VMSD's post_load(): in many cases, this DO need BQL, for example on
address space operations. Likely we should just take it for any
post_load().
- VMSD field's get(): this is tricky! It could internally be anything
even if it was only a field. E.g. there can be users to use a SINGLE
field to load a whole VMSD, which can further introduce more
possibilities.
In general, QEMUFile IOs should not need BQL, that is when receiving the
VMSD data and waiting for e.g. the socket buffer to get refilled. But
that's the easy part.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
include/migration/colo.h | 6 ++--
migration/migration.h | 52 ++++++++++++++++++++++++++------
migration/savevm.h | 5 ++--
migration/channel.c | 7 ++---
migration/colo-stubs.c | 2 +-
migration/colo.c | 23 ++++-----------
migration/migration.c | 62 ++++++++++++++++++++++++++++----------
migration/rdma.c | 5 ----
migration/savevm.c | 64 ++++++++++++++++++++++++----------------
migration/trace-events | 4 +--
10 files changed, 142 insertions(+), 88 deletions(-)
diff --git a/include/migration/colo.h b/include/migration/colo.h
index 43222ef5ae..bfb30eccf0 100644
--- a/include/migration/colo.h
+++ b/include/migration/colo.h
@@ -44,12 +44,10 @@ void colo_do_failover(void);
void colo_checkpoint_delay_set(void);
/*
- * Starts COLO incoming process. Called from process_incoming_migration_co()
+ * Starts COLO incoming process. Called from migration_incoming_thread()
* after loading the state.
- *
- * Called with BQL locked, may temporary release BQL.
*/
-void coroutine_fn colo_incoming_co(void);
+void colo_incoming_wait(void);
void colo_shutdown(void);
#endif
diff --git a/migration/migration.h b/migration/migration.h
index 01329bf824..c4a626eed4 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -42,6 +42,44 @@
#define MIGRATION_THREAD_DST_LISTEN "mig/dst/listen"
#define MIGRATION_THREAD_DST_PREEMPT "mig/dst/preempt"
+/**
+ * WITH_BQL_HELD(): Run a task, making sure BQL is held
+ *
+ * @bql_held: Whether BQL is already held
+ * @task: The task to run within BQL held
+ */
+#define WITH_BQL_HELD(bql_held, task) \
+ do { \
+ if (!bql_held) { \
+ bql_lock(); \
+ } else { \
+ assert(bql_locked()); \
+ } \
+ task; \
+ if (!bql_held) { \
+ bql_unlock(); \
+ } \
+ } while (0)
+
+/**
+ * WITHOUT_BQL_HELD(): Run a task, making sure BQL is released
+ *
+ * @bql_held: Whether BQL is already held
+ * @task: The task to run making sure BQL released
+ */
+#define WITHOUT_BQL_HELD(bql_held, task) \
+ do { \
+ if (bql_held) { \
+ bql_unlock(); \
+ } else { \
+ assert(!bql_locked()); \
+ } \
+ task; \
+ if (bql_held) { \
+ bql_lock(); \
+ } \
+ } while (0)
+
struct PostcopyBlocktimeContext;
typedef struct ThreadPool ThreadPool;
@@ -119,6 +157,10 @@ struct MigrationIncomingState {
bool have_listen_thread;
QemuThread listen_thread;
+ /* Migration main recv thread */
+ bool have_recv_thread;
+ QemuThread recv_thread;
+
/* For the kernel to send us notifications */
int userfault_fd;
/* To notify the fault_thread to wake, e.g., when need to quit */
@@ -177,15 +219,7 @@ struct MigrationIncomingState {
MigrationStatus state;
- /*
- * The incoming migration coroutine, non-NULL during qemu_loadvm_state().
- * Used to wake the migration incoming coroutine from rdma code. How much is
- * it safe - it's a question.
- */
- Coroutine *loadvm_co;
-
- /* The coroutine we should enter (back) after failover */
- Coroutine *colo_incoming_co;
+ /* Notify secondary VM to move on */
QemuEvent colo_incoming_event;
/* Optional load threads pool and its thread exit request flag */
diff --git a/migration/savevm.h b/migration/savevm.h
index 2d5e9c7166..c07e14f61a 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -64,9 +64,10 @@ void qemu_savevm_send_colo_enable(QEMUFile *f);
void qemu_savevm_live_state(QEMUFile *f);
int qemu_save_device_state(QEMUFile *f);
-int qemu_loadvm_state(QEMUFile *f);
+int qemu_loadvm_state(QEMUFile *f, bool bql_held);
void qemu_loadvm_state_cleanup(MigrationIncomingState *mis);
-int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
+int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
+ bool bql_held);
int qemu_load_device_state(QEMUFile *f);
int qemu_loadvm_approve_switchover(void);
int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
diff --git a/migration/channel.c b/migration/channel.c
index a547b1fbfe..621f8a4a2a 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -136,11 +136,8 @@ int migration_channel_read_peek(QIOChannel *ioc,
}
/* 1ms sleep. */
- if (qemu_in_coroutine()) {
- qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000);
- } else {
- g_usleep(1000);
- }
+ assert(!qemu_in_coroutine());
+ g_usleep(1000);
}
return 0;
diff --git a/migration/colo-stubs.c b/migration/colo-stubs.c
index e22ce65234..ef77d1ab4b 100644
--- a/migration/colo-stubs.c
+++ b/migration/colo-stubs.c
@@ -9,7 +9,7 @@ void colo_shutdown(void)
{
}
-void coroutine_fn colo_incoming_co(void)
+void colo_incoming_wait(void)
{
}
diff --git a/migration/colo.c b/migration/colo.c
index e0f713c837..f5722d9d9d 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -147,11 +147,6 @@ static void secondary_vm_do_failover(void)
}
/* Notify COLO incoming thread that failover work is finished */
qemu_event_set(&mis->colo_incoming_event);
-
- /* For Secondary VM, jump to incoming co */
- if (mis->colo_incoming_co) {
- qemu_coroutine_enter(mis->colo_incoming_co);
- }
}
static void primary_vm_do_failover(void)
@@ -686,7 +681,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
bql_lock();
cpu_synchronize_all_states();
- ret = qemu_loadvm_state_main(mis->from_src_file, mis);
+ ret = qemu_loadvm_state_main(mis->from_src_file, mis, true);
bql_unlock();
if (ret < 0) {
@@ -854,10 +849,8 @@ static void *colo_process_incoming_thread(void *opaque)
goto out;
}
/*
- * Note: the communication between Primary side and Secondary side
- * should be sequential, we set the fd to unblocked in migration incoming
- * coroutine, and here we are in the COLO incoming thread, so it is ok to
- * set the fd back to blocked.
+ * Here we are in the COLO incoming thread, so it is ok to set the fd
+ * to blocked.
*/
qemu_file_set_blocking(mis->from_src_file, true);
@@ -930,26 +923,20 @@ out:
return NULL;
}
-void coroutine_fn colo_incoming_co(void)
+/* Wait for failover */
+void colo_incoming_wait(void)
{
MigrationIncomingState *mis = migration_incoming_get_current();
QemuThread th;
- assert(bql_locked());
assert(migration_incoming_colo_enabled());
qemu_thread_create(&th, MIGRATION_THREAD_DST_COLO,
colo_process_incoming_thread,
mis, QEMU_THREAD_JOINABLE);
- mis->colo_incoming_co = qemu_coroutine_self();
- qemu_coroutine_yield();
- mis->colo_incoming_co = NULL;
-
- bql_unlock();
/* Wait checkpoint incoming thread exit before free resource */
qemu_thread_join(&th);
- bql_lock();
/* We hold the global BQL, so it is safe here */
colo_release_ram_cache();
diff --git a/migration/migration.c b/migration/migration.c
index 10c216d25d..7e4d25b15c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -494,6 +494,11 @@ void migration_incoming_state_destroy(void)
mis->postcopy_qemufile_dst = NULL;
}
+ if (mis->have_recv_thread) {
+ qemu_thread_join(&mis->recv_thread);
+ mis->have_recv_thread = false;
+ }
+
cpr_set_incoming_mode(MIG_MODE_NONE);
yank_unregister_instance(MIGRATION_YANK_INSTANCE);
}
@@ -864,30 +869,46 @@ static void process_incoming_migration_bh(void *opaque)
migration_incoming_state_destroy();
}
-static void coroutine_fn
-process_incoming_migration_co(void *opaque)
+static void migration_incoming_state_destroy_bh(void *opaque)
+{
+ struct MigrationIncomingState *mis = opaque;
+
+ if (mis->exit_on_error) {
+ /*
+ * NOTE: this exit() should better happen in the main thread, as
+ * the exit notifier may require BQL which can deadlock. See
+ * commit e7bc0204e57836 for example.
+ */
+ exit(EXIT_FAILURE);
+ }
+
+ migration_incoming_state_destroy();
+}
+
+static void *migration_incoming_thread(void *opaque)
{
MigrationState *s = migrate_get_current();
- MigrationIncomingState *mis = migration_incoming_get_current();
+ MigrationIncomingState *mis = opaque;
PostcopyState ps;
int ret;
Error *local_err = NULL;
+ rcu_register_thread();
+
assert(mis->from_src_file);
+ assert(!bql_locked());
mis->largest_page_size = qemu_ram_pagesize_largest();
postcopy_state_set(POSTCOPY_INCOMING_NONE);
migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP,
MIGRATION_STATUS_ACTIVE);
- mis->loadvm_co = qemu_coroutine_self();
- ret = qemu_loadvm_state(mis->from_src_file);
- mis->loadvm_co = NULL;
+ ret = qemu_loadvm_state(mis->from_src_file, false);
trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
ps = postcopy_state_get();
- trace_process_incoming_migration_co_end(ret, ps);
+ trace_process_incoming_migration_end(ret, ps);
if (ps != POSTCOPY_INCOMING_NONE) {
if (ps == POSTCOPY_INCOMING_ADVISE) {
/*
@@ -901,7 +922,7 @@ process_incoming_migration_co(void *opaque)
* Postcopy was started, cleanup should happen at the end of the
* postcopy thread.
*/
- trace_process_incoming_migration_co_postcopy_end_main();
+ trace_process_incoming_migration_postcopy_end_main();
goto out;
}
/* Else if something went wrong then just fall out of the normal exit */
@@ -913,8 +934,8 @@ process_incoming_migration_co(void *opaque)
}
if (migration_incoming_colo_enabled()) {
- /* yield until COLO exit */
- colo_incoming_co();
+ /* wait until COLO exits */
+ colo_incoming_wait();
}
migration_bh_schedule(process_incoming_migration_bh, mis);
@@ -926,19 +947,24 @@ fail:
migrate_set_error(s, local_err);
error_free(local_err);
- migration_incoming_state_destroy();
-
if (mis->exit_on_error) {
WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
error_report_err(s->error);
s->error = NULL;
}
-
- exit(EXIT_FAILURE);
}
+
+ /*
+ * There's some step of the destroy process that will need to happen in
+ * the main thread (e.g. joining this thread itself). Leave to a BH.
+ */
+ migration_bh_schedule(migration_incoming_state_destroy_bh, (void *)mis);
+
out:
/* Pairs with the refcount taken in qmp_migrate_incoming() */
migrate_incoming_unref_outgoing_state();
+ rcu_unregister_thread();
+ return NULL;
}
/**
@@ -956,8 +982,12 @@ static void migration_incoming_setup(QEMUFile *f)
void migration_incoming_process(void)
{
- Coroutine *co = qemu_coroutine_create(process_incoming_migration_co, NULL);
- qemu_coroutine_enter(co);
+ MigrationIncomingState *mis = migration_incoming_get_current();
+
+ mis->have_recv_thread = true;
+ qemu_thread_create(&mis->recv_thread, "mig/dst/main",
+ migration_incoming_thread, mis,
+ QEMU_THREAD_JOINABLE);
}
/* Returns true if recovered from a paused migration, otherwise false */
diff --git a/migration/rdma.c b/migration/rdma.c
index bcd7aae2f2..2b995513aa 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -3068,7 +3068,6 @@ static void rdma_cm_poll_handler(void *opaque)
{
RDMAContext *rdma = opaque;
struct rdma_cm_event *cm_event;
- MigrationIncomingState *mis = migration_incoming_get_current();
if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
error_report("get_cm_event failed %d", errno);
@@ -3087,10 +3086,6 @@ static void rdma_cm_poll_handler(void *opaque)
}
}
rdma_ack_cm_event(cm_event);
- if (mis->loadvm_co) {
- qemu_coroutine_enter(mis->loadvm_co);
- }
- return;
}
rdma_ack_cm_event(cm_event);
}
diff --git a/migration/savevm.c b/migration/savevm.c
index fabbeb296a..ad606c5425 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -154,11 +154,10 @@ static void qemu_loadvm_thread_pool_destroy(MigrationIncomingState *mis)
}
static bool qemu_loadvm_thread_pool_wait(MigrationState *s,
- MigrationIncomingState *mis)
+ MigrationIncomingState *mis,
+ bool bql_held)
{
- bql_unlock(); /* Let load threads do work requiring BQL */
- thread_pool_wait(mis->load_threads);
- bql_lock();
+ WITHOUT_BQL_HELD(bql_held, thread_pool_wait(mis->load_threads));
return !migrate_has_error(s);
}
@@ -2091,14 +2090,11 @@ static void *postcopy_ram_listen_thread(void *opaque)
trace_postcopy_ram_listen_thread_start();
rcu_register_thread();
- /*
- * Because we're a thread and not a coroutine we can't yield
- * in qemu_file, and thus we must be blocking now.
- */
+ /* Because we're a thread, making sure to use blocking mode */
qemu_file_set_blocking(f, true);
/* TODO: sanity check that only postcopiable data will be loaded here */
- load_res = qemu_loadvm_state_main(f, mis);
+ load_res = qemu_loadvm_state_main(f, mis, false);
/*
* This is tricky, but, mis->from_src_file can change after it
@@ -2392,13 +2388,14 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
* Immediately following this command is a blob of data containing an embedded
* chunk of migration stream; read it and load it.
*
- * @mis: Incoming state
- * @length: Length of packaged data to read
+ * @mis: Incoming state
+ * @bql_held: Whether BQL is held already
*
* Returns: Negative values on error
*
*/
-static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
+static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis,
+ bool bql_held)
{
int ret;
size_t length;
@@ -2449,7 +2446,7 @@ static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
qemu_coroutine_yield();
} while (1);
- ret = qemu_loadvm_state_main(packf, mis);
+ ret = qemu_loadvm_state_main(packf, mis, bql_held);
trace_loadvm_handle_cmd_packaged_main(ret);
qemu_fclose(packf);
object_unref(OBJECT(bioc));
@@ -2539,7 +2536,7 @@ static int loadvm_postcopy_handle_switchover_start(void)
* LOADVM_QUIT All good, but exit the loop
* <0 Error
*/
-static int loadvm_process_command(QEMUFile *f)
+static int loadvm_process_command(QEMUFile *f, bool bql_held)
{
MigrationIncomingState *mis = migration_incoming_get_current();
uint16_t cmd;
@@ -2609,7 +2606,7 @@ static int loadvm_process_command(QEMUFile *f)
break;
case MIG_CMD_PACKAGED:
- return loadvm_handle_cmd_packaged(mis);
+ return loadvm_handle_cmd_packaged(mis, bql_held);
case MIG_CMD_POSTCOPY_ADVISE:
return loadvm_postcopy_handle_advise(mis, len);
@@ -3028,7 +3025,8 @@ static bool postcopy_pause_incoming(MigrationIncomingState *mis)
return true;
}
-int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
+int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
+ bool bql_held)
{
uint8_t section_type;
int ret = 0;
@@ -3046,7 +3044,15 @@ retry:
switch (section_type) {
case QEMU_VM_SECTION_START:
case QEMU_VM_SECTION_FULL:
- ret = qemu_loadvm_section_start_full(f, section_type);
+ /*
+ * FULL should normally require BQL, e.g. during post_load()
+ * there can be memory region updates. START may or may not
+ * require it, but just to keep it simple to always hold BQL
+ * for now.
+ */
+ WITH_BQL_HELD(
+ bql_held,
+ ret = qemu_loadvm_section_start_full(f, section_type));
if (ret < 0) {
goto out;
}
@@ -3059,7 +3065,11 @@ retry:
}
break;
case QEMU_VM_COMMAND:
- ret = loadvm_process_command(f);
+ /*
+ * Be careful; QEMU_VM_COMMAND can embed FULL sections, so it
+ * may internally need BQL.
+ */
+ ret = loadvm_process_command(f, bql_held);
trace_qemu_loadvm_state_section_command(ret);
if ((ret < 0) || (ret == LOADVM_QUIT)) {
goto out;
@@ -3103,7 +3113,7 @@ out:
return ret;
}
-int qemu_loadvm_state(QEMUFile *f)
+int qemu_loadvm_state(QEMUFile *f, bool bql_held)
{
MigrationState *s = migrate_get_current();
MigrationIncomingState *mis = migration_incoming_get_current();
@@ -3131,9 +3141,10 @@ int qemu_loadvm_state(QEMUFile *f)
qemu_loadvm_state_switchover_ack_needed(mis);
}
- cpu_synchronize_all_pre_loadvm();
+ /* run_on_cpu() requires BQL */
+ WITH_BQL_HELD(bql_held, cpu_synchronize_all_pre_loadvm());
- ret = qemu_loadvm_state_main(f, mis);
+ ret = qemu_loadvm_state_main(f, mis, bql_held);
qemu_event_set(&mis->main_thread_load_event);
trace_qemu_loadvm_state_post_main(ret);
@@ -3149,7 +3160,7 @@ int qemu_loadvm_state(QEMUFile *f)
/* When reaching here, it must be precopy */
if (ret == 0) {
if (migrate_has_error(migrate_get_current()) ||
- !qemu_loadvm_thread_pool_wait(s, mis)) {
+ !qemu_loadvm_thread_pool_wait(s, mis, bql_held)) {
ret = -EINVAL;
} else {
ret = qemu_file_get_error(f);
@@ -3196,7 +3207,8 @@ int qemu_loadvm_state(QEMUFile *f)
}
}
- cpu_synchronize_all_post_init();
+ /* run_on_cpu() requires BQL */
+ WITH_BQL_HELD(bql_held, cpu_synchronize_all_post_init());
return ret;
}
@@ -3207,7 +3219,7 @@ int qemu_load_device_state(QEMUFile *f)
int ret;
/* Load QEMU_VM_SECTION_FULL section */
- ret = qemu_loadvm_state_main(f, mis);
+ ret = qemu_loadvm_state_main(f, mis, true);
if (ret < 0) {
error_report("Failed to load device state: %d", ret);
return ret;
@@ -3438,7 +3450,7 @@ void qmp_xen_load_devices_state(const char *filename, Error **errp)
f = qemu_file_new_input(QIO_CHANNEL(ioc));
object_unref(OBJECT(ioc));
- ret = qemu_loadvm_state(f);
+ ret = qemu_loadvm_state(f, true);
qemu_fclose(f);
if (ret < 0) {
error_setg(errp, "loading Xen device state failed");
@@ -3512,7 +3524,7 @@ bool load_snapshot(const char *name, const char *vmstate,
ret = -EINVAL;
goto err_drain;
}
- ret = qemu_loadvm_state(f);
+ ret = qemu_loadvm_state(f, true);
migration_incoming_state_destroy();
bdrv_drain_all_end();
diff --git a/migration/trace-events b/migration/trace-events
index 706db97def..eeb41e03f1 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -193,8 +193,8 @@ source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
source_return_path_thread_switchover_acked(void) ""
migration_thread_low_pending(uint64_t pending) "%" PRIu64
migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " max_size %" PRId64
-process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
-process_incoming_migration_co_postcopy_end_main(void) ""
+process_incoming_migration_end(int ret, int ps) "ret=%d postcopy-state=%d"
+process_incoming_migration_postcopy_end_main(void) ""
postcopy_preempt_enabled(bool value) "%d"
migration_precopy_complete(void) ""
--
2.50.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH RFC 6/9] migration/rdma: Remove coroutine path in qemu_rdma_wait_comp_channel
2025-08-27 20:59 [PATCH RFC 0/9] migration: Threadify loadvm process Peter Xu
` (4 preceding siblings ...)
2025-08-27 20:59 ` [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process Peter Xu
@ 2025-08-27 20:59 ` Peter Xu
2025-09-16 22:39 ` Fabiano Rosas
2025-09-26 2:44 ` Zhijian Li (Fujitsu)
2025-08-27 20:59 ` [PATCH RFC 7/9] migration/postcopy: Remove workaround on wait preempt channel Peter Xu
` (5 subsequent siblings)
11 siblings, 2 replies; 45+ messages in thread
From: Peter Xu @ 2025-08-27 20:59 UTC (permalink / raw)
To: qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Li Zhijian, Juraj Marcin
Now after threadified dest VM load during precopy, we will always in a
thread context rather than within a coroutine. We can remove this path
now.
With that, migration_started_on_destination can go away too.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/rdma.c | 102 +++++++++++++++++++----------------------------
1 file changed, 41 insertions(+), 61 deletions(-)
diff --git a/migration/rdma.c b/migration/rdma.c
index 2b995513aa..7751262460 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -29,7 +29,6 @@
#include "qemu/rcu.h"
#include "qemu/sockets.h"
#include "qemu/bitmap.h"
-#include "qemu/coroutine.h"
#include "system/memory.h"
#include <sys/socket.h>
#include <netdb.h>
@@ -357,13 +356,6 @@ typedef struct RDMAContext {
/* Index of the next RAMBlock received during block registration */
unsigned int next_src_index;
- /*
- * Migration on *destination* started.
- * Then use coroutine yield function.
- * Source runs in a thread, so we don't care.
- */
- int migration_started_on_destination;
-
int total_registrations;
int total_writes;
@@ -1353,66 +1345,55 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma,
struct rdma_cm_event *cm_event;
/*
- * Coroutine doesn't start until migration_fd_process_incoming()
- * so don't yield unless we know we're running inside of a coroutine.
+ * This is the source or dest side, either during precopy or
+ * postcopy. We're always in a separate thread when reaching here.
+ * Poll the fd. We need to be able to handle 'cancel' or an error
+ * without hanging forever.
*/
- if (rdma->migration_started_on_destination &&
- migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE &&
- qemu_in_coroutine()) {
- yield_until_fd_readable(comp_channel->fd);
- } else {
- /* This is the source side, we're in a separate thread
- * or destination prior to migration_fd_process_incoming()
- * after postcopy, the destination also in a separate thread.
- * we can't yield; so we have to poll the fd.
- * But we need to be able to handle 'cancel' or an error
- * without hanging forever.
- */
- while (!rdma->errored && !rdma->received_error) {
- GPollFD pfds[2];
- pfds[0].fd = comp_channel->fd;
- pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
- pfds[0].revents = 0;
-
- pfds[1].fd = rdma->channel->fd;
- pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
- pfds[1].revents = 0;
-
- /* 0.1s timeout, should be fine for a 'cancel' */
- switch (qemu_poll_ns(pfds, 2, 100 * 1000 * 1000)) {
- case 2:
- case 1: /* fd active */
- if (pfds[0].revents) {
- return 0;
- }
+ while (!rdma->errored && !rdma->received_error) {
+ GPollFD pfds[2];
+ pfds[0].fd = comp_channel->fd;
+ pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
+ pfds[0].revents = 0;
+
+ pfds[1].fd = rdma->channel->fd;
+ pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
+ pfds[1].revents = 0;
+
+ /* 0.1s timeout, should be fine for a 'cancel' */
+ switch (qemu_poll_ns(pfds, 2, 100 * 1000 * 1000)) {
+ case 2:
+ case 1: /* fd active */
+ if (pfds[0].revents) {
+ return 0;
+ }
- if (pfds[1].revents) {
- if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
- return -1;
- }
+ if (pfds[1].revents) {
+ if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
+ return -1;
+ }
- if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
- cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
- rdma_ack_cm_event(cm_event);
- return -1;
- }
+ if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
+ cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
rdma_ack_cm_event(cm_event);
+ return -1;
}
- break;
+ rdma_ack_cm_event(cm_event);
+ }
+ break;
- case 0: /* Timeout, go around again */
- break;
+ case 0: /* Timeout, go around again */
+ break;
- default: /* Error of some type -
- * I don't trust errno from qemu_poll_ns
- */
- return -1;
- }
+ default: /* Error of some type -
+ * I don't trust errno from qemu_poll_ns
+ */
+ return -1;
+ }
- if (migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) {
- /* Bail out and let the cancellation happen */
- return -1;
- }
+ if (migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) {
+ /* Bail out and let the cancellation happen */
+ return -1;
}
}
@@ -3817,7 +3798,6 @@ static void rdma_accept_incoming_migration(void *opaque)
return;
}
- rdma->migration_started_on_destination = 1;
migration_fd_process_incoming(f);
}
--
2.50.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH RFC 7/9] migration/postcopy: Remove workaround on wait preempt channel
2025-08-27 20:59 [PATCH RFC 0/9] migration: Threadify loadvm process Peter Xu
` (5 preceding siblings ...)
2025-08-27 20:59 ` [PATCH RFC 6/9] migration/rdma: Remove coroutine path in qemu_rdma_wait_comp_channel Peter Xu
@ 2025-08-27 20:59 ` Peter Xu
2025-09-17 18:30 ` Fabiano Rosas
2025-08-27 20:59 ` [PATCH RFC 8/9] migration/ram: Remove workaround on ram yield during load Peter Xu
` (4 subsequent siblings)
11 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2025-08-27 20:59 UTC (permalink / raw)
To: qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Li Zhijian, Juraj Marcin
This reverts commit 7afbdada7effbc2b97281bfbce0c6df351a3cf88.
Now after switching to a thread in loadvm process, the main thread should
be able to accept() even if loading the package could cause a page fault in
userfaultfd path.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/savevm.c | 21 ---------------------
1 file changed, 21 deletions(-)
diff --git a/migration/savevm.c b/migration/savevm.c
index ad606c5425..8018f7ad31 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2425,27 +2425,6 @@ static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis,
QEMUFile *packf = qemu_file_new_input(QIO_CHANNEL(bioc));
- /*
- * Before loading the guest states, ensure that the preempt channel has
- * been ready to use, as some of the states (e.g. via virtio_load) might
- * trigger page faults that will be handled through the preempt channel.
- * So yield to the main thread in the case that the channel create event
- * hasn't been dispatched.
- *
- * TODO: if we can move migration loadvm out of main thread, then we
- * won't block main thread from polling the accept() fds. We can drop
- * this as a whole when that is done.
- */
- do {
- if (!migrate_postcopy_preempt() || !qemu_in_coroutine() ||
- mis->postcopy_qemufile_dst) {
- break;
- }
-
- aio_co_schedule(qemu_get_current_aio_context(), qemu_coroutine_self());
- qemu_coroutine_yield();
- } while (1);
-
ret = qemu_loadvm_state_main(packf, mis, bql_held);
trace_loadvm_handle_cmd_packaged_main(ret);
qemu_fclose(packf);
--
2.50.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH RFC 8/9] migration/ram: Remove workaround on ram yield during load
2025-08-27 20:59 [PATCH RFC 0/9] migration: Threadify loadvm process Peter Xu
` (6 preceding siblings ...)
2025-08-27 20:59 ` [PATCH RFC 7/9] migration/postcopy: Remove workaround on wait preempt channel Peter Xu
@ 2025-08-27 20:59 ` Peter Xu
2025-09-17 18:31 ` Fabiano Rosas
2025-08-27 20:59 ` [PATCH RFC 9/9] migration/rdma: Remove rdma_cm_poll_handler Peter Xu
` (3 subsequent siblings)
11 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2025-08-27 20:59 UTC (permalink / raw)
To: qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Li Zhijian, Juraj Marcin
This reverts e65cec5e5d97927d22b39167d3e8edeffc771788.
RAM load path had a hack in the past explicitly yield the thread to the
main coroutine when RAM load spinning in a tight loop. Not needed now
because precopy RAM load now happens without the main thread.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/ram.c | 13 +------------
1 file changed, 1 insertion(+), 12 deletions(-)
diff --git a/migration/ram.c b/migration/ram.c
index 7208bc114f..2d9a6d1095 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4168,7 +4168,7 @@ static int parse_ramblocks(QEMUFile *f, ram_addr_t total_ram_bytes)
static int ram_load_precopy(QEMUFile *f)
{
MigrationIncomingState *mis = migration_incoming_get_current();
- int flags = 0, ret = 0, invalid_flags = 0, i = 0;
+ int flags = 0, ret = 0, invalid_flags = 0;
if (migrate_mapped_ram()) {
invalid_flags |= (RAM_SAVE_FLAG_HOOK | RAM_SAVE_FLAG_MULTIFD_FLUSH |
@@ -4181,17 +4181,6 @@ static int ram_load_precopy(QEMUFile *f)
void *host = NULL, *host_bak = NULL;
uint8_t ch;
- /*
- * Yield periodically to let main loop run, but an iteration of
- * the main loop is expensive, so do it each some iterations
- */
- if ((i & 32767) == 0 && qemu_in_coroutine()) {
- aio_co_schedule(qemu_get_current_aio_context(),
- qemu_coroutine_self());
- qemu_coroutine_yield();
- }
- i++;
-
addr = qemu_get_be64(f);
ret = qemu_file_get_error(f);
if (ret) {
--
2.50.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH RFC 9/9] migration/rdma: Remove rdma_cm_poll_handler
2025-08-27 20:59 [PATCH RFC 0/9] migration: Threadify loadvm process Peter Xu
` (7 preceding siblings ...)
2025-08-27 20:59 ` [PATCH RFC 8/9] migration/ram: Remove workaround on ram yield during load Peter Xu
@ 2025-08-27 20:59 ` Peter Xu
2025-09-17 18:38 ` Fabiano Rosas
2025-09-26 3:38 ` Zhijian Li (Fujitsu)
2025-08-29 8:29 ` [PATCH RFC 0/9] migration: Threadify loadvm process Vladimir Sementsov-Ogievskiy
` (2 subsequent siblings)
11 siblings, 2 replies; 45+ messages in thread
From: Peter Xu @ 2025-08-27 20:59 UTC (permalink / raw)
To: qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Li Zhijian, Juraj Marcin
This almost reverts commit 923709896b1b01fb982c93492ad01b233e6b6023.
It was needed because the RDMA iochannel on dest QEMU used to only yield
without monitoring the fd. Now it should be monitored by the same poll()
similarly on the src QEMU in qemu_rdma_wait_comp_channel(). So even
without the fd handler, dest QEMU should be able to receive the events.
I tested this by initiating an RDMA migration, then do two things:
- Either does migrate_cancel on src, or,
- Directly kill destination QEMU
In both cases, the other side of QEMU will be able to receive the
disconnect event in qemu_rdma_wait_comp_channel() and properly cancel or
fail the migration.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/rdma.c | 29 +----------------------------
1 file changed, 1 insertion(+), 28 deletions(-)
diff --git a/migration/rdma.c b/migration/rdma.c
index 7751262460..da7fd48bf3 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -3045,32 +3045,6 @@ int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
static void rdma_accept_incoming_migration(void *opaque);
-static void rdma_cm_poll_handler(void *opaque)
-{
- RDMAContext *rdma = opaque;
- struct rdma_cm_event *cm_event;
-
- if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
- error_report("get_cm_event failed %d", errno);
- return;
- }
-
- if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
- cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
- if (!rdma->errored &&
- migration_incoming_get_current()->state !=
- MIGRATION_STATUS_COMPLETED) {
- error_report("receive cm event, cm event is %d", cm_event->event);
- rdma->errored = true;
- if (rdma->return_path) {
- rdma->return_path->errored = true;
- }
- }
- rdma_ack_cm_event(cm_event);
- }
- rdma_ack_cm_event(cm_event);
-}
-
static int qemu_rdma_accept(RDMAContext *rdma)
{
Error *err = NULL;
@@ -3188,8 +3162,7 @@ static int qemu_rdma_accept(RDMAContext *rdma)
NULL,
(void *)(intptr_t)rdma->return_path);
} else {
- qemu_set_fd_handler(rdma->channel->fd, rdma_cm_poll_handler,
- NULL, rdma);
+ qemu_set_fd_handler(rdma->channel->fd, NULL, NULL, NULL);
}
ret = rdma_accept(rdma->cm_id, &conn_param);
--
2.50.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process
2025-08-27 20:59 ` [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process Peter Xu
@ 2025-08-27 23:51 ` Dr. David Alan Gilbert
2025-08-29 16:37 ` Peter Xu
2025-08-29 8:29 ` Vladimir Sementsov-Ogievskiy
` (2 subsequent siblings)
3 siblings, 1 reply; 45+ messages in thread
From: Dr. David Alan Gilbert @ 2025-08-27 23:51 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Kevin Wolf, Paolo Bonzini, Daniel P . Berrangé,
Fabiano Rosas, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
* Peter Xu (peterx@redhat.com) wrote:
> Migration module was there for 10+ years. Initially, it was in most cases
> based on coroutines. As more features were added into the framework, like
> postcopy, multifd, etc.. it became a mixture of threads and coroutines.
>
> I'm guessing coroutines just can't fix all issues that migration want to
> resolve.
Yeh migration can happily eat a whole core.
> After all these years, migration is now heavily based on a threaded model.
>
> Now there's still a major part of migration framework that is still not
> thread-based, which is precopy load. We do load in a separate thread in
> postcopy since the 1st day postcopy was introduced, however that requires a
> separate state transition from precopy loading all devices first, which
> still happens in the main thread of a coroutine.
...
> COLO
> ====
If you can I suggest splitting the COLO stuff out as a separate thread,
not many people understand it.
> TODO
> ====
>
> Currently the BQL is taken during loading of a START|FULL section. When
> the IO hangs (e.g. network issue) during this process, it could potentially
> block others like the monitor servers. One solution is breaking BQL to
> smaller granule and leave IOs to be always BQL-free. That'll need more
> justifications.
>
> For example, there are at least four things that need some closer
> attention:
>
> - SaveVMHandlers's load_state(): this likely DO NOT need BQL, but we need
> to justify all of them (not to mention, some of them look like prone to
> be rewritten as VMSDs..)
>
> - VMSD's pre_load(): in most cases, this DO NOT really need BQL, but
> sometimes maybe it will! Double checking on this will be needed.
>
> - VMSD's post_load(): in many cases, this DO need BQL, for example on
> address space operations. Likely we should just take it for any
> post_load().
>
> - VMSD field's get(): this is tricky! It could internally be anything
> even if it was only a field. E.g. there can be users to use a SINGLE
> field to load a whole VMSD, which can further introduce more
> possibilities.
Long long ago, I did convert some get's to structure; I got stuck on some
though - some have pretty crazy hand built lists and things.
> In general, QEMUFile IOs should not need BQL, that is when receiving the
> VMSD data and waiting for e.g. the socket buffer to get refilled. But
> that's the easy part.
It's probably generally a good thing to get rid of the BQL there, but I bet
it's going to throw some surprises; maybe something like devices doing
stuff before the migration has fully arrived or incoming socket connections
to non-migration stuff perhaps.
Dave
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> include/migration/colo.h | 6 ++--
> migration/migration.h | 52 ++++++++++++++++++++++++++------
> migration/savevm.h | 5 ++--
> migration/channel.c | 7 ++---
> migration/colo-stubs.c | 2 +-
> migration/colo.c | 23 ++++-----------
> migration/migration.c | 62 ++++++++++++++++++++++++++++----------
> migration/rdma.c | 5 ----
> migration/savevm.c | 64 ++++++++++++++++++++++++----------------
> migration/trace-events | 4 +--
> 10 files changed, 142 insertions(+), 88 deletions(-)
>
> diff --git a/include/migration/colo.h b/include/migration/colo.h
> index 43222ef5ae..bfb30eccf0 100644
> --- a/include/migration/colo.h
> +++ b/include/migration/colo.h
> @@ -44,12 +44,10 @@ void colo_do_failover(void);
> void colo_checkpoint_delay_set(void);
>
> /*
> - * Starts COLO incoming process. Called from process_incoming_migration_co()
> + * Starts COLO incoming process. Called from migration_incoming_thread()
> * after loading the state.
> - *
> - * Called with BQL locked, may temporary release BQL.
> */
> -void coroutine_fn colo_incoming_co(void);
> +void colo_incoming_wait(void);
>
> void colo_shutdown(void);
> #endif
> diff --git a/migration/migration.h b/migration/migration.h
> index 01329bf824..c4a626eed4 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -42,6 +42,44 @@
> #define MIGRATION_THREAD_DST_LISTEN "mig/dst/listen"
> #define MIGRATION_THREAD_DST_PREEMPT "mig/dst/preempt"
>
> +/**
> + * WITH_BQL_HELD(): Run a task, making sure BQL is held
> + *
> + * @bql_held: Whether BQL is already held
> + * @task: The task to run within BQL held
> + */
> +#define WITH_BQL_HELD(bql_held, task) \
> + do { \
> + if (!bql_held) { \
> + bql_lock(); \
> + } else { \
> + assert(bql_locked()); \
> + } \
> + task; \
> + if (!bql_held) { \
> + bql_unlock(); \
> + } \
> + } while (0)
> +
> +/**
> + * WITHOUT_BQL_HELD(): Run a task, making sure BQL is released
> + *
> + * @bql_held: Whether BQL is already held
> + * @task: The task to run making sure BQL released
> + */
> +#define WITHOUT_BQL_HELD(bql_held, task) \
> + do { \
> + if (bql_held) { \
> + bql_unlock(); \
> + } else { \
> + assert(!bql_locked()); \
> + } \
> + task; \
> + if (bql_held) { \
> + bql_lock(); \
> + } \
> + } while (0)
> +
> struct PostcopyBlocktimeContext;
> typedef struct ThreadPool ThreadPool;
>
> @@ -119,6 +157,10 @@ struct MigrationIncomingState {
> bool have_listen_thread;
> QemuThread listen_thread;
>
> + /* Migration main recv thread */
> + bool have_recv_thread;
> + QemuThread recv_thread;
> +
> /* For the kernel to send us notifications */
> int userfault_fd;
> /* To notify the fault_thread to wake, e.g., when need to quit */
> @@ -177,15 +219,7 @@ struct MigrationIncomingState {
>
> MigrationStatus state;
>
> - /*
> - * The incoming migration coroutine, non-NULL during qemu_loadvm_state().
> - * Used to wake the migration incoming coroutine from rdma code. How much is
> - * it safe - it's a question.
> - */
> - Coroutine *loadvm_co;
> -
> - /* The coroutine we should enter (back) after failover */
> - Coroutine *colo_incoming_co;
> + /* Notify secondary VM to move on */
> QemuEvent colo_incoming_event;
>
> /* Optional load threads pool and its thread exit request flag */
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 2d5e9c7166..c07e14f61a 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -64,9 +64,10 @@ void qemu_savevm_send_colo_enable(QEMUFile *f);
> void qemu_savevm_live_state(QEMUFile *f);
> int qemu_save_device_state(QEMUFile *f);
>
> -int qemu_loadvm_state(QEMUFile *f);
> +int qemu_loadvm_state(QEMUFile *f, bool bql_held);
> void qemu_loadvm_state_cleanup(MigrationIncomingState *mis);
> -int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
> +int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
> + bool bql_held);
> int qemu_load_device_state(QEMUFile *f);
> int qemu_loadvm_approve_switchover(void);
> int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> diff --git a/migration/channel.c b/migration/channel.c
> index a547b1fbfe..621f8a4a2a 100644
> --- a/migration/channel.c
> +++ b/migration/channel.c
> @@ -136,11 +136,8 @@ int migration_channel_read_peek(QIOChannel *ioc,
> }
>
> /* 1ms sleep. */
> - if (qemu_in_coroutine()) {
> - qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000);
> - } else {
> - g_usleep(1000);
> - }
> + assert(!qemu_in_coroutine());
> + g_usleep(1000);
> }
>
> return 0;
> diff --git a/migration/colo-stubs.c b/migration/colo-stubs.c
> index e22ce65234..ef77d1ab4b 100644
> --- a/migration/colo-stubs.c
> +++ b/migration/colo-stubs.c
> @@ -9,7 +9,7 @@ void colo_shutdown(void)
> {
> }
>
> -void coroutine_fn colo_incoming_co(void)
> +void colo_incoming_wait(void)
> {
> }
>
> diff --git a/migration/colo.c b/migration/colo.c
> index e0f713c837..f5722d9d9d 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -147,11 +147,6 @@ static void secondary_vm_do_failover(void)
> }
> /* Notify COLO incoming thread that failover work is finished */
> qemu_event_set(&mis->colo_incoming_event);
> -
> - /* For Secondary VM, jump to incoming co */
> - if (mis->colo_incoming_co) {
> - qemu_coroutine_enter(mis->colo_incoming_co);
> - }
> }
>
> static void primary_vm_do_failover(void)
> @@ -686,7 +681,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
>
> bql_lock();
> cpu_synchronize_all_states();
> - ret = qemu_loadvm_state_main(mis->from_src_file, mis);
> + ret = qemu_loadvm_state_main(mis->from_src_file, mis, true);
> bql_unlock();
>
> if (ret < 0) {
> @@ -854,10 +849,8 @@ static void *colo_process_incoming_thread(void *opaque)
> goto out;
> }
> /*
> - * Note: the communication between Primary side and Secondary side
> - * should be sequential, we set the fd to unblocked in migration incoming
> - * coroutine, and here we are in the COLO incoming thread, so it is ok to
> - * set the fd back to blocked.
> + * Here we are in the COLO incoming thread, so it is ok to set the fd
> + * to blocked.
> */
> qemu_file_set_blocking(mis->from_src_file, true);
>
> @@ -930,26 +923,20 @@ out:
> return NULL;
> }
>
> -void coroutine_fn colo_incoming_co(void)
> +/* Wait for failover */
> +void colo_incoming_wait(void)
> {
> MigrationIncomingState *mis = migration_incoming_get_current();
> QemuThread th;
>
> - assert(bql_locked());
> assert(migration_incoming_colo_enabled());
>
> qemu_thread_create(&th, MIGRATION_THREAD_DST_COLO,
> colo_process_incoming_thread,
> mis, QEMU_THREAD_JOINABLE);
>
> - mis->colo_incoming_co = qemu_coroutine_self();
> - qemu_coroutine_yield();
> - mis->colo_incoming_co = NULL;
> -
> - bql_unlock();
> /* Wait checkpoint incoming thread exit before free resource */
> qemu_thread_join(&th);
> - bql_lock();
>
> /* We hold the global BQL, so it is safe here */
> colo_release_ram_cache();
> diff --git a/migration/migration.c b/migration/migration.c
> index 10c216d25d..7e4d25b15c 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -494,6 +494,11 @@ void migration_incoming_state_destroy(void)
> mis->postcopy_qemufile_dst = NULL;
> }
>
> + if (mis->have_recv_thread) {
> + qemu_thread_join(&mis->recv_thread);
> + mis->have_recv_thread = false;
> + }
> +
> cpr_set_incoming_mode(MIG_MODE_NONE);
> yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> }
> @@ -864,30 +869,46 @@ static void process_incoming_migration_bh(void *opaque)
> migration_incoming_state_destroy();
> }
>
> -static void coroutine_fn
> -process_incoming_migration_co(void *opaque)
> +static void migration_incoming_state_destroy_bh(void *opaque)
> +{
> + struct MigrationIncomingState *mis = opaque;
> +
> + if (mis->exit_on_error) {
> + /*
> + * NOTE: this exit() should better happen in the main thread, as
> + * the exit notifier may require BQL which can deadlock. See
> + * commit e7bc0204e57836 for example.
> + */
> + exit(EXIT_FAILURE);
> + }
> +
> + migration_incoming_state_destroy();
> +}
> +
> +static void *migration_incoming_thread(void *opaque)
> {
> MigrationState *s = migrate_get_current();
> - MigrationIncomingState *mis = migration_incoming_get_current();
> + MigrationIncomingState *mis = opaque;
> PostcopyState ps;
> int ret;
> Error *local_err = NULL;
>
> + rcu_register_thread();
> +
> assert(mis->from_src_file);
> + assert(!bql_locked());
>
> mis->largest_page_size = qemu_ram_pagesize_largest();
> postcopy_state_set(POSTCOPY_INCOMING_NONE);
> migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP,
> MIGRATION_STATUS_ACTIVE);
>
> - mis->loadvm_co = qemu_coroutine_self();
> - ret = qemu_loadvm_state(mis->from_src_file);
> - mis->loadvm_co = NULL;
> + ret = qemu_loadvm_state(mis->from_src_file, false);
>
> trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
>
> ps = postcopy_state_get();
> - trace_process_incoming_migration_co_end(ret, ps);
> + trace_process_incoming_migration_end(ret, ps);
> if (ps != POSTCOPY_INCOMING_NONE) {
> if (ps == POSTCOPY_INCOMING_ADVISE) {
> /*
> @@ -901,7 +922,7 @@ process_incoming_migration_co(void *opaque)
> * Postcopy was started, cleanup should happen at the end of the
> * postcopy thread.
> */
> - trace_process_incoming_migration_co_postcopy_end_main();
> + trace_process_incoming_migration_postcopy_end_main();
> goto out;
> }
> /* Else if something went wrong then just fall out of the normal exit */
> @@ -913,8 +934,8 @@ process_incoming_migration_co(void *opaque)
> }
>
> if (migration_incoming_colo_enabled()) {
> - /* yield until COLO exit */
> - colo_incoming_co();
> + /* wait until COLO exits */
> + colo_incoming_wait();
> }
>
> migration_bh_schedule(process_incoming_migration_bh, mis);
> @@ -926,19 +947,24 @@ fail:
> migrate_set_error(s, local_err);
> error_free(local_err);
>
> - migration_incoming_state_destroy();
> -
> if (mis->exit_on_error) {
> WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> error_report_err(s->error);
> s->error = NULL;
> }
> -
> - exit(EXIT_FAILURE);
> }
> +
> + /*
> + * There's some step of the destroy process that will need to happen in
> + * the main thread (e.g. joining this thread itself). Leave to a BH.
> + */
> + migration_bh_schedule(migration_incoming_state_destroy_bh, (void *)mis);
> +
> out:
> /* Pairs with the refcount taken in qmp_migrate_incoming() */
> migrate_incoming_unref_outgoing_state();
> + rcu_unregister_thread();
> + return NULL;
> }
>
> /**
> @@ -956,8 +982,12 @@ static void migration_incoming_setup(QEMUFile *f)
>
> void migration_incoming_process(void)
> {
> - Coroutine *co = qemu_coroutine_create(process_incoming_migration_co, NULL);
> - qemu_coroutine_enter(co);
> + MigrationIncomingState *mis = migration_incoming_get_current();
> +
> + mis->have_recv_thread = true;
> + qemu_thread_create(&mis->recv_thread, "mig/dst/main",
> + migration_incoming_thread, mis,
> + QEMU_THREAD_JOINABLE);
> }
>
> /* Returns true if recovered from a paused migration, otherwise false */
> diff --git a/migration/rdma.c b/migration/rdma.c
> index bcd7aae2f2..2b995513aa 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -3068,7 +3068,6 @@ static void rdma_cm_poll_handler(void *opaque)
> {
> RDMAContext *rdma = opaque;
> struct rdma_cm_event *cm_event;
> - MigrationIncomingState *mis = migration_incoming_get_current();
>
> if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> error_report("get_cm_event failed %d", errno);
> @@ -3087,10 +3086,6 @@ static void rdma_cm_poll_handler(void *opaque)
> }
> }
> rdma_ack_cm_event(cm_event);
> - if (mis->loadvm_co) {
> - qemu_coroutine_enter(mis->loadvm_co);
> - }
> - return;
> }
> rdma_ack_cm_event(cm_event);
> }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index fabbeb296a..ad606c5425 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -154,11 +154,10 @@ static void qemu_loadvm_thread_pool_destroy(MigrationIncomingState *mis)
> }
>
> static bool qemu_loadvm_thread_pool_wait(MigrationState *s,
> - MigrationIncomingState *mis)
> + MigrationIncomingState *mis,
> + bool bql_held)
> {
> - bql_unlock(); /* Let load threads do work requiring BQL */
> - thread_pool_wait(mis->load_threads);
> - bql_lock();
> + WITHOUT_BQL_HELD(bql_held, thread_pool_wait(mis->load_threads));
>
> return !migrate_has_error(s);
> }
> @@ -2091,14 +2090,11 @@ static void *postcopy_ram_listen_thread(void *opaque)
> trace_postcopy_ram_listen_thread_start();
>
> rcu_register_thread();
> - /*
> - * Because we're a thread and not a coroutine we can't yield
> - * in qemu_file, and thus we must be blocking now.
> - */
> + /* Because we're a thread, making sure to use blocking mode */
> qemu_file_set_blocking(f, true);
>
> /* TODO: sanity check that only postcopiable data will be loaded here */
> - load_res = qemu_loadvm_state_main(f, mis);
> + load_res = qemu_loadvm_state_main(f, mis, false);
>
> /*
> * This is tricky, but, mis->from_src_file can change after it
> @@ -2392,13 +2388,14 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
> * Immediately following this command is a blob of data containing an embedded
> * chunk of migration stream; read it and load it.
> *
> - * @mis: Incoming state
> - * @length: Length of packaged data to read
> + * @mis: Incoming state
> + * @bql_held: Whether BQL is held already
> *
> * Returns: Negative values on error
> *
> */
> -static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
> +static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis,
> + bool bql_held)
> {
> int ret;
> size_t length;
> @@ -2449,7 +2446,7 @@ static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
> qemu_coroutine_yield();
> } while (1);
>
> - ret = qemu_loadvm_state_main(packf, mis);
> + ret = qemu_loadvm_state_main(packf, mis, bql_held);
> trace_loadvm_handle_cmd_packaged_main(ret);
> qemu_fclose(packf);
> object_unref(OBJECT(bioc));
> @@ -2539,7 +2536,7 @@ static int loadvm_postcopy_handle_switchover_start(void)
> * LOADVM_QUIT All good, but exit the loop
> * <0 Error
> */
> -static int loadvm_process_command(QEMUFile *f)
> +static int loadvm_process_command(QEMUFile *f, bool bql_held)
> {
> MigrationIncomingState *mis = migration_incoming_get_current();
> uint16_t cmd;
> @@ -2609,7 +2606,7 @@ static int loadvm_process_command(QEMUFile *f)
> break;
>
> case MIG_CMD_PACKAGED:
> - return loadvm_handle_cmd_packaged(mis);
> + return loadvm_handle_cmd_packaged(mis, bql_held);
>
> case MIG_CMD_POSTCOPY_ADVISE:
> return loadvm_postcopy_handle_advise(mis, len);
> @@ -3028,7 +3025,8 @@ static bool postcopy_pause_incoming(MigrationIncomingState *mis)
> return true;
> }
>
> -int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
> +int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
> + bool bql_held)
> {
> uint8_t section_type;
> int ret = 0;
> @@ -3046,7 +3044,15 @@ retry:
> switch (section_type) {
> case QEMU_VM_SECTION_START:
> case QEMU_VM_SECTION_FULL:
> - ret = qemu_loadvm_section_start_full(f, section_type);
> + /*
> + * FULL should normally require BQL, e.g. during post_load()
> + * there can be memory region updates. START may or may not
> + * require it, but just to keep it simple to always hold BQL
> + * for now.
> + */
> + WITH_BQL_HELD(
> + bql_held,
> + ret = qemu_loadvm_section_start_full(f, section_type));
> if (ret < 0) {
> goto out;
> }
> @@ -3059,7 +3065,11 @@ retry:
> }
> break;
> case QEMU_VM_COMMAND:
> - ret = loadvm_process_command(f);
> + /*
> + * Be careful; QEMU_VM_COMMAND can embed FULL sections, so it
> + * may internally need BQL.
> + */
> + ret = loadvm_process_command(f, bql_held);
> trace_qemu_loadvm_state_section_command(ret);
> if ((ret < 0) || (ret == LOADVM_QUIT)) {
> goto out;
> @@ -3103,7 +3113,7 @@ out:
> return ret;
> }
>
> -int qemu_loadvm_state(QEMUFile *f)
> +int qemu_loadvm_state(QEMUFile *f, bool bql_held)
> {
> MigrationState *s = migrate_get_current();
> MigrationIncomingState *mis = migration_incoming_get_current();
> @@ -3131,9 +3141,10 @@ int qemu_loadvm_state(QEMUFile *f)
> qemu_loadvm_state_switchover_ack_needed(mis);
> }
>
> - cpu_synchronize_all_pre_loadvm();
> + /* run_on_cpu() requires BQL */
> + WITH_BQL_HELD(bql_held, cpu_synchronize_all_pre_loadvm());
>
> - ret = qemu_loadvm_state_main(f, mis);
> + ret = qemu_loadvm_state_main(f, mis, bql_held);
> qemu_event_set(&mis->main_thread_load_event);
>
> trace_qemu_loadvm_state_post_main(ret);
> @@ -3149,7 +3160,7 @@ int qemu_loadvm_state(QEMUFile *f)
> /* When reaching here, it must be precopy */
> if (ret == 0) {
> if (migrate_has_error(migrate_get_current()) ||
> - !qemu_loadvm_thread_pool_wait(s, mis)) {
> + !qemu_loadvm_thread_pool_wait(s, mis, bql_held)) {
> ret = -EINVAL;
> } else {
> ret = qemu_file_get_error(f);
> @@ -3196,7 +3207,8 @@ int qemu_loadvm_state(QEMUFile *f)
> }
> }
>
> - cpu_synchronize_all_post_init();
> + /* run_on_cpu() requires BQL */
> + WITH_BQL_HELD(bql_held, cpu_synchronize_all_post_init());
>
> return ret;
> }
> @@ -3207,7 +3219,7 @@ int qemu_load_device_state(QEMUFile *f)
> int ret;
>
> /* Load QEMU_VM_SECTION_FULL section */
> - ret = qemu_loadvm_state_main(f, mis);
> + ret = qemu_loadvm_state_main(f, mis, true);
> if (ret < 0) {
> error_report("Failed to load device state: %d", ret);
> return ret;
> @@ -3438,7 +3450,7 @@ void qmp_xen_load_devices_state(const char *filename, Error **errp)
> f = qemu_file_new_input(QIO_CHANNEL(ioc));
> object_unref(OBJECT(ioc));
>
> - ret = qemu_loadvm_state(f);
> + ret = qemu_loadvm_state(f, true);
> qemu_fclose(f);
> if (ret < 0) {
> error_setg(errp, "loading Xen device state failed");
> @@ -3512,7 +3524,7 @@ bool load_snapshot(const char *name, const char *vmstate,
> ret = -EINVAL;
> goto err_drain;
> }
> - ret = qemu_loadvm_state(f);
> + ret = qemu_loadvm_state(f, true);
> migration_incoming_state_destroy();
>
> bdrv_drain_all_end();
> diff --git a/migration/trace-events b/migration/trace-events
> index 706db97def..eeb41e03f1 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -193,8 +193,8 @@ source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
> source_return_path_thread_switchover_acked(void) ""
> migration_thread_low_pending(uint64_t pending) "%" PRIu64
> migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " max_size %" PRId64
> -process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> -process_incoming_migration_co_postcopy_end_main(void) ""
> +process_incoming_migration_end(int ret, int ps) "ret=%d postcopy-state=%d"
> +process_incoming_migration_postcopy_end_main(void) ""
> postcopy_preempt_enabled(bool value) "%d"
> migration_precopy_complete(void) ""
>
> --
> 2.50.1
>
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux | Happy \
\ dave @ treblig.org | | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 1/9] migration/vfio: Remove BQL implication in vfio_multifd_switchover_start()
2025-08-27 20:59 ` [PATCH RFC 1/9] migration/vfio: Remove BQL implication in vfio_multifd_switchover_start() Peter Xu
@ 2025-08-28 18:05 ` Maciej S. Szmigiero
2025-09-16 21:34 ` Fabiano Rosas
1 sibling, 0 replies; 45+ messages in thread
From: Maciej S. Szmigiero @ 2025-08-28 18:05 UTC (permalink / raw)
To: Peter Xu
Cc: Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Li Zhijian, Juraj Marcin, Cédric Le Goater,
qemu-devel
On 27.08.2025 22:59, Peter Xu wrote:
> We may switch to a BQL-free loadvm model. Be prepared with it.
>
> Cc: Cédric Le Goater <clg@redhat.com>
> Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> hw/vfio/migration-multifd.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index e4785031a7..8dc8444f0d 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -763,16 +763,21 @@ int vfio_multifd_switchover_start(VFIODevice *vbasedev)
> {
> VFIOMigration *migration = vbasedev->migration;
> VFIOMultifd *multifd = migration->multifd;
> + bool bql_is_locked = bql_locked();
>
> assert(multifd);
>
> /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
> - bql_unlock();
> + if (bql_is_locked) {
> + bql_unlock();
> + }
> WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
> assert(!multifd->load_bufs_thread_running);
> multifd->load_bufs_thread_running = true;
> }
> - bql_lock();
> + if (bql_is_locked) {
> + bql_lock();
> + }
>
> qemu_loadvm_start_load_thread(vfio_load_bufs_thread, vbasedev);
>
This patch makes sense to me - I don't see anything obviously wrong here.
In general, thank you for your series Peter.
I am actually looking at a similar subject - how to make
vfio_pci_load_config() and its sub-calls use more fine grained locking
than BQL so device configuration loading for multiple VFIO devices
can happen in parallel instead of being serialized by BQL.
Don't have an ETA for this yet, but it's good that other people also
working on improving live migration scalability.
Thanks,
Maciej
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 0/9] migration: Threadify loadvm process
2025-08-27 20:59 [PATCH RFC 0/9] migration: Threadify loadvm process Peter Xu
` (8 preceding siblings ...)
2025-08-27 20:59 ` [PATCH RFC 9/9] migration/rdma: Remove rdma_cm_poll_handler Peter Xu
@ 2025-08-29 8:29 ` Vladimir Sementsov-Ogievskiy
2025-08-29 17:18 ` Peter Xu
2025-09-04 8:27 ` Zhang Chen
2025-09-16 21:32 ` Fabiano Rosas
11 siblings, 1 reply; 45+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2025-08-29 8:29 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Prasad Pandit, Zhang Chen, Li Zhijian, Juraj Marcin
On 27.08.25 23:59, Peter Xu wrote:
> split the patches into smaller ones if possible
Support for bql_held parameter for some functions may also be
moved to separate preparation patches, which will simplify the
main patch.
--
Best regards,
Vladimir
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process
2025-08-27 20:59 ` [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process Peter Xu
2025-08-27 23:51 ` Dr. David Alan Gilbert
@ 2025-08-29 8:29 ` Vladimir Sementsov-Ogievskiy
2025-08-29 17:17 ` Peter Xu
2025-09-17 18:23 ` Fabiano Rosas
2025-09-26 3:41 ` Zhijian Li (Fujitsu)
3 siblings, 1 reply; 45+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2025-08-29 8:29 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Prasad Pandit, Zhang Chen, Li Zhijian, Juraj Marcin
On 27.08.25 23:59, Peter Xu wrote:
> Migration module was there for 10+ years. Initially, it was in most cases
> based on coroutines. As more features were added into the framework, like
> postcopy, multifd, etc.. it became a mixture of threads and coroutines.
>
> I'm guessing coroutines just can't fix all issues that migration want to
> resolve.
>
> After all these years, migration is now heavily based on a threaded model.
>
> Now there's still a major part of migration framework that is still not
> thread-based, which is precopy load. We do load in a separate thread in
> postcopy since the 1st day postcopy was introduced, however that requires a
> separate state transition from precopy loading all devices first, which
> still happens in the main thread of a coroutine.
>
> This patch tries to move the migration incoming side to be run inside a
> separate thread (mig/dst/main) just like the src (mig/src/main). The
> entrance to be migration_incoming_thread().
>
> Quite a few things are needed to make it fly..
>
> BQL Analysis
> ============
>
> Firstly, when moving it over to the thread, it means the thread cannot take
> BQL during the whole process of loading anymore, because otherwise it can
> block main thread from using the BQL for all kinds of other concurrent
> tasks (for example, processing QMP / HMP commands).
>
> Here the first question to ask is: what needs BQL during precopy load, and
> what doesn't?
>
> Most of the load process shouldn't need BQL, especially when it's about
> RAM. After all, RAM is still the major chunk of data to move for a live
> migration process. VFIO started to change that, though, but still, VFIO is
> per-device so that shouldn't need BQL either in most cases.
>
> Generic device loads will need BQL, likely not when receiving VMSDs, but
> when applying them. One example is any post_load() could potentially
> inject memory regions causing memory transactions to happen. That'll need
> to update the global address spaces, hence requires BQL. The other one is
> CPU sync operations, even if the sync alone may not need BQL (which is
> still to be further justified), run_on_cpu() will need it.
>
> For that, qemu_loadvm_state() and qemu_loadvm_state_main() functions need
> to now take a "bql_held" parameter saying whether bql is held. We could
> use things like BQL_LOCK_GUARD(), but this patch goes with explicit
> lockings rather than relying on bql_locked TLS variable. In case of
> migration, we always know whether BQL is held in different context as long
> as we can still pass that information downwards.
Agree, but I think it's better to make new macros following same pattern, i.e.
WITH_BQL_HELD(bql_held) {
action();
}
instead of
WITH_BQL_HELD(bql_held, actions());
..
Or I'm missing something and we already have a precedent of the latter
notation?
--
Best regards,
Vladimir
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process
2025-08-27 23:51 ` Dr. David Alan Gilbert
@ 2025-08-29 16:37 ` Peter Xu
2025-09-04 1:38 ` Dr. David Alan Gilbert
0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2025-08-29 16:37 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: qemu-devel, Kevin Wolf, Paolo Bonzini, Daniel P . Berrangé,
Fabiano Rosas, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
On Wed, Aug 27, 2025 at 11:51:06PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Migration module was there for 10+ years. Initially, it was in most cases
> > based on coroutines. As more features were added into the framework, like
> > postcopy, multifd, etc.. it became a mixture of threads and coroutines.
> >
> > I'm guessing coroutines just can't fix all issues that migration want to
> > resolve.
>
> Yeh migration can happily eat a whole core.
>
> > After all these years, migration is now heavily based on a threaded model.
> >
> > Now there's still a major part of migration framework that is still not
> > thread-based, which is precopy load. We do load in a separate thread in
> > postcopy since the 1st day postcopy was introduced, however that requires a
> > separate state transition from precopy loading all devices first, which
> > still happens in the main thread of a coroutine.
>
> ...
>
> > COLO
> > ====
>
> If you can I suggest splitting the COLO stuff out as a separate thread,
> not many people understand it.
I can try this one, but then it'll be a bunch of "if (qemu_in_coroutine())"
checks all over the places.
For emample, this change of this patch:
- assert(bql_locked());
assert(migration_incoming_colo_enabled());
qemu_thread_create(&th, MIGRATION_THREAD_DST_COLO,
colo_process_incoming_thread,
mis, QEMU_THREAD_JOINABLE);
- mis->colo_incoming_co = qemu_coroutine_self();
- qemu_coroutine_yield();
- mis->colo_incoming_co = NULL;
-
- bql_unlock();
/* Wait checkpoint incoming thread exit before free resource */
qemu_thread_join(&th);
- bql_lock();
Will become:
- assert(bql_locked());
assert(migration_incoming_colo_enabled());
qemu_thread_create(&th, MIGRATION_THREAD_DST_COLO,
colo_process_incoming_thread,
mis, QEMU_THREAD_JOINABLE);
- mis->colo_incoming_co = qemu_coroutine_self();
- qemu_coroutine_yield();
- mis->colo_incoming_co = NULL;
+ if (qemu_in_coroutine()) {
+ assert(bql_locked());
+ mis->colo_incoming_co = qemu_coroutine_self();
+ qemu_coroutine_yield();
+ mis->colo_incoming_co = NULL;
+ bql_unlock();
+ }
- bql_unlock();
/* Wait checkpoint incoming thread exit before free resource */
qemu_thread_join(&th);
- bql_lock();
+
+ if (qemu_in_coroutine()) {
+ bql_lock();
+ }
Then I'll add one more patch at last to remove all these "if" blocks.
Which one is better?
For the rest, I can still try to move things; migration_channel_read_peek()
change be a separate patch after this one, but that's pretty small.. not
so much like that, normally we'll still need such "if"s to be added prior
this patch, apply this patch, then removed those "if"s in another later patch.
>
> > TODO
> > ====
> >
> > Currently the BQL is taken during loading of a START|FULL section. When
> > the IO hangs (e.g. network issue) during this process, it could potentially
> > block others like the monitor servers. One solution is breaking BQL to
> > smaller granule and leave IOs to be always BQL-free. That'll need more
> > justifications.
> >
> > For example, there are at least four things that need some closer
> > attention:
> >
> > - SaveVMHandlers's load_state(): this likely DO NOT need BQL, but we need
> > to justify all of them (not to mention, some of them look like prone to
> > be rewritten as VMSDs..)
> >
> > - VMSD's pre_load(): in most cases, this DO NOT really need BQL, but
> > sometimes maybe it will! Double checking on this will be needed.
> >
> > - VMSD's post_load(): in many cases, this DO need BQL, for example on
> > address space operations. Likely we should just take it for any
> > post_load().
> >
> > - VMSD field's get(): this is tricky! It could internally be anything
> > even if it was only a field. E.g. there can be users to use a SINGLE
> > field to load a whole VMSD, which can further introduce more
> > possibilities.
>
> Long long ago, I did convert some get's to structure; I got stuck on some
> though - some have pretty crazy hand built lists and things.
Yeah, I can feel it even though I didn't look into each of them yet. :)
Looks like they're all explicit VMS_SINGLE users; we have 22 instances.
Unfortunately, I still see new ones being added, latest one in
5d56bff11e3d. I wonder whether pre_save() + post_load() would have worked
there..
>
> > In general, QEMUFile IOs should not need BQL, that is when receiving the
> > VMSD data and waiting for e.g. the socket buffer to get refilled. But
> > that's the easy part.
>
> It's probably generally a good thing to get rid of the BQL there, but I bet
> it's going to throw some surprises; maybe something like devices doing
> stuff before the migration has fully arrived
Is that pre_load() or.. maybe something else?
I should still look into each of them, but only if we want to further push
the bql to be at post_load() level. I am not sure if some pre_load() would
assume BQL won't be released until post_load(), if so that'll be an issue,
and that will need some closer code observation...
> or incoming socket connections to non-migration stuff perhaps.
Any example for this one?
Thanks!
>
> Dave
>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > include/migration/colo.h | 6 ++--
> > migration/migration.h | 52 ++++++++++++++++++++++++++------
> > migration/savevm.h | 5 ++--
> > migration/channel.c | 7 ++---
> > migration/colo-stubs.c | 2 +-
> > migration/colo.c | 23 ++++-----------
> > migration/migration.c | 62 ++++++++++++++++++++++++++++----------
> > migration/rdma.c | 5 ----
> > migration/savevm.c | 64 ++++++++++++++++++++++++----------------
> > migration/trace-events | 4 +--
> > 10 files changed, 142 insertions(+), 88 deletions(-)
> >
> > diff --git a/include/migration/colo.h b/include/migration/colo.h
> > index 43222ef5ae..bfb30eccf0 100644
> > --- a/include/migration/colo.h
> > +++ b/include/migration/colo.h
> > @@ -44,12 +44,10 @@ void colo_do_failover(void);
> > void colo_checkpoint_delay_set(void);
> >
> > /*
> > - * Starts COLO incoming process. Called from process_incoming_migration_co()
> > + * Starts COLO incoming process. Called from migration_incoming_thread()
> > * after loading the state.
> > - *
> > - * Called with BQL locked, may temporary release BQL.
> > */
> > -void coroutine_fn colo_incoming_co(void);
> > +void colo_incoming_wait(void);
> >
> > void colo_shutdown(void);
> > #endif
> > diff --git a/migration/migration.h b/migration/migration.h
> > index 01329bf824..c4a626eed4 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -42,6 +42,44 @@
> > #define MIGRATION_THREAD_DST_LISTEN "mig/dst/listen"
> > #define MIGRATION_THREAD_DST_PREEMPT "mig/dst/preempt"
> >
> > +/**
> > + * WITH_BQL_HELD(): Run a task, making sure BQL is held
> > + *
> > + * @bql_held: Whether BQL is already held
> > + * @task: The task to run within BQL held
> > + */
> > +#define WITH_BQL_HELD(bql_held, task) \
> > + do { \
> > + if (!bql_held) { \
> > + bql_lock(); \
> > + } else { \
> > + assert(bql_locked()); \
> > + } \
> > + task; \
> > + if (!bql_held) { \
> > + bql_unlock(); \
> > + } \
> > + } while (0)
> > +
> > +/**
> > + * WITHOUT_BQL_HELD(): Run a task, making sure BQL is released
> > + *
> > + * @bql_held: Whether BQL is already held
> > + * @task: The task to run making sure BQL released
> > + */
> > +#define WITHOUT_BQL_HELD(bql_held, task) \
> > + do { \
> > + if (bql_held) { \
> > + bql_unlock(); \
> > + } else { \
> > + assert(!bql_locked()); \
> > + } \
> > + task; \
> > + if (bql_held) { \
> > + bql_lock(); \
> > + } \
> > + } while (0)
> > +
> > struct PostcopyBlocktimeContext;
> > typedef struct ThreadPool ThreadPool;
> >
> > @@ -119,6 +157,10 @@ struct MigrationIncomingState {
> > bool have_listen_thread;
> > QemuThread listen_thread;
> >
> > + /* Migration main recv thread */
> > + bool have_recv_thread;
> > + QemuThread recv_thread;
> > +
> > /* For the kernel to send us notifications */
> > int userfault_fd;
> > /* To notify the fault_thread to wake, e.g., when need to quit */
> > @@ -177,15 +219,7 @@ struct MigrationIncomingState {
> >
> > MigrationStatus state;
> >
> > - /*
> > - * The incoming migration coroutine, non-NULL during qemu_loadvm_state().
> > - * Used to wake the migration incoming coroutine from rdma code. How much is
> > - * it safe - it's a question.
> > - */
> > - Coroutine *loadvm_co;
> > -
> > - /* The coroutine we should enter (back) after failover */
> > - Coroutine *colo_incoming_co;
> > + /* Notify secondary VM to move on */
> > QemuEvent colo_incoming_event;
> >
> > /* Optional load threads pool and its thread exit request flag */
> > diff --git a/migration/savevm.h b/migration/savevm.h
> > index 2d5e9c7166..c07e14f61a 100644
> > --- a/migration/savevm.h
> > +++ b/migration/savevm.h
> > @@ -64,9 +64,10 @@ void qemu_savevm_send_colo_enable(QEMUFile *f);
> > void qemu_savevm_live_state(QEMUFile *f);
> > int qemu_save_device_state(QEMUFile *f);
> >
> > -int qemu_loadvm_state(QEMUFile *f);
> > +int qemu_loadvm_state(QEMUFile *f, bool bql_held);
> > void qemu_loadvm_state_cleanup(MigrationIncomingState *mis);
> > -int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
> > +int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
> > + bool bql_held);
> > int qemu_load_device_state(QEMUFile *f);
> > int qemu_loadvm_approve_switchover(void);
> > int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> > diff --git a/migration/channel.c b/migration/channel.c
> > index a547b1fbfe..621f8a4a2a 100644
> > --- a/migration/channel.c
> > +++ b/migration/channel.c
> > @@ -136,11 +136,8 @@ int migration_channel_read_peek(QIOChannel *ioc,
> > }
> >
> > /* 1ms sleep. */
> > - if (qemu_in_coroutine()) {
> > - qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000);
> > - } else {
> > - g_usleep(1000);
> > - }
> > + assert(!qemu_in_coroutine());
> > + g_usleep(1000);
> > }
> >
> > return 0;
> > diff --git a/migration/colo-stubs.c b/migration/colo-stubs.c
> > index e22ce65234..ef77d1ab4b 100644
> > --- a/migration/colo-stubs.c
> > +++ b/migration/colo-stubs.c
> > @@ -9,7 +9,7 @@ void colo_shutdown(void)
> > {
> > }
> >
> > -void coroutine_fn colo_incoming_co(void)
> > +void colo_incoming_wait(void)
> > {
> > }
> >
> > diff --git a/migration/colo.c b/migration/colo.c
> > index e0f713c837..f5722d9d9d 100644
> > --- a/migration/colo.c
> > +++ b/migration/colo.c
> > @@ -147,11 +147,6 @@ static void secondary_vm_do_failover(void)
> > }
> > /* Notify COLO incoming thread that failover work is finished */
> > qemu_event_set(&mis->colo_incoming_event);
> > -
> > - /* For Secondary VM, jump to incoming co */
> > - if (mis->colo_incoming_co) {
> > - qemu_coroutine_enter(mis->colo_incoming_co);
> > - }
> > }
> >
> > static void primary_vm_do_failover(void)
> > @@ -686,7 +681,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
> >
> > bql_lock();
> > cpu_synchronize_all_states();
> > - ret = qemu_loadvm_state_main(mis->from_src_file, mis);
> > + ret = qemu_loadvm_state_main(mis->from_src_file, mis, true);
> > bql_unlock();
> >
> > if (ret < 0) {
> > @@ -854,10 +849,8 @@ static void *colo_process_incoming_thread(void *opaque)
> > goto out;
> > }
> > /*
> > - * Note: the communication between Primary side and Secondary side
> > - * should be sequential, we set the fd to unblocked in migration incoming
> > - * coroutine, and here we are in the COLO incoming thread, so it is ok to
> > - * set the fd back to blocked.
> > + * Here we are in the COLO incoming thread, so it is ok to set the fd
> > + * to blocked.
> > */
> > qemu_file_set_blocking(mis->from_src_file, true);
> >
> > @@ -930,26 +923,20 @@ out:
> > return NULL;
> > }
> >
> > -void coroutine_fn colo_incoming_co(void)
> > +/* Wait for failover */
> > +void colo_incoming_wait(void)
> > {
> > MigrationIncomingState *mis = migration_incoming_get_current();
> > QemuThread th;
> >
> > - assert(bql_locked());
> > assert(migration_incoming_colo_enabled());
> >
> > qemu_thread_create(&th, MIGRATION_THREAD_DST_COLO,
> > colo_process_incoming_thread,
> > mis, QEMU_THREAD_JOINABLE);
> >
> > - mis->colo_incoming_co = qemu_coroutine_self();
> > - qemu_coroutine_yield();
> > - mis->colo_incoming_co = NULL;
> > -
> > - bql_unlock();
> > /* Wait checkpoint incoming thread exit before free resource */
> > qemu_thread_join(&th);
> > - bql_lock();
> >
> > /* We hold the global BQL, so it is safe here */
> > colo_release_ram_cache();
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 10c216d25d..7e4d25b15c 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -494,6 +494,11 @@ void migration_incoming_state_destroy(void)
> > mis->postcopy_qemufile_dst = NULL;
> > }
> >
> > + if (mis->have_recv_thread) {
> > + qemu_thread_join(&mis->recv_thread);
> > + mis->have_recv_thread = false;
> > + }
> > +
> > cpr_set_incoming_mode(MIG_MODE_NONE);
> > yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> > }
> > @@ -864,30 +869,46 @@ static void process_incoming_migration_bh(void *opaque)
> > migration_incoming_state_destroy();
> > }
> >
> > -static void coroutine_fn
> > -process_incoming_migration_co(void *opaque)
> > +static void migration_incoming_state_destroy_bh(void *opaque)
> > +{
> > + struct MigrationIncomingState *mis = opaque;
> > +
> > + if (mis->exit_on_error) {
> > + /*
> > + * NOTE: this exit() should better happen in the main thread, as
> > + * the exit notifier may require BQL which can deadlock. See
> > + * commit e7bc0204e57836 for example.
> > + */
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + migration_incoming_state_destroy();
> > +}
> > +
> > +static void *migration_incoming_thread(void *opaque)
> > {
> > MigrationState *s = migrate_get_current();
> > - MigrationIncomingState *mis = migration_incoming_get_current();
> > + MigrationIncomingState *mis = opaque;
> > PostcopyState ps;
> > int ret;
> > Error *local_err = NULL;
> >
> > + rcu_register_thread();
> > +
> > assert(mis->from_src_file);
> > + assert(!bql_locked());
> >
> > mis->largest_page_size = qemu_ram_pagesize_largest();
> > postcopy_state_set(POSTCOPY_INCOMING_NONE);
> > migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP,
> > MIGRATION_STATUS_ACTIVE);
> >
> > - mis->loadvm_co = qemu_coroutine_self();
> > - ret = qemu_loadvm_state(mis->from_src_file);
> > - mis->loadvm_co = NULL;
> > + ret = qemu_loadvm_state(mis->from_src_file, false);
> >
> > trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
> >
> > ps = postcopy_state_get();
> > - trace_process_incoming_migration_co_end(ret, ps);
> > + trace_process_incoming_migration_end(ret, ps);
> > if (ps != POSTCOPY_INCOMING_NONE) {
> > if (ps == POSTCOPY_INCOMING_ADVISE) {
> > /*
> > @@ -901,7 +922,7 @@ process_incoming_migration_co(void *opaque)
> > * Postcopy was started, cleanup should happen at the end of the
> > * postcopy thread.
> > */
> > - trace_process_incoming_migration_co_postcopy_end_main();
> > + trace_process_incoming_migration_postcopy_end_main();
> > goto out;
> > }
> > /* Else if something went wrong then just fall out of the normal exit */
> > @@ -913,8 +934,8 @@ process_incoming_migration_co(void *opaque)
> > }
> >
> > if (migration_incoming_colo_enabled()) {
> > - /* yield until COLO exit */
> > - colo_incoming_co();
> > + /* wait until COLO exits */
> > + colo_incoming_wait();
> > }
> >
> > migration_bh_schedule(process_incoming_migration_bh, mis);
> > @@ -926,19 +947,24 @@ fail:
> > migrate_set_error(s, local_err);
> > error_free(local_err);
> >
> > - migration_incoming_state_destroy();
> > -
> > if (mis->exit_on_error) {
> > WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> > error_report_err(s->error);
> > s->error = NULL;
> > }
> > -
> > - exit(EXIT_FAILURE);
> > }
> > +
> > + /*
> > + * There's some step of the destroy process that will need to happen in
> > + * the main thread (e.g. joining this thread itself). Leave to a BH.
> > + */
> > + migration_bh_schedule(migration_incoming_state_destroy_bh, (void *)mis);
> > +
> > out:
> > /* Pairs with the refcount taken in qmp_migrate_incoming() */
> > migrate_incoming_unref_outgoing_state();
> > + rcu_unregister_thread();
> > + return NULL;
> > }
> >
> > /**
> > @@ -956,8 +982,12 @@ static void migration_incoming_setup(QEMUFile *f)
> >
> > void migration_incoming_process(void)
> > {
> > - Coroutine *co = qemu_coroutine_create(process_incoming_migration_co, NULL);
> > - qemu_coroutine_enter(co);
> > + MigrationIncomingState *mis = migration_incoming_get_current();
> > +
> > + mis->have_recv_thread = true;
> > + qemu_thread_create(&mis->recv_thread, "mig/dst/main",
> > + migration_incoming_thread, mis,
> > + QEMU_THREAD_JOINABLE);
> > }
> >
> > /* Returns true if recovered from a paused migration, otherwise false */
> > diff --git a/migration/rdma.c b/migration/rdma.c
> > index bcd7aae2f2..2b995513aa 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -3068,7 +3068,6 @@ static void rdma_cm_poll_handler(void *opaque)
> > {
> > RDMAContext *rdma = opaque;
> > struct rdma_cm_event *cm_event;
> > - MigrationIncomingState *mis = migration_incoming_get_current();
> >
> > if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> > error_report("get_cm_event failed %d", errno);
> > @@ -3087,10 +3086,6 @@ static void rdma_cm_poll_handler(void *opaque)
> > }
> > }
> > rdma_ack_cm_event(cm_event);
> > - if (mis->loadvm_co) {
> > - qemu_coroutine_enter(mis->loadvm_co);
> > - }
> > - return;
> > }
> > rdma_ack_cm_event(cm_event);
> > }
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index fabbeb296a..ad606c5425 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -154,11 +154,10 @@ static void qemu_loadvm_thread_pool_destroy(MigrationIncomingState *mis)
> > }
> >
> > static bool qemu_loadvm_thread_pool_wait(MigrationState *s,
> > - MigrationIncomingState *mis)
> > + MigrationIncomingState *mis,
> > + bool bql_held)
> > {
> > - bql_unlock(); /* Let load threads do work requiring BQL */
> > - thread_pool_wait(mis->load_threads);
> > - bql_lock();
> > + WITHOUT_BQL_HELD(bql_held, thread_pool_wait(mis->load_threads));
> >
> > return !migrate_has_error(s);
> > }
> > @@ -2091,14 +2090,11 @@ static void *postcopy_ram_listen_thread(void *opaque)
> > trace_postcopy_ram_listen_thread_start();
> >
> > rcu_register_thread();
> > - /*
> > - * Because we're a thread and not a coroutine we can't yield
> > - * in qemu_file, and thus we must be blocking now.
> > - */
> > + /* Because we're a thread, making sure to use blocking mode */
> > qemu_file_set_blocking(f, true);
> >
> > /* TODO: sanity check that only postcopiable data will be loaded here */
> > - load_res = qemu_loadvm_state_main(f, mis);
> > + load_res = qemu_loadvm_state_main(f, mis, false);
> >
> > /*
> > * This is tricky, but, mis->from_src_file can change after it
> > @@ -2392,13 +2388,14 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
> > * Immediately following this command is a blob of data containing an embedded
> > * chunk of migration stream; read it and load it.
> > *
> > - * @mis: Incoming state
> > - * @length: Length of packaged data to read
> > + * @mis: Incoming state
> > + * @bql_held: Whether BQL is held already
> > *
> > * Returns: Negative values on error
> > *
> > */
> > -static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
> > +static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis,
> > + bool bql_held)
> > {
> > int ret;
> > size_t length;
> > @@ -2449,7 +2446,7 @@ static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
> > qemu_coroutine_yield();
> > } while (1);
> >
> > - ret = qemu_loadvm_state_main(packf, mis);
> > + ret = qemu_loadvm_state_main(packf, mis, bql_held);
> > trace_loadvm_handle_cmd_packaged_main(ret);
> > qemu_fclose(packf);
> > object_unref(OBJECT(bioc));
> > @@ -2539,7 +2536,7 @@ static int loadvm_postcopy_handle_switchover_start(void)
> > * LOADVM_QUIT All good, but exit the loop
> > * <0 Error
> > */
> > -static int loadvm_process_command(QEMUFile *f)
> > +static int loadvm_process_command(QEMUFile *f, bool bql_held)
> > {
> > MigrationIncomingState *mis = migration_incoming_get_current();
> > uint16_t cmd;
> > @@ -2609,7 +2606,7 @@ static int loadvm_process_command(QEMUFile *f)
> > break;
> >
> > case MIG_CMD_PACKAGED:
> > - return loadvm_handle_cmd_packaged(mis);
> > + return loadvm_handle_cmd_packaged(mis, bql_held);
> >
> > case MIG_CMD_POSTCOPY_ADVISE:
> > return loadvm_postcopy_handle_advise(mis, len);
> > @@ -3028,7 +3025,8 @@ static bool postcopy_pause_incoming(MigrationIncomingState *mis)
> > return true;
> > }
> >
> > -int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
> > +int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
> > + bool bql_held)
> > {
> > uint8_t section_type;
> > int ret = 0;
> > @@ -3046,7 +3044,15 @@ retry:
> > switch (section_type) {
> > case QEMU_VM_SECTION_START:
> > case QEMU_VM_SECTION_FULL:
> > - ret = qemu_loadvm_section_start_full(f, section_type);
> > + /*
> > + * FULL should normally require BQL, e.g. during post_load()
> > + * there can be memory region updates. START may or may not
> > + * require it, but just to keep it simple to always hold BQL
> > + * for now.
> > + */
> > + WITH_BQL_HELD(
> > + bql_held,
> > + ret = qemu_loadvm_section_start_full(f, section_type));
> > if (ret < 0) {
> > goto out;
> > }
> > @@ -3059,7 +3065,11 @@ retry:
> > }
> > break;
> > case QEMU_VM_COMMAND:
> > - ret = loadvm_process_command(f);
> > + /*
> > + * Be careful; QEMU_VM_COMMAND can embed FULL sections, so it
> > + * may internally need BQL.
> > + */
> > + ret = loadvm_process_command(f, bql_held);
> > trace_qemu_loadvm_state_section_command(ret);
> > if ((ret < 0) || (ret == LOADVM_QUIT)) {
> > goto out;
> > @@ -3103,7 +3113,7 @@ out:
> > return ret;
> > }
> >
> > -int qemu_loadvm_state(QEMUFile *f)
> > +int qemu_loadvm_state(QEMUFile *f, bool bql_held)
> > {
> > MigrationState *s = migrate_get_current();
> > MigrationIncomingState *mis = migration_incoming_get_current();
> > @@ -3131,9 +3141,10 @@ int qemu_loadvm_state(QEMUFile *f)
> > qemu_loadvm_state_switchover_ack_needed(mis);
> > }
> >
> > - cpu_synchronize_all_pre_loadvm();
> > + /* run_on_cpu() requires BQL */
> > + WITH_BQL_HELD(bql_held, cpu_synchronize_all_pre_loadvm());
> >
> > - ret = qemu_loadvm_state_main(f, mis);
> > + ret = qemu_loadvm_state_main(f, mis, bql_held);
> > qemu_event_set(&mis->main_thread_load_event);
> >
> > trace_qemu_loadvm_state_post_main(ret);
> > @@ -3149,7 +3160,7 @@ int qemu_loadvm_state(QEMUFile *f)
> > /* When reaching here, it must be precopy */
> > if (ret == 0) {
> > if (migrate_has_error(migrate_get_current()) ||
> > - !qemu_loadvm_thread_pool_wait(s, mis)) {
> > + !qemu_loadvm_thread_pool_wait(s, mis, bql_held)) {
> > ret = -EINVAL;
> > } else {
> > ret = qemu_file_get_error(f);
> > @@ -3196,7 +3207,8 @@ int qemu_loadvm_state(QEMUFile *f)
> > }
> > }
> >
> > - cpu_synchronize_all_post_init();
> > + /* run_on_cpu() requires BQL */
> > + WITH_BQL_HELD(bql_held, cpu_synchronize_all_post_init());
> >
> > return ret;
> > }
> > @@ -3207,7 +3219,7 @@ int qemu_load_device_state(QEMUFile *f)
> > int ret;
> >
> > /* Load QEMU_VM_SECTION_FULL section */
> > - ret = qemu_loadvm_state_main(f, mis);
> > + ret = qemu_loadvm_state_main(f, mis, true);
> > if (ret < 0) {
> > error_report("Failed to load device state: %d", ret);
> > return ret;
> > @@ -3438,7 +3450,7 @@ void qmp_xen_load_devices_state(const char *filename, Error **errp)
> > f = qemu_file_new_input(QIO_CHANNEL(ioc));
> > object_unref(OBJECT(ioc));
> >
> > - ret = qemu_loadvm_state(f);
> > + ret = qemu_loadvm_state(f, true);
> > qemu_fclose(f);
> > if (ret < 0) {
> > error_setg(errp, "loading Xen device state failed");
> > @@ -3512,7 +3524,7 @@ bool load_snapshot(const char *name, const char *vmstate,
> > ret = -EINVAL;
> > goto err_drain;
> > }
> > - ret = qemu_loadvm_state(f);
> > + ret = qemu_loadvm_state(f, true);
> > migration_incoming_state_destroy();
> >
> > bdrv_drain_all_end();
> > diff --git a/migration/trace-events b/migration/trace-events
> > index 706db97def..eeb41e03f1 100644
> > --- a/migration/trace-events
> > +++ b/migration/trace-events
> > @@ -193,8 +193,8 @@ source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
> > source_return_path_thread_switchover_acked(void) ""
> > migration_thread_low_pending(uint64_t pending) "%" PRIu64
> > migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " max_size %" PRId64
> > -process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> > -process_incoming_migration_co_postcopy_end_main(void) ""
> > +process_incoming_migration_end(int ret, int ps) "ret=%d postcopy-state=%d"
> > +process_incoming_migration_postcopy_end_main(void) ""
> > postcopy_preempt_enabled(bool value) "%d"
> > migration_precopy_complete(void) ""
> >
> > --
> > 2.50.1
> >
> --
> -----Open up your eyes, open up your mind, open up your code -------
> / Dr. David Alan Gilbert | Running GNU/Linux | Happy \
> \ dave @ treblig.org | | In Hex /
> \ _________________________|_____ http://www.treblig.org |_______/
>
--
Peter Xu
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process
2025-08-29 8:29 ` Vladimir Sementsov-Ogievskiy
@ 2025-08-29 17:17 ` Peter Xu
2025-09-01 9:35 ` Vladimir Sementsov-Ogievskiy
0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2025-08-29 17:17 UTC (permalink / raw)
To: Vladimir Sementsov-Ogievskiy
Cc: qemu-devel, Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Prasad Pandit, Zhang Chen, Li Zhijian, Juraj Marcin
On Fri, Aug 29, 2025 at 11:29:59AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > For that, qemu_loadvm_state() and qemu_loadvm_state_main() functions need
> > to now take a "bql_held" parameter saying whether bql is held. We could
> > use things like BQL_LOCK_GUARD(), but this patch goes with explicit
> > lockings rather than relying on bql_locked TLS variable. In case of
> > migration, we always know whether BQL is held in different context as long
> > as we can still pass that information downwards.
>
> Agree, but I think it's better to make new macros following same pattern, i.e.
>
> WITH_BQL_HELD(bql_held) {
> action();
> }
>
> instead of
>
> WITH_BQL_HELD(bql_held, actions());
>
> ..
>
> Or I'm missing something and we already have a precedent of the latter
> notation?
Nop.. it's just that when initially working on that I didn't try as hard to
achieve such pattern. Here we need to recover the BQL status after the
block, so I didn't immediately see how autoptr would work there.
But I tried slightly harder, I think below should achieve the same pattern
but based on some for() magic.
Thanks for raising this, early comments still be welcomed or I'll go with
that.
===8<===
static inline void
with_bql_held_lock(bool bql_held, const char *file, int line)
{
assert(bql_held == bql_locked());
if (!bql_held) {
bql_lock_impl(file, line);
}
}
static inline void
with_bql_held_unlock(bool bql_held)
{
assert(bql_locked());
if (!bql_held) {
bql_unlock();
}
}
/**
* WITH_BQL_HELD(): Run a block of code, making sure BQL is held
* @bql_held: Whether BQL is already held
*
* Example use case:
*
* WITH_BQL_HELD(bql_held) {
* // BQL is guaranteed to be held within this block,
* // if it wasn't held, will be released when the block finishes.
* }
*/
#define WITH_BQL_HELD(bql_held) \
for (bool _bql_once = \
(with_bql_held_lock(bql_held, __FILE__, __LINE__), true); \
_bql_once; \
_bql_once = (with_bql_held_unlock(bql_held), false)) \
static inline void
with_bql_released_unlock(bool bql_held)
{
assert(bql_held == bql_locked());
if (bql_held) {
bql_unlock();
}
}
static inline void
with_bql_released_lock(bool bql_held, const char *file, int line)
{
assert(!bql_locked());
if (bql_held) {
bql_lock_impl(file, line);
}
}
/**
* WITH_BQL_RELEASED(): Run a task, making sure BQL is released
* @bql_held: Whether BQL is already held
*
* Example use case:
*
* WITH_BQL_RELEASED(bql_held) {
* // BQL is guaranteed to be released within this block,
* // if it was held, will be re-taken when the block finishes.
* }
*/
#define WITH_BQL_RELEASED(bql_held) \
for (bool _bql_once = (with_bql_released_unlock(bql_held), true); \
_bql_once; \
_bql_once = \
(with_bql_released_lock(bql_held, __FILE__, __LINE__), false)) \
--
Peter Xu
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 0/9] migration: Threadify loadvm process
2025-08-29 8:29 ` [PATCH RFC 0/9] migration: Threadify loadvm process Vladimir Sementsov-Ogievskiy
@ 2025-08-29 17:18 ` Peter Xu
0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2025-08-29 17:18 UTC (permalink / raw)
To: Vladimir Sementsov-Ogievskiy
Cc: qemu-devel, Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Prasad Pandit, Zhang Chen, Li Zhijian, Juraj Marcin
On Fri, Aug 29, 2025 at 11:29:37AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 27.08.25 23:59, Peter Xu wrote:
> > split the patches into smaller ones if possible
>
> Support for bql_held parameter for some functions may also be
> moved to separate preparation patches, which will simplify the
> main patch.
Sure, I can do that.
--
Peter Xu
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process
2025-08-29 17:17 ` Peter Xu
@ 2025-09-01 9:35 ` Vladimir Sementsov-Ogievskiy
0 siblings, 0 replies; 45+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2025-09-01 9:35 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P. Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Prasad Pandit, Zhang Chen, Li Zhijian, Juraj Marcin
On 29.08.25 20:17, Peter Xu wrote:
> On Fri, Aug 29, 2025 at 11:29:59AM +0300, Vladimir Sementsov-Ogievskiy wrote:
>>> For that, qemu_loadvm_state() and qemu_loadvm_state_main() functions need
>>> to now take a "bql_held" parameter saying whether bql is held. We could
>>> use things like BQL_LOCK_GUARD(), but this patch goes with explicit
>>> lockings rather than relying on bql_locked TLS variable. In case of
>>> migration, we always know whether BQL is held in different context as long
>>> as we can still pass that information downwards.
>>
>> Agree, but I think it's better to make new macros following same pattern, i.e.
>>
>> WITH_BQL_HELD(bql_held) {
>> action();
>> }
>>
>> instead of
>>
>> WITH_BQL_HELD(bql_held, actions());
>>
>> ..
>>
>> Or I'm missing something and we already have a precedent of the latter
>> notation?
>
> Nop.. it's just that when initially working on that I didn't try as hard to
> achieve such pattern. Here we need to recover the BQL status after the
> block, so I didn't immediately see how autoptr would work there.
>
> But I tried slightly harder, I think below should achieve the same pattern
> but based on some for() magic.
>
> Thanks for raising this, early comments still be welcomed or I'll go with
> that.
>
> ===8<===
>
> static inline void
> with_bql_held_lock(bool bql_held, const char *file, int line)
> {
> assert(bql_held == bql_locked());
> if (!bql_held) {
> bql_lock_impl(file, line);
> }
> }
>
> static inline void
> with_bql_held_unlock(bool bql_held)
> {
> assert(bql_locked());
> if (!bql_held) {
> bql_unlock();
> }
> }
>
> /**
> * WITH_BQL_HELD(): Run a block of code, making sure BQL is held
> * @bql_held: Whether BQL is already held
> *
> * Example use case:
> *
> * WITH_BQL_HELD(bql_held) {
> * // BQL is guaranteed to be held within this block,
> * // if it wasn't held, will be released when the block finishes.
> * }
> */
> #define WITH_BQL_HELD(bql_held) \
> for (bool _bql_once = \
> (with_bql_held_lock(bql_held, __FILE__, __LINE__), true); \
> _bql_once; \
> _bql_once = (with_bql_held_unlock(bql_held), false)) \
>
> static inline void
> with_bql_released_unlock(bool bql_held)
> {
> assert(bql_held == bql_locked());
> if (bql_held) {
> bql_unlock();
> }
> }
>
> static inline void
> with_bql_released_lock(bool bql_held, const char *file, int line)
> {
> assert(!bql_locked());
> if (bql_held) {
> bql_lock_impl(file, line);
> }
> }
>
> /**
> * WITH_BQL_RELEASED(): Run a task, making sure BQL is released
> * @bql_held: Whether BQL is already held
> *
> * Example use case:
> *
> * WITH_BQL_RELEASED(bql_held) {
> * // BQL is guaranteed to be released within this block,
> * // if it was held, will be re-taken when the block finishes.
> * }
> */
> #define WITH_BQL_RELEASED(bql_held) \
> for (bool _bql_once = (with_bql_released_unlock(bql_held), true); \
> _bql_once; \
> _bql_once = \
> (with_bql_released_lock(bql_held, __FILE__, __LINE__), false)) \
>
Hm, still it's doesn't achieve same magic as WITH_QEMU_LOCK_GUARD, as we cant use
"return" inside this for-loop (may be not critical, as you anyway don't use it..)
Something like this should work I think:
static inline BQLLockAutoCond *bql_auto_lock_cond(bool bql_held, const char *file, int line)
{
assert(bql_held == bql_locked());
if (bql_held) {
return (BQLLockAutoCond *)(uintptr_t)2;
}
bql_lock_impl(file, line);
return (BQLLockAutoCond *)(uintptr_t)1;
}
static inline void bql_auto_unlock_cond(BQLLockAutoCond *l)
{
if (l == (BQLLockAutoCond *)(uintptr_t)1) {
bql_unlock();
}
}
G_DEFINE_AUTOPTR_CLEANUP_FUNC(BQLLockAutoCond, bql_auto_unlock_cond)
#define WITH_BQL_HELD_(bql_held, var) \
for (g_autoptr(BQLLockAutoCond) var = \
bql_auto_lock_cond(bql_held, __FILE__, __LINE__); \
var; \
bql_auto_unlock_cond(var), var = NULL)
#define WITH_BQL_HELD(bql_held) \
WITH_BQL_HELD_((bql_held), glue(bql_held_cond_auto, __COUNTER__))
--
Best regards,
Vladimir
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process
2025-08-29 16:37 ` Peter Xu
@ 2025-09-04 1:38 ` Dr. David Alan Gilbert
2025-10-08 21:02 ` Peter Xu
0 siblings, 1 reply; 45+ messages in thread
From: Dr. David Alan Gilbert @ 2025-09-04 1:38 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Kevin Wolf, Paolo Bonzini, Daniel P . Berrangé,
Fabiano Rosas, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
* Peter Xu (peterx@redhat.com) wrote:
> On Wed, Aug 27, 2025 at 11:51:06PM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > Migration module was there for 10+ years. Initially, it was in most cases
> > > based on coroutines. As more features were added into the framework, like
> > > postcopy, multifd, etc.. it became a mixture of threads and coroutines.
> > >
> > > I'm guessing coroutines just can't fix all issues that migration want to
> > > resolve.
> >
> > Yeh migration can happily eat a whole core.
> >
> > > After all these years, migration is now heavily based on a threaded model.
> > >
> > > Now there's still a major part of migration framework that is still not
> > > thread-based, which is precopy load. We do load in a separate thread in
> > > postcopy since the 1st day postcopy was introduced, however that requires a
> > > separate state transition from precopy loading all devices first, which
> > > still happens in the main thread of a coroutine.
> >
> > ...
> >
> > > COLO
> > > ====
> >
> > If you can I suggest splitting the COLO stuff out as a separate thread,
> > not many people understand it.
>
> I can try this one, but then it'll be a bunch of "if (qemu_in_coroutine())"
> checks all over the places.
>
> For emample, this change of this patch:
>
> - assert(bql_locked());
> assert(migration_incoming_colo_enabled());
>
> qemu_thread_create(&th, MIGRATION_THREAD_DST_COLO,
> colo_process_incoming_thread,
> mis, QEMU_THREAD_JOINABLE);
>
> - mis->colo_incoming_co = qemu_coroutine_self();
> - qemu_coroutine_yield();
> - mis->colo_incoming_co = NULL;
> -
> - bql_unlock();
> /* Wait checkpoint incoming thread exit before free resource */
> qemu_thread_join(&th);
> - bql_lock();
>
> Will become:
>
> - assert(bql_locked());
> assert(migration_incoming_colo_enabled());
>
> qemu_thread_create(&th, MIGRATION_THREAD_DST_COLO,
> colo_process_incoming_thread,
> mis, QEMU_THREAD_JOINABLE);
>
> - mis->colo_incoming_co = qemu_coroutine_self();
> - qemu_coroutine_yield();
> - mis->colo_incoming_co = NULL;
> + if (qemu_in_coroutine()) {
> + assert(bql_locked());
> + mis->colo_incoming_co = qemu_coroutine_self();
> + qemu_coroutine_yield();
> + mis->colo_incoming_co = NULL;
> + bql_unlock();
> + }
>
> - bql_unlock();
> /* Wait checkpoint incoming thread exit before free resource */
> qemu_thread_join(&th);
> - bql_lock();
> +
> + if (qemu_in_coroutine()) {
> + bql_lock();
> + }
>
> Then I'll add one more patch at last to remove all these "if" blocks.
>
> Which one is better?
Not much difference is there.
>
> For the rest, I can still try to move things; migration_channel_read_peek()
> change be a separate patch after this one, but that's pretty small.. not
> so much like that, normally we'll still need such "if"s to be added prior
> this patch, apply this patch, then removed those "if"s in another later patch.
>
> >
> > > TODO
> > > ====
> > >
> > > Currently the BQL is taken during loading of a START|FULL section. When
> > > the IO hangs (e.g. network issue) during this process, it could potentially
> > > block others like the monitor servers. One solution is breaking BQL to
> > > smaller granule and leave IOs to be always BQL-free. That'll need more
> > > justifications.
> > >
> > > For example, there are at least four things that need some closer
> > > attention:
> > >
> > > - SaveVMHandlers's load_state(): this likely DO NOT need BQL, but we need
> > > to justify all of them (not to mention, some of them look like prone to
> > > be rewritten as VMSDs..)
> > >
> > > - VMSD's pre_load(): in most cases, this DO NOT really need BQL, but
> > > sometimes maybe it will! Double checking on this will be needed.
> > >
> > > - VMSD's post_load(): in many cases, this DO need BQL, for example on
> > > address space operations. Likely we should just take it for any
> > > post_load().
> > >
> > > - VMSD field's get(): this is tricky! It could internally be anything
> > > even if it was only a field. E.g. there can be users to use a SINGLE
> > > field to load a whole VMSD, which can further introduce more
> > > possibilities.
> >
> > Long long ago, I did convert some get's to structure; I got stuck on some
> > though - some have pretty crazy hand built lists and things.
>
> Yeah, I can feel it even though I didn't look into each of them yet. :)
>
> Looks like they're all explicit VMS_SINGLE users; we have 22 instances.
> Unfortunately, I still see new ones being added, latest one in
> 5d56bff11e3d. I wonder whether pre_save() + post_load() would have worked
> there..
I seem to remember the virtio stuff is particularly complicated, but remember
other lists as well.
> >
> > > In general, QEMUFile IOs should not need BQL, that is when receiving the
> > > VMSD data and waiting for e.g. the socket buffer to get refilled. But
> > > that's the easy part.
> >
> > It's probably generally a good thing to get rid of the BQL there, but I bet
> > it's going to throw some surprises; maybe something like devices doing
> > stuff before the migration has fully arrived
>
> Is that pre_load() or.. maybe something else?
>
> I should still look into each of them, but only if we want to further push
> the bql to be at post_load() level. I am not sure if some pre_load() would
> assume BQL won't be released until post_load(), if so that'll be an issue,
> and that will need some closer code observation...
Well maybe pre_load; but anything that might start happening once the
state has been loaded that shouldn't start happening until migration ends;
I think there are some devices that do it properly and wait for end of migration.
> > or incoming socket connections to non-migration stuff perhaps.
>
> Any example for this one?
I was just thinking aloud; but was thinking of NIC activity or maybe
UI stuff? But just guesses.
Dave
> Thanks!
>
> >
> > Dave
> >
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > > include/migration/colo.h | 6 ++--
> > > migration/migration.h | 52 ++++++++++++++++++++++++++------
> > > migration/savevm.h | 5 ++--
> > > migration/channel.c | 7 ++---
> > > migration/colo-stubs.c | 2 +-
> > > migration/colo.c | 23 ++++-----------
> > > migration/migration.c | 62 ++++++++++++++++++++++++++++----------
> > > migration/rdma.c | 5 ----
> > > migration/savevm.c | 64 ++++++++++++++++++++++++----------------
> > > migration/trace-events | 4 +--
> > > 10 files changed, 142 insertions(+), 88 deletions(-)
> > >
> > > diff --git a/include/migration/colo.h b/include/migration/colo.h
> > > index 43222ef5ae..bfb30eccf0 100644
> > > --- a/include/migration/colo.h
> > > +++ b/include/migration/colo.h
> > > @@ -44,12 +44,10 @@ void colo_do_failover(void);
> > > void colo_checkpoint_delay_set(void);
> > >
> > > /*
> > > - * Starts COLO incoming process. Called from process_incoming_migration_co()
> > > + * Starts COLO incoming process. Called from migration_incoming_thread()
> > > * after loading the state.
> > > - *
> > > - * Called with BQL locked, may temporary release BQL.
> > > */
> > > -void coroutine_fn colo_incoming_co(void);
> > > +void colo_incoming_wait(void);
> > >
> > > void colo_shutdown(void);
> > > #endif
> > > diff --git a/migration/migration.h b/migration/migration.h
> > > index 01329bf824..c4a626eed4 100644
> > > --- a/migration/migration.h
> > > +++ b/migration/migration.h
> > > @@ -42,6 +42,44 @@
> > > #define MIGRATION_THREAD_DST_LISTEN "mig/dst/listen"
> > > #define MIGRATION_THREAD_DST_PREEMPT "mig/dst/preempt"
> > >
> > > +/**
> > > + * WITH_BQL_HELD(): Run a task, making sure BQL is held
> > > + *
> > > + * @bql_held: Whether BQL is already held
> > > + * @task: The task to run within BQL held
> > > + */
> > > +#define WITH_BQL_HELD(bql_held, task) \
> > > + do { \
> > > + if (!bql_held) { \
> > > + bql_lock(); \
> > > + } else { \
> > > + assert(bql_locked()); \
> > > + } \
> > > + task; \
> > > + if (!bql_held) { \
> > > + bql_unlock(); \
> > > + } \
> > > + } while (0)
> > > +
> > > +/**
> > > + * WITHOUT_BQL_HELD(): Run a task, making sure BQL is released
> > > + *
> > > + * @bql_held: Whether BQL is already held
> > > + * @task: The task to run making sure BQL released
> > > + */
> > > +#define WITHOUT_BQL_HELD(bql_held, task) \
> > > + do { \
> > > + if (bql_held) { \
> > > + bql_unlock(); \
> > > + } else { \
> > > + assert(!bql_locked()); \
> > > + } \
> > > + task; \
> > > + if (bql_held) { \
> > > + bql_lock(); \
> > > + } \
> > > + } while (0)
> > > +
> > > struct PostcopyBlocktimeContext;
> > > typedef struct ThreadPool ThreadPool;
> > >
> > > @@ -119,6 +157,10 @@ struct MigrationIncomingState {
> > > bool have_listen_thread;
> > > QemuThread listen_thread;
> > >
> > > + /* Migration main recv thread */
> > > + bool have_recv_thread;
> > > + QemuThread recv_thread;
> > > +
> > > /* For the kernel to send us notifications */
> > > int userfault_fd;
> > > /* To notify the fault_thread to wake, e.g., when need to quit */
> > > @@ -177,15 +219,7 @@ struct MigrationIncomingState {
> > >
> > > MigrationStatus state;
> > >
> > > - /*
> > > - * The incoming migration coroutine, non-NULL during qemu_loadvm_state().
> > > - * Used to wake the migration incoming coroutine from rdma code. How much is
> > > - * it safe - it's a question.
> > > - */
> > > - Coroutine *loadvm_co;
> > > -
> > > - /* The coroutine we should enter (back) after failover */
> > > - Coroutine *colo_incoming_co;
> > > + /* Notify secondary VM to move on */
> > > QemuEvent colo_incoming_event;
> > >
> > > /* Optional load threads pool and its thread exit request flag */
> > > diff --git a/migration/savevm.h b/migration/savevm.h
> > > index 2d5e9c7166..c07e14f61a 100644
> > > --- a/migration/savevm.h
> > > +++ b/migration/savevm.h
> > > @@ -64,9 +64,10 @@ void qemu_savevm_send_colo_enable(QEMUFile *f);
> > > void qemu_savevm_live_state(QEMUFile *f);
> > > int qemu_save_device_state(QEMUFile *f);
> > >
> > > -int qemu_loadvm_state(QEMUFile *f);
> > > +int qemu_loadvm_state(QEMUFile *f, bool bql_held);
> > > void qemu_loadvm_state_cleanup(MigrationIncomingState *mis);
> > > -int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
> > > +int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
> > > + bool bql_held);
> > > int qemu_load_device_state(QEMUFile *f);
> > > int qemu_loadvm_approve_switchover(void);
> > > int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> > > diff --git a/migration/channel.c b/migration/channel.c
> > > index a547b1fbfe..621f8a4a2a 100644
> > > --- a/migration/channel.c
> > > +++ b/migration/channel.c
> > > @@ -136,11 +136,8 @@ int migration_channel_read_peek(QIOChannel *ioc,
> > > }
> > >
> > > /* 1ms sleep. */
> > > - if (qemu_in_coroutine()) {
> > > - qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000);
> > > - } else {
> > > - g_usleep(1000);
> > > - }
> > > + assert(!qemu_in_coroutine());
> > > + g_usleep(1000);
> > > }
> > >
> > > return 0;
> > > diff --git a/migration/colo-stubs.c b/migration/colo-stubs.c
> > > index e22ce65234..ef77d1ab4b 100644
> > > --- a/migration/colo-stubs.c
> > > +++ b/migration/colo-stubs.c
> > > @@ -9,7 +9,7 @@ void colo_shutdown(void)
> > > {
> > > }
> > >
> > > -void coroutine_fn colo_incoming_co(void)
> > > +void colo_incoming_wait(void)
> > > {
> > > }
> > >
> > > diff --git a/migration/colo.c b/migration/colo.c
> > > index e0f713c837..f5722d9d9d 100644
> > > --- a/migration/colo.c
> > > +++ b/migration/colo.c
> > > @@ -147,11 +147,6 @@ static void secondary_vm_do_failover(void)
> > > }
> > > /* Notify COLO incoming thread that failover work is finished */
> > > qemu_event_set(&mis->colo_incoming_event);
> > > -
> > > - /* For Secondary VM, jump to incoming co */
> > > - if (mis->colo_incoming_co) {
> > > - qemu_coroutine_enter(mis->colo_incoming_co);
> > > - }
> > > }
> > >
> > > static void primary_vm_do_failover(void)
> > > @@ -686,7 +681,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
> > >
> > > bql_lock();
> > > cpu_synchronize_all_states();
> > > - ret = qemu_loadvm_state_main(mis->from_src_file, mis);
> > > + ret = qemu_loadvm_state_main(mis->from_src_file, mis, true);
> > > bql_unlock();
> > >
> > > if (ret < 0) {
> > > @@ -854,10 +849,8 @@ static void *colo_process_incoming_thread(void *opaque)
> > > goto out;
> > > }
> > > /*
> > > - * Note: the communication between Primary side and Secondary side
> > > - * should be sequential, we set the fd to unblocked in migration incoming
> > > - * coroutine, and here we are in the COLO incoming thread, so it is ok to
> > > - * set the fd back to blocked.
> > > + * Here we are in the COLO incoming thread, so it is ok to set the fd
> > > + * to blocked.
> > > */
> > > qemu_file_set_blocking(mis->from_src_file, true);
> > >
> > > @@ -930,26 +923,20 @@ out:
> > > return NULL;
> > > }
> > >
> > > -void coroutine_fn colo_incoming_co(void)
> > > +/* Wait for failover */
> > > +void colo_incoming_wait(void)
> > > {
> > > MigrationIncomingState *mis = migration_incoming_get_current();
> > > QemuThread th;
> > >
> > > - assert(bql_locked());
> > > assert(migration_incoming_colo_enabled());
> > >
> > > qemu_thread_create(&th, MIGRATION_THREAD_DST_COLO,
> > > colo_process_incoming_thread,
> > > mis, QEMU_THREAD_JOINABLE);
> > >
> > > - mis->colo_incoming_co = qemu_coroutine_self();
> > > - qemu_coroutine_yield();
> > > - mis->colo_incoming_co = NULL;
> > > -
> > > - bql_unlock();
> > > /* Wait checkpoint incoming thread exit before free resource */
> > > qemu_thread_join(&th);
> > > - bql_lock();
> > >
> > > /* We hold the global BQL, so it is safe here */
> > > colo_release_ram_cache();
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 10c216d25d..7e4d25b15c 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -494,6 +494,11 @@ void migration_incoming_state_destroy(void)
> > > mis->postcopy_qemufile_dst = NULL;
> > > }
> > >
> > > + if (mis->have_recv_thread) {
> > > + qemu_thread_join(&mis->recv_thread);
> > > + mis->have_recv_thread = false;
> > > + }
> > > +
> > > cpr_set_incoming_mode(MIG_MODE_NONE);
> > > yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> > > }
> > > @@ -864,30 +869,46 @@ static void process_incoming_migration_bh(void *opaque)
> > > migration_incoming_state_destroy();
> > > }
> > >
> > > -static void coroutine_fn
> > > -process_incoming_migration_co(void *opaque)
> > > +static void migration_incoming_state_destroy_bh(void *opaque)
> > > +{
> > > + struct MigrationIncomingState *mis = opaque;
> > > +
> > > + if (mis->exit_on_error) {
> > > + /*
> > > + * NOTE: this exit() should better happen in the main thread, as
> > > + * the exit notifier may require BQL which can deadlock. See
> > > + * commit e7bc0204e57836 for example.
> > > + */
> > > + exit(EXIT_FAILURE);
> > > + }
> > > +
> > > + migration_incoming_state_destroy();
> > > +}
> > > +
> > > +static void *migration_incoming_thread(void *opaque)
> > > {
> > > MigrationState *s = migrate_get_current();
> > > - MigrationIncomingState *mis = migration_incoming_get_current();
> > > + MigrationIncomingState *mis = opaque;
> > > PostcopyState ps;
> > > int ret;
> > > Error *local_err = NULL;
> > >
> > > + rcu_register_thread();
> > > +
> > > assert(mis->from_src_file);
> > > + assert(!bql_locked());
> > >
> > > mis->largest_page_size = qemu_ram_pagesize_largest();
> > > postcopy_state_set(POSTCOPY_INCOMING_NONE);
> > > migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP,
> > > MIGRATION_STATUS_ACTIVE);
> > >
> > > - mis->loadvm_co = qemu_coroutine_self();
> > > - ret = qemu_loadvm_state(mis->from_src_file);
> > > - mis->loadvm_co = NULL;
> > > + ret = qemu_loadvm_state(mis->from_src_file, false);
> > >
> > > trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
> > >
> > > ps = postcopy_state_get();
> > > - trace_process_incoming_migration_co_end(ret, ps);
> > > + trace_process_incoming_migration_end(ret, ps);
> > > if (ps != POSTCOPY_INCOMING_NONE) {
> > > if (ps == POSTCOPY_INCOMING_ADVISE) {
> > > /*
> > > @@ -901,7 +922,7 @@ process_incoming_migration_co(void *opaque)
> > > * Postcopy was started, cleanup should happen at the end of the
> > > * postcopy thread.
> > > */
> > > - trace_process_incoming_migration_co_postcopy_end_main();
> > > + trace_process_incoming_migration_postcopy_end_main();
> > > goto out;
> > > }
> > > /* Else if something went wrong then just fall out of the normal exit */
> > > @@ -913,8 +934,8 @@ process_incoming_migration_co(void *opaque)
> > > }
> > >
> > > if (migration_incoming_colo_enabled()) {
> > > - /* yield until COLO exit */
> > > - colo_incoming_co();
> > > + /* wait until COLO exits */
> > > + colo_incoming_wait();
> > > }
> > >
> > > migration_bh_schedule(process_incoming_migration_bh, mis);
> > > @@ -926,19 +947,24 @@ fail:
> > > migrate_set_error(s, local_err);
> > > error_free(local_err);
> > >
> > > - migration_incoming_state_destroy();
> > > -
> > > if (mis->exit_on_error) {
> > > WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> > > error_report_err(s->error);
> > > s->error = NULL;
> > > }
> > > -
> > > - exit(EXIT_FAILURE);
> > > }
> > > +
> > > + /*
> > > + * There's some step of the destroy process that will need to happen in
> > > + * the main thread (e.g. joining this thread itself). Leave to a BH.
> > > + */
> > > + migration_bh_schedule(migration_incoming_state_destroy_bh, (void *)mis);
> > > +
> > > out:
> > > /* Pairs with the refcount taken in qmp_migrate_incoming() */
> > > migrate_incoming_unref_outgoing_state();
> > > + rcu_unregister_thread();
> > > + return NULL;
> > > }
> > >
> > > /**
> > > @@ -956,8 +982,12 @@ static void migration_incoming_setup(QEMUFile *f)
> > >
> > > void migration_incoming_process(void)
> > > {
> > > - Coroutine *co = qemu_coroutine_create(process_incoming_migration_co, NULL);
> > > - qemu_coroutine_enter(co);
> > > + MigrationIncomingState *mis = migration_incoming_get_current();
> > > +
> > > + mis->have_recv_thread = true;
> > > + qemu_thread_create(&mis->recv_thread, "mig/dst/main",
> > > + migration_incoming_thread, mis,
> > > + QEMU_THREAD_JOINABLE);
> > > }
> > >
> > > /* Returns true if recovered from a paused migration, otherwise false */
> > > diff --git a/migration/rdma.c b/migration/rdma.c
> > > index bcd7aae2f2..2b995513aa 100644
> > > --- a/migration/rdma.c
> > > +++ b/migration/rdma.c
> > > @@ -3068,7 +3068,6 @@ static void rdma_cm_poll_handler(void *opaque)
> > > {
> > > RDMAContext *rdma = opaque;
> > > struct rdma_cm_event *cm_event;
> > > - MigrationIncomingState *mis = migration_incoming_get_current();
> > >
> > > if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> > > error_report("get_cm_event failed %d", errno);
> > > @@ -3087,10 +3086,6 @@ static void rdma_cm_poll_handler(void *opaque)
> > > }
> > > }
> > > rdma_ack_cm_event(cm_event);
> > > - if (mis->loadvm_co) {
> > > - qemu_coroutine_enter(mis->loadvm_co);
> > > - }
> > > - return;
> > > }
> > > rdma_ack_cm_event(cm_event);
> > > }
> > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > index fabbeb296a..ad606c5425 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -154,11 +154,10 @@ static void qemu_loadvm_thread_pool_destroy(MigrationIncomingState *mis)
> > > }
> > >
> > > static bool qemu_loadvm_thread_pool_wait(MigrationState *s,
> > > - MigrationIncomingState *mis)
> > > + MigrationIncomingState *mis,
> > > + bool bql_held)
> > > {
> > > - bql_unlock(); /* Let load threads do work requiring BQL */
> > > - thread_pool_wait(mis->load_threads);
> > > - bql_lock();
> > > + WITHOUT_BQL_HELD(bql_held, thread_pool_wait(mis->load_threads));
> > >
> > > return !migrate_has_error(s);
> > > }
> > > @@ -2091,14 +2090,11 @@ static void *postcopy_ram_listen_thread(void *opaque)
> > > trace_postcopy_ram_listen_thread_start();
> > >
> > > rcu_register_thread();
> > > - /*
> > > - * Because we're a thread and not a coroutine we can't yield
> > > - * in qemu_file, and thus we must be blocking now.
> > > - */
> > > + /* Because we're a thread, making sure to use blocking mode */
> > > qemu_file_set_blocking(f, true);
> > >
> > > /* TODO: sanity check that only postcopiable data will be loaded here */
> > > - load_res = qemu_loadvm_state_main(f, mis);
> > > + load_res = qemu_loadvm_state_main(f, mis, false);
> > >
> > > /*
> > > * This is tricky, but, mis->from_src_file can change after it
> > > @@ -2392,13 +2388,14 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
> > > * Immediately following this command is a blob of data containing an embedded
> > > * chunk of migration stream; read it and load it.
> > > *
> > > - * @mis: Incoming state
> > > - * @length: Length of packaged data to read
> > > + * @mis: Incoming state
> > > + * @bql_held: Whether BQL is held already
> > > *
> > > * Returns: Negative values on error
> > > *
> > > */
> > > -static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
> > > +static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis,
> > > + bool bql_held)
> > > {
> > > int ret;
> > > size_t length;
> > > @@ -2449,7 +2446,7 @@ static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
> > > qemu_coroutine_yield();
> > > } while (1);
> > >
> > > - ret = qemu_loadvm_state_main(packf, mis);
> > > + ret = qemu_loadvm_state_main(packf, mis, bql_held);
> > > trace_loadvm_handle_cmd_packaged_main(ret);
> > > qemu_fclose(packf);
> > > object_unref(OBJECT(bioc));
> > > @@ -2539,7 +2536,7 @@ static int loadvm_postcopy_handle_switchover_start(void)
> > > * LOADVM_QUIT All good, but exit the loop
> > > * <0 Error
> > > */
> > > -static int loadvm_process_command(QEMUFile *f)
> > > +static int loadvm_process_command(QEMUFile *f, bool bql_held)
> > > {
> > > MigrationIncomingState *mis = migration_incoming_get_current();
> > > uint16_t cmd;
> > > @@ -2609,7 +2606,7 @@ static int loadvm_process_command(QEMUFile *f)
> > > break;
> > >
> > > case MIG_CMD_PACKAGED:
> > > - return loadvm_handle_cmd_packaged(mis);
> > > + return loadvm_handle_cmd_packaged(mis, bql_held);
> > >
> > > case MIG_CMD_POSTCOPY_ADVISE:
> > > return loadvm_postcopy_handle_advise(mis, len);
> > > @@ -3028,7 +3025,8 @@ static bool postcopy_pause_incoming(MigrationIncomingState *mis)
> > > return true;
> > > }
> > >
> > > -int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
> > > +int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
> > > + bool bql_held)
> > > {
> > > uint8_t section_type;
> > > int ret = 0;
> > > @@ -3046,7 +3044,15 @@ retry:
> > > switch (section_type) {
> > > case QEMU_VM_SECTION_START:
> > > case QEMU_VM_SECTION_FULL:
> > > - ret = qemu_loadvm_section_start_full(f, section_type);
> > > + /*
> > > + * FULL should normally require BQL, e.g. during post_load()
> > > + * there can be memory region updates. START may or may not
> > > + * require it, but just to keep it simple to always hold BQL
> > > + * for now.
> > > + */
> > > + WITH_BQL_HELD(
> > > + bql_held,
> > > + ret = qemu_loadvm_section_start_full(f, section_type));
> > > if (ret < 0) {
> > > goto out;
> > > }
> > > @@ -3059,7 +3065,11 @@ retry:
> > > }
> > > break;
> > > case QEMU_VM_COMMAND:
> > > - ret = loadvm_process_command(f);
> > > + /*
> > > + * Be careful; QEMU_VM_COMMAND can embed FULL sections, so it
> > > + * may internally need BQL.
> > > + */
> > > + ret = loadvm_process_command(f, bql_held);
> > > trace_qemu_loadvm_state_section_command(ret);
> > > if ((ret < 0) || (ret == LOADVM_QUIT)) {
> > > goto out;
> > > @@ -3103,7 +3113,7 @@ out:
> > > return ret;
> > > }
> > >
> > > -int qemu_loadvm_state(QEMUFile *f)
> > > +int qemu_loadvm_state(QEMUFile *f, bool bql_held)
> > > {
> > > MigrationState *s = migrate_get_current();
> > > MigrationIncomingState *mis = migration_incoming_get_current();
> > > @@ -3131,9 +3141,10 @@ int qemu_loadvm_state(QEMUFile *f)
> > > qemu_loadvm_state_switchover_ack_needed(mis);
> > > }
> > >
> > > - cpu_synchronize_all_pre_loadvm();
> > > + /* run_on_cpu() requires BQL */
> > > + WITH_BQL_HELD(bql_held, cpu_synchronize_all_pre_loadvm());
> > >
> > > - ret = qemu_loadvm_state_main(f, mis);
> > > + ret = qemu_loadvm_state_main(f, mis, bql_held);
> > > qemu_event_set(&mis->main_thread_load_event);
> > >
> > > trace_qemu_loadvm_state_post_main(ret);
> > > @@ -3149,7 +3160,7 @@ int qemu_loadvm_state(QEMUFile *f)
> > > /* When reaching here, it must be precopy */
> > > if (ret == 0) {
> > > if (migrate_has_error(migrate_get_current()) ||
> > > - !qemu_loadvm_thread_pool_wait(s, mis)) {
> > > + !qemu_loadvm_thread_pool_wait(s, mis, bql_held)) {
> > > ret = -EINVAL;
> > > } else {
> > > ret = qemu_file_get_error(f);
> > > @@ -3196,7 +3207,8 @@ int qemu_loadvm_state(QEMUFile *f)
> > > }
> > > }
> > >
> > > - cpu_synchronize_all_post_init();
> > > + /* run_on_cpu() requires BQL */
> > > + WITH_BQL_HELD(bql_held, cpu_synchronize_all_post_init());
> > >
> > > return ret;
> > > }
> > > @@ -3207,7 +3219,7 @@ int qemu_load_device_state(QEMUFile *f)
> > > int ret;
> > >
> > > /* Load QEMU_VM_SECTION_FULL section */
> > > - ret = qemu_loadvm_state_main(f, mis);
> > > + ret = qemu_loadvm_state_main(f, mis, true);
> > > if (ret < 0) {
> > > error_report("Failed to load device state: %d", ret);
> > > return ret;
> > > @@ -3438,7 +3450,7 @@ void qmp_xen_load_devices_state(const char *filename, Error **errp)
> > > f = qemu_file_new_input(QIO_CHANNEL(ioc));
> > > object_unref(OBJECT(ioc));
> > >
> > > - ret = qemu_loadvm_state(f);
> > > + ret = qemu_loadvm_state(f, true);
> > > qemu_fclose(f);
> > > if (ret < 0) {
> > > error_setg(errp, "loading Xen device state failed");
> > > @@ -3512,7 +3524,7 @@ bool load_snapshot(const char *name, const char *vmstate,
> > > ret = -EINVAL;
> > > goto err_drain;
> > > }
> > > - ret = qemu_loadvm_state(f);
> > > + ret = qemu_loadvm_state(f, true);
> > > migration_incoming_state_destroy();
> > >
> > > bdrv_drain_all_end();
> > > diff --git a/migration/trace-events b/migration/trace-events
> > > index 706db97def..eeb41e03f1 100644
> > > --- a/migration/trace-events
> > > +++ b/migration/trace-events
> > > @@ -193,8 +193,8 @@ source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
> > > source_return_path_thread_switchover_acked(void) ""
> > > migration_thread_low_pending(uint64_t pending) "%" PRIu64
> > > migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " max_size %" PRId64
> > > -process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> > > -process_incoming_migration_co_postcopy_end_main(void) ""
> > > +process_incoming_migration_end(int ret, int ps) "ret=%d postcopy-state=%d"
> > > +process_incoming_migration_postcopy_end_main(void) ""
> > > postcopy_preempt_enabled(bool value) "%d"
> > > migration_precopy_complete(void) ""
> > >
> > > --
> > > 2.50.1
> > >
> > --
> > -----Open up your eyes, open up your mind, open up your code -------
> > / Dr. David Alan Gilbert | Running GNU/Linux | Happy \
> > \ dave @ treblig.org | | In Hex /
> > \ _________________________|_____ http://www.treblig.org |_______/
> >
>
> --
> Peter Xu
>
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux | Happy \
\ dave @ treblig.org | | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 0/9] migration: Threadify loadvm process
2025-08-27 20:59 [PATCH RFC 0/9] migration: Threadify loadvm process Peter Xu
` (9 preceding siblings ...)
2025-08-29 8:29 ` [PATCH RFC 0/9] migration: Threadify loadvm process Vladimir Sementsov-Ogievskiy
@ 2025-09-04 8:27 ` Zhang Chen
2025-10-08 21:26 ` Peter Xu
2025-09-16 21:32 ` Fabiano Rosas
11 siblings, 1 reply; 45+ messages in thread
From: Zhang Chen @ 2025-09-04 8:27 UTC (permalink / raw)
To: Peter Xu, Hailiang Zhang
Cc: qemu-devel, Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Li Zhijian,
Juraj Marcin
On Thu, Aug 28, 2025 at 4:59 AM Peter Xu <peterx@redhat.com> wrote:
>
> [this is an early RFC, not for merge, but to collect initial feedbacks]
>
> Background
> ==========
>
> Nowadays, live migration heavily depends on threads. For example, most of
> the major features that will be used nowadays in live migration (multifd,
> postcopy, mapped-ram, vfio, etc.) all work with threads internally.
>
> But still, from time to time, we'll see some coroutines floating around the
> migration context. The major one is precopy's loadvm, which is internally
> a coroutine. It is still a critical path that any live migration depends on.
>
> A mixture of using both coroutines and threads is prone to issues. Some
> examples can refer to commit e65cec5e5d ("migration/ram: Yield periodically
> to the main loop") or commit 7afbdada7e ("migration/postcopy: ensure
> preempt channel is ready before loading states").
>
> Overview
> ========
>
> This series tries to move migration further into the thread-based model, by
> allowing the loadvm process to happen in a thread rather than in the main
> thread with a coroutine.
>
> Luckily, since the qio channel code is always ready for both cases, IO
> paths should all be fine.
>
> Note that loadvm for postcopy already happens in a ram load thread which is
> separate. However, RAM is just the simple case here, even it has its own
> challenges (on atomically update of the pgtables), its complexity lies in
> the kernel.
>
> For precopy, loadvm has quite a few operations that will need BQL. The
> question is we can't take BQL for the whole process of loadvm, because
> that'll block the main thread from executions (e.g. QMP hangs). Here, the
> finer granule we can push BQL the better. This series so far chose
> somewhere in the middle, by taking BQL on majorly these two places:
>
> - CPU synchronizations
> - Device START/FULL sections
>
> After this series applied, most of the rest loadvm path will run without
> BQL anymore. There is a more detailed discussion / todo in the commit
> message of patch "migration: Thread-ify precopy vmstate load process"
> explaning how to further split the BQL critical sections.
>
> I was trying to split the patches into smaller ones if possible, but it's
> still quite challenging so there's one major patch that does the work.
>
> After the series applied, the only leftover pieces in migration/ that would
> use a coroutine is snapshot save/load/delete jobs.
>
> Tests
> =====
>
> Default CI passes.
>
> RDMA unit tests pass as usual. I also tried out cancellation / failure
> tests over RDMA channels, making sure nothing is stuck.
>
> I also roughly measured how long it takes to run the whole 80+ migration
> qtest suite, and see no measurable difference before / after this series.
>
> Risks
> =====
>
> This series has the risk of breaking things. I would be surprised if it
> didn't..
>
> I confess I didn't test anything on COLO but only from code observations
> and analysis. COLO maintainers: could you add some unit tests to QEMU's
> qtests?
For the COLO part, I think remove the coroutines related code is OK for me.
Because the original coroutine still need to call the
"colo_process_incoming_thread".
Hi Hailiang, any comments for this part?
Thanks
Chen
>
> The current way of taking BQL during FULL section load may cause issues, it
> means when the IOs are unstable we could be waiting for IO (in the new
> migration incoming thread) with BQL held. This is low possibility, though,
> only happens when the network halts during flushing the device states.
> However still possible. One solution is to further breakdown the BQL
> critical sections to smaller sections, as mentioned in TODO.
>
> Anything more than welcomed: suggestions, questions, objections, tests..
>
> Todo
> ====
>
> - Test COLO?
> - Finer grained BQL breakdown
> - More..
>
> Thanks,
>
> Peter Xu (9):
> migration/vfio: Remove BQL implication in
> vfio_multifd_switchover_start()
> migration/rdma: Fix wrong context in qio_channel_rdma_shutdown()
> migration/rdma: Allow qemu_rdma_wait_comp_channel work with thread
> migration/rdma: Change io_create_watch() to return immediately
> migration: Thread-ify precopy vmstate load process
> migration/rdma: Remove coroutine path in qemu_rdma_wait_comp_channel
> migration/postcopy: Remove workaround on wait preempt channel
> migration/ram: Remove workaround on ram yield during load
> migration/rdma: Remove rdma_cm_poll_handler
>
> include/migration/colo.h | 6 +-
> migration/migration.h | 52 +++++++--
> migration/savevm.h | 5 +-
> hw/vfio/migration-multifd.c | 9 +-
> migration/channel.c | 7 +-
> migration/colo-stubs.c | 2 +-
> migration/colo.c | 23 +---
> migration/migration.c | 62 ++++++++---
> migration/ram.c | 13 +--
> migration/rdma.c | 206 ++++++++----------------------------
> migration/savevm.c | 85 +++++++--------
> migration/trace-events | 4 +-
> 12 files changed, 196 insertions(+), 278 deletions(-)
>
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 0/9] migration: Threadify loadvm process
2025-08-27 20:59 [PATCH RFC 0/9] migration: Threadify loadvm process Peter Xu
` (10 preceding siblings ...)
2025-09-04 8:27 ` Zhang Chen
@ 2025-09-16 21:32 ` Fabiano Rosas
2025-10-09 16:58 ` Peter Xu
11 siblings, 1 reply; 45+ messages in thread
From: Fabiano Rosas @ 2025-09-16 21:32 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
Peter Xu <peterx@redhat.com> writes:
> [this is an early RFC, not for merge, but to collect initial feedbacks]
>
> Background
> ==========
>
> Nowadays, live migration heavily depends on threads. For example, most of
> the major features that will be used nowadays in live migration (multifd,
> postcopy, mapped-ram, vfio, etc.) all work with threads internally.
>
> But still, from time to time, we'll see some coroutines floating around the
> migration context. The major one is precopy's loadvm, which is internally
> a coroutine. It is still a critical path that any live migration depends on.
>
I always wanted to be an archaeologist:
https://lists.gnu.org/archive/html/qemu-devel//2012-08/msg01136.html
I was expecting to find some complicated chain of events leading to the
choice of using a coroutine, but no.
> A mixture of using both coroutines and threads is prone to issues. Some
> examples can refer to commit e65cec5e5d ("migration/ram: Yield periodically
> to the main loop") or commit 7afbdada7e ("migration/postcopy: ensure
> preempt channel is ready before loading states").
>
> Overview
> ========
>
> This series tries to move migration further into the thread-based model, by
> allowing the loadvm process to happen in a thread rather than in the main
> thread with a coroutine.
>
> Luckily, since the qio channel code is always ready for both cases, IO
> paths should all be fine.
>
> Note that loadvm for postcopy already happens in a ram load thread which is
> separate. However, RAM is just the simple case here, even it has its own
> challenges (on atomically update of the pgtables), its complexity lies in
> the kernel.
>
> For precopy, loadvm has quite a few operations that will need BQL. The
> question is we can't take BQL for the whole process of loadvm, because
> that'll block the main thread from executions (e.g. QMP hangs). Here, the
> finer granule we can push BQL the better. This series so far chose
> somewhere in the middle, by taking BQL on majorly these two places:
>
> - CPU synchronizations
> - Device START/FULL sections
>
> After this series applied, most of the rest loadvm path will run without
> BQL anymore. There is a more detailed discussion / todo in the commit
> message of patch "migration: Thread-ify precopy vmstate load process"
> explaning how to further split the BQL critical sections.
>
> I was trying to split the patches into smaller ones if possible, but it's
> still quite challenging so there's one major patch that does the work.
>
> After the series applied, the only leftover pieces in migration/ that would
> use a coroutine is snapshot save/load/delete jobs.
>
Which are then fine because the work itself runs on the main loop,
right? So the bottom-half scheduling could be left as a coroutine.
> Tests
> =====
>
> Default CI passes.
>
> RDMA unit tests pass as usual. I also tried out cancellation / failure
> tests over RDMA channels, making sure nothing is stuck.
>
> I also roughly measured how long it takes to run the whole 80+ migration
> qtest suite, and see no measurable difference before / after this series.
>
> Risks
> =====
>
> This series has the risk of breaking things. I would be surprised if it
> didn't..
>
> I confess I didn't test anything on COLO but only from code observations
> and analysis. COLO maintainers: could you add some unit tests to QEMU's
> qtests?
>
> The current way of taking BQL during FULL section load may cause issues, it
> means when the IOs are unstable we could be waiting for IO (in the new
> migration incoming thread) with BQL held. This is low possibility, though,
> only happens when the network halts during flushing the device states.
> However still possible. One solution is to further breakdown the BQL
> critical sections to smaller sections, as mentioned in TODO.
>
> Anything more than welcomed: suggestions, questions, objections, tests..
>
> Todo
> ====
>
> - Test COLO?
> - Finer grained BQL breakdown
> - More..
>
> Thanks,
>
> Peter Xu (9):
> migration/vfio: Remove BQL implication in
> vfio_multifd_switchover_start()
> migration/rdma: Fix wrong context in qio_channel_rdma_shutdown()
> migration/rdma: Allow qemu_rdma_wait_comp_channel work with thread
> migration/rdma: Change io_create_watch() to return immediately
> migration: Thread-ify precopy vmstate load process
> migration/rdma: Remove coroutine path in qemu_rdma_wait_comp_channel
> migration/postcopy: Remove workaround on wait preempt channel
> migration/ram: Remove workaround on ram yield during load
> migration/rdma: Remove rdma_cm_poll_handler
>
> include/migration/colo.h | 6 +-
> migration/migration.h | 52 +++++++--
> migration/savevm.h | 5 +-
> hw/vfio/migration-multifd.c | 9 +-
> migration/channel.c | 7 +-
> migration/colo-stubs.c | 2 +-
> migration/colo.c | 23 +---
> migration/migration.c | 62 ++++++++---
> migration/ram.c | 13 +--
> migration/rdma.c | 206 ++++++++----------------------------
> migration/savevm.c | 85 +++++++--------
> migration/trace-events | 4 +-
> 12 files changed, 196 insertions(+), 278 deletions(-)
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 1/9] migration/vfio: Remove BQL implication in vfio_multifd_switchover_start()
2025-08-27 20:59 ` [PATCH RFC 1/9] migration/vfio: Remove BQL implication in vfio_multifd_switchover_start() Peter Xu
2025-08-28 18:05 ` Maciej S. Szmigiero
@ 2025-09-16 21:34 ` Fabiano Rosas
1 sibling, 0 replies; 45+ messages in thread
From: Fabiano Rosas @ 2025-09-16 21:34 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin, Cédric Le Goater,
Maciej S. Szmigiero
Peter Xu <peterx@redhat.com> writes:
> We may switch to a BQL-free loadvm model. Be prepared with it.
>
> Cc: Cédric Le Goater <clg@redhat.com>
> Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> hw/vfio/migration-multifd.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index e4785031a7..8dc8444f0d 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -763,16 +763,21 @@ int vfio_multifd_switchover_start(VFIODevice *vbasedev)
> {
> VFIOMigration *migration = vbasedev->migration;
> VFIOMultifd *multifd = migration->multifd;
> + bool bql_is_locked = bql_locked();
>
> assert(multifd);
>
> /* The lock order is load_bufs_mutex -> BQL so unlock BQL here first */
> - bql_unlock();
> + if (bql_is_locked) {
> + bql_unlock();
> + }
> WITH_QEMU_LOCK_GUARD(&multifd->load_bufs_mutex) {
> assert(!multifd->load_bufs_thread_running);
> multifd->load_bufs_thread_running = true;
> }
> - bql_lock();
> + if (bql_is_locked) {
> + bql_lock();
> + }
>
> qemu_loadvm_start_load_thread(vfio_load_bufs_thread, vbasedev);
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 2/9] migration/rdma: Fix wrong context in qio_channel_rdma_shutdown()
2025-08-27 20:59 ` [PATCH RFC 2/9] migration/rdma: Fix wrong context in qio_channel_rdma_shutdown() Peter Xu
@ 2025-09-16 21:41 ` Fabiano Rosas
2025-09-26 1:01 ` Zhijian Li (Fujitsu)
1 sibling, 0 replies; 45+ messages in thread
From: Fabiano Rosas @ 2025-09-16 21:41 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin, Lidong Chen
Peter Xu <peterx@redhat.com> writes:
> The rdmaout should be a cache of rioc->rdmaout, not rioc->rdmain.
>
> Cc: Zhijian Li (Fujitsu) <lizhijian@fujitsu.com>
> Cc: Lidong Chen <jemmy858585@gmail.com>
> Fixes: 54db882f07 ("migration: implement the shutdown for RDMA QIOChannel")
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/rdma.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 2d839fce6c..e6837184c8 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -2986,7 +2986,7 @@ qio_channel_rdma_shutdown(QIOChannel *ioc,
> RCU_READ_LOCK_GUARD();
>
> rdmain = qatomic_rcu_read(&rioc->rdmain);
> - rdmaout = qatomic_rcu_read(&rioc->rdmain);
> + rdmaout = qatomic_rcu_read(&rioc->rdmaout);
>
> switch (how) {
> case QIO_CHANNEL_SHUTDOWN_READ:
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 3/9] migration/rdma: Allow qemu_rdma_wait_comp_channel work with thread
2025-08-27 20:59 ` [PATCH RFC 3/9] migration/rdma: Allow qemu_rdma_wait_comp_channel work with thread Peter Xu
@ 2025-09-16 21:50 ` Fabiano Rosas
2025-09-26 1:02 ` Zhijian Li (Fujitsu)
1 sibling, 0 replies; 45+ messages in thread
From: Fabiano Rosas @ 2025-09-16 21:50 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
Peter Xu <peterx@redhat.com> writes:
> It's almost there, except that currently it relies on a global flag showing
> that it's in incoming migration.
>
> Change it to detect coroutine instead.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/rdma.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index e6837184c8..ed4e20b988 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1357,7 +1357,8 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma,
> * so don't yield unless we know we're running inside of a coroutine.
> */
> if (rdma->migration_started_on_destination &&
> - migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE) {
> + migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE &&
> + qemu_in_coroutine()) {
> yield_until_fd_readable(comp_channel->fd);
> } else {
> /* This is the source side, we're in a separate thread
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 4/9] migration/rdma: Change io_create_watch() to return immediately
2025-08-27 20:59 ` [PATCH RFC 4/9] migration/rdma: Change io_create_watch() to return immediately Peter Xu
@ 2025-09-16 22:35 ` Fabiano Rosas
2025-10-08 20:34 ` Peter Xu
2025-09-26 2:39 ` Zhijian Li (Fujitsu)
1 sibling, 1 reply; 45+ messages in thread
From: Fabiano Rosas @ 2025-09-16 22:35 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
Peter Xu <peterx@redhat.com> writes:
> The old RDMA's io_create_watch() isn't really doing much work anyway. For
> G_IO_OUT, it already does return immediately. For G_IO_IN, it will try to
> detect some RDMA context length however normally nobody will be able to set
> it at all.
>
> Simplify the code so that RDMA iochannels simply always rely on synchronous
> reads and writes. It is highly likely what 6ddd2d76ca6f86f was talking
> about, that the async model isn't really working well.
>
> This helps because this is almost the only dependency that the migration
> core would need a coroutine for rdma channels.
>
I don't understand this. How does this code require a coroutine? Isn't
the io_watch exactly the strategy used when there is no coroutine?
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/rdma.c | 69 +++---------------------------------------------
> 1 file changed, 3 insertions(+), 66 deletions(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index ed4e20b988..bcd7aae2f2 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -2789,56 +2789,14 @@ static gboolean
> qio_channel_rdma_source_prepare(GSource *source,
> gint *timeout)
> {
> - QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
> - RDMAContext *rdma;
> - GIOCondition cond = 0;
> *timeout = -1;
> -
> - RCU_READ_LOCK_GUARD();
> - if (rsource->condition == G_IO_IN) {
> - rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
> - } else {
> - rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
> - }
> -
> - if (!rdma) {
> - error_report("RDMAContext is NULL when prepare Gsource");
> - return FALSE;
> - }
> -
> - if (rdma->wr_data[0].control_len) {
> - cond |= G_IO_IN;
> - }
> - cond |= G_IO_OUT;
> -
> - return cond & rsource->condition;
> + return TRUE;
> }
>
> static gboolean
> qio_channel_rdma_source_check(GSource *source)
> {
> - QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
> - RDMAContext *rdma;
> - GIOCondition cond = 0;
> -
> - RCU_READ_LOCK_GUARD();
> - if (rsource->condition == G_IO_IN) {
> - rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
> - } else {
> - rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
> - }
> -
> - if (!rdma) {
> - error_report("RDMAContext is NULL when check Gsource");
> - return FALSE;
> - }
> -
> - if (rdma->wr_data[0].control_len) {
> - cond |= G_IO_IN;
> - }
> - cond |= G_IO_OUT;
> -
> - return cond & rsource->condition;
> + return TRUE;
These are fine if we want the source to run as soon as possible, I
think. But then...
> }
>
> static gboolean
> @@ -2848,29 +2806,8 @@ qio_channel_rdma_source_dispatch(GSource *source,
> {
> QIOChannelFunc func = (QIOChannelFunc)callback;
> QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
> - RDMAContext *rdma;
> - GIOCondition cond = 0;
> -
> - RCU_READ_LOCK_GUARD();
> - if (rsource->condition == G_IO_IN) {
> - rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
> - } else {
> - rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
> - }
> -
> - if (!rdma) {
> - error_report("RDMAContext is NULL when dispatch Gsource");
> - return FALSE;
> - }
> -
> - if (rdma->wr_data[0].control_len) {
> - cond |= G_IO_IN;
> - }
> - cond |= G_IO_OUT;
>
> - return (*func)(QIO_CHANNEL(rsource->rioc),
> - (cond & rsource->condition),
> - user_data);
> + return (*func)(QIO_CHANNEL(rsource->rioc), rsource->condition, user_data);
No idea who even calls g_source_set_callback() in this case. What is func?
> }
>
> static void
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 6/9] migration/rdma: Remove coroutine path in qemu_rdma_wait_comp_channel
2025-08-27 20:59 ` [PATCH RFC 6/9] migration/rdma: Remove coroutine path in qemu_rdma_wait_comp_channel Peter Xu
@ 2025-09-16 22:39 ` Fabiano Rosas
2025-10-08 21:18 ` Peter Xu
2025-09-26 2:44 ` Zhijian Li (Fujitsu)
1 sibling, 1 reply; 45+ messages in thread
From: Fabiano Rosas @ 2025-09-16 22:39 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
Peter Xu <peterx@redhat.com> writes:
> Now after threadified dest VM load during precopy, we will always in a
> thread context rather than within a coroutine. We can remove this path
> now.
>
> With that, migration_started_on_destination can go away too.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/rdma.c | 102 +++++++++++++++++++----------------------------
> 1 file changed, 41 insertions(+), 61 deletions(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 2b995513aa..7751262460 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -29,7 +29,6 @@
> #include "qemu/rcu.h"
> #include "qemu/sockets.h"
> #include "qemu/bitmap.h"
> -#include "qemu/coroutine.h"
> #include "system/memory.h"
> #include <sys/socket.h>
> #include <netdb.h>
> @@ -357,13 +356,6 @@ typedef struct RDMAContext {
> /* Index of the next RAMBlock received during block registration */
> unsigned int next_src_index;
>
> - /*
> - * Migration on *destination* started.
> - * Then use coroutine yield function.
> - * Source runs in a thread, so we don't care.
> - */
> - int migration_started_on_destination;
> -
> int total_registrations;
> int total_writes;
>
> @@ -1353,66 +1345,55 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma,
> struct rdma_cm_event *cm_event;
>
> /*
> - * Coroutine doesn't start until migration_fd_process_incoming()
> - * so don't yield unless we know we're running inside of a coroutine.
> + * This is the source or dest side, either during precopy or
> + * postcopy. We're always in a separate thread when reaching here.
> + * Poll the fd. We need to be able to handle 'cancel' or an error
> + * without hanging forever.
> */
> - if (rdma->migration_started_on_destination &&
> - migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE &&
> - qemu_in_coroutine()) {
> - yield_until_fd_readable(comp_channel->fd);
> - } else {
> - /* This is the source side, we're in a separate thread
> - * or destination prior to migration_fd_process_incoming()
> - * after postcopy, the destination also in a separate thread.
> - * we can't yield; so we have to poll the fd.
> - * But we need to be able to handle 'cancel' or an error
> - * without hanging forever.
> - */
> - while (!rdma->errored && !rdma->received_error) {
> - GPollFD pfds[2];
> - pfds[0].fd = comp_channel->fd;
> - pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> - pfds[0].revents = 0;
> -
> - pfds[1].fd = rdma->channel->fd;
> - pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> - pfds[1].revents = 0;
> -
> - /* 0.1s timeout, should be fine for a 'cancel' */
> - switch (qemu_poll_ns(pfds, 2, 100 * 1000 * 1000)) {
> - case 2:
> - case 1: /* fd active */
> - if (pfds[0].revents) {
> - return 0;
> - }
> + while (!rdma->errored && !rdma->received_error) {
> + GPollFD pfds[2];
> + pfds[0].fd = comp_channel->fd;
> + pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> + pfds[0].revents = 0;
> +
> + pfds[1].fd = rdma->channel->fd;
> + pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> + pfds[1].revents = 0;
> +
> + /* 0.1s timeout, should be fine for a 'cancel' */
> + switch (qemu_poll_ns(pfds, 2, 100 * 1000 * 1000)) {
Don't glib have facilities for polling? Isn't this what
qio_channel_rdma_create_watch() is for already?
> + case 2:
> + case 1: /* fd active */
> + if (pfds[0].revents) {
> + return 0;
> + }
>
> - if (pfds[1].revents) {
> - if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> - return -1;
> - }
> + if (pfds[1].revents) {
> + if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> + return -1;
> + }
>
> - if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
> - cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
> - rdma_ack_cm_event(cm_event);
> - return -1;
> - }
> + if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
> + cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
> rdma_ack_cm_event(cm_event);
> + return -1;
> }
> - break;
> + rdma_ack_cm_event(cm_event);
> + }
> + break;
>
> - case 0: /* Timeout, go around again */
> - break;
> + case 0: /* Timeout, go around again */
> + break;
>
> - default: /* Error of some type -
> - * I don't trust errno from qemu_poll_ns
> - */
> - return -1;
> - }
> + default: /* Error of some type -
> + * I don't trust errno from qemu_poll_ns
> + */
> + return -1;
> + }
>
> - if (migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) {
> - /* Bail out and let the cancellation happen */
> - return -1;
> - }
> + if (migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) {
> + /* Bail out and let the cancellation happen */
> + return -1;
> }
> }
>
> @@ -3817,7 +3798,6 @@ static void rdma_accept_incoming_migration(void *opaque)
> return;
> }
>
> - rdma->migration_started_on_destination = 1;
> migration_fd_process_incoming(f);
> }
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process
2025-08-27 20:59 ` [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process Peter Xu
2025-08-27 23:51 ` Dr. David Alan Gilbert
2025-08-29 8:29 ` Vladimir Sementsov-Ogievskiy
@ 2025-09-17 18:23 ` Fabiano Rosas
2025-10-09 21:41 ` Peter Xu
2025-09-26 3:41 ` Zhijian Li (Fujitsu)
3 siblings, 1 reply; 45+ messages in thread
From: Fabiano Rosas @ 2025-09-17 18:23 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
Peter Xu <peterx@redhat.com> writes:
> Migration module was there for 10+ years. Initially, it was in most cases
> based on coroutines. As more features were added into the framework, like
> postcopy, multifd, etc.. it became a mixture of threads and coroutines.
>
> I'm guessing coroutines just can't fix all issues that migration want to
> resolve.
>
> After all these years, migration is now heavily based on a threaded model.
>
> Now there's still a major part of migration framework that is still not
> thread-based, which is precopy load. We do load in a separate thread in
> postcopy since the 1st day postcopy was introduced, however that requires a
> separate state transition from precopy loading all devices first, which
> still happens in the main thread of a coroutine.
>
> This patch tries to move the migration incoming side to be run inside a
> separate thread (mig/dst/main) just like the src (mig/src/main). The
> entrance to be migration_incoming_thread().
>
> Quite a few things are needed to make it fly..
>
> BQL Analysis
> ============
>
> Firstly, when moving it over to the thread, it means the thread cannot take
> BQL during the whole process of loading anymore, because otherwise it can
> block main thread from using the BQL for all kinds of other concurrent
> tasks (for example, processing QMP / HMP commands).
>
Pure question, how does load_snapshot avoids these issues if it doesn't
already use a coroutine? I feel I'm missing something.
Not that I disagree with the concerns around threading + BQL, I'm just
wondering if the issues may be fairly small since snapshots work fine.
> Here the first question to ask is: what needs BQL during precopy load, and
> what doesn't?
>
> Most of the load process shouldn't need BQL, especially when it's about
> RAM. After all, RAM is still the major chunk of data to move for a live
> migration process. VFIO started to change that, though, but still, VFIO is
> per-device so that shouldn't need BQL either in most cases.
>
> Generic device loads will need BQL, likely not when receiving VMSDs, but
> when applying them. One example is any post_load() could potentially
> inject memory regions causing memory transactions to happen. That'll need
> to update the global address spaces, hence requires BQL. The other one is
> CPU sync operations, even if the sync alone may not need BQL (which is
> still to be further justified), run_on_cpu() will need it.
>
> For that, qemu_loadvm_state() and qemu_loadvm_state_main() functions need
> to now take a "bql_held" parameter saying whether bql is held. We could
> use things like BQL_LOCK_GUARD(), but this patch goes with explicit
> lockings rather than relying on bql_locked TLS variable.
Why exactly? Seems redundant to plumb the variable through when we have
bql_locked and the macros around it.
At first sight I'd say we could already add BQL macros around code that
we're sure needs it, which would maybe simplify the patch a bit.
> In case of
> migration, we always know whether BQL is held in different context as long
> as we can still pass that information downwards.
>
> COLO
> ====
>
> COLO assumed the dest VM load happens in a coroutine. After this patch,
> it's not anymore. Change that by invoking colo_incoming_co() directly from
> the migration_incoming_thread().
>
> The name (colo_incoming_co()) isn't proper anymore. Change it to
> colo_incoming_wait(), removing the coroutine annotation alongside.
>
> Remove all the bql_lock() implications in COLO, e.g., colo_incoming_co()
> used to release the lock for a short period while join(). Now it's not
> needed.
>
> At the meantime, there's colo_incoming_co variable that used to store the
> COLO incoming coroutine, only to be kicked off when a secondary failover
> happens.
>
> To recap, what should happen for such failover should be (taking example of
> a QMP command x-colo-lost-heartbeat triggering on dest QEMU):
>
> - The QMP command will kick off both the coroutine and the COLO
> thread (colo_process_incoming_thread()), with something like:
>
> /* Notify COLO incoming thread that failover work is finished */
> qemu_event_set(&mis->colo_incoming_event);
>
> qemu_coroutine_enter(mis->colo_incoming_co);
>
> - The coroutine, which yielded itself before, now resumes after enter(),
> then it'll wait for the join():
>
> mis->colo_incoming_co = qemu_coroutine_self();
> qemu_coroutine_yield();
> mis->colo_incoming_co = NULL;
>
> /* Wait checkpoint incoming thread exit before free resource */
> qemu_thread_join(&th);
>
> Here, when switching to a thread model, it should be fine removing
> colo_incoming_co variable completely, because if so, the incoming thread
> will (instead of yielding the coroutine) wait at qemu_thread_join() until
> the colo thread completes execution (after receiving colo_incoming_event).
>
> RDMA
> ====
>
> With the prior patch making sure io_watch won't block for RDMA iochannels,
> RDMA threads should only block at its io_readv/io_writev functions. When a
> disconnection is detected (as in rdma_cm_poll_handler()), the update to
> "errored" field will be immediately reflected in the migration incoming
> thread. Hence the coroutine for RDMA is not needed anymore to kick the
> thread out.
>
> TODO
> ====
>
> Currently the BQL is taken during loading of a START|FULL section. When
> the IO hangs (e.g. network issue) during this process, it could potentially
> block others like the monitor servers. One solution is breaking BQL to
> smaller granule and leave IOs to be always BQL-free. That'll need more
> justifications.
>
> For example, there are at least four things that need some closer
> attention:
>
> - SaveVMHandlers's load_state(): this likely DO NOT need BQL, but we need
> to justify all of them (not to mention, some of them look like prone to
> be rewritten as VMSDs..)
>
> - VMSD's pre_load(): in most cases, this DO NOT really need BQL, but
> sometimes maybe it will! Double checking on this will be needed.
>
> - VMSD's post_load(): in many cases, this DO need BQL, for example on
> address space operations. Likely we should just take it for any
> post_load().
>
> - VMSD field's get(): this is tricky! It could internally be anything
> even if it was only a field. E.g. there can be users to use a SINGLE
> field to load a whole VMSD, which can further introduce more
> possibilities.
>
> In general, QEMUFile IOs should not need BQL, that is when receiving the
> VMSD data and waiting for e.g. the socket buffer to get refilled. But
> that's the easy part.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> include/migration/colo.h | 6 ++--
> migration/migration.h | 52 ++++++++++++++++++++++++++------
> migration/savevm.h | 5 ++--
> migration/channel.c | 7 ++---
> migration/colo-stubs.c | 2 +-
> migration/colo.c | 23 ++++-----------
> migration/migration.c | 62 ++++++++++++++++++++++++++++----------
> migration/rdma.c | 5 ----
> migration/savevm.c | 64 ++++++++++++++++++++++++----------------
> migration/trace-events | 4 +--
> 10 files changed, 142 insertions(+), 88 deletions(-)
>
> diff --git a/include/migration/colo.h b/include/migration/colo.h
> index 43222ef5ae..bfb30eccf0 100644
> --- a/include/migration/colo.h
> +++ b/include/migration/colo.h
> @@ -44,12 +44,10 @@ void colo_do_failover(void);
> void colo_checkpoint_delay_set(void);
>
> /*
> - * Starts COLO incoming process. Called from process_incoming_migration_co()
> + * Starts COLO incoming process. Called from migration_incoming_thread()
> * after loading the state.
> - *
> - * Called with BQL locked, may temporary release BQL.
> */
> -void coroutine_fn colo_incoming_co(void);
> +void colo_incoming_wait(void);
>
> void colo_shutdown(void);
> #endif
> diff --git a/migration/migration.h b/migration/migration.h
> index 01329bf824..c4a626eed4 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -42,6 +42,44 @@
> #define MIGRATION_THREAD_DST_LISTEN "mig/dst/listen"
> #define MIGRATION_THREAD_DST_PREEMPT "mig/dst/preempt"
>
> +/**
> + * WITH_BQL_HELD(): Run a task, making sure BQL is held
> + *
> + * @bql_held: Whether BQL is already held
> + * @task: The task to run within BQL held
> + */
> +#define WITH_BQL_HELD(bql_held, task) \
> + do { \
> + if (!bql_held) { \
> + bql_lock(); \
> + } else { \
> + assert(bql_locked()); \
> + } \
> + task; \
> + if (!bql_held) { \
> + bql_unlock(); \
> + } \
> + } while (0)
> +
> +/**
> + * WITHOUT_BQL_HELD(): Run a task, making sure BQL is released
> + *
> + * @bql_held: Whether BQL is already held
> + * @task: The task to run making sure BQL released
> + */
> +#define WITHOUT_BQL_HELD(bql_held, task) \
> + do { \
> + if (bql_held) { \
> + bql_unlock(); \
> + } else { \
> + assert(!bql_locked()); \
> + } \
> + task; \
> + if (bql_held) { \
> + bql_lock(); \
> + } \
> + } while (0)
> +
> struct PostcopyBlocktimeContext;
> typedef struct ThreadPool ThreadPool;
>
> @@ -119,6 +157,10 @@ struct MigrationIncomingState {
> bool have_listen_thread;
> QemuThread listen_thread;
>
> + /* Migration main recv thread */
> + bool have_recv_thread;
> + QemuThread recv_thread;
> +
> /* For the kernel to send us notifications */
> int userfault_fd;
> /* To notify the fault_thread to wake, e.g., when need to quit */
> @@ -177,15 +219,7 @@ struct MigrationIncomingState {
>
> MigrationStatus state;
>
> - /*
> - * The incoming migration coroutine, non-NULL during qemu_loadvm_state().
> - * Used to wake the migration incoming coroutine from rdma code. How much is
> - * it safe - it's a question.
> - */
> - Coroutine *loadvm_co;
> -
> - /* The coroutine we should enter (back) after failover */
> - Coroutine *colo_incoming_co;
> + /* Notify secondary VM to move on */
> QemuEvent colo_incoming_event;
>
> /* Optional load threads pool and its thread exit request flag */
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 2d5e9c7166..c07e14f61a 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -64,9 +64,10 @@ void qemu_savevm_send_colo_enable(QEMUFile *f);
> void qemu_savevm_live_state(QEMUFile *f);
> int qemu_save_device_state(QEMUFile *f);
>
> -int qemu_loadvm_state(QEMUFile *f);
> +int qemu_loadvm_state(QEMUFile *f, bool bql_held);
> void qemu_loadvm_state_cleanup(MigrationIncomingState *mis);
> -int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
> +int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
> + bool bql_held);
> int qemu_load_device_state(QEMUFile *f);
> int qemu_loadvm_approve_switchover(void);
> int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> diff --git a/migration/channel.c b/migration/channel.c
> index a547b1fbfe..621f8a4a2a 100644
> --- a/migration/channel.c
> +++ b/migration/channel.c
> @@ -136,11 +136,8 @@ int migration_channel_read_peek(QIOChannel *ioc,
> }
>
> /* 1ms sleep. */
> - if (qemu_in_coroutine()) {
> - qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000);
> - } else {
> - g_usleep(1000);
> - }
> + assert(!qemu_in_coroutine());
> + g_usleep(1000);
What bug is this hiding? =)
> }
>
> return 0;
> diff --git a/migration/colo-stubs.c b/migration/colo-stubs.c
> index e22ce65234..ef77d1ab4b 100644
> --- a/migration/colo-stubs.c
> +++ b/migration/colo-stubs.c
> @@ -9,7 +9,7 @@ void colo_shutdown(void)
> {
> }
>
> -void coroutine_fn colo_incoming_co(void)
> +void colo_incoming_wait(void)
> {
> }
>
> diff --git a/migration/colo.c b/migration/colo.c
> index e0f713c837..f5722d9d9d 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -147,11 +147,6 @@ static void secondary_vm_do_failover(void)
> }
> /* Notify COLO incoming thread that failover work is finished */
> qemu_event_set(&mis->colo_incoming_event);
> -
> - /* For Secondary VM, jump to incoming co */
> - if (mis->colo_incoming_co) {
> - qemu_coroutine_enter(mis->colo_incoming_co);
> - }
> }
>
> static void primary_vm_do_failover(void)
> @@ -686,7 +681,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
>
> bql_lock();
> cpu_synchronize_all_states();
> - ret = qemu_loadvm_state_main(mis->from_src_file, mis);
> + ret = qemu_loadvm_state_main(mis->from_src_file, mis, true);
> bql_unlock();
>
> if (ret < 0) {
> @@ -854,10 +849,8 @@ static void *colo_process_incoming_thread(void *opaque)
> goto out;
> }
> /*
> - * Note: the communication between Primary side and Secondary side
> - * should be sequential, we set the fd to unblocked in migration incoming
> - * coroutine, and here we are in the COLO incoming thread, so it is ok to
> - * set the fd back to blocked.
> + * Here we are in the COLO incoming thread, so it is ok to set the fd
> + * to blocked.
nit: s/blocked/blocking/
> */
> qemu_file_set_blocking(mis->from_src_file, true);
>
> @@ -930,26 +923,20 @@ out:
> return NULL;
> }
>
> -void coroutine_fn colo_incoming_co(void)
> +/* Wait for failover */
> +void colo_incoming_wait(void)
> {
> MigrationIncomingState *mis = migration_incoming_get_current();
> QemuThread th;
>
> - assert(bql_locked());
> assert(migration_incoming_colo_enabled());
>
> qemu_thread_create(&th, MIGRATION_THREAD_DST_COLO,
> colo_process_incoming_thread,
> mis, QEMU_THREAD_JOINABLE);
>
> - mis->colo_incoming_co = qemu_coroutine_self();
> - qemu_coroutine_yield();
> - mis->colo_incoming_co = NULL;
> -
> - bql_unlock();
What does the BQL protects from colo_do_failover() until here? Could we
have a preliminary patch reducing the BQL scope? I'm thinking about
which changes we can merge upfront so we're already testing our
assumptions before the whole series completes.
> /* Wait checkpoint incoming thread exit before free resource */
> qemu_thread_join(&th);
> - bql_lock();
>
> /* We hold the global BQL, so it is safe here */
> colo_release_ram_cache();
Maybe a candidate for WITH_BQL_HELD?
> diff --git a/migration/migration.c b/migration/migration.c
> index 10c216d25d..7e4d25b15c 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -494,6 +494,11 @@ void migration_incoming_state_destroy(void)
> mis->postcopy_qemufile_dst = NULL;
> }
>
> + if (mis->have_recv_thread) {
Help me out here, is this read race-free with the write at
migration_incoming_process() because...?
This read is reached from qemu_bh_schedule() while the write comes from
the qio_channel_add_watch_full() callback. Are those (potentially)
different AioContexts and thus the BQL is what's really doing the work
(instead of the event serialization caused by the main loop)?
> + qemu_thread_join(&mis->recv_thread);
> + mis->have_recv_thread = false;
> + }
> +
> cpr_set_incoming_mode(MIG_MODE_NONE);
> yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> }
> @@ -864,30 +869,46 @@ static void process_incoming_migration_bh(void *opaque)
BTW, I don't think we need this BH anymore since effd60c8781. This
process_incoming_migration_bh was originally introduced (0aa6aefc9) to
move the brdv_activate_all out of coroutines due to a bug in the block
layer, basically this:
bdrv_activate_all ->
bdrv_invalidate_cache ->
coroutine_fn qcow2_co_invalidate_cache
{
...
BDRVQcow2State *s = bs->opaque;
...
memset(s, 0, sizeof(BDRVQcow2State)); <-- clears memory
...
flags &= ~BDRV_O_INACTIVE;
qemu_co_mutex_lock(&s->lock);
ret = qcow2_do_open(bs, options, flags, false, errp);
^ may yield before repopulating all of BDRVQcow2State
info block or something else reads 's' and BOOM
qemu_co_mutex_unlock(&s->lock);
...
}
Note that 2 years after the BH creation (0aa6aefc9), 2b148f392b2 has
moved the invalidate function back into a coroutine anyway.
> migration_incoming_state_destroy();
> }
>
> -static void coroutine_fn
> -process_incoming_migration_co(void *opaque)
> +static void migration_incoming_state_destroy_bh(void *opaque)
I only mention all of the above because it would allow merging the two
paths that call migration_incoming_state_destroy() and avoid this new
BH.
> +{
> + struct MigrationIncomingState *mis = opaque;
> +
> + if (mis->exit_on_error) {
> + /*
> + * NOTE: this exit() should better happen in the main thread, as
> + * the exit notifier may require BQL which can deadlock. See
> + * commit e7bc0204e57836 for example.
> + */
> + exit(EXIT_FAILURE);
> + }
> +
> + migration_incoming_state_destroy();
> +}
> +
> +static void *migration_incoming_thread(void *opaque)
> {
> MigrationState *s = migrate_get_current();
> - MigrationIncomingState *mis = migration_incoming_get_current();
> + MigrationIncomingState *mis = opaque;
> PostcopyState ps;
> int ret;
> Error *local_err = NULL;
>
> + rcu_register_thread();
> +
> assert(mis->from_src_file);
> + assert(!bql_locked());
>
> mis->largest_page_size = qemu_ram_pagesize_largest();
> postcopy_state_set(POSTCOPY_INCOMING_NONE);
> migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP,
> MIGRATION_STATUS_ACTIVE);
>
> - mis->loadvm_co = qemu_coroutine_self();
> - ret = qemu_loadvm_state(mis->from_src_file);
> - mis->loadvm_co = NULL;
> + ret = qemu_loadvm_state(mis->from_src_file, false);
>
> trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
>
> ps = postcopy_state_get();
> - trace_process_incoming_migration_co_end(ret, ps);
> + trace_process_incoming_migration_end(ret, ps);
> if (ps != POSTCOPY_INCOMING_NONE) {
> if (ps == POSTCOPY_INCOMING_ADVISE) {
> /*
> @@ -901,7 +922,7 @@ process_incoming_migration_co(void *opaque)
> * Postcopy was started, cleanup should happen at the end of the
> * postcopy thread.
> */
> - trace_process_incoming_migration_co_postcopy_end_main();
> + trace_process_incoming_migration_postcopy_end_main();
> goto out;
> }
> /* Else if something went wrong then just fall out of the normal exit */
> @@ -913,8 +934,8 @@ process_incoming_migration_co(void *opaque)
> }
>
> if (migration_incoming_colo_enabled()) {
> - /* yield until COLO exit */
> - colo_incoming_co();
> + /* wait until COLO exits */
> + colo_incoming_wait();
> }
>
> migration_bh_schedule(process_incoming_migration_bh, mis);
> @@ -926,19 +947,24 @@ fail:
> migrate_set_error(s, local_err);
> error_free(local_err);
>
> - migration_incoming_state_destroy();
> -
Moving this below the exit will affect the source I think, for instance:
migration_incoming_state_destroy
{
...
/* Tell source that we are done */
migrate_send_rp_shut(mis, qemu_file_get_error(mis->from_src_file) != 0);
> if (mis->exit_on_error) {
> WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> error_report_err(s->error);
> s->error = NULL;
> }
> -
> - exit(EXIT_FAILURE);
> }
> +
> + /*
> + * There's some step of the destroy process that will need to happen in
> + * the main thread (e.g. joining this thread itself). Leave to a BH.
> + */
> + migration_bh_schedule(migration_incoming_state_destroy_bh, (void *)mis);
> +
> out:
> /* Pairs with the refcount taken in qmp_migrate_incoming() */
> migrate_incoming_unref_outgoing_state();
> + rcu_unregister_thread();
> + return NULL;
> }
>
> /**
> @@ -956,8 +982,12 @@ static void migration_incoming_setup(QEMUFile *f)
>
> void migration_incoming_process(void)
> {
> - Coroutine *co = qemu_coroutine_create(process_incoming_migration_co, NULL);
> - qemu_coroutine_enter(co);
> + MigrationIncomingState *mis = migration_incoming_get_current();
> +
> + mis->have_recv_thread = true;
> + qemu_thread_create(&mis->recv_thread, "mig/dst/main",
> + migration_incoming_thread, mis,
> + QEMU_THREAD_JOINABLE);
> }
>
> /* Returns true if recovered from a paused migration, otherwise false */
> diff --git a/migration/rdma.c b/migration/rdma.c
> index bcd7aae2f2..2b995513aa 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -3068,7 +3068,6 @@ static void rdma_cm_poll_handler(void *opaque)
> {
> RDMAContext *rdma = opaque;
> struct rdma_cm_event *cm_event;
> - MigrationIncomingState *mis = migration_incoming_get_current();
>
> if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> error_report("get_cm_event failed %d", errno);
> @@ -3087,10 +3086,6 @@ static void rdma_cm_poll_handler(void *opaque)
> }
> }
> rdma_ack_cm_event(cm_event);
> - if (mis->loadvm_co) {
> - qemu_coroutine_enter(mis->loadvm_co);
> - }
> - return;
> }
> rdma_ack_cm_event(cm_event);
> }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index fabbeb296a..ad606c5425 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -154,11 +154,10 @@ static void qemu_loadvm_thread_pool_destroy(MigrationIncomingState *mis)
> }
>
> static bool qemu_loadvm_thread_pool_wait(MigrationState *s,
> - MigrationIncomingState *mis)
> + MigrationIncomingState *mis,
> + bool bql_held)
> {
> - bql_unlock(); /* Let load threads do work requiring BQL */
> - thread_pool_wait(mis->load_threads);
> - bql_lock();
> + WITHOUT_BQL_HELD(bql_held, thread_pool_wait(mis->load_threads));
>
> return !migrate_has_error(s);
> }
> @@ -2091,14 +2090,11 @@ static void *postcopy_ram_listen_thread(void *opaque)
> trace_postcopy_ram_listen_thread_start();
>
> rcu_register_thread();
> - /*
> - * Because we're a thread and not a coroutine we can't yield
> - * in qemu_file, and thus we must be blocking now.
> - */
> + /* Because we're a thread, making sure to use blocking mode */
> qemu_file_set_blocking(f, true);
>
> /* TODO: sanity check that only postcopiable data will be loaded here */
> - load_res = qemu_loadvm_state_main(f, mis);
> + load_res = qemu_loadvm_state_main(f, mis, false);
>
> /*
> * This is tricky, but, mis->from_src_file can change after it
> @@ -2392,13 +2388,14 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
> * Immediately following this command is a blob of data containing an embedded
> * chunk of migration stream; read it and load it.
> *
> - * @mis: Incoming state
> - * @length: Length of packaged data to read
> + * @mis: Incoming state
> + * @bql_held: Whether BQL is held already
> *
> * Returns: Negative values on error
> *
> */
> -static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
> +static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis,
> + bool bql_held)
> {
> int ret;
> size_t length;
> @@ -2449,7 +2446,7 @@ static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
> qemu_coroutine_yield();
> } while (1);
>
> - ret = qemu_loadvm_state_main(packf, mis);
> + ret = qemu_loadvm_state_main(packf, mis, bql_held);
> trace_loadvm_handle_cmd_packaged_main(ret);
> qemu_fclose(packf);
> object_unref(OBJECT(bioc));
> @@ -2539,7 +2536,7 @@ static int loadvm_postcopy_handle_switchover_start(void)
> * LOADVM_QUIT All good, but exit the loop
> * <0 Error
> */
> -static int loadvm_process_command(QEMUFile *f)
> +static int loadvm_process_command(QEMUFile *f, bool bql_held)
> {
> MigrationIncomingState *mis = migration_incoming_get_current();
> uint16_t cmd;
> @@ -2609,7 +2606,7 @@ static int loadvm_process_command(QEMUFile *f)
> break;
>
> case MIG_CMD_PACKAGED:
> - return loadvm_handle_cmd_packaged(mis);
> + return loadvm_handle_cmd_packaged(mis, bql_held);
>
> case MIG_CMD_POSTCOPY_ADVISE:
> return loadvm_postcopy_handle_advise(mis, len);
> @@ -3028,7 +3025,8 @@ static bool postcopy_pause_incoming(MigrationIncomingState *mis)
> return true;
> }
>
> -int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
> +int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
> + bool bql_held)
> {
> uint8_t section_type;
> int ret = 0;
> @@ -3046,7 +3044,15 @@ retry:
> switch (section_type) {
> case QEMU_VM_SECTION_START:
> case QEMU_VM_SECTION_FULL:
> - ret = qemu_loadvm_section_start_full(f, section_type);
> + /*
> + * FULL should normally require BQL, e.g. during post_load()
> + * there can be memory region updates. START may or may not
> + * require it, but just to keep it simple to always hold BQL
> + * for now.
> + */
> + WITH_BQL_HELD(
> + bql_held,
> + ret = qemu_loadvm_section_start_full(f, section_type));
> if (ret < 0) {
> goto out;
> }
> @@ -3059,7 +3065,11 @@ retry:
> }
> break;
> case QEMU_VM_COMMAND:
> - ret = loadvm_process_command(f);
> + /*
> + * Be careful; QEMU_VM_COMMAND can embed FULL sections, so it
> + * may internally need BQL.
> + */
> + ret = loadvm_process_command(f, bql_held);
> trace_qemu_loadvm_state_section_command(ret);
> if ((ret < 0) || (ret == LOADVM_QUIT)) {
> goto out;
> @@ -3103,7 +3113,7 @@ out:
> return ret;
> }
>
> -int qemu_loadvm_state(QEMUFile *f)
> +int qemu_loadvm_state(QEMUFile *f, bool bql_held)
> {
> MigrationState *s = migrate_get_current();
> MigrationIncomingState *mis = migration_incoming_get_current();
> @@ -3131,9 +3141,10 @@ int qemu_loadvm_state(QEMUFile *f)
> qemu_loadvm_state_switchover_ack_needed(mis);
> }
>
> - cpu_synchronize_all_pre_loadvm();
> + /* run_on_cpu() requires BQL */
> + WITH_BQL_HELD(bql_held, cpu_synchronize_all_pre_loadvm());
>
> - ret = qemu_loadvm_state_main(f, mis);
> + ret = qemu_loadvm_state_main(f, mis, bql_held);
> qemu_event_set(&mis->main_thread_load_event);
>
> trace_qemu_loadvm_state_post_main(ret);
> @@ -3149,7 +3160,7 @@ int qemu_loadvm_state(QEMUFile *f)
> /* When reaching here, it must be precopy */
> if (ret == 0) {
> if (migrate_has_error(migrate_get_current()) ||
> - !qemu_loadvm_thread_pool_wait(s, mis)) {
> + !qemu_loadvm_thread_pool_wait(s, mis, bql_held)) {
> ret = -EINVAL;
> } else {
> ret = qemu_file_get_error(f);
> @@ -3196,7 +3207,8 @@ int qemu_loadvm_state(QEMUFile *f)
> }
> }
>
> - cpu_synchronize_all_post_init();
> + /* run_on_cpu() requires BQL */
> + WITH_BQL_HELD(bql_held, cpu_synchronize_all_post_init());
>
> return ret;
> }
> @@ -3207,7 +3219,7 @@ int qemu_load_device_state(QEMUFile *f)
> int ret;
>
> /* Load QEMU_VM_SECTION_FULL section */
> - ret = qemu_loadvm_state_main(f, mis);
> + ret = qemu_loadvm_state_main(f, mis, true);
> if (ret < 0) {
> error_report("Failed to load device state: %d", ret);
> return ret;
> @@ -3438,7 +3450,7 @@ void qmp_xen_load_devices_state(const char *filename, Error **errp)
> f = qemu_file_new_input(QIO_CHANNEL(ioc));
> object_unref(OBJECT(ioc));
>
> - ret = qemu_loadvm_state(f);
> + ret = qemu_loadvm_state(f, true);
> qemu_fclose(f);
> if (ret < 0) {
> error_setg(errp, "loading Xen device state failed");
> @@ -3512,7 +3524,7 @@ bool load_snapshot(const char *name, const char *vmstate,
> ret = -EINVAL;
> goto err_drain;
> }
> - ret = qemu_loadvm_state(f);
> + ret = qemu_loadvm_state(f, true);
> migration_incoming_state_destroy();
>
> bdrv_drain_all_end();
> diff --git a/migration/trace-events b/migration/trace-events
> index 706db97def..eeb41e03f1 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -193,8 +193,8 @@ source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
> source_return_path_thread_switchover_acked(void) ""
> migration_thread_low_pending(uint64_t pending) "%" PRIu64
> migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " max_size %" PRId64
> -process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> -process_incoming_migration_co_postcopy_end_main(void) ""
> +process_incoming_migration_end(int ret, int ps) "ret=%d postcopy-state=%d"
> +process_incoming_migration_postcopy_end_main(void) ""
> postcopy_preempt_enabled(bool value) "%d"
> migration_precopy_complete(void) ""
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 7/9] migration/postcopy: Remove workaround on wait preempt channel
2025-08-27 20:59 ` [PATCH RFC 7/9] migration/postcopy: Remove workaround on wait preempt channel Peter Xu
@ 2025-09-17 18:30 ` Fabiano Rosas
0 siblings, 0 replies; 45+ messages in thread
From: Fabiano Rosas @ 2025-09-17 18:30 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
Peter Xu <peterx@redhat.com> writes:
> This reverts commit 7afbdada7effbc2b97281bfbce0c6df351a3cf88.
>
> Now after switching to a thread in loadvm process, the main thread should
> be able to accept() even if loading the package could cause a page fault in
> userfaultfd path.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/savevm.c | 21 ---------------------
> 1 file changed, 21 deletions(-)
>
> diff --git a/migration/savevm.c b/migration/savevm.c
> index ad606c5425..8018f7ad31 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2425,27 +2425,6 @@ static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis,
>
> QEMUFile *packf = qemu_file_new_input(QIO_CHANNEL(bioc));
>
> - /*
> - * Before loading the guest states, ensure that the preempt channel has
> - * been ready to use, as some of the states (e.g. via virtio_load) might
> - * trigger page faults that will be handled through the preempt channel.
> - * So yield to the main thread in the case that the channel create event
> - * hasn't been dispatched.
> - *
> - * TODO: if we can move migration loadvm out of main thread, then we
> - * won't block main thread from polling the accept() fds. We can drop
> - * this as a whole when that is done.
> - */
> - do {
> - if (!migrate_postcopy_preempt() || !qemu_in_coroutine() ||
> - mis->postcopy_qemufile_dst) {
> - break;
> - }
> -
> - aio_co_schedule(qemu_get_current_aio_context(), qemu_coroutine_self());
> - qemu_coroutine_yield();
> - } while (1);
> -
> ret = qemu_loadvm_state_main(packf, mis, bql_held);
> trace_loadvm_handle_cmd_packaged_main(ret);
> qemu_fclose(packf);
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 8/9] migration/ram: Remove workaround on ram yield during load
2025-08-27 20:59 ` [PATCH RFC 8/9] migration/ram: Remove workaround on ram yield during load Peter Xu
@ 2025-09-17 18:31 ` Fabiano Rosas
0 siblings, 0 replies; 45+ messages in thread
From: Fabiano Rosas @ 2025-09-17 18:31 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
Peter Xu <peterx@redhat.com> writes:
> This reverts e65cec5e5d97927d22b39167d3e8edeffc771788.
>
> RAM load path had a hack in the past explicitly yield the thread to the
> main coroutine when RAM load spinning in a tight loop. Not needed now
> because precopy RAM load now happens without the main thread.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/ram.c | 13 +------------
> 1 file changed, 1 insertion(+), 12 deletions(-)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 7208bc114f..2d9a6d1095 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -4168,7 +4168,7 @@ static int parse_ramblocks(QEMUFile *f, ram_addr_t total_ram_bytes)
> static int ram_load_precopy(QEMUFile *f)
> {
> MigrationIncomingState *mis = migration_incoming_get_current();
> - int flags = 0, ret = 0, invalid_flags = 0, i = 0;
> + int flags = 0, ret = 0, invalid_flags = 0;
>
> if (migrate_mapped_ram()) {
> invalid_flags |= (RAM_SAVE_FLAG_HOOK | RAM_SAVE_FLAG_MULTIFD_FLUSH |
> @@ -4181,17 +4181,6 @@ static int ram_load_precopy(QEMUFile *f)
> void *host = NULL, *host_bak = NULL;
> uint8_t ch;
>
> - /*
> - * Yield periodically to let main loop run, but an iteration of
> - * the main loop is expensive, so do it each some iterations
> - */
> - if ((i & 32767) == 0 && qemu_in_coroutine()) {
> - aio_co_schedule(qemu_get_current_aio_context(),
> - qemu_coroutine_self());
> - qemu_coroutine_yield();
> - }
> - i++;
> -
> addr = qemu_get_be64(f);
> ret = qemu_file_get_error(f);
> if (ret) {
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 9/9] migration/rdma: Remove rdma_cm_poll_handler
2025-08-27 20:59 ` [PATCH RFC 9/9] migration/rdma: Remove rdma_cm_poll_handler Peter Xu
@ 2025-09-17 18:38 ` Fabiano Rosas
2025-10-08 21:22 ` Peter Xu
2025-09-26 3:38 ` Zhijian Li (Fujitsu)
1 sibling, 1 reply; 45+ messages in thread
From: Fabiano Rosas @ 2025-09-17 18:38 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: Dr . David Alan Gilbert, peterx, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
Peter Xu <peterx@redhat.com> writes:
> This almost reverts commit 923709896b1b01fb982c93492ad01b233e6b6023.
>
> It was needed because the RDMA iochannel on dest QEMU used to only yield
> without monitoring the fd. Now it should be monitored by the same poll()
> similarly on the src QEMU in qemu_rdma_wait_comp_channel(). So even
> without the fd handler, dest QEMU should be able to receive the events.
>
> I tested this by initiating an RDMA migration, then do two things:
>
> - Either does migrate_cancel on src, or,
> - Directly kill destination QEMU
>
> In both cases, the other side of QEMU will be able to receive the
> disconnect event in qemu_rdma_wait_comp_channel() and properly cancel or
> fail the migration.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/rdma.c | 29 +----------------------------
> 1 file changed, 1 insertion(+), 28 deletions(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 7751262460..da7fd48bf3 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -3045,32 +3045,6 @@ int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
>
> static void rdma_accept_incoming_migration(void *opaque);
>
> -static void rdma_cm_poll_handler(void *opaque)
> -{
> - RDMAContext *rdma = opaque;
> - struct rdma_cm_event *cm_event;
> -
> - if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> - error_report("get_cm_event failed %d", errno);
> - return;
> - }
> -
> - if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
> - cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
> - if (!rdma->errored &&
> - migration_incoming_get_current()->state !=
> - MIGRATION_STATUS_COMPLETED) {
> - error_report("receive cm event, cm event is %d", cm_event->event);
> - rdma->errored = true;
> - if (rdma->return_path) {
> - rdma->return_path->errored = true;
> - }
> - }
> - rdma_ack_cm_event(cm_event);
> - }
> - rdma_ack_cm_event(cm_event);
> -}
> -
> static int qemu_rdma_accept(RDMAContext *rdma)
> {
> Error *err = NULL;
> @@ -3188,8 +3162,7 @@ static int qemu_rdma_accept(RDMAContext *rdma)
> NULL,
> (void *)(intptr_t)rdma->return_path);
> } else {
> - qemu_set_fd_handler(rdma->channel->fd, rdma_cm_poll_handler,
> - NULL, rdma);
> + qemu_set_fd_handler(rdma->channel->fd, NULL, NULL, NULL);
I'm not familiar with this code, but is this left here to remove the
handler? Can't we remove this line altogether?
> }
>
> ret = rdma_accept(rdma->cm_id, &conn_param);
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 2/9] migration/rdma: Fix wrong context in qio_channel_rdma_shutdown()
2025-08-27 20:59 ` [PATCH RFC 2/9] migration/rdma: Fix wrong context in qio_channel_rdma_shutdown() Peter Xu
2025-09-16 21:41 ` Fabiano Rosas
@ 2025-09-26 1:01 ` Zhijian Li (Fujitsu)
1 sibling, 0 replies; 45+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-09-26 1:01 UTC (permalink / raw)
To: Peter Xu, qemu-devel@nongnu.org
Cc: Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Juraj Marcin, Lidong Chen
On 28/08/2025 04:59, Peter Xu wrote:
> The rdmaout should be a cache of rioc->rdmaout, not rioc->rdmain.
>
> Cc: Zhijian Li (Fujitsu) <lizhijian@fujitsu.com>
> Cc: Lidong Chen <jemmy858585@gmail.com>
> Fixes: 54db882f07 ("migration: implement the shutdown for RDMA QIOChannel")
> Signed-off-by: Peter Xu <peterx@redhat.com>
Good catch.
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
> migration/rdma.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 2d839fce6c..e6837184c8 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -2986,7 +2986,7 @@ qio_channel_rdma_shutdown(QIOChannel *ioc,
> RCU_READ_LOCK_GUARD();
>
> rdmain = qatomic_rcu_read(&rioc->rdmain);
> - rdmaout = qatomic_rcu_read(&rioc->rdmain);
> + rdmaout = qatomic_rcu_read(&rioc->rdmaout);
>
> switch (how) {
> case QIO_CHANNEL_SHUTDOWN_READ:
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 3/9] migration/rdma: Allow qemu_rdma_wait_comp_channel work with thread
2025-08-27 20:59 ` [PATCH RFC 3/9] migration/rdma: Allow qemu_rdma_wait_comp_channel work with thread Peter Xu
2025-09-16 21:50 ` Fabiano Rosas
@ 2025-09-26 1:02 ` Zhijian Li (Fujitsu)
1 sibling, 0 replies; 45+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-09-26 1:02 UTC (permalink / raw)
To: Peter Xu, qemu-devel@nongnu.org
Cc: Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Juraj Marcin
On 28/08/2025 04:59, Peter Xu wrote:
> It's almost there, except that currently it relies on a global flag showing
> that it's in incoming migration.
>
> Change it to detect coroutine instead.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
> migration/rdma.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index e6837184c8..ed4e20b988 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1357,7 +1357,8 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma,
> * so don't yield unless we know we're running inside of a coroutine.
> */
> if (rdma->migration_started_on_destination &&
> - migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE) {
> + migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE &&
> + qemu_in_coroutine()) {
> yield_until_fd_readable(comp_channel->fd);
> } else {
> /* This is the source side, we're in a separate thread
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 4/9] migration/rdma: Change io_create_watch() to return immediately
2025-08-27 20:59 ` [PATCH RFC 4/9] migration/rdma: Change io_create_watch() to return immediately Peter Xu
2025-09-16 22:35 ` Fabiano Rosas
@ 2025-09-26 2:39 ` Zhijian Li (Fujitsu)
2025-10-08 20:42 ` Peter Xu
1 sibling, 1 reply; 45+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-09-26 2:39 UTC (permalink / raw)
To: Peter Xu, qemu-devel@nongnu.org
Cc: Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Juraj Marcin
On 28/08/2025 04:59, Peter Xu wrote:
> The old RDMA's io_create_watch() isn't really doing much work anyway. For
> G_IO_OUT, it already does return immediately. For G_IO_IN, it will try to
> detect some RDMA context length however normally nobody will be able to set
> it at all.
>
First, RDMA migration works well with this patch applied.
Tested-by: Li Zhijian <lizhijian@fujitsu.com>
I have a small question. While testing, I didn't observe any callers to
qio_channel_rdma_create_watch() during a complete RDMA migration using
the default capabilities and parameters.
I was wondering in which case this function is expected to be called?
(I see io_create_watch() is mandatory for QIOChannelClass)
Thanks
Zhijian
> Simplify the code so that RDMA iochannels simply always rely on synchronous
> reads and writes. It is highly likely what 6ddd2d76ca6f86f was talking
> about, that the async model isn't really working well.
>
> This helps because this is almost the only dependency that the migration
> core would need a coroutine for rdma channels.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/rdma.c | 69 +++---------------------------------------------
> 1 file changed, 3 insertions(+), 66 deletions(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index ed4e20b988..bcd7aae2f2 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -2789,56 +2789,14 @@ static gboolean
> qio_channel_rdma_source_prepare(GSource *source,
> gint *timeout)
> {
> - QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
> - RDMAContext *rdma;
> - GIOCondition cond = 0;
> *timeout = -1;
> -
> - RCU_READ_LOCK_GUARD();
> - if (rsource->condition == G_IO_IN) {
> - rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
> - } else {
> - rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
> - }
> -
> - if (!rdma) {
> - error_report("RDMAContext is NULL when prepare Gsource");
> - return FALSE;
> - }
> -
> - if (rdma->wr_data[0].control_len) {
> - cond |= G_IO_IN;
> - }
> - cond |= G_IO_OUT;
> -
> - return cond & rsource->condition;
> + return TRUE;
> }
>
> static gboolean
> qio_channel_rdma_source_check(GSource *source)
> {
> - QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
> - RDMAContext *rdma;
> - GIOCondition cond = 0;
> -
> - RCU_READ_LOCK_GUARD();
> - if (rsource->condition == G_IO_IN) {
> - rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
> - } else {
> - rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
> - }
> -
> - if (!rdma) {
> - error_report("RDMAContext is NULL when check Gsource");
> - return FALSE;
> - }
> -
> - if (rdma->wr_data[0].control_len) {
> - cond |= G_IO_IN;
> - }
> - cond |= G_IO_OUT;
> -
> - return cond & rsource->condition;
> + return TRUE;
> }
>
> static gboolean
> @@ -2848,29 +2806,8 @@ qio_channel_rdma_source_dispatch(GSource *source,
> {
> QIOChannelFunc func = (QIOChannelFunc)callback;
> QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
> - RDMAContext *rdma;
> - GIOCondition cond = 0;
> -
> - RCU_READ_LOCK_GUARD();
> - if (rsource->condition == G_IO_IN) {
> - rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
> - } else {
> - rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
> - }
> -
> - if (!rdma) {
> - error_report("RDMAContext is NULL when dispatch Gsource");
> - return FALSE;
> - }
> -
> - if (rdma->wr_data[0].control_len) {
> - cond |= G_IO_IN;
> - }
> - cond |= G_IO_OUT;
>
> - return (*func)(QIO_CHANNEL(rsource->rioc),
> - (cond & rsource->condition),
> - user_data);
> + return (*func)(QIO_CHANNEL(rsource->rioc), rsource->condition, user_data);
> }
>
> static void
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 6/9] migration/rdma: Remove coroutine path in qemu_rdma_wait_comp_channel
2025-08-27 20:59 ` [PATCH RFC 6/9] migration/rdma: Remove coroutine path in qemu_rdma_wait_comp_channel Peter Xu
2025-09-16 22:39 ` Fabiano Rosas
@ 2025-09-26 2:44 ` Zhijian Li (Fujitsu)
1 sibling, 0 replies; 45+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-09-26 2:44 UTC (permalink / raw)
To: Peter Xu, qemu-devel@nongnu.org
Cc: Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Juraj Marcin
On 28/08/2025 04:59, Peter Xu wrote:
> Now after threadified dest VM load during precopy, we will always in a
> thread context rather than within a coroutine. We can remove this path
> now.
>
> With that, migration_started_on_destination can go away too.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Thanks
Zhijian
> ---
> migration/rdma.c | 102 +++++++++++++++++++----------------------------
> 1 file changed, 41 insertions(+), 61 deletions(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 2b995513aa..7751262460 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -29,7 +29,6 @@
> #include "qemu/rcu.h"
> #include "qemu/sockets.h"
> #include "qemu/bitmap.h"
> -#include "qemu/coroutine.h"
> #include "system/memory.h"
> #include <sys/socket.h>
> #include <netdb.h>
> @@ -357,13 +356,6 @@ typedef struct RDMAContext {
> /* Index of the next RAMBlock received during block registration */
> unsigned int next_src_index;
>
> - /*
> - * Migration on *destination* started.
> - * Then use coroutine yield function.
> - * Source runs in a thread, so we don't care.
> - */
> - int migration_started_on_destination;
> -
> int total_registrations;
> int total_writes;
>
> @@ -1353,66 +1345,55 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma,
> struct rdma_cm_event *cm_event;
>
> /*
> - * Coroutine doesn't start until migration_fd_process_incoming()
> - * so don't yield unless we know we're running inside of a coroutine.
> + * This is the source or dest side, either during precopy or
> + * postcopy. We're always in a separate thread when reaching here.
> + * Poll the fd. We need to be able to handle 'cancel' or an error
> + * without hanging forever.
> */
> - if (rdma->migration_started_on_destination &&
> - migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE &&
> - qemu_in_coroutine()) {
> - yield_until_fd_readable(comp_channel->fd);
> - } else {
> - /* This is the source side, we're in a separate thread
> - * or destination prior to migration_fd_process_incoming()
> - * after postcopy, the destination also in a separate thread.
> - * we can't yield; so we have to poll the fd.
> - * But we need to be able to handle 'cancel' or an error
> - * without hanging forever.
> - */
> - while (!rdma->errored && !rdma->received_error) {
> - GPollFD pfds[2];
> - pfds[0].fd = comp_channel->fd;
> - pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> - pfds[0].revents = 0;
> -
> - pfds[1].fd = rdma->channel->fd;
> - pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> - pfds[1].revents = 0;
> -
> - /* 0.1s timeout, should be fine for a 'cancel' */
> - switch (qemu_poll_ns(pfds, 2, 100 * 1000 * 1000)) {
> - case 2:
> - case 1: /* fd active */
> - if (pfds[0].revents) {
> - return 0;
> - }
> + while (!rdma->errored && !rdma->received_error) {
> + GPollFD pfds[2];
> + pfds[0].fd = comp_channel->fd;
> + pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> + pfds[0].revents = 0;
> +
> + pfds[1].fd = rdma->channel->fd;
> + pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> + pfds[1].revents = 0;
> +
> + /* 0.1s timeout, should be fine for a 'cancel' */
> + switch (qemu_poll_ns(pfds, 2, 100 * 1000 * 1000)) {
> + case 2:
> + case 1: /* fd active */
> + if (pfds[0].revents) {
> + return 0;
> + }
>
> - if (pfds[1].revents) {
> - if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> - return -1;
> - }
> + if (pfds[1].revents) {
> + if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> + return -1;
> + }
>
> - if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
> - cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
> - rdma_ack_cm_event(cm_event);
> - return -1;
> - }
> + if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
> + cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
> rdma_ack_cm_event(cm_event);
> + return -1;
> }
> - break;
> + rdma_ack_cm_event(cm_event);
> + }
> + break;
>
> - case 0: /* Timeout, go around again */
> - break;
> + case 0: /* Timeout, go around again */
> + break;
>
> - default: /* Error of some type -
> - * I don't trust errno from qemu_poll_ns
> - */
> - return -1;
> - }
> + default: /* Error of some type -
> + * I don't trust errno from qemu_poll_ns
> + */
> + return -1;
> + }
>
> - if (migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) {
> - /* Bail out and let the cancellation happen */
> - return -1;
> - }
> + if (migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) {
> + /* Bail out and let the cancellation happen */
> + return -1;
> }
> }
>
> @@ -3817,7 +3798,6 @@ static void rdma_accept_incoming_migration(void *opaque)
> return;
> }
>
> - rdma->migration_started_on_destination = 1;
> migration_fd_process_incoming(f);
> }
>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 9/9] migration/rdma: Remove rdma_cm_poll_handler
2025-08-27 20:59 ` [PATCH RFC 9/9] migration/rdma: Remove rdma_cm_poll_handler Peter Xu
2025-09-17 18:38 ` Fabiano Rosas
@ 2025-09-26 3:38 ` Zhijian Li (Fujitsu)
1 sibling, 0 replies; 45+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-09-26 3:38 UTC (permalink / raw)
To: Peter Xu, qemu-devel@nongnu.org
Cc: Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Juraj Marcin
On 28/08/2025 04:59, Peter Xu wrote:
> This almost reverts commit 923709896b1b01fb982c93492ad01b233e6b6023.
>
> It was needed because the RDMA iochannel on dest QEMU used to only yield
> without monitoring the fd. Now it should be monitored by the same poll()
> similarly on the src QEMU in qemu_rdma_wait_comp_channel(). So even
> without the fd handler, dest QEMU should be able to receive the events.
Agree
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
>
> I tested this by initiating an RDMA migration, then do two things:
>
> - Either does migrate_cancel on src, or,
> - Directly kill destination QEMU
>
> In both cases, the other side of QEMU will be able to receive the
> disconnect event in qemu_rdma_wait_comp_channel() and properly cancel or
> fail the migration.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/rdma.c | 29 +----------------------------
> 1 file changed, 1 insertion(+), 28 deletions(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 7751262460..da7fd48bf3 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -3045,32 +3045,6 @@ int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
>
> static void rdma_accept_incoming_migration(void *opaque);
>
> -static void rdma_cm_poll_handler(void *opaque)
> -{
> - RDMAContext *rdma = opaque;
> - struct rdma_cm_event *cm_event;
> -
> - if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> - error_report("get_cm_event failed %d", errno);
> - return;
> - }
> -
> - if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
> - cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
> - if (!rdma->errored &&
> - migration_incoming_get_current()->state !=
> - MIGRATION_STATUS_COMPLETED) {
> - error_report("receive cm event, cm event is %d", cm_event->event);
> - rdma->errored = true;
> - if (rdma->return_path) {
> - rdma->return_path->errored = true;
> - }
> - }
> - rdma_ack_cm_event(cm_event);
> - }
> - rdma_ack_cm_event(cm_event);
> -}
> -
> static int qemu_rdma_accept(RDMAContext *rdma)
> {
> Error *err = NULL;
> @@ -3188,8 +3162,7 @@ static int qemu_rdma_accept(RDMAContext *rdma)
> NULL,
> (void *)(intptr_t)rdma->return_path);
> } else {
> - qemu_set_fd_handler(rdma->channel->fd, rdma_cm_poll_handler,
> - NULL, rdma);
> + qemu_set_fd_handler(rdma->channel->fd, NULL, NULL, NULL);
> }
>
> ret = rdma_accept(rdma->cm_id, &conn_param);
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process
2025-08-27 20:59 ` [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process Peter Xu
` (2 preceding siblings ...)
2025-09-17 18:23 ` Fabiano Rosas
@ 2025-09-26 3:41 ` Zhijian Li (Fujitsu)
2025-10-08 21:10 ` Peter Xu
3 siblings, 1 reply; 45+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-09-26 3:41 UTC (permalink / raw)
To: Peter Xu, qemu-devel@nongnu.org
Cc: Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Fabiano Rosas, Hailiang Zhang,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Zhang Chen, Juraj Marcin
On 28/08/2025 04:59, Peter Xu wrote:
> diff --git a/migration/rdma.c b/migration/rdma.c
> index bcd7aae2f2..2b995513aa 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -3068,7 +3068,6 @@ static void rdma_cm_poll_handler(void *opaque)
> {
> RDMAContext *rdma = opaque;
> struct rdma_cm_event *cm_event;
> - MigrationIncomingState *mis = migration_incoming_get_current();
>
> if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> error_report("get_cm_event failed %d", errno);
> @@ -3087,10 +3086,6 @@ static void rdma_cm_poll_handler(void *opaque)
> }
> }
> rdma_ack_cm_event(cm_event);
This above line should be removed as well, otherwise it will cause a double cm_event free.
> - if (mis->loadvm_co) {
> - qemu_coroutine_enter(mis->loadvm_co);
> - }
> - return;
> }
> rdma_ack_cm_event(cm_event);
> }
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 4/9] migration/rdma: Change io_create_watch() to return immediately
2025-09-16 22:35 ` Fabiano Rosas
@ 2025-10-08 20:34 ` Peter Xu
0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2025-10-08 20:34 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
On Tue, Sep 16, 2025 at 07:35:45PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > The old RDMA's io_create_watch() isn't really doing much work anyway. For
> > G_IO_OUT, it already does return immediately. For G_IO_IN, it will try to
> > detect some RDMA context length however normally nobody will be able to set
> > it at all.
> >
> > Simplify the code so that RDMA iochannels simply always rely on synchronous
> > reads and writes. It is highly likely what 6ddd2d76ca6f86f was talking
> > about, that the async model isn't really working well.
> >
> > This helps because this is almost the only dependency that the migration
> > core would need a coroutine for rdma channels.
> >
>
> I don't understand this. How does this code require a coroutine? Isn't
> the io_watch exactly the strategy used when there is no coroutine?
Good question. I can't remember what I was picturing when writting it.
Here the rational should be, RDMA works slightly differently from other
iochannels, because its async model doesn't really work
asynchronously.. instead no matter whether the channel is in sync/async
mode, it always only work in a sync manner.
Here, when I was saying async I meant we currently set NONBLOCK always for
incoming main channel.
For non-RDMA channels, what happens with current master branch is when we
have nothing to read, we yield at qemu_fill_buffer().
For RDMA channels, what I see is it always polls on its own and it yields
at qemu_rdma_wait_comp_channel(). A sample stack:
#0 qemu_coroutine_yield
#1 0x0000562e46e51f77 in yield_until_fd_readable
#2 0x0000562e46927823 in qemu_rdma_wait_comp_channel
#3 0x0000562e46927b35 in qemu_rdma_block_for_wrid
#4 0x0000562e46927e6f in qemu_rdma_post_send_control
#5 0x0000562e4692857f in qemu_rdma_exchange_recv
#6 0x0000562e4692ab5e in qio_channel_rdma_readv
#7 0x0000562e46c1f2d7 in qio_channel_readv_full
#8 0x0000562e46c13a6e in qemu_fill_buffer
#9 0x0000562e46c14ba8 in qemu_peek_byte
#10 0x0000562e46c14c09 in qemu_get_byte
#11 0x0000562e46c14e2a in qemu_get_be32
#12 0x0000562e46c14e8a in qemu_get_be64
#13 0x0000562e46913f08 in ram_load_precopy
#14 0x0000562e46914448 in ram_load
#15 0x0000562e469186e3 in vmstate_load
#16 0x0000562e4691ce6d in qemu_loadvm_section_part_end
#17 0x0000562e4691d99b in qemu_loadvm_state_main
#18 0x0000562e4691db87 in qemu_loadvm_state
#19 0x0000562e468f2e87 in process_incoming_migration_co
AFAICT, this is the only channel that does explicit yields internally,
rather than relying on iochannel/qemufile framework, aka, qemu_fill_buffer().
IOW, I don't even know when RDMA's qemu_fill_buffer() internally will get a
retval of QIO_CHANNEL_ERR_BLOCK for its qio_channel_readv_full(), because
rdma's io_readv() ignors NONBLOCK always.. AFAIU.
Now, going back to this patch: since I never hit QIO_CHANNEL_ERR_BLOCK
before, I don't think I know when I'll need this patch, but I had this
patch to make sure after we switch to the thread model, we will never go
into qio_channel_wait(), because IIUC fundamentally it's broken. After
this patch applied, it'll reliably retry immediately. Again, I don't know
when it'll become useful, but I'm trying to make sure we stick with the
solo place (qemu_rdma_wait_comp_channel) for polling things.
So I plan to remove this sentence, which looks misleading. Meanwhile I can
add some of above into it.
>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > migration/rdma.c | 69 +++---------------------------------------------
> > 1 file changed, 3 insertions(+), 66 deletions(-)
> >
> > diff --git a/migration/rdma.c b/migration/rdma.c
> > index ed4e20b988..bcd7aae2f2 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -2789,56 +2789,14 @@ static gboolean
> > qio_channel_rdma_source_prepare(GSource *source,
> > gint *timeout)
> > {
> > - QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
> > - RDMAContext *rdma;
> > - GIOCondition cond = 0;
> > *timeout = -1;
> > -
> > - RCU_READ_LOCK_GUARD();
> > - if (rsource->condition == G_IO_IN) {
> > - rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
> > - } else {
> > - rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
> > - }
> > -
> > - if (!rdma) {
> > - error_report("RDMAContext is NULL when prepare Gsource");
> > - return FALSE;
> > - }
> > -
> > - if (rdma->wr_data[0].control_len) {
> > - cond |= G_IO_IN;
> > - }
> > - cond |= G_IO_OUT;
> > -
> > - return cond & rsource->condition;
> > + return TRUE;
> > }
> >
> > static gboolean
> > qio_channel_rdma_source_check(GSource *source)
> > {
> > - QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
> > - RDMAContext *rdma;
> > - GIOCondition cond = 0;
> > -
> > - RCU_READ_LOCK_GUARD();
> > - if (rsource->condition == G_IO_IN) {
> > - rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
> > - } else {
> > - rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
> > - }
> > -
> > - if (!rdma) {
> > - error_report("RDMAContext is NULL when check Gsource");
> > - return FALSE;
> > - }
> > -
> > - if (rdma->wr_data[0].control_len) {
> > - cond |= G_IO_IN;
> > - }
> > - cond |= G_IO_OUT;
> > -
> > - return cond & rsource->condition;
> > + return TRUE;
>
> These are fine if we want the source to run as soon as possible, I
> think. But then...
>
> > }
> >
> > static gboolean
> > @@ -2848,29 +2806,8 @@ qio_channel_rdma_source_dispatch(GSource *source,
> > {
> > QIOChannelFunc func = (QIOChannelFunc)callback;
> > QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
> > - RDMAContext *rdma;
> > - GIOCondition cond = 0;
> > -
> > - RCU_READ_LOCK_GUARD();
> > - if (rsource->condition == G_IO_IN) {
> > - rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
> > - } else {
> > - rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
> > - }
> > -
> > - if (!rdma) {
> > - error_report("RDMAContext is NULL when dispatch Gsource");
> > - return FALSE;
> > - }
> > -
> > - if (rdma->wr_data[0].control_len) {
> > - cond |= G_IO_IN;
> > - }
> > - cond |= G_IO_OUT;
> >
> > - return (*func)(QIO_CHANNEL(rsource->rioc),
> > - (cond & rsource->condition),
> > - user_data);
> > + return (*func)(QIO_CHANNEL(rsource->rioc), rsource->condition, user_data);
>
> No idea who even calls g_source_set_callback() in this case. What is func?
In terms of qio_channel_wait(), func is qio_channel_wait_complete().
After this patch, qio_channel_wait_complete() will be invoked immediately,
hence qio_channel_wait() will reliably return immediately for rdma channels.
--
Peter Xu
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 4/9] migration/rdma: Change io_create_watch() to return immediately
2025-09-26 2:39 ` Zhijian Li (Fujitsu)
@ 2025-10-08 20:42 ` Peter Xu
0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2025-10-08 20:42 UTC (permalink / raw)
To: Zhijian Li (Fujitsu)
Cc: qemu-devel@nongnu.org, Dr . David Alan Gilbert, Kevin Wolf,
Paolo Bonzini, Daniel P . Berrangé, Fabiano Rosas,
Hailiang Zhang, Yury Kotov, Vladimir Sementsov-Ogievskiy,
Prasad Pandit, Zhang Chen, Juraj Marcin
On Fri, Sep 26, 2025 at 02:39:43AM +0000, Zhijian Li (Fujitsu) wrote:
>
>
> On 28/08/2025 04:59, Peter Xu wrote:
> > The old RDMA's io_create_watch() isn't really doing much work anyway. For
> > G_IO_OUT, it already does return immediately. For G_IO_IN, it will try to
> > detect some RDMA context length however normally nobody will be able to set
> > it at all.
> >
>
>
> First, RDMA migration works well with this patch applied.
>
> Tested-by: Li Zhijian <lizhijian@fujitsu.com>
Thanks a lot, Zhijian.
>
>
> I have a small question. While testing, I didn't observe any callers to
> qio_channel_rdma_create_watch() during a complete RDMA migration using
> the default capabilities and parameters.
> I was wondering in which case this function is expected to be called?
> (I see io_create_watch() is mandatory for QIOChannelClass)
Yes, that's also my observation. See my reply to Fabiano on the same patch
for some information.
A summary of what I said there but more focused to what you're asking: IIUC
currently we almost always rely on qemu_rdma_wait_comp_channel() to poll
the two rdma fds, and yield if necessary when in a coroutine.
IOW, I don't know when qio_channel_rdma_create_watch(), or in most cases,
qio_channel_wait(), will be used at all. I had a feeling that if it's used
it might stuck forever (as the gsource will be monitoring control_len, see
below [1], while IIUC only the thread itself can update it, or am I
wrong?). But I'm not fluent with the RDMA codebase. Maybe you'll have a
better picture after seeing what I said here and there.
This patch is almost something I want to guarantee it won't happen, hence
for whatever could return QIO_CHANNEL_ERR_BLOCK for rdma channels I want to
make sure it immediately retries instead of hanging forever in the temp
main loop of qio_channel_wait().
>
>
> Thanks
> Zhijian
>
>
> > Simplify the code so that RDMA iochannels simply always rely on synchronous
> > reads and writes. It is highly likely what 6ddd2d76ca6f86f was talking
> > about, that the async model isn't really working well.
> >
> > This helps because this is almost the only dependency that the migration
> > core would need a coroutine for rdma channels.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > migration/rdma.c | 69 +++---------------------------------------------
> > 1 file changed, 3 insertions(+), 66 deletions(-)
> >
> > diff --git a/migration/rdma.c b/migration/rdma.c
> > index ed4e20b988..bcd7aae2f2 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -2789,56 +2789,14 @@ static gboolean
> > qio_channel_rdma_source_prepare(GSource *source,
> > gint *timeout)
> > {
> > - QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
> > - RDMAContext *rdma;
> > - GIOCondition cond = 0;
> > *timeout = -1;
> > -
> > - RCU_READ_LOCK_GUARD();
> > - if (rsource->condition == G_IO_IN) {
> > - rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
> > - } else {
> > - rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
> > - }
> > -
> > - if (!rdma) {
> > - error_report("RDMAContext is NULL when prepare Gsource");
> > - return FALSE;
> > - }
> > -
> > - if (rdma->wr_data[0].control_len) {
> > - cond |= G_IO_IN;
> > - }
> > - cond |= G_IO_OUT;
> > -
> > - return cond & rsource->condition;
> > + return TRUE;
> > }
> >
> > static gboolean
> > qio_channel_rdma_source_check(GSource *source)
> > {
> > - QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
> > - RDMAContext *rdma;
> > - GIOCondition cond = 0;
> > -
> > - RCU_READ_LOCK_GUARD();
> > - if (rsource->condition == G_IO_IN) {
> > - rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
> > - } else {
> > - rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
> > - }
> > -
> > - if (!rdma) {
> > - error_report("RDMAContext is NULL when check Gsource");
> > - return FALSE;
> > - }
> > -
> > - if (rdma->wr_data[0].control_len) {
[1]
> > - cond |= G_IO_IN;
> > - }
> > - cond |= G_IO_OUT;
> > -
> > - return cond & rsource->condition;
> > + return TRUE;
> > }
> >
> > static gboolean
> > @@ -2848,29 +2806,8 @@ qio_channel_rdma_source_dispatch(GSource *source,
> > {
> > QIOChannelFunc func = (QIOChannelFunc)callback;
> > QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
> > - RDMAContext *rdma;
> > - GIOCondition cond = 0;
> > -
> > - RCU_READ_LOCK_GUARD();
> > - if (rsource->condition == G_IO_IN) {
> > - rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
> > - } else {
> > - rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
> > - }
> > -
> > - if (!rdma) {
> > - error_report("RDMAContext is NULL when dispatch Gsource");
> > - return FALSE;
> > - }
> > -
> > - if (rdma->wr_data[0].control_len) {
> > - cond |= G_IO_IN;
> > - }
> > - cond |= G_IO_OUT;
> >
> > - return (*func)(QIO_CHANNEL(rsource->rioc),
> > - (cond & rsource->condition),
> > - user_data);
> > + return (*func)(QIO_CHANNEL(rsource->rioc), rsource->condition, user_data);
> > }
> >
> > static void
--
Peter Xu
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process
2025-09-04 1:38 ` Dr. David Alan Gilbert
@ 2025-10-08 21:02 ` Peter Xu
0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2025-10-08 21:02 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: qemu-devel, Kevin Wolf, Paolo Bonzini, Daniel P . Berrangé,
Fabiano Rosas, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
On Thu, Sep 04, 2025 at 01:38:14AM +0000, Dr. David Alan Gilbert wrote:
> > > > In general, QEMUFile IOs should not need BQL, that is when receiving the
> > > > VMSD data and waiting for e.g. the socket buffer to get refilled. But
> > > > that's the easy part.
> > >
> > > It's probably generally a good thing to get rid of the BQL there, but I bet
> > > it's going to throw some surprises; maybe something like devices doing
> > > stuff before the migration has fully arrived
> >
> > Is that pre_load() or.. maybe something else?
> >
> > I should still look into each of them, but only if we want to further push
> > the bql to be at post_load() level. I am not sure if some pre_load() would
> > assume BQL won't be released until post_load(), if so that'll be an issue,
> > and that will need some closer code observation...
>
> Well maybe pre_load; but anything that might start happening once the
> state has been loaded that shouldn't start happening until migration ends;
> I think there are some devices that do it properly and wait for end of migration.
>
> > > or incoming socket connections to non-migration stuff perhaps.
> >
> > Any example for this one?
>
> I was just thinking aloud; but was thinking of NIC activity or maybe
> UI stuff? But just guesses.
I'll see if I can get some more test coverage to be slightly more
confident.. If you have any special setup / devices in mind that might be
prone to issues with this series, please let me know! I can test them even
earlier.
We also always have the option to provide a knob, so that whenever
necessary we can still allow user to fallback to the coroutine way, until
it's proven solid..
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process
2025-09-26 3:41 ` Zhijian Li (Fujitsu)
@ 2025-10-08 21:10 ` Peter Xu
0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2025-10-08 21:10 UTC (permalink / raw)
To: Zhijian Li (Fujitsu)
Cc: qemu-devel@nongnu.org, Dr . David Alan Gilbert, Kevin Wolf,
Paolo Bonzini, Daniel P . Berrangé, Fabiano Rosas,
Hailiang Zhang, Yury Kotov, Vladimir Sementsov-Ogievskiy,
Prasad Pandit, Zhang Chen, Juraj Marcin
On Fri, Sep 26, 2025 at 03:41:42AM +0000, Zhijian Li (Fujitsu) wrote:
>
>
> On 28/08/2025 04:59, Peter Xu wrote:
> > diff --git a/migration/rdma.c b/migration/rdma.c
> > index bcd7aae2f2..2b995513aa 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -3068,7 +3068,6 @@ static void rdma_cm_poll_handler(void *opaque)
> > {
> > RDMAContext *rdma = opaque;
> > struct rdma_cm_event *cm_event;
> > - MigrationIncomingState *mis = migration_incoming_get_current();
> >
> > if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> > error_report("get_cm_event failed %d", errno);
> > @@ -3087,10 +3086,6 @@ static void rdma_cm_poll_handler(void *opaque)
> > }
> > }
> > rdma_ack_cm_event(cm_event);
>
>
> This above line should be removed as well, otherwise it will cause a double cm_event free.
Good catch, thanks. This fn is completely removed in the last patch, hence
it'll be an internal difference within the series.
Said that, I wonder if I should squash the last patch into this one
instead, because after this patch applied, we should be polling the fd in
two threads (main, and the loadvm thread), and operating on it concurrently
too.. which looks risky if not racy already..
>
>
>
>
> > - if (mis->loadvm_co) {
> > - qemu_coroutine_enter(mis->loadvm_co);
> > - }
> > - return;
> > }
> > rdma_ack_cm_event(cm_event);
> > }
--
Peter Xu
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 6/9] migration/rdma: Remove coroutine path in qemu_rdma_wait_comp_channel
2025-09-16 22:39 ` Fabiano Rosas
@ 2025-10-08 21:18 ` Peter Xu
0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2025-10-08 21:18 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
On Tue, Sep 16, 2025 at 07:39:30PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > Now after threadified dest VM load during precopy, we will always in a
> > thread context rather than within a coroutine. We can remove this path
> > now.
> >
> > With that, migration_started_on_destination can go away too.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > migration/rdma.c | 102 +++++++++++++++++++----------------------------
> > 1 file changed, 41 insertions(+), 61 deletions(-)
> >
> > diff --git a/migration/rdma.c b/migration/rdma.c
> > index 2b995513aa..7751262460 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -29,7 +29,6 @@
> > #include "qemu/rcu.h"
> > #include "qemu/sockets.h"
> > #include "qemu/bitmap.h"
> > -#include "qemu/coroutine.h"
> > #include "system/memory.h"
> > #include <sys/socket.h>
> > #include <netdb.h>
> > @@ -357,13 +356,6 @@ typedef struct RDMAContext {
> > /* Index of the next RAMBlock received during block registration */
> > unsigned int next_src_index;
> >
> > - /*
> > - * Migration on *destination* started.
> > - * Then use coroutine yield function.
> > - * Source runs in a thread, so we don't care.
> > - */
> > - int migration_started_on_destination;
> > -
> > int total_registrations;
> > int total_writes;
> >
> > @@ -1353,66 +1345,55 @@ static int qemu_rdma_wait_comp_channel(RDMAContext *rdma,
> > struct rdma_cm_event *cm_event;
> >
> > /*
> > - * Coroutine doesn't start until migration_fd_process_incoming()
> > - * so don't yield unless we know we're running inside of a coroutine.
> > + * This is the source or dest side, either during precopy or
> > + * postcopy. We're always in a separate thread when reaching here.
> > + * Poll the fd. We need to be able to handle 'cancel' or an error
> > + * without hanging forever.
> > */
> > - if (rdma->migration_started_on_destination &&
> > - migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE &&
> > - qemu_in_coroutine()) {
> > - yield_until_fd_readable(comp_channel->fd);
> > - } else {
> > - /* This is the source side, we're in a separate thread
> > - * or destination prior to migration_fd_process_incoming()
> > - * after postcopy, the destination also in a separate thread.
> > - * we can't yield; so we have to poll the fd.
> > - * But we need to be able to handle 'cancel' or an error
> > - * without hanging forever.
> > - */
> > - while (!rdma->errored && !rdma->received_error) {
> > - GPollFD pfds[2];
> > - pfds[0].fd = comp_channel->fd;
> > - pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> > - pfds[0].revents = 0;
> > -
> > - pfds[1].fd = rdma->channel->fd;
> > - pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> > - pfds[1].revents = 0;
> > -
> > - /* 0.1s timeout, should be fine for a 'cancel' */
> > - switch (qemu_poll_ns(pfds, 2, 100 * 1000 * 1000)) {
> > - case 2:
> > - case 1: /* fd active */
> > - if (pfds[0].revents) {
> > - return 0;
> > - }
> > + while (!rdma->errored && !rdma->received_error) {
> > + GPollFD pfds[2];
> > + pfds[0].fd = comp_channel->fd;
> > + pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> > + pfds[0].revents = 0;
> > +
> > + pfds[1].fd = rdma->channel->fd;
> > + pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> > + pfds[1].revents = 0;
> > +
> > + /* 0.1s timeout, should be fine for a 'cancel' */
> > + switch (qemu_poll_ns(pfds, 2, 100 * 1000 * 1000)) {
>
> Don't glib have facilities for polling? Isn't this what
> qio_channel_rdma_create_watch() is for already?
Yes. I don't know why the RDMA channel is done like this; I didn't dig
deeper. I bet Dan has more clues (as author of 6ddd2d76ca6f). The hope is I
also don't need to dig it if I only want to make the loadvm to work in a
thread. :)
I also replied to your other email, that should have some more info
regarding to why I think rdma's io_create_watch() isn't used.. or seems
broken.
For this patch alone, it almost only removed the "if()" section, these
lines are untouched except indentation changes.
--
Peter Xu
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 9/9] migration/rdma: Remove rdma_cm_poll_handler
2025-09-17 18:38 ` Fabiano Rosas
@ 2025-10-08 21:22 ` Peter Xu
0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2025-10-08 21:22 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
On Wed, Sep 17, 2025 at 03:38:35PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > This almost reverts commit 923709896b1b01fb982c93492ad01b233e6b6023.
> >
> > It was needed because the RDMA iochannel on dest QEMU used to only yield
> > without monitoring the fd. Now it should be monitored by the same poll()
> > similarly on the src QEMU in qemu_rdma_wait_comp_channel(). So even
> > without the fd handler, dest QEMU should be able to receive the events.
> >
> > I tested this by initiating an RDMA migration, then do two things:
> >
> > - Either does migrate_cancel on src, or,
> > - Directly kill destination QEMU
> >
> > In both cases, the other side of QEMU will be able to receive the
> > disconnect event in qemu_rdma_wait_comp_channel() and properly cancel or
> > fail the migration.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > migration/rdma.c | 29 +----------------------------
> > 1 file changed, 1 insertion(+), 28 deletions(-)
> >
> > diff --git a/migration/rdma.c b/migration/rdma.c
> > index 7751262460..da7fd48bf3 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -3045,32 +3045,6 @@ int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
> >
> > static void rdma_accept_incoming_migration(void *opaque);
> >
> > -static void rdma_cm_poll_handler(void *opaque)
> > -{
> > - RDMAContext *rdma = opaque;
> > - struct rdma_cm_event *cm_event;
> > -
> > - if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> > - error_report("get_cm_event failed %d", errno);
> > - return;
> > - }
> > -
> > - if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
> > - cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
> > - if (!rdma->errored &&
> > - migration_incoming_get_current()->state !=
> > - MIGRATION_STATUS_COMPLETED) {
> > - error_report("receive cm event, cm event is %d", cm_event->event);
> > - rdma->errored = true;
> > - if (rdma->return_path) {
> > - rdma->return_path->errored = true;
> > - }
> > - }
> > - rdma_ack_cm_event(cm_event);
> > - }
> > - rdma_ack_cm_event(cm_event);
> > -}
> > -
> > static int qemu_rdma_accept(RDMAContext *rdma)
> > {
> > Error *err = NULL;
> > @@ -3188,8 +3162,7 @@ static int qemu_rdma_accept(RDMAContext *rdma)
> > NULL,
> > (void *)(intptr_t)rdma->return_path);
> > } else {
> > - qemu_set_fd_handler(rdma->channel->fd, rdma_cm_poll_handler,
> > - NULL, rdma);
> > + qemu_set_fd_handler(rdma->channel->fd, NULL, NULL, NULL);
>
> I'm not familiar with this code, but is this left here to remove the
> handler? Can't we remove this line altogether?
Fair question. I was just lazy because I know it's safe to call it like
that no matter what, unregistering anything if we registered some,
otherwise this qemu_set_fd_handler() should be a no-op.
I am just not confident on RDMA code that we can remove it. IOW, before
923709896b1 we did that, so I kept it as-is.
--
Peter Xu
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 0/9] migration: Threadify loadvm process
2025-09-04 8:27 ` Zhang Chen
@ 2025-10-08 21:26 ` Peter Xu
0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2025-10-08 21:26 UTC (permalink / raw)
To: Zhang Chen
Cc: Hailiang Zhang, qemu-devel, Dr . David Alan Gilbert, Kevin Wolf,
Paolo Bonzini, Daniel P . Berrangé, Fabiano Rosas,
Yury Kotov, Vladimir Sementsov-Ogievskiy, Prasad Pandit,
Li Zhijian, Juraj Marcin
On Thu, Sep 04, 2025 at 04:27:39PM +0800, Zhang Chen wrote:
> > I confess I didn't test anything on COLO but only from code observations
> > and analysis. COLO maintainers: could you add some unit tests to QEMU's
> > qtests?
>
> For the COLO part, I think remove the coroutines related code is OK for me.
> Because the original coroutine still need to call the
> "colo_process_incoming_thread".
Chen, thanks for the comment. It's still reassuring.
>
> Hi Hailiang, any comments for this part?
Any further comment on this series would always be helpful.
It'll be also great if anyone can come up with a selftest for COLO. Now
any new migration features needs both unit test and doc to get merged.
COLO was merged earlier so it doesn't need to, however these will be
helpful for sure to make sure COLO won't be easily broken.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 0/9] migration: Threadify loadvm process
2025-09-16 21:32 ` Fabiano Rosas
@ 2025-10-09 16:58 ` Peter Xu
0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2025-10-09 16:58 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
On Tue, Sep 16, 2025 at 06:32:59PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > [this is an early RFC, not for merge, but to collect initial feedbacks]
> >
> > Background
> > ==========
> >
> > Nowadays, live migration heavily depends on threads. For example, most of
> > the major features that will be used nowadays in live migration (multifd,
> > postcopy, mapped-ram, vfio, etc.) all work with threads internally.
> >
> > But still, from time to time, we'll see some coroutines floating around the
> > migration context. The major one is precopy's loadvm, which is internally
> > a coroutine. It is still a critical path that any live migration depends on.
> >
>
> I always wanted to be an archaeologist:
>
> https://lists.gnu.org/archive/html/qemu-devel//2012-08/msg01136.html
>
> I was expecting to find some complicated chain of events leading to the
> choice of using a coroutine, but no.
I actually didn't see that previously.. I'll add this link into that major
patch commit message, to make future archaeology work easier.
>
> > A mixture of using both coroutines and threads is prone to issues. Some
> > examples can refer to commit e65cec5e5d ("migration/ram: Yield periodically
> > to the main loop") or commit 7afbdada7e ("migration/postcopy: ensure
> > preempt channel is ready before loading states").
> >
> > Overview
> > ========
> >
> > This series tries to move migration further into the thread-based model, by
> > allowing the loadvm process to happen in a thread rather than in the main
> > thread with a coroutine.
> >
> > Luckily, since the qio channel code is always ready for both cases, IO
> > paths should all be fine.
> >
> > Note that loadvm for postcopy already happens in a ram load thread which is
> > separate. However, RAM is just the simple case here, even it has its own
> > challenges (on atomically update of the pgtables), its complexity lies in
> > the kernel.
> >
> > For precopy, loadvm has quite a few operations that will need BQL. The
> > question is we can't take BQL for the whole process of loadvm, because
> > that'll block the main thread from executions (e.g. QMP hangs). Here, the
> > finer granule we can push BQL the better. This series so far chose
> > somewhere in the middle, by taking BQL on majorly these two places:
> >
> > - CPU synchronizations
> > - Device START/FULL sections
> >
> > After this series applied, most of the rest loadvm path will run without
> > BQL anymore. There is a more detailed discussion / todo in the commit
> > message of patch "migration: Thread-ify precopy vmstate load process"
> > explaning how to further split the BQL critical sections.
> >
> > I was trying to split the patches into smaller ones if possible, but it's
> > still quite challenging so there's one major patch that does the work.
> >
> > After the series applied, the only leftover pieces in migration/ that would
> > use a coroutine is snapshot save/load/delete jobs.
> >
>
> Which are then fine because the work itself runs on the main loop,
> right? So the bottom-half scheduling could be left as a coroutine.
Correct, iochannel works for both cases.
For coroutines, it can properly register the fd and yield like before for
snapshot save/load. It used to do the same for live loadvm, but now after
moving to a thread it will start to use qio_channel_wait() instead.
I think we could also move back to blocking mode for live migration
incoming side after make it a thread, which might be slightly more
efficient to directly block in recvmsg() rather than return+poll. But it
is trivial comparing to "moving to thread" change, and it can be done for
later even if it works.
>
> > Tests
> > =====
> >
> > Default CI passes.
> >
> > RDMA unit tests pass as usual. I also tried out cancellation / failure
> > tests over RDMA channels, making sure nothing is stuck.
> >
> > I also roughly measured how long it takes to run the whole 80+ migration
> > qtest suite, and see no measurable difference before / after this series.
> >
> > Risks
> > =====
> >
> > This series has the risk of breaking things. I would be surprised if it
> > didn't..
> >
> > I confess I didn't test anything on COLO but only from code observations
> > and analysis. COLO maintainers: could you add some unit tests to QEMU's
> > qtests?
> >
> > The current way of taking BQL during FULL section load may cause issues, it
> > means when the IOs are unstable we could be waiting for IO (in the new
> > migration incoming thread) with BQL held. This is low possibility, though,
> > only happens when the network halts during flushing the device states.
> > However still possible. One solution is to further breakdown the BQL
> > critical sections to smaller sections, as mentioned in TODO.
> >
> > Anything more than welcomed: suggestions, questions, objections, tests..
> >
> > Todo
> > ====
> >
> > - Test COLO?
> > - Finer grained BQL breakdown
> > - More..
> >
> > Thanks,
> >
> > Peter Xu (9):
> > migration/vfio: Remove BQL implication in
> > vfio_multifd_switchover_start()
> > migration/rdma: Fix wrong context in qio_channel_rdma_shutdown()
> > migration/rdma: Allow qemu_rdma_wait_comp_channel work with thread
> > migration/rdma: Change io_create_watch() to return immediately
> > migration: Thread-ify precopy vmstate load process
> > migration/rdma: Remove coroutine path in qemu_rdma_wait_comp_channel
> > migration/postcopy: Remove workaround on wait preempt channel
> > migration/ram: Remove workaround on ram yield during load
> > migration/rdma: Remove rdma_cm_poll_handler
> >
> > include/migration/colo.h | 6 +-
> > migration/migration.h | 52 +++++++--
> > migration/savevm.h | 5 +-
> > hw/vfio/migration-multifd.c | 9 +-
> > migration/channel.c | 7 +-
> > migration/colo-stubs.c | 2 +-
> > migration/colo.c | 23 +---
> > migration/migration.c | 62 ++++++++---
> > migration/ram.c | 13 +--
> > migration/rdma.c | 206 ++++++++----------------------------
> > migration/savevm.c | 85 +++++++--------
> > migration/trace-events | 4 +-
> > 12 files changed, 196 insertions(+), 278 deletions(-)
>
--
Peter Xu
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process
2025-09-17 18:23 ` Fabiano Rosas
@ 2025-10-09 21:41 ` Peter Xu
0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2025-10-09 21:41 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Dr . David Alan Gilbert, Kevin Wolf, Paolo Bonzini,
Daniel P . Berrangé, Hailiang Zhang, Yury Kotov,
Vladimir Sementsov-Ogievskiy, Prasad Pandit, Zhang Chen,
Li Zhijian, Juraj Marcin
On Wed, Sep 17, 2025 at 03:23:53PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > Migration module was there for 10+ years. Initially, it was in most cases
> > based on coroutines. As more features were added into the framework, like
> > postcopy, multifd, etc.. it became a mixture of threads and coroutines.
> >
> > I'm guessing coroutines just can't fix all issues that migration want to
> > resolve.
> >
> > After all these years, migration is now heavily based on a threaded model.
> >
> > Now there's still a major part of migration framework that is still not
> > thread-based, which is precopy load. We do load in a separate thread in
> > postcopy since the 1st day postcopy was introduced, however that requires a
> > separate state transition from precopy loading all devices first, which
> > still happens in the main thread of a coroutine.
> >
> > This patch tries to move the migration incoming side to be run inside a
> > separate thread (mig/dst/main) just like the src (mig/src/main). The
> > entrance to be migration_incoming_thread().
> >
> > Quite a few things are needed to make it fly..
> >
> > BQL Analysis
> > ============
> >
> > Firstly, when moving it over to the thread, it means the thread cannot take
> > BQL during the whole process of loading anymore, because otherwise it can
> > block main thread from using the BQL for all kinds of other concurrent
> > tasks (for example, processing QMP / HMP commands).
> >
>
> Pure question, how does load_snapshot avoids these issues if it doesn't
> already use a coroutine? I feel I'm missing something.
Take my answer with a grain of salt, but this is how I read it..
Then when snapshots load/save, it operates on the disks with the block
qiochannel in the BH context. IIUC it has its own internal processing via
the block layer code, and at least when reaching, taking load as example,
bdrv_readv_vmstate(), there must be a coroutine context.
Then in proper places of the block/ code it'll yield() when needed, so that
monitor fds can still be polled, and when active monitor code can still run.
I feel like you have a better understanding of block/ than me. Correct me
if something was wrong!
>
> Not that I disagree with the concerns around threading + BQL, I'm just
> wondering if the issues may be fairly small since snapshots work fine.
>
> > Here the first question to ask is: what needs BQL during precopy load, and
> > what doesn't?
> >
> > Most of the load process shouldn't need BQL, especially when it's about
> > RAM. After all, RAM is still the major chunk of data to move for a live
> > migration process. VFIO started to change that, though, but still, VFIO is
> > per-device so that shouldn't need BQL either in most cases.
> >
> > Generic device loads will need BQL, likely not when receiving VMSDs, but
> > when applying them. One example is any post_load() could potentially
> > inject memory regions causing memory transactions to happen. That'll need
> > to update the global address spaces, hence requires BQL. The other one is
> > CPU sync operations, even if the sync alone may not need BQL (which is
> > still to be further justified), run_on_cpu() will need it.
> >
> > For that, qemu_loadvm_state() and qemu_loadvm_state_main() functions need
> > to now take a "bql_held" parameter saying whether bql is held. We could
> > use things like BQL_LOCK_GUARD(), but this patch goes with explicit
> > lockings rather than relying on bql_locked TLS variable.
>
> Why exactly? Seems redundant to plumb the variable through when we have
> bql_locked and the macros around it.
>
> At first sight I'd say we could already add BQL macros around code that
> we're sure needs it, which would maybe simplify the patch a bit.
Yeah I don't have a strong feeling on this one, I just preferred to be
explicit for now, before I get a better grasp of what if I don't.
Firstly, if you see I found at least one spot where tls_bql_locked might
report wrong things (when waiting on a condvar; I sent that patch
elsewhere). So I have concern on what if it goes wrong and we add more
dependency to this code, even if in our case it's still manageable (what I
did currently).
OTOH, I also prefer explicit knowledge of who took the BQL whenever
possible, instead of knowing "BQL is taken, but I don't know who took it".
BQL_LOCK_GUARD() is very handy, it's handy enough to not even need to know
the answer of 2nd question. The parameter can still let the migration code
be BQL-aware on which layer took it, and then when we know exactly when we
could either (1) remove a BQL_LOCK_GUARD use or (2) convert that to
bql_lock() when we know all contexts changed so BQL was never held.
Let me know if you think otherwise. I'm open to opinions.
>
> > In case of
> > migration, we always know whether BQL is held in different context as long
> > as we can still pass that information downwards.
> >
> > COLO
> > ====
> >
> > COLO assumed the dest VM load happens in a coroutine. After this patch,
> > it's not anymore. Change that by invoking colo_incoming_co() directly from
> > the migration_incoming_thread().
> >
> > The name (colo_incoming_co()) isn't proper anymore. Change it to
> > colo_incoming_wait(), removing the coroutine annotation alongside.
> >
> > Remove all the bql_lock() implications in COLO, e.g., colo_incoming_co()
> > used to release the lock for a short period while join(). Now it's not
> > needed.
> >
> > At the meantime, there's colo_incoming_co variable that used to store the
> > COLO incoming coroutine, only to be kicked off when a secondary failover
> > happens.
> >
> > To recap, what should happen for such failover should be (taking example of
> > a QMP command x-colo-lost-heartbeat triggering on dest QEMU):
> >
> > - The QMP command will kick off both the coroutine and the COLO
> > thread (colo_process_incoming_thread()), with something like:
> >
> > /* Notify COLO incoming thread that failover work is finished */
> > qemu_event_set(&mis->colo_incoming_event);
> >
> > qemu_coroutine_enter(mis->colo_incoming_co);
> >
> > - The coroutine, which yielded itself before, now resumes after enter(),
> > then it'll wait for the join():
> >
> > mis->colo_incoming_co = qemu_coroutine_self();
> > qemu_coroutine_yield();
> > mis->colo_incoming_co = NULL;
> >
> > /* Wait checkpoint incoming thread exit before free resource */
> > qemu_thread_join(&th);
> >
> > Here, when switching to a thread model, it should be fine removing
> > colo_incoming_co variable completely, because if so, the incoming thread
> > will (instead of yielding the coroutine) wait at qemu_thread_join() until
> > the colo thread completes execution (after receiving colo_incoming_event).
> >
> > RDMA
> > ====
> >
> > With the prior patch making sure io_watch won't block for RDMA iochannels,
> > RDMA threads should only block at its io_readv/io_writev functions. When a
> > disconnection is detected (as in rdma_cm_poll_handler()), the update to
> > "errored" field will be immediately reflected in the migration incoming
> > thread. Hence the coroutine for RDMA is not needed anymore to kick the
> > thread out.
> >
> > TODO
> > ====
> >
> > Currently the BQL is taken during loading of a START|FULL section. When
> > the IO hangs (e.g. network issue) during this process, it could potentially
> > block others like the monitor servers. One solution is breaking BQL to
> > smaller granule and leave IOs to be always BQL-free. That'll need more
> > justifications.
> >
> > For example, there are at least four things that need some closer
> > attention:
> >
> > - SaveVMHandlers's load_state(): this likely DO NOT need BQL, but we need
> > to justify all of them (not to mention, some of them look like prone to
> > be rewritten as VMSDs..)
> >
> > - VMSD's pre_load(): in most cases, this DO NOT really need BQL, but
> > sometimes maybe it will! Double checking on this will be needed.
> >
> > - VMSD's post_load(): in many cases, this DO need BQL, for example on
> > address space operations. Likely we should just take it for any
> > post_load().
> >
> > - VMSD field's get(): this is tricky! It could internally be anything
> > even if it was only a field. E.g. there can be users to use a SINGLE
> > field to load a whole VMSD, which can further introduce more
> > possibilities.
> >
> > In general, QEMUFile IOs should not need BQL, that is when receiving the
> > VMSD data and waiting for e.g. the socket buffer to get refilled. But
> > that's the easy part.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > include/migration/colo.h | 6 ++--
> > migration/migration.h | 52 ++++++++++++++++++++++++++------
> > migration/savevm.h | 5 ++--
> > migration/channel.c | 7 ++---
> > migration/colo-stubs.c | 2 +-
> > migration/colo.c | 23 ++++-----------
> > migration/migration.c | 62 ++++++++++++++++++++++++++++----------
> > migration/rdma.c | 5 ----
> > migration/savevm.c | 64 ++++++++++++++++++++++++----------------
> > migration/trace-events | 4 +--
> > 10 files changed, 142 insertions(+), 88 deletions(-)
> >
> > diff --git a/include/migration/colo.h b/include/migration/colo.h
> > index 43222ef5ae..bfb30eccf0 100644
> > --- a/include/migration/colo.h
> > +++ b/include/migration/colo.h
> > @@ -44,12 +44,10 @@ void colo_do_failover(void);
> > void colo_checkpoint_delay_set(void);
> >
> > /*
> > - * Starts COLO incoming process. Called from process_incoming_migration_co()
> > + * Starts COLO incoming process. Called from migration_incoming_thread()
> > * after loading the state.
> > - *
> > - * Called with BQL locked, may temporary release BQL.
> > */
> > -void coroutine_fn colo_incoming_co(void);
> > +void colo_incoming_wait(void);
> >
> > void colo_shutdown(void);
> > #endif
> > diff --git a/migration/migration.h b/migration/migration.h
> > index 01329bf824..c4a626eed4 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -42,6 +42,44 @@
> > #define MIGRATION_THREAD_DST_LISTEN "mig/dst/listen"
> > #define MIGRATION_THREAD_DST_PREEMPT "mig/dst/preempt"
> >
> > +/**
> > + * WITH_BQL_HELD(): Run a task, making sure BQL is held
> > + *
> > + * @bql_held: Whether BQL is already held
> > + * @task: The task to run within BQL held
> > + */
> > +#define WITH_BQL_HELD(bql_held, task) \
> > + do { \
> > + if (!bql_held) { \
> > + bql_lock(); \
> > + } else { \
> > + assert(bql_locked()); \
> > + } \
> > + task; \
> > + if (!bql_held) { \
> > + bql_unlock(); \
> > + } \
> > + } while (0)
> > +
> > +/**
> > + * WITHOUT_BQL_HELD(): Run a task, making sure BQL is released
> > + *
> > + * @bql_held: Whether BQL is already held
> > + * @task: The task to run making sure BQL released
> > + */
> > +#define WITHOUT_BQL_HELD(bql_held, task) \
> > + do { \
> > + if (bql_held) { \
> > + bql_unlock(); \
> > + } else { \
> > + assert(!bql_locked()); \
> > + } \
> > + task; \
> > + if (bql_held) { \
> > + bql_lock(); \
> > + } \
> > + } while (0)
> > +
> > struct PostcopyBlocktimeContext;
> > typedef struct ThreadPool ThreadPool;
> >
> > @@ -119,6 +157,10 @@ struct MigrationIncomingState {
> > bool have_listen_thread;
> > QemuThread listen_thread;
> >
> > + /* Migration main recv thread */
> > + bool have_recv_thread;
> > + QemuThread recv_thread;
> > +
> > /* For the kernel to send us notifications */
> > int userfault_fd;
> > /* To notify the fault_thread to wake, e.g., when need to quit */
> > @@ -177,15 +219,7 @@ struct MigrationIncomingState {
> >
> > MigrationStatus state;
> >
> > - /*
> > - * The incoming migration coroutine, non-NULL during qemu_loadvm_state().
> > - * Used to wake the migration incoming coroutine from rdma code. How much is
> > - * it safe - it's a question.
> > - */
> > - Coroutine *loadvm_co;
> > -
> > - /* The coroutine we should enter (back) after failover */
> > - Coroutine *colo_incoming_co;
> > + /* Notify secondary VM to move on */
> > QemuEvent colo_incoming_event;
> >
> > /* Optional load threads pool and its thread exit request flag */
> > diff --git a/migration/savevm.h b/migration/savevm.h
> > index 2d5e9c7166..c07e14f61a 100644
> > --- a/migration/savevm.h
> > +++ b/migration/savevm.h
> > @@ -64,9 +64,10 @@ void qemu_savevm_send_colo_enable(QEMUFile *f);
> > void qemu_savevm_live_state(QEMUFile *f);
> > int qemu_save_device_state(QEMUFile *f);
> >
> > -int qemu_loadvm_state(QEMUFile *f);
> > +int qemu_loadvm_state(QEMUFile *f, bool bql_held);
> > void qemu_loadvm_state_cleanup(MigrationIncomingState *mis);
> > -int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
> > +int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
> > + bool bql_held);
> > int qemu_load_device_state(QEMUFile *f);
> > int qemu_loadvm_approve_switchover(void);
> > int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> > diff --git a/migration/channel.c b/migration/channel.c
> > index a547b1fbfe..621f8a4a2a 100644
> > --- a/migration/channel.c
> > +++ b/migration/channel.c
> > @@ -136,11 +136,8 @@ int migration_channel_read_peek(QIOChannel *ioc,
> > }
> >
> > /* 1ms sleep. */
> > - if (qemu_in_coroutine()) {
> > - qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000);
> > - } else {
> > - g_usleep(1000);
> > - }
> > + assert(!qemu_in_coroutine());
> > + g_usleep(1000);
>
> What bug is this hiding? =)
I believe it was just a laziness when developing 6720c2b32725. AFAIU, we
could do better with simiilar qio_channel_yield() / qio_channel_wait()
handling like what we do elsewhere. Here it's a peek but it's essentially
a same read request, it's just that it doesn't move the needle.
I will prepare a pre-requisite patch remove the sleep.
>
> > }
> >
> > return 0;
> > diff --git a/migration/colo-stubs.c b/migration/colo-stubs.c
> > index e22ce65234..ef77d1ab4b 100644
> > --- a/migration/colo-stubs.c
> > +++ b/migration/colo-stubs.c
> > @@ -9,7 +9,7 @@ void colo_shutdown(void)
> > {
> > }
> >
> > -void coroutine_fn colo_incoming_co(void)
> > +void colo_incoming_wait(void)
> > {
> > }
> >
> > diff --git a/migration/colo.c b/migration/colo.c
> > index e0f713c837..f5722d9d9d 100644
> > --- a/migration/colo.c
> > +++ b/migration/colo.c
> > @@ -147,11 +147,6 @@ static void secondary_vm_do_failover(void)
> > }
> > /* Notify COLO incoming thread that failover work is finished */
> > qemu_event_set(&mis->colo_incoming_event);
> > -
> > - /* For Secondary VM, jump to incoming co */
> > - if (mis->colo_incoming_co) {
> > - qemu_coroutine_enter(mis->colo_incoming_co);
> > - }
> > }
> >
> > static void primary_vm_do_failover(void)
> > @@ -686,7 +681,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
> >
> > bql_lock();
> > cpu_synchronize_all_states();
> > - ret = qemu_loadvm_state_main(mis->from_src_file, mis);
> > + ret = qemu_loadvm_state_main(mis->from_src_file, mis, true);
> > bql_unlock();
> >
> > if (ret < 0) {
> > @@ -854,10 +849,8 @@ static void *colo_process_incoming_thread(void *opaque)
> > goto out;
> > }
> > /*
> > - * Note: the communication between Primary side and Secondary side
> > - * should be sequential, we set the fd to unblocked in migration incoming
> > - * coroutine, and here we are in the COLO incoming thread, so it is ok to
> > - * set the fd back to blocked.
> > + * Here we are in the COLO incoming thread, so it is ok to set the fd
> > + * to blocked.
>
> nit: s/blocked/blocking/
I kept it as-is, but yeah I'll go and touch it.
>
> > */
> > qemu_file_set_blocking(mis->from_src_file, true);
> >
> > @@ -930,26 +923,20 @@ out:
> > return NULL;
> > }
> >
> > -void coroutine_fn colo_incoming_co(void)
> > +/* Wait for failover */
> > +void colo_incoming_wait(void)
> > {
> > MigrationIncomingState *mis = migration_incoming_get_current();
> > QemuThread th;
> >
> > - assert(bql_locked());
> > assert(migration_incoming_colo_enabled());
> >
> > qemu_thread_create(&th, MIGRATION_THREAD_DST_COLO,
> > colo_process_incoming_thread,
> > mis, QEMU_THREAD_JOINABLE);
> >
> > - mis->colo_incoming_co = qemu_coroutine_self();
> > - qemu_coroutine_yield();
> > - mis->colo_incoming_co = NULL;
> > -
> > - bql_unlock();
>
> What does the BQL protects from colo_do_failover() until here? Could we
> have a preliminary patch reducing the BQL scope? I'm thinking about
> which changes we can merge upfront so we're already testing our
> assumptions before the whole series completes.
I think qemu_coroutine_yield() should at least need it.
My very limited understanding is the coroutines running in the main thread
must yield with bql held, otherwise all coroutines share the same
tls_bql_locked (which is a thread var) and it doesn't look right that bql
status could ever change after switching back to a coroutine which used to
hold bql.
>
> > /* Wait checkpoint incoming thread exit before free resource */
> > qemu_thread_join(&th);
> > - bql_lock();
> >
> > /* We hold the global BQL, so it is safe here */
> > colo_release_ram_cache();
>
> Maybe a candidate for WITH_BQL_HELD?
Heh, I believe we need bql here, thanks for catching it. Logically I
should test this path, but you know..
I'll directly use bql_lock() / bql_unlock() here, because we explicitly
know we don't have bql.
>
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 10c216d25d..7e4d25b15c 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -494,6 +494,11 @@ void migration_incoming_state_destroy(void)
> > mis->postcopy_qemufile_dst = NULL;
> > }
> >
> > + if (mis->have_recv_thread) {
>
> Help me out here, is this read race-free with the write at
> migration_incoming_process() because...?
>
> This read is reached from qemu_bh_schedule() while the write comes from
> the qio_channel_add_watch_full() callback. Are those (potentially)
> different AioContexts and thus the BQL is what's really doing the work
> (instead of the event serialization caused by the main loop)?
Yes I believe the setter and here all needs bql, so race free.
>
> > + qemu_thread_join(&mis->recv_thread);
> > + mis->have_recv_thread = false;
> > + }
> > +
> > cpr_set_incoming_mode(MIG_MODE_NONE);
> > yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> > }
> > @@ -864,30 +869,46 @@ static void process_incoming_migration_bh(void *opaque)
>
> BTW, I don't think we need this BH anymore since effd60c8781. This
> process_incoming_migration_bh was originally introduced (0aa6aefc9) to
> move the brdv_activate_all out of coroutines due to a bug in the block
> layer, basically this:
>
> bdrv_activate_all ->
> bdrv_invalidate_cache ->
> coroutine_fn qcow2_co_invalidate_cache
> {
> ...
> BDRVQcow2State *s = bs->opaque;
> ...
> memset(s, 0, sizeof(BDRVQcow2State)); <-- clears memory
> ...
> flags &= ~BDRV_O_INACTIVE;
> qemu_co_mutex_lock(&s->lock);
> ret = qcow2_do_open(bs, options, flags, false, errp);
> ^ may yield before repopulating all of BDRVQcow2State
> info block or something else reads 's' and BOOM
> qemu_co_mutex_unlock(&s->lock);
> ...
> }
>
> Note that 2 years after the BH creation (0aa6aefc9), 2b148f392b2 has
> moved the invalidate function back into a coroutine anyway.
I'm not sure if we will not need it before this series, but we'll at least
start to need it now? After this series, process_incoming_migration_co()
becomes migration_incoming_thread(), which will not hold bql
anymore.. while many of process_incoming_migration_bh() does need bql?
>
> > migration_incoming_state_destroy();
> > }
> >
> > -static void coroutine_fn
> > -process_incoming_migration_co(void *opaque)
> > +static void migration_incoming_state_destroy_bh(void *opaque)
>
> I only mention all of the above because it would allow merging the two
> paths that call migration_incoming_state_destroy() and avoid this new
> BH.
>
> > +{
> > + struct MigrationIncomingState *mis = opaque;
> > +
> > + if (mis->exit_on_error) {
> > + /*
> > + * NOTE: this exit() should better happen in the main thread, as
> > + * the exit notifier may require BQL which can deadlock. See
> > + * commit e7bc0204e57836 for example.
> > + */
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + migration_incoming_state_destroy();
> > +}
> > +
> > +static void *migration_incoming_thread(void *opaque)
> > {
> > MigrationState *s = migrate_get_current();
> > - MigrationIncomingState *mis = migration_incoming_get_current();
> > + MigrationIncomingState *mis = opaque;
> > PostcopyState ps;
> > int ret;
> > Error *local_err = NULL;
> >
> > + rcu_register_thread();
> > +
> > assert(mis->from_src_file);
> > + assert(!bql_locked());
> >
> > mis->largest_page_size = qemu_ram_pagesize_largest();
> > postcopy_state_set(POSTCOPY_INCOMING_NONE);
> > migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP,
> > MIGRATION_STATUS_ACTIVE);
> >
> > - mis->loadvm_co = qemu_coroutine_self();
> > - ret = qemu_loadvm_state(mis->from_src_file);
> > - mis->loadvm_co = NULL;
> > + ret = qemu_loadvm_state(mis->from_src_file, false);
> >
> > trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
> >
> > ps = postcopy_state_get();
> > - trace_process_incoming_migration_co_end(ret, ps);
> > + trace_process_incoming_migration_end(ret, ps);
> > if (ps != POSTCOPY_INCOMING_NONE) {
> > if (ps == POSTCOPY_INCOMING_ADVISE) {
> > /*
> > @@ -901,7 +922,7 @@ process_incoming_migration_co(void *opaque)
> > * Postcopy was started, cleanup should happen at the end of the
> > * postcopy thread.
> > */
> > - trace_process_incoming_migration_co_postcopy_end_main();
> > + trace_process_incoming_migration_postcopy_end_main();
> > goto out;
> > }
> > /* Else if something went wrong then just fall out of the normal exit */
> > @@ -913,8 +934,8 @@ process_incoming_migration_co(void *opaque)
> > }
> >
> > if (migration_incoming_colo_enabled()) {
> > - /* yield until COLO exit */
> > - colo_incoming_co();
> > + /* wait until COLO exits */
> > + colo_incoming_wait();
> > }
> >
> > migration_bh_schedule(process_incoming_migration_bh, mis);
> > @@ -926,19 +947,24 @@ fail:
> > migrate_set_error(s, local_err);
> > error_free(local_err);
> >
> > - migration_incoming_state_destroy();
> > -
>
> Moving this below the exit will affect the source I think, for instance:
>
> migration_incoming_state_destroy
> {
> ...
> /* Tell source that we are done */
> migrate_send_rp_shut(mis, qemu_file_get_error(mis->from_src_file) != 0);
Indeed. I would expect there's no real difference because this already
means migration failed.. however I agree it's also not my intention to
explicitly change the ordering, but pretty much an oversight.
>
> > if (mis->exit_on_error) {
> > WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> > error_report_err(s->error);
> > s->error = NULL;
> > }
> > -
> > - exit(EXIT_FAILURE);
> > }
> > +
> > + /*
> > + * There's some step of the destroy process that will need to happen in
> > + * the main thread (e.g. joining this thread itself). Leave to a BH.
> > + */
> > + migration_bh_schedule(migration_incoming_state_destroy_bh, (void *)mis);
> > +
> > out:
> > /* Pairs with the refcount taken in qmp_migrate_incoming() */
> > migrate_incoming_unref_outgoing_state();
> > + rcu_unregister_thread();
> > + return NULL;
> > }
> >
> > /**
> > @@ -956,8 +982,12 @@ static void migration_incoming_setup(QEMUFile *f)
> >
> > void migration_incoming_process(void)
> > {
> > - Coroutine *co = qemu_coroutine_create(process_incoming_migration_co, NULL);
> > - qemu_coroutine_enter(co);
> > + MigrationIncomingState *mis = migration_incoming_get_current();
> > +
> > + mis->have_recv_thread = true;
> > + qemu_thread_create(&mis->recv_thread, "mig/dst/main",
> > + migration_incoming_thread, mis,
> > + QEMU_THREAD_JOINABLE);
> > }
> >
> > /* Returns true if recovered from a paused migration, otherwise false */
> > diff --git a/migration/rdma.c b/migration/rdma.c
> > index bcd7aae2f2..2b995513aa 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -3068,7 +3068,6 @@ static void rdma_cm_poll_handler(void *opaque)
> > {
> > RDMAContext *rdma = opaque;
> > struct rdma_cm_event *cm_event;
> > - MigrationIncomingState *mis = migration_incoming_get_current();
> >
> > if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> > error_report("get_cm_event failed %d", errno);
> > @@ -3087,10 +3086,6 @@ static void rdma_cm_poll_handler(void *opaque)
> > }
> > }
> > rdma_ack_cm_event(cm_event);
> > - if (mis->loadvm_co) {
> > - qemu_coroutine_enter(mis->loadvm_co);
> > - }
> > - return;
> > }
> > rdma_ack_cm_event(cm_event);
> > }
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index fabbeb296a..ad606c5425 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -154,11 +154,10 @@ static void qemu_loadvm_thread_pool_destroy(MigrationIncomingState *mis)
> > }
> >
> > static bool qemu_loadvm_thread_pool_wait(MigrationState *s,
> > - MigrationIncomingState *mis)
> > + MigrationIncomingState *mis,
> > + bool bql_held)
> > {
> > - bql_unlock(); /* Let load threads do work requiring BQL */
> > - thread_pool_wait(mis->load_threads);
> > - bql_lock();
> > + WITHOUT_BQL_HELD(bql_held, thread_pool_wait(mis->load_threads));
> >
> > return !migrate_has_error(s);
> > }
> > @@ -2091,14 +2090,11 @@ static void *postcopy_ram_listen_thread(void *opaque)
> > trace_postcopy_ram_listen_thread_start();
> >
> > rcu_register_thread();
> > - /*
> > - * Because we're a thread and not a coroutine we can't yield
> > - * in qemu_file, and thus we must be blocking now.
> > - */
> > + /* Because we're a thread, making sure to use blocking mode */
> > qemu_file_set_blocking(f, true);
> >
> > /* TODO: sanity check that only postcopiable data will be loaded here */
> > - load_res = qemu_loadvm_state_main(f, mis);
> > + load_res = qemu_loadvm_state_main(f, mis, false);
> >
> > /*
> > * This is tricky, but, mis->from_src_file can change after it
> > @@ -2392,13 +2388,14 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
> > * Immediately following this command is a blob of data containing an embedded
> > * chunk of migration stream; read it and load it.
> > *
> > - * @mis: Incoming state
> > - * @length: Length of packaged data to read
> > + * @mis: Incoming state
> > + * @bql_held: Whether BQL is held already
> > *
> > * Returns: Negative values on error
> > *
> > */
> > -static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
> > +static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis,
> > + bool bql_held)
> > {
> > int ret;
> > size_t length;
> > @@ -2449,7 +2446,7 @@ static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
> > qemu_coroutine_yield();
> > } while (1);
> >
> > - ret = qemu_loadvm_state_main(packf, mis);
> > + ret = qemu_loadvm_state_main(packf, mis, bql_held);
> > trace_loadvm_handle_cmd_packaged_main(ret);
> > qemu_fclose(packf);
> > object_unref(OBJECT(bioc));
> > @@ -2539,7 +2536,7 @@ static int loadvm_postcopy_handle_switchover_start(void)
> > * LOADVM_QUIT All good, but exit the loop
> > * <0 Error
> > */
> > -static int loadvm_process_command(QEMUFile *f)
> > +static int loadvm_process_command(QEMUFile *f, bool bql_held)
> > {
> > MigrationIncomingState *mis = migration_incoming_get_current();
> > uint16_t cmd;
> > @@ -2609,7 +2606,7 @@ static int loadvm_process_command(QEMUFile *f)
> > break;
> >
> > case MIG_CMD_PACKAGED:
> > - return loadvm_handle_cmd_packaged(mis);
> > + return loadvm_handle_cmd_packaged(mis, bql_held);
> >
> > case MIG_CMD_POSTCOPY_ADVISE:
> > return loadvm_postcopy_handle_advise(mis, len);
> > @@ -3028,7 +3025,8 @@ static bool postcopy_pause_incoming(MigrationIncomingState *mis)
> > return true;
> > }
> >
> > -int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
> > +int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
> > + bool bql_held)
> > {
> > uint8_t section_type;
> > int ret = 0;
> > @@ -3046,7 +3044,15 @@ retry:
> > switch (section_type) {
> > case QEMU_VM_SECTION_START:
> > case QEMU_VM_SECTION_FULL:
> > - ret = qemu_loadvm_section_start_full(f, section_type);
> > + /*
> > + * FULL should normally require BQL, e.g. during post_load()
> > + * there can be memory region updates. START may or may not
> > + * require it, but just to keep it simple to always hold BQL
> > + * for now.
> > + */
> > + WITH_BQL_HELD(
> > + bql_held,
> > + ret = qemu_loadvm_section_start_full(f, section_type));
> > if (ret < 0) {
> > goto out;
> > }
> > @@ -3059,7 +3065,11 @@ retry:
> > }
> > break;
> > case QEMU_VM_COMMAND:
> > - ret = loadvm_process_command(f);
> > + /*
> > + * Be careful; QEMU_VM_COMMAND can embed FULL sections, so it
> > + * may internally need BQL.
> > + */
> > + ret = loadvm_process_command(f, bql_held);
> > trace_qemu_loadvm_state_section_command(ret);
> > if ((ret < 0) || (ret == LOADVM_QUIT)) {
> > goto out;
> > @@ -3103,7 +3113,7 @@ out:
> > return ret;
> > }
> >
> > -int qemu_loadvm_state(QEMUFile *f)
> > +int qemu_loadvm_state(QEMUFile *f, bool bql_held)
> > {
> > MigrationState *s = migrate_get_current();
> > MigrationIncomingState *mis = migration_incoming_get_current();
> > @@ -3131,9 +3141,10 @@ int qemu_loadvm_state(QEMUFile *f)
> > qemu_loadvm_state_switchover_ack_needed(mis);
> > }
> >
> > - cpu_synchronize_all_pre_loadvm();
> > + /* run_on_cpu() requires BQL */
> > + WITH_BQL_HELD(bql_held, cpu_synchronize_all_pre_loadvm());
> >
> > - ret = qemu_loadvm_state_main(f, mis);
> > + ret = qemu_loadvm_state_main(f, mis, bql_held);
> > qemu_event_set(&mis->main_thread_load_event);
> >
> > trace_qemu_loadvm_state_post_main(ret);
> > @@ -3149,7 +3160,7 @@ int qemu_loadvm_state(QEMUFile *f)
> > /* When reaching here, it must be precopy */
> > if (ret == 0) {
> > if (migrate_has_error(migrate_get_current()) ||
> > - !qemu_loadvm_thread_pool_wait(s, mis)) {
> > + !qemu_loadvm_thread_pool_wait(s, mis, bql_held)) {
> > ret = -EINVAL;
> > } else {
> > ret = qemu_file_get_error(f);
> > @@ -3196,7 +3207,8 @@ int qemu_loadvm_state(QEMUFile *f)
> > }
> > }
> >
> > - cpu_synchronize_all_post_init();
> > + /* run_on_cpu() requires BQL */
> > + WITH_BQL_HELD(bql_held, cpu_synchronize_all_post_init());
> >
> > return ret;
> > }
> > @@ -3207,7 +3219,7 @@ int qemu_load_device_state(QEMUFile *f)
> > int ret;
> >
> > /* Load QEMU_VM_SECTION_FULL section */
> > - ret = qemu_loadvm_state_main(f, mis);
> > + ret = qemu_loadvm_state_main(f, mis, true);
> > if (ret < 0) {
> > error_report("Failed to load device state: %d", ret);
> > return ret;
> > @@ -3438,7 +3450,7 @@ void qmp_xen_load_devices_state(const char *filename, Error **errp)
> > f = qemu_file_new_input(QIO_CHANNEL(ioc));
> > object_unref(OBJECT(ioc));
> >
> > - ret = qemu_loadvm_state(f);
> > + ret = qemu_loadvm_state(f, true);
> > qemu_fclose(f);
> > if (ret < 0) {
> > error_setg(errp, "loading Xen device state failed");
> > @@ -3512,7 +3524,7 @@ bool load_snapshot(const char *name, const char *vmstate,
> > ret = -EINVAL;
> > goto err_drain;
> > }
> > - ret = qemu_loadvm_state(f);
> > + ret = qemu_loadvm_state(f, true);
> > migration_incoming_state_destroy();
> >
> > bdrv_drain_all_end();
> > diff --git a/migration/trace-events b/migration/trace-events
> > index 706db97def..eeb41e03f1 100644
> > --- a/migration/trace-events
> > +++ b/migration/trace-events
> > @@ -193,8 +193,8 @@ source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
> > source_return_path_thread_switchover_acked(void) ""
> > migration_thread_low_pending(uint64_t pending) "%" PRIu64
> > migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " max_size %" PRId64
> > -process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> > -process_incoming_migration_co_postcopy_end_main(void) ""
> > +process_incoming_migration_end(int ret, int ps) "ret=%d postcopy-state=%d"
> > +process_incoming_migration_postcopy_end_main(void) ""
> > postcopy_preempt_enabled(bool value) "%d"
> > migration_precopy_complete(void) ""
>
--
Peter Xu
^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2025-10-09 21:42 UTC | newest]
Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-27 20:59 [PATCH RFC 0/9] migration: Threadify loadvm process Peter Xu
2025-08-27 20:59 ` [PATCH RFC 1/9] migration/vfio: Remove BQL implication in vfio_multifd_switchover_start() Peter Xu
2025-08-28 18:05 ` Maciej S. Szmigiero
2025-09-16 21:34 ` Fabiano Rosas
2025-08-27 20:59 ` [PATCH RFC 2/9] migration/rdma: Fix wrong context in qio_channel_rdma_shutdown() Peter Xu
2025-09-16 21:41 ` Fabiano Rosas
2025-09-26 1:01 ` Zhijian Li (Fujitsu)
2025-08-27 20:59 ` [PATCH RFC 3/9] migration/rdma: Allow qemu_rdma_wait_comp_channel work with thread Peter Xu
2025-09-16 21:50 ` Fabiano Rosas
2025-09-26 1:02 ` Zhijian Li (Fujitsu)
2025-08-27 20:59 ` [PATCH RFC 4/9] migration/rdma: Change io_create_watch() to return immediately Peter Xu
2025-09-16 22:35 ` Fabiano Rosas
2025-10-08 20:34 ` Peter Xu
2025-09-26 2:39 ` Zhijian Li (Fujitsu)
2025-10-08 20:42 ` Peter Xu
2025-08-27 20:59 ` [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process Peter Xu
2025-08-27 23:51 ` Dr. David Alan Gilbert
2025-08-29 16:37 ` Peter Xu
2025-09-04 1:38 ` Dr. David Alan Gilbert
2025-10-08 21:02 ` Peter Xu
2025-08-29 8:29 ` Vladimir Sementsov-Ogievskiy
2025-08-29 17:17 ` Peter Xu
2025-09-01 9:35 ` Vladimir Sementsov-Ogievskiy
2025-09-17 18:23 ` Fabiano Rosas
2025-10-09 21:41 ` Peter Xu
2025-09-26 3:41 ` Zhijian Li (Fujitsu)
2025-10-08 21:10 ` Peter Xu
2025-08-27 20:59 ` [PATCH RFC 6/9] migration/rdma: Remove coroutine path in qemu_rdma_wait_comp_channel Peter Xu
2025-09-16 22:39 ` Fabiano Rosas
2025-10-08 21:18 ` Peter Xu
2025-09-26 2:44 ` Zhijian Li (Fujitsu)
2025-08-27 20:59 ` [PATCH RFC 7/9] migration/postcopy: Remove workaround on wait preempt channel Peter Xu
2025-09-17 18:30 ` Fabiano Rosas
2025-08-27 20:59 ` [PATCH RFC 8/9] migration/ram: Remove workaround on ram yield during load Peter Xu
2025-09-17 18:31 ` Fabiano Rosas
2025-08-27 20:59 ` [PATCH RFC 9/9] migration/rdma: Remove rdma_cm_poll_handler Peter Xu
2025-09-17 18:38 ` Fabiano Rosas
2025-10-08 21:22 ` Peter Xu
2025-09-26 3:38 ` Zhijian Li (Fujitsu)
2025-08-29 8:29 ` [PATCH RFC 0/9] migration: Threadify loadvm process Vladimir Sementsov-Ogievskiy
2025-08-29 17:18 ` Peter Xu
2025-09-04 8:27 ` Zhang Chen
2025-10-08 21:26 ` Peter Xu
2025-09-16 21:32 ` Fabiano Rosas
2025-10-09 16:58 ` Peter Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).