From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D0306CAC59A for ; Wed, 17 Sep 2025 18:25:24 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1uywp3-0006hi-JE; Wed, 17 Sep 2025 14:24:05 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1uywp1-0006hI-Is for qemu-devel@nongnu.org; Wed, 17 Sep 2025 14:24:03 -0400 Received: from smtp-out2.suse.de ([195.135.223.131]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1uywox-0004A2-Fx for qemu-devel@nongnu.org; Wed, 17 Sep 2025 14:24:03 -0400 Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id EBB505C130; Wed, 17 Sep 2025 18:23:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1758133437; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=DudKm7cVM42Gh5jxup1ybFK0a/bn1sPmZAJVlqYZQPU=; b=nnVljavV5qZVp77DHPYbLL444yQl5ZKJcHd4s6Vq2IpHBKFZYAJ1Iv9aM0po69fJXW0Uey nMD4+3jr5l5gRkwLRlJdsHlk6F7CnfbgIRTX6U3f9zFiOr2a1cBW6DkxCmwjHIzks2nq/P LF2CHAaSdlu0/iY0eFXGeZALk7iLPQg= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1758133437; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=DudKm7cVM42Gh5jxup1ybFK0a/bn1sPmZAJVlqYZQPU=; b=liKopZc4gCSrqF7E/AYd4/NxAi73TiJ/Byi6aBYP5NiBMfFtXF+b06LpnoKhcTr/DWs2eE mVkW4+Rmw7a3p6Ag== Authentication-Results: smtp-out2.suse.de; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1758133436; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=DudKm7cVM42Gh5jxup1ybFK0a/bn1sPmZAJVlqYZQPU=; b=P5JE0Unk4bO+arl2AzFwXB6ZuiNm66k8Gv3ZqqFVTaCSRBr3XXSEQ75SyRgEKkXHeMSxgl E3pUQTK+vnCoIXzrHgk2em3ITaK47dnEkAFp+8l3GsULV5+/lqO1FByr1ngl9f1zgna1rd rmPOqICEveNHQclvsyeQP1x54jpEisg= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1758133436; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=DudKm7cVM42Gh5jxup1ybFK0a/bn1sPmZAJVlqYZQPU=; b=L8t34nZChuN956FMSTQu/IAkhyUwh8P/Pv0rzpTCb9RYMJJ/M9RZtw4UuuSnfCU9QrlfFJ ltoTgf62aCylSpDw== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 17AF41368D; Wed, 17 Sep 2025 18:23:55 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id iULJMbv8ymgBUwAAD6G6ig (envelope-from ); Wed, 17 Sep 2025 18:23:55 +0000 From: Fabiano Rosas To: Peter Xu , qemu-devel@nongnu.org Cc: "Dr . David Alan Gilbert" , peterx@redhat.com, Kevin Wolf , Paolo Bonzini , =?utf-8?Q?D?= =?utf-8?Q?aniel_P_=2E_Berrang=C3=A9?= , Hailiang Zhang , Yury Kotov , Vladimir Sementsov-Ogievskiy , Prasad Pandit , Zhang Chen , Li Zhijian , Juraj Marcin Subject: Re: [PATCH RFC 5/9] migration: Thread-ify precopy vmstate load process In-Reply-To: <20250827205949.364606-6-peterx@redhat.com> References: <20250827205949.364606-1-peterx@redhat.com> <20250827205949.364606-6-peterx@redhat.com> Date: Wed, 17 Sep 2025 15:23:53 -0300 Message-ID: <87frclvsxy.fsf@suse.de> MIME-Version: 1.0 Content-Type: text/plain X-Spamd-Result: default: False [-4.30 / 50.00]; BAYES_HAM(-3.00)[100.00%]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FUZZY_RATELIMITED(0.00)[rspamd.com]; FREEMAIL_ENVRCPT(0.00)[gmail.com]; MIME_TRACE(0.00)[0:+]; ARC_NA(0.00)[]; RCPT_COUNT_TWELVE(0.00)[14]; FREEMAIL_CC(0.00)[treblig.org,redhat.com,xfusion.com,yandex-team.ru,gmail.com,fujitsu.com]; RCVD_TLS_ALL(0.00)[]; MISSING_XM_UA(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; RCVD_VIA_SMTP_AUTH(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; DBL_BLOCKED_OPENRESOLVER(0.00)[suse.de:mid,imap1.dmz-prg2.suse.org:helo] Received-SPF: pass client-ip=195.135.223.131; envelope-from=farosas@suse.de; helo=smtp-out2.suse.de X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, RCVD_IN_VALIDITY_SAFE_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Peter Xu writes: > Migration module was there for 10+ years. Initially, it was in most cases > based on coroutines. As more features were added into the framework, like > postcopy, multifd, etc.. it became a mixture of threads and coroutines. > > I'm guessing coroutines just can't fix all issues that migration want to > resolve. > > After all these years, migration is now heavily based on a threaded model. > > Now there's still a major part of migration framework that is still not > thread-based, which is precopy load. We do load in a separate thread in > postcopy since the 1st day postcopy was introduced, however that requires a > separate state transition from precopy loading all devices first, which > still happens in the main thread of a coroutine. > > This patch tries to move the migration incoming side to be run inside a > separate thread (mig/dst/main) just like the src (mig/src/main). The > entrance to be migration_incoming_thread(). > > Quite a few things are needed to make it fly.. > > BQL Analysis > ============ > > Firstly, when moving it over to the thread, it means the thread cannot take > BQL during the whole process of loading anymore, because otherwise it can > block main thread from using the BQL for all kinds of other concurrent > tasks (for example, processing QMP / HMP commands). > Pure question, how does load_snapshot avoids these issues if it doesn't already use a coroutine? I feel I'm missing something. Not that I disagree with the concerns around threading + BQL, I'm just wondering if the issues may be fairly small since snapshots work fine. > Here the first question to ask is: what needs BQL during precopy load, and > what doesn't? > > Most of the load process shouldn't need BQL, especially when it's about > RAM. After all, RAM is still the major chunk of data to move for a live > migration process. VFIO started to change that, though, but still, VFIO is > per-device so that shouldn't need BQL either in most cases. > > Generic device loads will need BQL, likely not when receiving VMSDs, but > when applying them. One example is any post_load() could potentially > inject memory regions causing memory transactions to happen. That'll need > to update the global address spaces, hence requires BQL. The other one is > CPU sync operations, even if the sync alone may not need BQL (which is > still to be further justified), run_on_cpu() will need it. > > For that, qemu_loadvm_state() and qemu_loadvm_state_main() functions need > to now take a "bql_held" parameter saying whether bql is held. We could > use things like BQL_LOCK_GUARD(), but this patch goes with explicit > lockings rather than relying on bql_locked TLS variable. Why exactly? Seems redundant to plumb the variable through when we have bql_locked and the macros around it. At first sight I'd say we could already add BQL macros around code that we're sure needs it, which would maybe simplify the patch a bit. > In case of > migration, we always know whether BQL is held in different context as long > as we can still pass that information downwards. > > COLO > ==== > > COLO assumed the dest VM load happens in a coroutine. After this patch, > it's not anymore. Change that by invoking colo_incoming_co() directly from > the migration_incoming_thread(). > > The name (colo_incoming_co()) isn't proper anymore. Change it to > colo_incoming_wait(), removing the coroutine annotation alongside. > > Remove all the bql_lock() implications in COLO, e.g., colo_incoming_co() > used to release the lock for a short period while join(). Now it's not > needed. > > At the meantime, there's colo_incoming_co variable that used to store the > COLO incoming coroutine, only to be kicked off when a secondary failover > happens. > > To recap, what should happen for such failover should be (taking example of > a QMP command x-colo-lost-heartbeat triggering on dest QEMU): > > - The QMP command will kick off both the coroutine and the COLO > thread (colo_process_incoming_thread()), with something like: > > /* Notify COLO incoming thread that failover work is finished */ > qemu_event_set(&mis->colo_incoming_event); > > qemu_coroutine_enter(mis->colo_incoming_co); > > - The coroutine, which yielded itself before, now resumes after enter(), > then it'll wait for the join(): > > mis->colo_incoming_co = qemu_coroutine_self(); > qemu_coroutine_yield(); > mis->colo_incoming_co = NULL; > > /* Wait checkpoint incoming thread exit before free resource */ > qemu_thread_join(&th); > > Here, when switching to a thread model, it should be fine removing > colo_incoming_co variable completely, because if so, the incoming thread > will (instead of yielding the coroutine) wait at qemu_thread_join() until > the colo thread completes execution (after receiving colo_incoming_event). > > RDMA > ==== > > With the prior patch making sure io_watch won't block for RDMA iochannels, > RDMA threads should only block at its io_readv/io_writev functions. When a > disconnection is detected (as in rdma_cm_poll_handler()), the update to > "errored" field will be immediately reflected in the migration incoming > thread. Hence the coroutine for RDMA is not needed anymore to kick the > thread out. > > TODO > ==== > > Currently the BQL is taken during loading of a START|FULL section. When > the IO hangs (e.g. network issue) during this process, it could potentially > block others like the monitor servers. One solution is breaking BQL to > smaller granule and leave IOs to be always BQL-free. That'll need more > justifications. > > For example, there are at least four things that need some closer > attention: > > - SaveVMHandlers's load_state(): this likely DO NOT need BQL, but we need > to justify all of them (not to mention, some of them look like prone to > be rewritten as VMSDs..) > > - VMSD's pre_load(): in most cases, this DO NOT really need BQL, but > sometimes maybe it will! Double checking on this will be needed. > > - VMSD's post_load(): in many cases, this DO need BQL, for example on > address space operations. Likely we should just take it for any > post_load(). > > - VMSD field's get(): this is tricky! It could internally be anything > even if it was only a field. E.g. there can be users to use a SINGLE > field to load a whole VMSD, which can further introduce more > possibilities. > > In general, QEMUFile IOs should not need BQL, that is when receiving the > VMSD data and waiting for e.g. the socket buffer to get refilled. But > that's the easy part. > > Signed-off-by: Peter Xu > --- > include/migration/colo.h | 6 ++-- > migration/migration.h | 52 ++++++++++++++++++++++++++------ > migration/savevm.h | 5 ++-- > migration/channel.c | 7 ++--- > migration/colo-stubs.c | 2 +- > migration/colo.c | 23 ++++----------- > migration/migration.c | 62 ++++++++++++++++++++++++++++---------- > migration/rdma.c | 5 ---- > migration/savevm.c | 64 ++++++++++++++++++++++++---------------- > migration/trace-events | 4 +-- > 10 files changed, 142 insertions(+), 88 deletions(-) > > diff --git a/include/migration/colo.h b/include/migration/colo.h > index 43222ef5ae..bfb30eccf0 100644 > --- a/include/migration/colo.h > +++ b/include/migration/colo.h > @@ -44,12 +44,10 @@ void colo_do_failover(void); > void colo_checkpoint_delay_set(void); > > /* > - * Starts COLO incoming process. Called from process_incoming_migration_co() > + * Starts COLO incoming process. Called from migration_incoming_thread() > * after loading the state. > - * > - * Called with BQL locked, may temporary release BQL. > */ > -void coroutine_fn colo_incoming_co(void); > +void colo_incoming_wait(void); > > void colo_shutdown(void); > #endif > diff --git a/migration/migration.h b/migration/migration.h > index 01329bf824..c4a626eed4 100644 > --- a/migration/migration.h > +++ b/migration/migration.h > @@ -42,6 +42,44 @@ > #define MIGRATION_THREAD_DST_LISTEN "mig/dst/listen" > #define MIGRATION_THREAD_DST_PREEMPT "mig/dst/preempt" > > +/** > + * WITH_BQL_HELD(): Run a task, making sure BQL is held > + * > + * @bql_held: Whether BQL is already held > + * @task: The task to run within BQL held > + */ > +#define WITH_BQL_HELD(bql_held, task) \ > + do { \ > + if (!bql_held) { \ > + bql_lock(); \ > + } else { \ > + assert(bql_locked()); \ > + } \ > + task; \ > + if (!bql_held) { \ > + bql_unlock(); \ > + } \ > + } while (0) > + > +/** > + * WITHOUT_BQL_HELD(): Run a task, making sure BQL is released > + * > + * @bql_held: Whether BQL is already held > + * @task: The task to run making sure BQL released > + */ > +#define WITHOUT_BQL_HELD(bql_held, task) \ > + do { \ > + if (bql_held) { \ > + bql_unlock(); \ > + } else { \ > + assert(!bql_locked()); \ > + } \ > + task; \ > + if (bql_held) { \ > + bql_lock(); \ > + } \ > + } while (0) > + > struct PostcopyBlocktimeContext; > typedef struct ThreadPool ThreadPool; > > @@ -119,6 +157,10 @@ struct MigrationIncomingState { > bool have_listen_thread; > QemuThread listen_thread; > > + /* Migration main recv thread */ > + bool have_recv_thread; > + QemuThread recv_thread; > + > /* For the kernel to send us notifications */ > int userfault_fd; > /* To notify the fault_thread to wake, e.g., when need to quit */ > @@ -177,15 +219,7 @@ struct MigrationIncomingState { > > MigrationStatus state; > > - /* > - * The incoming migration coroutine, non-NULL during qemu_loadvm_state(). > - * Used to wake the migration incoming coroutine from rdma code. How much is > - * it safe - it's a question. > - */ > - Coroutine *loadvm_co; > - > - /* The coroutine we should enter (back) after failover */ > - Coroutine *colo_incoming_co; > + /* Notify secondary VM to move on */ > QemuEvent colo_incoming_event; > > /* Optional load threads pool and its thread exit request flag */ > diff --git a/migration/savevm.h b/migration/savevm.h > index 2d5e9c7166..c07e14f61a 100644 > --- a/migration/savevm.h > +++ b/migration/savevm.h > @@ -64,9 +64,10 @@ void qemu_savevm_send_colo_enable(QEMUFile *f); > void qemu_savevm_live_state(QEMUFile *f); > int qemu_save_device_state(QEMUFile *f); > > -int qemu_loadvm_state(QEMUFile *f); > +int qemu_loadvm_state(QEMUFile *f, bool bql_held); > void qemu_loadvm_state_cleanup(MigrationIncomingState *mis); > -int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis); > +int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis, > + bool bql_held); > int qemu_load_device_state(QEMUFile *f); > int qemu_loadvm_approve_switchover(void); > int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f, > diff --git a/migration/channel.c b/migration/channel.c > index a547b1fbfe..621f8a4a2a 100644 > --- a/migration/channel.c > +++ b/migration/channel.c > @@ -136,11 +136,8 @@ int migration_channel_read_peek(QIOChannel *ioc, > } > > /* 1ms sleep. */ > - if (qemu_in_coroutine()) { > - qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000); > - } else { > - g_usleep(1000); > - } > + assert(!qemu_in_coroutine()); > + g_usleep(1000); What bug is this hiding? =) > } > > return 0; > diff --git a/migration/colo-stubs.c b/migration/colo-stubs.c > index e22ce65234..ef77d1ab4b 100644 > --- a/migration/colo-stubs.c > +++ b/migration/colo-stubs.c > @@ -9,7 +9,7 @@ void colo_shutdown(void) > { > } > > -void coroutine_fn colo_incoming_co(void) > +void colo_incoming_wait(void) > { > } > > diff --git a/migration/colo.c b/migration/colo.c > index e0f713c837..f5722d9d9d 100644 > --- a/migration/colo.c > +++ b/migration/colo.c > @@ -147,11 +147,6 @@ static void secondary_vm_do_failover(void) > } > /* Notify COLO incoming thread that failover work is finished */ > qemu_event_set(&mis->colo_incoming_event); > - > - /* For Secondary VM, jump to incoming co */ > - if (mis->colo_incoming_co) { > - qemu_coroutine_enter(mis->colo_incoming_co); > - } > } > > static void primary_vm_do_failover(void) > @@ -686,7 +681,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis, > > bql_lock(); > cpu_synchronize_all_states(); > - ret = qemu_loadvm_state_main(mis->from_src_file, mis); > + ret = qemu_loadvm_state_main(mis->from_src_file, mis, true); > bql_unlock(); > > if (ret < 0) { > @@ -854,10 +849,8 @@ static void *colo_process_incoming_thread(void *opaque) > goto out; > } > /* > - * Note: the communication between Primary side and Secondary side > - * should be sequential, we set the fd to unblocked in migration incoming > - * coroutine, and here we are in the COLO incoming thread, so it is ok to > - * set the fd back to blocked. > + * Here we are in the COLO incoming thread, so it is ok to set the fd > + * to blocked. nit: s/blocked/blocking/ > */ > qemu_file_set_blocking(mis->from_src_file, true); > > @@ -930,26 +923,20 @@ out: > return NULL; > } > > -void coroutine_fn colo_incoming_co(void) > +/* Wait for failover */ > +void colo_incoming_wait(void) > { > MigrationIncomingState *mis = migration_incoming_get_current(); > QemuThread th; > > - assert(bql_locked()); > assert(migration_incoming_colo_enabled()); > > qemu_thread_create(&th, MIGRATION_THREAD_DST_COLO, > colo_process_incoming_thread, > mis, QEMU_THREAD_JOINABLE); > > - mis->colo_incoming_co = qemu_coroutine_self(); > - qemu_coroutine_yield(); > - mis->colo_incoming_co = NULL; > - > - bql_unlock(); What does the BQL protects from colo_do_failover() until here? Could we have a preliminary patch reducing the BQL scope? I'm thinking about which changes we can merge upfront so we're already testing our assumptions before the whole series completes. > /* Wait checkpoint incoming thread exit before free resource */ > qemu_thread_join(&th); > - bql_lock(); > > /* We hold the global BQL, so it is safe here */ > colo_release_ram_cache(); Maybe a candidate for WITH_BQL_HELD? > diff --git a/migration/migration.c b/migration/migration.c > index 10c216d25d..7e4d25b15c 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -494,6 +494,11 @@ void migration_incoming_state_destroy(void) > mis->postcopy_qemufile_dst = NULL; > } > > + if (mis->have_recv_thread) { Help me out here, is this read race-free with the write at migration_incoming_process() because...? This read is reached from qemu_bh_schedule() while the write comes from the qio_channel_add_watch_full() callback. Are those (potentially) different AioContexts and thus the BQL is what's really doing the work (instead of the event serialization caused by the main loop)? > + qemu_thread_join(&mis->recv_thread); > + mis->have_recv_thread = false; > + } > + > cpr_set_incoming_mode(MIG_MODE_NONE); > yank_unregister_instance(MIGRATION_YANK_INSTANCE); > } > @@ -864,30 +869,46 @@ static void process_incoming_migration_bh(void *opaque) BTW, I don't think we need this BH anymore since effd60c8781. This process_incoming_migration_bh was originally introduced (0aa6aefc9) to move the brdv_activate_all out of coroutines due to a bug in the block layer, basically this: bdrv_activate_all -> bdrv_invalidate_cache -> coroutine_fn qcow2_co_invalidate_cache { ... BDRVQcow2State *s = bs->opaque; ... memset(s, 0, sizeof(BDRVQcow2State)); <-- clears memory ... flags &= ~BDRV_O_INACTIVE; qemu_co_mutex_lock(&s->lock); ret = qcow2_do_open(bs, options, flags, false, errp); ^ may yield before repopulating all of BDRVQcow2State info block or something else reads 's' and BOOM qemu_co_mutex_unlock(&s->lock); ... } Note that 2 years after the BH creation (0aa6aefc9), 2b148f392b2 has moved the invalidate function back into a coroutine anyway. > migration_incoming_state_destroy(); > } > > -static void coroutine_fn > -process_incoming_migration_co(void *opaque) > +static void migration_incoming_state_destroy_bh(void *opaque) I only mention all of the above because it would allow merging the two paths that call migration_incoming_state_destroy() and avoid this new BH. > +{ > + struct MigrationIncomingState *mis = opaque; > + > + if (mis->exit_on_error) { > + /* > + * NOTE: this exit() should better happen in the main thread, as > + * the exit notifier may require BQL which can deadlock. See > + * commit e7bc0204e57836 for example. > + */ > + exit(EXIT_FAILURE); > + } > + > + migration_incoming_state_destroy(); > +} > + > +static void *migration_incoming_thread(void *opaque) > { > MigrationState *s = migrate_get_current(); > - MigrationIncomingState *mis = migration_incoming_get_current(); > + MigrationIncomingState *mis = opaque; > PostcopyState ps; > int ret; > Error *local_err = NULL; > > + rcu_register_thread(); > + > assert(mis->from_src_file); > + assert(!bql_locked()); > > mis->largest_page_size = qemu_ram_pagesize_largest(); > postcopy_state_set(POSTCOPY_INCOMING_NONE); > migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP, > MIGRATION_STATUS_ACTIVE); > > - mis->loadvm_co = qemu_coroutine_self(); > - ret = qemu_loadvm_state(mis->from_src_file); > - mis->loadvm_co = NULL; > + ret = qemu_loadvm_state(mis->from_src_file, false); > > trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed"); > > ps = postcopy_state_get(); > - trace_process_incoming_migration_co_end(ret, ps); > + trace_process_incoming_migration_end(ret, ps); > if (ps != POSTCOPY_INCOMING_NONE) { > if (ps == POSTCOPY_INCOMING_ADVISE) { > /* > @@ -901,7 +922,7 @@ process_incoming_migration_co(void *opaque) > * Postcopy was started, cleanup should happen at the end of the > * postcopy thread. > */ > - trace_process_incoming_migration_co_postcopy_end_main(); > + trace_process_incoming_migration_postcopy_end_main(); > goto out; > } > /* Else if something went wrong then just fall out of the normal exit */ > @@ -913,8 +934,8 @@ process_incoming_migration_co(void *opaque) > } > > if (migration_incoming_colo_enabled()) { > - /* yield until COLO exit */ > - colo_incoming_co(); > + /* wait until COLO exits */ > + colo_incoming_wait(); > } > > migration_bh_schedule(process_incoming_migration_bh, mis); > @@ -926,19 +947,24 @@ fail: > migrate_set_error(s, local_err); > error_free(local_err); > > - migration_incoming_state_destroy(); > - Moving this below the exit will affect the source I think, for instance: migration_incoming_state_destroy { ... /* Tell source that we are done */ migrate_send_rp_shut(mis, qemu_file_get_error(mis->from_src_file) != 0); > if (mis->exit_on_error) { > WITH_QEMU_LOCK_GUARD(&s->error_mutex) { > error_report_err(s->error); > s->error = NULL; > } > - > - exit(EXIT_FAILURE); > } > + > + /* > + * There's some step of the destroy process that will need to happen in > + * the main thread (e.g. joining this thread itself). Leave to a BH. > + */ > + migration_bh_schedule(migration_incoming_state_destroy_bh, (void *)mis); > + > out: > /* Pairs with the refcount taken in qmp_migrate_incoming() */ > migrate_incoming_unref_outgoing_state(); > + rcu_unregister_thread(); > + return NULL; > } > > /** > @@ -956,8 +982,12 @@ static void migration_incoming_setup(QEMUFile *f) > > void migration_incoming_process(void) > { > - Coroutine *co = qemu_coroutine_create(process_incoming_migration_co, NULL); > - qemu_coroutine_enter(co); > + MigrationIncomingState *mis = migration_incoming_get_current(); > + > + mis->have_recv_thread = true; > + qemu_thread_create(&mis->recv_thread, "mig/dst/main", > + migration_incoming_thread, mis, > + QEMU_THREAD_JOINABLE); > } > > /* Returns true if recovered from a paused migration, otherwise false */ > diff --git a/migration/rdma.c b/migration/rdma.c > index bcd7aae2f2..2b995513aa 100644 > --- a/migration/rdma.c > +++ b/migration/rdma.c > @@ -3068,7 +3068,6 @@ static void rdma_cm_poll_handler(void *opaque) > { > RDMAContext *rdma = opaque; > struct rdma_cm_event *cm_event; > - MigrationIncomingState *mis = migration_incoming_get_current(); > > if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) { > error_report("get_cm_event failed %d", errno); > @@ -3087,10 +3086,6 @@ static void rdma_cm_poll_handler(void *opaque) > } > } > rdma_ack_cm_event(cm_event); > - if (mis->loadvm_co) { > - qemu_coroutine_enter(mis->loadvm_co); > - } > - return; > } > rdma_ack_cm_event(cm_event); > } > diff --git a/migration/savevm.c b/migration/savevm.c > index fabbeb296a..ad606c5425 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -154,11 +154,10 @@ static void qemu_loadvm_thread_pool_destroy(MigrationIncomingState *mis) > } > > static bool qemu_loadvm_thread_pool_wait(MigrationState *s, > - MigrationIncomingState *mis) > + MigrationIncomingState *mis, > + bool bql_held) > { > - bql_unlock(); /* Let load threads do work requiring BQL */ > - thread_pool_wait(mis->load_threads); > - bql_lock(); > + WITHOUT_BQL_HELD(bql_held, thread_pool_wait(mis->load_threads)); > > return !migrate_has_error(s); > } > @@ -2091,14 +2090,11 @@ static void *postcopy_ram_listen_thread(void *opaque) > trace_postcopy_ram_listen_thread_start(); > > rcu_register_thread(); > - /* > - * Because we're a thread and not a coroutine we can't yield > - * in qemu_file, and thus we must be blocking now. > - */ > + /* Because we're a thread, making sure to use blocking mode */ > qemu_file_set_blocking(f, true); > > /* TODO: sanity check that only postcopiable data will be loaded here */ > - load_res = qemu_loadvm_state_main(f, mis); > + load_res = qemu_loadvm_state_main(f, mis, false); > > /* > * This is tricky, but, mis->from_src_file can change after it > @@ -2392,13 +2388,14 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis) > * Immediately following this command is a blob of data containing an embedded > * chunk of migration stream; read it and load it. > * > - * @mis: Incoming state > - * @length: Length of packaged data to read > + * @mis: Incoming state > + * @bql_held: Whether BQL is held already > * > * Returns: Negative values on error > * > */ > -static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis) > +static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis, > + bool bql_held) > { > int ret; > size_t length; > @@ -2449,7 +2446,7 @@ static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis) > qemu_coroutine_yield(); > } while (1); > > - ret = qemu_loadvm_state_main(packf, mis); > + ret = qemu_loadvm_state_main(packf, mis, bql_held); > trace_loadvm_handle_cmd_packaged_main(ret); > qemu_fclose(packf); > object_unref(OBJECT(bioc)); > @@ -2539,7 +2536,7 @@ static int loadvm_postcopy_handle_switchover_start(void) > * LOADVM_QUIT All good, but exit the loop > * <0 Error > */ > -static int loadvm_process_command(QEMUFile *f) > +static int loadvm_process_command(QEMUFile *f, bool bql_held) > { > MigrationIncomingState *mis = migration_incoming_get_current(); > uint16_t cmd; > @@ -2609,7 +2606,7 @@ static int loadvm_process_command(QEMUFile *f) > break; > > case MIG_CMD_PACKAGED: > - return loadvm_handle_cmd_packaged(mis); > + return loadvm_handle_cmd_packaged(mis, bql_held); > > case MIG_CMD_POSTCOPY_ADVISE: > return loadvm_postcopy_handle_advise(mis, len); > @@ -3028,7 +3025,8 @@ static bool postcopy_pause_incoming(MigrationIncomingState *mis) > return true; > } > > -int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis) > +int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis, > + bool bql_held) > { > uint8_t section_type; > int ret = 0; > @@ -3046,7 +3044,15 @@ retry: > switch (section_type) { > case QEMU_VM_SECTION_START: > case QEMU_VM_SECTION_FULL: > - ret = qemu_loadvm_section_start_full(f, section_type); > + /* > + * FULL should normally require BQL, e.g. during post_load() > + * there can be memory region updates. START may or may not > + * require it, but just to keep it simple to always hold BQL > + * for now. > + */ > + WITH_BQL_HELD( > + bql_held, > + ret = qemu_loadvm_section_start_full(f, section_type)); > if (ret < 0) { > goto out; > } > @@ -3059,7 +3065,11 @@ retry: > } > break; > case QEMU_VM_COMMAND: > - ret = loadvm_process_command(f); > + /* > + * Be careful; QEMU_VM_COMMAND can embed FULL sections, so it > + * may internally need BQL. > + */ > + ret = loadvm_process_command(f, bql_held); > trace_qemu_loadvm_state_section_command(ret); > if ((ret < 0) || (ret == LOADVM_QUIT)) { > goto out; > @@ -3103,7 +3113,7 @@ out: > return ret; > } > > -int qemu_loadvm_state(QEMUFile *f) > +int qemu_loadvm_state(QEMUFile *f, bool bql_held) > { > MigrationState *s = migrate_get_current(); > MigrationIncomingState *mis = migration_incoming_get_current(); > @@ -3131,9 +3141,10 @@ int qemu_loadvm_state(QEMUFile *f) > qemu_loadvm_state_switchover_ack_needed(mis); > } > > - cpu_synchronize_all_pre_loadvm(); > + /* run_on_cpu() requires BQL */ > + WITH_BQL_HELD(bql_held, cpu_synchronize_all_pre_loadvm()); > > - ret = qemu_loadvm_state_main(f, mis); > + ret = qemu_loadvm_state_main(f, mis, bql_held); > qemu_event_set(&mis->main_thread_load_event); > > trace_qemu_loadvm_state_post_main(ret); > @@ -3149,7 +3160,7 @@ int qemu_loadvm_state(QEMUFile *f) > /* When reaching here, it must be precopy */ > if (ret == 0) { > if (migrate_has_error(migrate_get_current()) || > - !qemu_loadvm_thread_pool_wait(s, mis)) { > + !qemu_loadvm_thread_pool_wait(s, mis, bql_held)) { > ret = -EINVAL; > } else { > ret = qemu_file_get_error(f); > @@ -3196,7 +3207,8 @@ int qemu_loadvm_state(QEMUFile *f) > } > } > > - cpu_synchronize_all_post_init(); > + /* run_on_cpu() requires BQL */ > + WITH_BQL_HELD(bql_held, cpu_synchronize_all_post_init()); > > return ret; > } > @@ -3207,7 +3219,7 @@ int qemu_load_device_state(QEMUFile *f) > int ret; > > /* Load QEMU_VM_SECTION_FULL section */ > - ret = qemu_loadvm_state_main(f, mis); > + ret = qemu_loadvm_state_main(f, mis, true); > if (ret < 0) { > error_report("Failed to load device state: %d", ret); > return ret; > @@ -3438,7 +3450,7 @@ void qmp_xen_load_devices_state(const char *filename, Error **errp) > f = qemu_file_new_input(QIO_CHANNEL(ioc)); > object_unref(OBJECT(ioc)); > > - ret = qemu_loadvm_state(f); > + ret = qemu_loadvm_state(f, true); > qemu_fclose(f); > if (ret < 0) { > error_setg(errp, "loading Xen device state failed"); > @@ -3512,7 +3524,7 @@ bool load_snapshot(const char *name, const char *vmstate, > ret = -EINVAL; > goto err_drain; > } > - ret = qemu_loadvm_state(f); > + ret = qemu_loadvm_state(f, true); > migration_incoming_state_destroy(); > > bdrv_drain_all_end(); > diff --git a/migration/trace-events b/migration/trace-events > index 706db97def..eeb41e03f1 100644 > --- a/migration/trace-events > +++ b/migration/trace-events > @@ -193,8 +193,8 @@ source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32 > source_return_path_thread_switchover_acked(void) "" > migration_thread_low_pending(uint64_t pending) "%" PRIu64 > migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " max_size %" PRId64 > -process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d" > -process_incoming_migration_co_postcopy_end_main(void) "" > +process_incoming_migration_end(int ret, int ps) "ret=%d postcopy-state=%d" > +process_incoming_migration_postcopy_end_main(void) "" > postcopy_preempt_enabled(bool value) "%d" > migration_precopy_complete(void) ""