From: Lukas Straub <lukasstraub2@web.de>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org, Fabiano Rosas <farosas@suse.de>,
Laurent Vivier <lvivier@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Zhang Chen <zhangckid@gmail.com>,
Hailiang Zhang <zhanghailiang@xfusion.com>,
Markus Armbruster <armbru@redhat.com>,
Li Zhijian <lizhijian@fujitsu.com>,
"Dr. David Alan Gilbert" <dave@treblig.org>
Subject: Re: [PATCH v3 06/10] migration-test: Add COLO migration unit test
Date: Fri, 6 Feb 2026 20:11:22 +0100 [thread overview]
Message-ID: <20260206201050.6a692a34@penguin> (raw)
In-Reply-To: <aYJmwfQgw0dD7CjD@x1.local>
[-- Attachment #1: Type: text/plain, Size: 4810 bytes --]
On Tue, 3 Feb 2026 16:21:05 -0500
Peter Xu <peterx@redhat.com> wrote:
> On Tue, Feb 03, 2026 at 10:18:22AM +0100, Lukas Straub wrote:
> > On Mon, 2 Feb 2026 09:26:06 -0500
> > Peter Xu <peterx@redhat.com> wrote:
> >
> > > On Fri, Jan 30, 2026 at 11:24:02AM +0100, Lukas Straub wrote:
> > > > On Tue, 27 Jan 2026 15:49:31 -0500
> > > > Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > > On Sun, Jan 25, 2026 at 09:40:11PM +0100, Lukas Straub wrote:
> > > > > > +void migration_test_add_colo(MigrationTestEnv *env)
> > > > > > +{
> > > > > > + if (!env->has_kvm) {
> > > > > > + g_test_skip("COLO requires KVM accelerator");
> > > > > > + return;
> > > > > > + }
> > > > >
> > > > > I'm OK if you want to explicitly bypass others, but could you explanation
> > > > > why?
> > > > >
> > > > > Thanks,
> > > > >
> > > >
> > > > It used to hang with TCG. Now it crashes, since
> > > > migration_bitmap_sync_precopy assumes bql is held. Something for later.
> > >
> > > If we want to keep COLO around and be serious, let's try to make COLO the
> > > same standard we target for migration in general whenever possible. We
> > > shouldn't randomly workaround bugs. We should fix it.
> > >
> > > It looks to me there's some locking issue instead.
> > >
> > > Iterator's complete() requires BQL. Would a patch like below makes sense
> > > to you?
> > >
> > > diff --git a/migration/colo.c b/migration/colo.c
> > > index db783f6fa7..b3ea137120 100644
> > > --- a/migration/colo.c
> > > +++ b/migration/colo.c
> > > @@ -458,8 +458,8 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
> > > /* Note: device state is saved into buffer */
> > > ret = qemu_save_device_state(fb);
> > >
> > > - bql_unlock();
> > > if (ret < 0) {
> > > + bql_unlock();
> > > goto out;
> > > }
> > >
> > > @@ -473,6 +473,9 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
> > > */
> > > qemu_savevm_live_state(s->to_dst_file);
> > >
> > > + /* Save live state requires BQL */
> > > + bql_unlock();
> > > +
> > > qemu_fflush(fb);
> > >
> > > /*
> >
> > I already tested that and it works. However, we have to be very careful
> > around the locking here and I don't think it is safe to take the bql on
> > the primary here:
> >
> > The secondary has the bql held at this point:
>
> This is definitely an interesting piece of code... one question:
>
> >
> > colo_receive_check_message(mis->from_src_file,
> > COLO_MESSAGE_VMSTATE_SEND, &local_err);
> > ...
> > bql_lock();
> > cpu_synchronize_all_states();
>
> Why this is needed at all? ^^^^^^^^^^^^^^^
>
> The qemu_loadvm_state_main() line right below should only load RAM. I
> don't see how it has anything to do with CPU register states..
You are right we don't need this and the lock is needed here. Then I'm
fine with removing the lock here and adding one on the primary side.
>
> > ret = qemu_loadvm_state_main(mis->from_src_file, mis, errp);
> > bql_unlock();
> >
> > On the primary there is a filter-mirror mirroring incoming packets to
> > the secondary filter-redirector. However since the secondary migration
> > holds bql the receiving filter is blocked and will not receive anything
> > from the socket. Thus filter-mirror on the primary also may get blocked
> > during send and block the mainloop (It uses blocking IO).
>
> Hmm... could you explain why a blocking IO operation to mirror some packets
> require holding BQL? This sounds wrong on its own.
Yes there is no need for the BQL, it just is wrong. The tap fd gets a
POLLIN event, main loop takes BQL and calls the tap fd callback. Tap
reads a packet from the fd and calls qemu_send_packet_async() which
puts it through the net-filters and filter-mirror does a blocking send,
blocking the main loop while BQL is held.
>
> >
> > Now if the primary migration thread wants to take the bql it will
> > deadlock.
> >
> > So I think this is something to fix in a separate series since it is
> > more involved.
>
> Yes it might be involved, but this is really not something like "let's make
> it simple for now and improve it later". This is "OK this function
> _requires_ this lock, but let's not take this lock and leave it for
> later". It's not something we can put aside, afaiu. We should really fix
> it..
>
> How far do you think we can fix it? Could you explain the problem better?
>
> It might be helpful if you can reproduce the hang, then attach the logs
> from both QEMU on a full thread backtrace dump. I'll see what I can help.
>
> Thanks,
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2026-02-06 19:12 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-25 20:40 [PATCH v3 00/10] migration: Add COLO multifd support and COLO migration unit test Lukas Straub
2026-01-25 20:40 ` [PATCH v3 01/10] MAINTAINERS: Add myself as maintainer for COLO migration framework Lukas Straub
2026-01-25 20:40 ` [PATCH v3 02/10] MAINTAINERS: Remove Hailiang Zhang from " Lukas Straub
2026-01-25 20:40 ` [PATCH v3 03/10] Move ram state receive into multifd_ram_state_recv() Lukas Straub
2026-01-26 12:51 ` Fabiano Rosas
2026-01-25 20:40 ` [PATCH v3 04/10] multifd: Add COLO support Lukas Straub
2026-01-26 10:36 ` Zhang Chen
2026-01-26 11:13 ` Lukas Straub
2026-01-26 14:33 ` Fabiano Rosas
2026-01-26 19:33 ` Lukas Straub
2026-01-26 21:37 ` Fabiano Rosas
2026-01-27 20:36 ` Peter Xu
2026-01-28 12:30 ` Fabiano Rosas
2026-01-28 14:09 ` Peter Xu
2026-01-28 20:02 ` Fabiano Rosas
2026-02-03 9:47 ` Lukas Straub
2026-01-25 20:40 ` [PATCH v3 05/10] colo: Fix crash during device vmstate load Lukas Straub
2026-01-27 20:38 ` Peter Xu
2026-01-30 12:49 ` Lukas Straub
2026-02-02 14:12 ` Peter Xu
2026-02-03 9:25 ` Lukas Straub
2026-01-25 20:40 ` [PATCH v3 06/10] migration-test: Add COLO migration unit test Lukas Straub
2026-01-26 14:40 ` Fabiano Rosas
2026-01-27 20:49 ` Peter Xu
2026-01-30 10:24 ` Lukas Straub
2026-02-02 14:26 ` Peter Xu
2026-02-03 9:18 ` Lukas Straub
2026-02-03 21:21 ` Peter Xu
2026-02-06 19:11 ` Lukas Straub [this message]
2026-01-28 12:32 ` Fabiano Rosas
2026-01-25 20:40 ` [PATCH v3 07/10] Convert colo main documentation to restructuredText Lukas Straub
2026-01-25 20:40 ` [PATCH v3 08/10] qemu-colo.rst: Miscellaneous changes Lukas Straub
2026-01-26 10:21 ` Zhang Chen
2026-01-26 10:56 ` Lukas Straub
2026-01-25 20:40 ` [PATCH v3 09/10] qemu-colo.rst: Add my copyright Lukas Straub
2026-01-26 10:23 ` Zhang Chen
2026-01-25 20:40 ` [PATCH v3 10/10] qemu-colo.rst: Simplify the block replication setup Lukas Straub
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260206201050.6a692a34@penguin \
--to=lukasstraub2@web.de \
--cc=armbru@redhat.com \
--cc=dave@treblig.org \
--cc=farosas@suse.de \
--cc=lizhijian@fujitsu.com \
--cc=lvivier@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=zhangckid@gmail.com \
--cc=zhanghailiang@xfusion.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.