From: "Dr. David Alan Gilbert" <dave@treblig.org>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Fabiano Rosas <farosas@suse.de>,
Lukas Straub <lukasstraub2@web.de>,
qemu-devel@nongnu.org, Peter Xu <peterx@redhat.com>,
Zhang Chen <zhangckid@gmail.com>,
Hailiang Zhang <zhanghailiang@xfusion.com>,
Li Zhijian <lizhijian@fujitsu.com>
Subject: Re: COLO concurrency issues
Date: Fri, 20 Feb 2026 02:04:38 +0000 [thread overview]
Message-ID: <aZfBNug85Xn_j4o_@gallifrey> (raw)
In-Reply-To: <20260219143620.GC817358@fedora>
* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Fri, Feb 13, 2026 at 09:13:49AM -0300, Fabiano Rosas wrote:
> > Hi, I've been following the qemu-colo.rst steps to test COLO and
> > encountered a couple of issues. Unfortunately, I don't have cycles to
> > investigate further. Happens with QEMU master (also tested some versions
> > back until the COLO fix 0b5bf4ea76).
> >
> > 1) Deadlock at fdmon_io_uring_wait:
> >
> > (steps from qemu-colo.rst)
> > - Secondary Failover
> > - Secondary resume replication
> > - Start the new Secondary
> > - Sync
> > - Wait until disk is synced, then:
> >
> > {"execute": "stop"}
> > {"execute": "block-job-cancel", "arguments":{ "device": "resync" } }
> >
> > The above results in the old secondary hanging indefinitely at:
> >
> > do {
> > ret = io_uring_submit_and_wait(&ctx->fdmon_io_uring, wait_nr);
> > } while (ret == -EINTR);
> >
> > (gdb) bt
> > #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
> > #1 0x00007f5519e0204e in ??? () at //usr/lib64/liburing.so.2
> > #2 0x00007f5519e01b00 in ??? () at //usr/lib64/liburing.so.2
> > #3 0x0000563c2dc06cc9 in fdmon_io_uring_wait (ctx=0x563c30411b00, ready_list=0x7ffd0bad8f58, timeout=575708467831) at ../util/fdmon-io_uring.c:416
> > #4 0x0000563c2dc00976 in aio_poll (ctx=0x563c30411b00, blocking=true) at ../util/aio-posix.c:699
> > #5 0x0000563c2daa01c6 in bdrv_drain_all_begin () at ../block/io.c:529
> > #6 0x0000563c2daa03d8 in bdrv_drain_all () at ../block/io.c:574
> > #7 0x0000563c2d764aae in do_vm_stop (state=RUN_STATE_PAUSED, send_stop=true) at ../system/cpus.c:312
> > #8 0x0000563c2d765964 in vm_stop (state=RUN_STATE_PAUSED) at ../system/cpus.c:754
> > #9 0x0000563c2d7f3378 in qmp_stop (errp=0x7ffd0bad9080) at ../monitor/qmp-cmds.c:62
> > #10 0x0000563c2dba7a72 in qmp_marshal_stop (args=0x563c306ac070, ret=0x7f5518dffda8, errp=0x7f5518dffda0) at qapi/qapi-commands-misc.c:197
> > #11 0x0000563c2dbf1316 in do_qmp_dispatch_bh (opaque=0x7f5518dffe40) at ../qapi/qmp-dispatch.c:128
> > #12 0x0000563c2dc1de48 in aio_bh_call (bh=0x563c3040fef0) at ../util/async.c:173
> > #13 0x0000563c2dc1df64 in aio_bh_poll (ctx=0x563c3040c070) at ../util/async.c:220
> > #14 0x0000563c2dbffff0 in aio_dispatch (ctx=0x563c3040c070) at ../util/aio-posix.c:389
> > #15 0x0000563c2dc1e3cd in aio_ctx_dispatch (source=0x563c3040c070, callback=0x0, user_data=0x0) at ../util/async.c:365
> > #16 0x00007f551b114f4c in g_main_dispatch (context=0x563c304120f0) at ../glib/gmain.c:3476
> > #17 g_main_context_dispatch_unlocked (context=context@entry=0x563c304120f0) at ../glib/gmain.c:4284
> > #18 0x00007f551b1170c9 in g_main_context_dispatch (context=0x563c304120f0) at ../glib/gmain.c:4272
> > #19 0x0000563c2dc1fa0b in glib_pollfds_poll () at ../util/main-loop.c:290
> > #20 0x0000563c2dc1fa85 in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:313
> > #21 0x0000563c2dc1fb8a in main_loop_wait (nonblocking=0) at ../util/main-loop.c:592
> > #22 0x0000563c2d78eb60 in qemu_main_loop () at ../system/runstate.c:903
> > #23 0x0000563c2db412fc in qemu_default_main (opaque=0x0) at ../system/main.c:50
> > #24 0x0000563c2db413ab in main (argc=40, argv=0x7ffd0bad94d8) at ../system/main.c:93
>
> Two ideas on how to debug further:
>
> 1. Attach to the hung QEMU process with a debugger and inspect the
> block_backends global variable (see block/block-backends.c). The
> question is why bdrv_drain_all_begin() is not making progress. There are
> probably in-flight requests that can be observed in the
> BlockDriverState->tracked_requests list. Also check the BlockDriverState
> and BlockBackend in_flight fields. This will let you identify which
> block device is causing the hang and what it's doing during the hang.
(I've not looked at this for ages)
I'd guess it's probably the network block sync; the fun part of this test
is it's where the secondary is being restarted after a failure; so is
this blocking on the old sync connection or the new one?
And if it's blocking on the new one, then is that because the secondary
is blocked?
Dave
> 2. Try disabling io_uring on the host via `sudo sysctl
> kernel.io_uring_disabled=2` and then run QEMU again. If this is an issue
> with QEMU's recently-enabled io_uring event loop, then there will be no
> hang with io_uring disabled.
>
> Stefan
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux | Happy \
\ dave @ treblig.org | | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/
next prev parent reply other threads:[~2026-02-20 2:05 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-13 12:13 COLO concurrency issues Fabiano Rosas
2026-02-14 16:11 ` Lukas Straub
2026-02-19 14:36 ` Stefan Hajnoczi
2026-02-20 2:04 ` Dr. David Alan Gilbert [this message]
2026-03-05 21:42 ` Fabiano Rosas
2026-03-05 21:54 ` Dr. David Alan Gilbert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aZfBNug85Xn_j4o_@gallifrey \
--to=dave@treblig.org \
--cc=farosas@suse.de \
--cc=lizhijian@fujitsu.com \
--cc=lukasstraub2@web.de \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
--cc=zhangckid@gmail.com \
--cc=zhanghailiang@xfusion.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.