* [BUG] Block graph assertion failure on blockdev-add
@ 2025-11-12 16:41 andrey.drobyshev
2025-11-12 16:41 ` [RFC PATCH 1/1] block: avoid global drain on graph subtrees manipulation andrey.drobyshev
2025-11-18 22:42 ` [BUG] Block graph assertion failure on blockdev-add Kevin Wolf
0 siblings, 2 replies; 3+ messages in thread
From: andrey.drobyshev @ 2025-11-12 16:41 UTC (permalink / raw)
To: qemu-block; +Cc: qemu-devel, kwolf, f.ebner, hreitz, den, andrey.drobyshev
From: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Hi all,
There's a bug in block layer which leads to an assertion failure and crash.
The reproduce is flaky, which suggests there's a race involved. Here's how
it's reproduced:
1. Run QEMU with 2 disks: one system (with actual guest) and one empty; attach
them to an iothread;
2. In the guest, format the empty disk and start sequential IO on it;
3. At the host, start growing a chain of snapshots upon the empty image.
Here're the scripts:
1. Running QEMU:
QMPSOCK=/var/run/qmp.sock
SERIALSOCK=/var/run/serial.sock
ROOTIMG=/path/to/rootimg.qcow2
EMPTYIMG=/path/to/emptyimg.qcow2
rm -f $EMPTYIMG
qemu-img create -f qcow2 \
-o cluster_size=1M,extended_l2=on,lazy_refcounts=on \
$EMPTYIMG 64G
/path/to/qemu/build/qemu-system-x86_64 -enable-kvm \
-machine q35 -cpu Nehalem \
-name guest=guestvm,debug-threads=on \
-m 2g -smp 2 \
-nographic \
\
-serial unix:$SERIALSOCK,server=on,wait=off \
-qmp unix:$QMPSOCK,server=on,wait=off \
\
-object iothread,id=iothread1 \
\
-blockdev node-name=disk-storage,driver=file,filename=$ROOTIMG,aio=native,cache.direct=on,cache.no-flush=off,auto-read-only=true,discard=unmap \
-blockdev node-name=disk-prealloc,driver=preallocate,file=disk-storage,discard=unmap \
-blockdev node-name=disk-format,driver=qcow2,file=disk-prealloc,read-only=false,discard=unmap \
\
-blockdev node-name=empty-storage,driver=file,filename=$EMPTYIMG,aio=native,cache.direct=on,cache.no-flush=off,auto-read-only=true,discard=unmap \
-blockdev node-name=empty-prealloc,driver=preallocate,file=empty-storage,discard=unmap \
-blockdev node-name=empty-format,driver=qcow2,file=empty-prealloc,read-only=false,discard=unmap \
\
-device pcie-root-port,id=root_port0,slot=1,bus=pcie.0 \
-device pcie-root-port,id=root_port1,slot=2,bus=pcie.0 \
-device virtio-blk-pci,drive=disk-format,id=virtio-disk0,bus=root_port0,bootindex=1 \
-device virtio-blk-pci,drive=empty-format,iothread=iothread1,id=virtio-disk1,bus=root_port1
2. Formatting the disk in the guest and running sequential IO:
BLKDEV=$(ls /dev/[sv]d[b-z] | sort | sed -n '1p')
sgdisk -Z -n 1:0:0 -t 1:8300 $BLKDEV
mkfs.ext4 -F "${BLKDEV}1"
mkdir -p /mnt/empty
mount "${BLKDEV}1" /mnt/empty
fio --name=grow-disk --filename=/mnt/empty/growing.data \
--rw=write --bs=4K --size=60G --time_based --runtime=900 \
--ioengine=libaio --iodepth=32 --numjobs=4 --direct=0
3. Grow a chain of snapshots:
TESTDIR=/path/to/testdir
QMPSOCK=/var/run/qmp.sock
QMPSHELL=/path/to/qemu/scripts/qmp/qmp-shell
rm -f $TESTDIR/snap*.qcow2
SNAPNODE=empty-format
for i in {1..200} ; do
SNAPIMG=$TESTDIR/snap$i.qcow2
qemu-img create -f qcow2 \
-o cluster_size=1M,extended_l2=on,lazy_refcounts=on \
$SNAPIMG 64G
$QMPSHELL -p $QMPSOCK <<EOF
blockdev-add node-name=snap${i}-storage driver=file filename=$SNAPIMG auto-read-only=true aio=native cache={"no-flush":false,"direct":true} discard=unmap
blockdev-add node-name=snap${i}-prealloc driver=preallocate discard=unmap file=snap${i}-storage
blockdev-add node-name=snap${i}-format driver=qcow2 read-only=false cache={"no-flush":false,"direct":true} discard=unmap file=snap${i}-prealloc
blockdev-snapshot node=$SNAPNODE overlay=snap${i}-format
EOF
SNAPNODE=snap${i}-format
done
Then, at some point, we get:
qemu-system-x86_64: ../block/io.c:441: bdrv_drain_assert_idle: Assertion `qatomic_read(&bs->in_flight) == 0' failed.
Stack trace:
(gdb) bt
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007f92e52a15a3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2 0x00007f92e5254d06 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007f92e52287f3 in __GI_abort () at abort.c:79
#4 0x00007f92e522871b in __assert_fail_base (fmt=<optimized out>, assertion=<optimized out>, file=<optimized out>, line=<optimized out>, function=<optimized out>) at assert.c:92
#5 0x00007f92e524dca6 in __assert_fail (assertion=0x555c50966200 "qatomic_read(&bs->in_flight) == 0", file=0x555c509660eb "../block/io.c", line=441, function=0x555c50966e60 <__PRETTY_FUNCTION__.33> "bdrv_drain_assert_idle") at assert.c:101
#6 0x0000555c505cb3f7 in bdrv_drain_assert_idle (bs=0x555c5384de20) at ../block/io.c:441
#7 0x0000555c505cb757 in bdrv_drain_all_begin () at ../block/io.c:531
#8 0x0000555c505c93c3 in bdrv_graph_wrlock_drained () at ../block/graph-lock.c:168
#9 0x0000555c505b83e3 in blk_remove_bs (blk=0x555c535b89f0) at ../block/block-backend.c:892
#10 0x0000555c505b73cb in blk_delete (blk=0x555c535b89f0) at ../block/block-backend.c:487
#11 0x0000555c505b77ac in blk_unref (blk=0x555c535b89f0) at ../block/block-backend.c:547
#12 0x0000555c5058f7f9 in bdrv_open_inherit (filename=0x0, reference=0x0, options=0x555c5386a530, flags=8194, parent=0x0, child_class=0x0, child_role=0, parse_filename=true, errp=0x7ffd1d8cba68) at ../block.c:4196
#13 0x0000555c5058fd89 in bdrv_open (filename=0x0, reference=0x0, options=0x555c53687680, flags=0, errp=0x7ffd1d8cba68) at ../block.c:4285
#14 0x0000555c5057ade1 in bds_tree_init (bs_opts=0x555c53687680, errp=0x7ffd1d8cba68) at ../blockdev.c:680
#15 0x0000555c50581c24 in qmp_blockdev_add (options=0x7ffd1d8cbaa0, errp=0x7ffd1d8cba68) at ../blockdev.c:3434
#16 0x0000555c506b28f8 in qmp_marshal_blockdev_add (args=0x7f92c400a9e0, ret=0x7f92d9fe3da8, errp=0x7f92d9fe3da0) at qapi/qapi-commands-block-core.c:1417
#17 0x0000555c5073d311 in do_qmp_dispatch_bh (opaque=0x7f92d9fe3e40) at ../qapi/qmp-dispatch.c:128
#18 0x0000555c5076a318 in aio_bh_call (bh=0x555c53640580) at ../util/async.c:172
#19 0x0000555c5076a43b in aio_bh_poll (ctx=0x555c52f635f0) at ../util/async.c:219
#20 0x0000555c5074aac2 in aio_dispatch (ctx=0x555c52f635f0) at ../util/aio-posix.c:436
#21 0x0000555c5076a8d5 in aio_ctx_dispatch (source=0x555c52f635f0, callback=0x0, user_data=0x0) at ../util/async.c:364
#22 0x00007f92e6913e3f in g_main_dispatch (context=0x555c52f69c10) at ../glib/gmain.c:3364
#23 g_main_context_dispatch (context=0x555c52f69c10) at ../glib/gmain.c:4079
#24 0x0000555c5076c13c in glib_pollfds_poll () at ../util/main-loop.c:290
#25 0x0000555c5076c1c3 in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:313
#26 0x0000555c5076c2e3 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:592
#27 0x0000555c502126b2 in qemu_main_loop () at ../system/runstate.c:903
#28 0x0000555c50681fb1 in qemu_default_main (opaque=0x0) at ../system/main.c:50
#29 0x0000555c50682060 in main (argc=39, argv=0x7ffd1d8cbf68) at ../system/main.c:93
(gdb) frame 6
#6 0x0000555c505cb3f7 in bdrv_drain_assert_idle (bs=0x555c5384de20) at ../block/io.c:441
441 {
(gdb) printf "%s (%s): %d\n", bs->node_name, bs->drv->format_name, bs->in_flight
snap40-storage (file): 1
So draining doesn't work as expected: in-flight requests are supposed to be
polled right before calling bdrv_drain_assert_idle(), but there's new IO
arriving:
void coroutine_mixed_fn bdrv_drain_all_begin(void)
{
...
bdrv_drain_all_begin_nopoll();
/* Now poll the in-flight requests */
AIO_WAIT_WHILE_UNLOCKED(NULL, bdrv_drain_all_poll()); <---------
while ((bs = bdrv_next_all_states(bs))) {
bdrv_drain_assert_idle(bs);
}
}
I've bisected the issue, and it seems to be introduced by Fiona's series [0]
for fixing a deadlock. Namely, before the commit b13f546545 ("block: move
drain outside of bdrv_root_unref_child()") my reproducer above produces a
deadlock similar to what we had in [1]. And after commit b13f546545 is
applied, we get the crash.
Attached is a naive fix which simply eliminates global draining (i.e. draining
of all the block graph) in problematic codepaths. While global draining might
indeed be redundant there, the real problem, i.e. the race, still remains.
Any comments and suggestions are welcomed. Thanks!
[0] https://lore.kernel.org/qemu-devel/20250530151125.955508-1-f.ebner@proxmox.com/
[1] https://lore.kernel.org/qemu-devel/73839c04-7616-407e-b057-80ca69e63f51@virtuozzo.com/
Andrey Drobyshev (1):
block: avoid global drain on graph subtrees manipulation
block.c | 8 +++++++-
block/block-backend.c | 11 ++++++++++-
2 files changed, 17 insertions(+), 2 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 3+ messages in thread
* [RFC PATCH 1/1] block: avoid global drain on graph subtrees manipulation
2025-11-12 16:41 [BUG] Block graph assertion failure on blockdev-add andrey.drobyshev
@ 2025-11-12 16:41 ` andrey.drobyshev
2025-11-18 22:42 ` [BUG] Block graph assertion failure on blockdev-add Kevin Wolf
1 sibling, 0 replies; 3+ messages in thread
From: andrey.drobyshev @ 2025-11-12 16:41 UTC (permalink / raw)
To: qemu-block; +Cc: qemu-devel, kwolf, f.ebner, hreitz, den, andrey.drobyshev
From: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
The block layer's graph drain series [0] moved many drain operations outside
of graph locks, but introduced a regression: some function now unconditionally
use bdrv_graph_wrlock_drained(), which drains all BDS in the system and
asserts no I/O is in flight.
When 'blockdev-add' creates a new node, it creates a temporary
BlockBackend which is only used for probing, only to delete it later.
Before deletion, it's also attaching child 'file' BDS to that same
BlockBackend in bdrv_open_child_common().
Both blk_remove_bs() and bdrv_open_child_common() call
bdrv_graph_wrlock_drained() unconditionally. If we have an incoming IO
at the same time, it might lead to assertion failure:
qemu-system-x86_64: ../block/io.c:441: bdrv_drain_assert_idle: \
Assertion `qatomic_read(&bs->in_flight) == 0' failed.
The global drain is unnecessary for temporary nodes created during
'blockdev-add'. Only live device roots require full graph draining.
[0] https://lore.kernel.org/qemu-devel/20250530151125.955508-1-f.ebner@proxmox.com/
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
block.c | 8 +++++++-
block/block-backend.c | 11 ++++++++++-
2 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/block.c b/block.c
index cf08e64add..26dfc624b6 100644
--- a/block.c
+++ b/block.c
@@ -3796,10 +3796,16 @@ bdrv_open_child_common(const char *filename, QDict *options,
return NULL;
}
- bdrv_graph_wrlock_drained();
+
+ /* We only need this parent's subtree to be drained, not the entire graph.
+ * So we avoid global drain here.
+ */
+ bdrv_drained_begin(parent);
+ bdrv_graph_wrlock();
child = bdrv_attach_child(parent, bs, bdref_key, child_class, child_role,
errp);
bdrv_graph_wrunlock();
+ bdrv_drained_end(parent);
return child;
}
diff --git a/block/block-backend.c b/block/block-backend.c
index f8d6ba65c1..d8110cbaff 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -889,7 +889,16 @@ void blk_remove_bs(BlockBackend *blk)
root = blk->root;
blk->root = NULL;
- bdrv_graph_wrlock_drained();
+ if (blk->dev) {
+ bdrv_graph_wrlock_drained();
+ } else {
+ /* We suppose that blk with an unattached dev is temporary (e.g.
+ * used for probing in bdrv_open_inherit()) and therefore global
+ * drain is unnecessary. Draining of this blk's subtree is done above
+ * in blk_drain().
+ */
+ bdrv_graph_wrlock();
+ }
bdrv_root_unref_child(root);
bdrv_graph_wrunlock();
}
--
2.43.5
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [BUG] Block graph assertion failure on blockdev-add
2025-11-12 16:41 [BUG] Block graph assertion failure on blockdev-add andrey.drobyshev
2025-11-12 16:41 ` [RFC PATCH 1/1] block: avoid global drain on graph subtrees manipulation andrey.drobyshev
@ 2025-11-18 22:42 ` Kevin Wolf
1 sibling, 0 replies; 3+ messages in thread
From: Kevin Wolf @ 2025-11-18 22:42 UTC (permalink / raw)
To: andrey.drobyshev; +Cc: qemu-block, qemu-devel, f.ebner, hreitz, den
Am 12.11.2025 um 17:41 hat andrey.drobyshev@virtuozzo.com geschrieben:
> From: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>
> Hi all,
>
> There's a bug in block layer which leads to an assertion failure and crash.
> The reproduce is flaky, which suggests there's a race involved. Here's how
> it's reproduced:
>
> 1. Run QEMU with 2 disks: one system (with actual guest) and one empty; attach
> them to an iothread;
> 2. In the guest, format the empty disk and start sequential IO on it;
> 3. At the host, start growing a chain of snapshots upon the empty image.
>
> Here're the scripts:
> [...]
> So draining doesn't work as expected: in-flight requests are supposed to be
> polled right before calling bdrv_drain_assert_idle(), but there's new IO
> arriving:
>
> void coroutine_mixed_fn bdrv_drain_all_begin(void)
> {
> ...
> bdrv_drain_all_begin_nopoll();
>
> /* Now poll the in-flight requests */
> AIO_WAIT_WHILE_UNLOCKED(NULL, bdrv_drain_all_poll()); <---------
>
> while ((bs = bdrv_next_all_states(bs))) {
> bdrv_drain_assert_idle(bs);
> }
> }
Took me a bit, but I can reproduce the problem. Yes, "new" I/O arrives
between AIO_WAIT_WHILE_UNLOCKED() and the assertion, which means that
drain itself doesn't work correctly.
After some debugging, I ended up adding some new assertions that would
trigger when the new I/O is added during this period, and what I caught
is this (same in multiple runs):
(gdb) qemu bt
#0 __GI_abort () at abort.c:95
#1 0x00007f99cb823639 in __assert_fail_base (fmt=<optimized out>, assertion=<optimized out>, file=<optimized out>, line=<optimized out>, function=<optimized out>) at assert.c:118
#2 0x00007f99cb8340af in __assert_fail (assertion=<optimized out>, file=<optimized out>, line=<optimized out>, function=<optimized out>) at assert.c:127
#3 0x0000557c2a0d64c6 in blk_wait_while_drained (blk=0x557c69d18020) at ../block/block-backend.c:1325
#4 0x0000557c2a0d5069 in blk_co_do_pwritev_part (blk=0x1, offset=140292968848728, bytes=0, qiov=0x7f99b4004568, qiov_offset=qiov_offset@entry=0, flags=3416308576)
at ../block/block-backend.c:1417
#5 0x0000557c2a0d55ae in blk_aio_write_entry (opaque=0x7f99b4002020) at ../block/block-backend.c:1635
#6 0x0000557c2a30a376 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:175
#7 0x00007f99cb856f70 in ??? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:66
#8 0x00007f99c35f0260 in ??? ()
#9 0x0000000000000000 in ??? ()
[Thread 0x7f98a94fc6c0 (LWP 321021) exited]
[Thread 0x7f99c2dfd6c0 (LWP 320958) exited]
Coroutine at 0x7f99c35fe588:
#0 qemu_coroutine_switch (from_=from_@entry=0x7f99c35fe588, to_=to_@entry=0x7f99b40051f0, action=action@entry=COROUTINE_ENTER) at ../util/coroutine-ucontext.c:322
#1 0x0000557c2a3089e8 in qemu_aio_coroutine_enter (ctx=ctx@entry=0x557c68d5cae0, co=<optimized out>) at ../util/qemu-coroutine.c:293
#2 0x0000557c2a3071f7 in co_schedule_bh_cb (opaque=0x557c68d5cae0) at ../util/async.c:547
#3 0x0000557c2a306b79 in aio_bh_call (bh=0x557c68d822d0) at ../util/async.c:172
#4 aio_bh_poll (ctx=ctx@entry=0x557c68d5cae0) at ../util/async.c:219
#5 0x0000557c2a2ee9e5 in aio_poll (ctx=0x557c68d5cae0, blocking=<optimized out>) at ../util/aio-posix.c:719
#6 0x0000557c2a0ac032 in iothread_run (opaque=opaque@entry=0x557c68d810a0) at ../iothread.c:63
#7 0x0000557c2a2f2aee in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:393
#8 0x00007f99cb893f54 in start_thread (arg=<optimized out>) at pthread_create.c:448
#9 0x00007f99cb91732c in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
This suggests that the following happens:
1. Drain section starts
2. A new request still arrives at the BlockBackend from the iothread and
is queued in blk_wait_while_drained(). This alone is already
interesting because we thought that the BlockBackend queuing was only
necessary for IDE and friends, not for virtio-blk. Probably only true
in the main thread.
3. Drain section ends, blk_root_drained_end() calls qemu_co_enter_next()
and that is effectively aio_co_wake(). As the request coroutine is in
the iothread and blk_root_drained_end() in the main thread, the
coroutine is only scheduled in its AioContext, but doesn't run yet.
4. Another drain section starts
4a. bdrv_drain_all_begin_nopoll()
4b. AIO_WAIT_WHILE_UNLOCKED(NULL, bdrv_drain_all_poll());
5. The iothread processes its BH list and reenters the queued request,
which increases blk->in_flight and bs->in_flight and runs the actual
request even though we're drained again. Oops.
4c. bdrv_drain_assert_idle() fails
After a simple s/if/while/ in blk_wait_while_drained(), I can't
reproduce the bug any more. I don't think it's a fully correct fix
because blk->in_flight is still increased for a short time and that
could in theory still trigger the assertion. It's just that the window
is now much smaller because the request doesn't actually execute but is
requeued right away.
I need to think a bit more about the right synchronisation tomorrow.
> I've bisected the issue, and it seems to be introduced by Fiona's series [0]
> for fixing a deadlock. Namely, before the commit b13f546545 ("block: move
> drain outside of bdrv_root_unref_child()") my reproducer above produces a
> deadlock similar to what we had in [1]. And after commit b13f546545 is
> applied, we get the crash.
>
> Attached is a naive fix which simply eliminates global draining (i.e. draining
> of all the block graph) in problematic codepaths. While global draining might
> indeed be redundant there, the real problem, i.e. the race, still remains.
Yes, we don't want to just paper over the bug.
Kevin
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-11-18 22:43 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-12 16:41 [BUG] Block graph assertion failure on blockdev-add andrey.drobyshev
2025-11-12 16:41 ` [RFC PATCH 1/1] block: avoid global drain on graph subtrees manipulation andrey.drobyshev
2025-11-18 22:42 ` [BUG] Block graph assertion failure on blockdev-add Kevin Wolf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).