* [PATCH 1/2] block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock()
2026-04-24 10:39 [PATCH 0/2] block: fix two missed-wakeup hangs on shutdown path Denis V. Lunev via qemu development
@ 2026-04-24 10:39 ` Denis V. Lunev via qemu development
2026-05-15 13:17 ` Kevin Wolf
2026-04-24 10:39 ` [PATCH 2/2] block/qcow2: fix hangup in cache_clean_timer cancellation Denis V. Lunev via qemu development
2026-05-11 21:53 ` [PATCH 0/2] block: fix two missed-wakeup hangs on shutdown path Denis V. Lunev
2 siblings, 1 reply; 8+ messages in thread
From: Denis V. Lunev via qemu development @ 2026-04-24 10:39 UTC (permalink / raw)
To: qemu-devel
Cc: qemu-block, qemu-stable, Denis V. Lunev, Kevin Wolf, Hanna Reitz,
Stefan Hajnoczi, Fiona Ebner
tests/qemu-iotests/tests/iothreads-create reproduces the hang on
master under `stress-ng --cpu $(nproc) --timeout 0`. The iotest's
vm.run_job() times out and qemu stays permanently stuck in
ppoll(timeout=-1) inside bdrv_graph_wrlock_drained -> blk_remove_bs
during qemu_cleanup(). The timing window is narrow on modern
bare-metal hardware and much wider in a VM guest; downstream trees
that still use plain bdrv_graph_wrlock() in blk_remove_bs() hit it
on the first iteration under the same stress.
bdrv_graph_wrlock() zeroes has_writer around its AIO_WAIT_WHILE loop
so that callbacks dispatched by aio_poll() can still take the read
lock on the fast path. The rdunlock side, however, only kicks a
waiting writer when has_writer is observed set; a reader that drops
its lock inside the polling window silently returns and nothing ever
wakes the writer:
main thread iothread0 coroutine
----------- -------------------
bdrv_graph_wrlock: rdlock held, reader_count=1
bdrv_drain_all_begin_nopoll
has_writer = 0
AIO_WAIT_WHILE_UNLOCKED(
NULL, reader_count >= 1):
num_waiters++
smp_mb
aio_poll(main_ctx, true) --> bdrv_graph_co_rdunlock:
(ppoll, blocked) reader_count-- -> 0
smp_mb
read has_writer = 0
skip aio_wait_kick()
return
reader_count is now 0 and num_waiters is still 1, but no BH, fd or
timer on the main AioContext will fire -- the only entity that could
kick just decided it did not have to. Main stays in ppoll() holding
BQL, so RCU, VCPUs and any iothread path that needs BQL stall behind
it. The hang is final; no timeout, no forward progress, no recovery
as there is no other source of wake up inside qemu_cleanup().
bdrv_drain_all_begin() does not close the race on its own: it
quiesces in-flight I/O, but graph readers also include non-I/O
coroutines (block-job cleanup, virtio-scsi polling) that drain does
not evict. The bdrv_graph_wrlock_drained() wrapper narrows the
window but does not eliminate it; every plain bdrv_graph_wrlock()
site is exposed on the same basis.
Drop the has_writer check in bdrv_graph_co_rdunlock() and call
aio_wait_kick() unconditionally. The helper itself loads num_waiters
atomically and only schedules a dummy BH when a waiter exists, so the
change is a no-op on the no-writer path and closes the missed-wakeup
on the writer path.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Hanna Reitz <hreitz@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Fiona Ebner <f.ebner@proxmox.com>
---
block/graph-lock.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/block/graph-lock.c b/block/graph-lock.c
index b7319473a1..f2501d75fb 100644
--- a/block/graph-lock.c
+++ b/block/graph-lock.c
@@ -278,14 +278,12 @@ void coroutine_fn bdrv_graph_co_rdunlock(void)
smp_mb();
/*
- * has_writer == 0: this means reader will read reader_count decreased
- * has_writer == 1: we don't know if writer read reader_count old or
- * new. Therefore, kick again so on next iteration
- * writer will for sure read the updated value.
+ * Always kick: bdrv_graph_wrlock() zeroes has_writer while polling (to
+ * let callbacks take the reader lock via the fast path), so we cannot
+ * rely on has_writer to detect a waiting writer. aio_wait_kick() is a
+ * no-op when no one is waiting, so it is cheap in the common case.
*/
- if (qatomic_read(&has_writer)) {
- aio_wait_kick();
- }
+ aio_wait_kick();
}
void bdrv_graph_rdlock_main_loop(void)
--
2.51.0
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH 1/2] block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock()
2026-04-24 10:39 ` [PATCH 1/2] block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock() Denis V. Lunev via qemu development
@ 2026-05-15 13:17 ` Kevin Wolf
0 siblings, 0 replies; 8+ messages in thread
From: Kevin Wolf @ 2026-05-15 13:17 UTC (permalink / raw)
To: Denis V. Lunev
Cc: qemu-devel, qemu-block, qemu-stable, Hanna Reitz, Stefan Hajnoczi,
Fiona Ebner
Am 24.04.2026 um 12:39 hat Denis V. Lunev geschrieben:
> tests/qemu-iotests/tests/iothreads-create reproduces the hang on
> master under `stress-ng --cpu $(nproc) --timeout 0`. The iotest's
> vm.run_job() times out and qemu stays permanently stuck in
> ppoll(timeout=-1) inside bdrv_graph_wrlock_drained -> blk_remove_bs
> during qemu_cleanup(). The timing window is narrow on modern
> bare-metal hardware and much wider in a VM guest; downstream trees
> that still use plain bdrv_graph_wrlock() in blk_remove_bs() hit it
> on the first iteration under the same stress.
>
> bdrv_graph_wrlock() zeroes has_writer around its AIO_WAIT_WHILE loop
> so that callbacks dispatched by aio_poll() can still take the read
> lock on the fast path. The rdunlock side, however, only kicks a
> waiting writer when has_writer is observed set; a reader that drops
> its lock inside the polling window silently returns and nothing ever
> wakes the writer:
>
> main thread iothread0 coroutine
> ----------- -------------------
> bdrv_graph_wrlock: rdlock held, reader_count=1
> bdrv_drain_all_begin_nopoll
> has_writer = 0
> AIO_WAIT_WHILE_UNLOCKED(
> NULL, reader_count >= 1):
> num_waiters++
> smp_mb
> aio_poll(main_ctx, true) --> bdrv_graph_co_rdunlock:
> (ppoll, blocked) reader_count-- -> 0
> smp_mb
> read has_writer = 0
> skip aio_wait_kick()
> return
>
> reader_count is now 0 and num_waiters is still 1, but no BH, fd or
> timer on the main AioContext will fire -- the only entity that could
> kick just decided it did not have to. Main stays in ppoll() holding
> BQL, so RCU, VCPUs and any iothread path that needs BQL stall behind
> it. The hang is final; no timeout, no forward progress, no recovery
> as there is no other source of wake up inside qemu_cleanup().
>
> bdrv_drain_all_begin() does not close the race on its own: it
> quiesces in-flight I/O, but graph readers also include non-I/O
> coroutines (block-job cleanup, virtio-scsi polling) that drain does
> not evict. The bdrv_graph_wrlock_drained() wrapper narrows the
> window but does not eliminate it; every plain bdrv_graph_wrlock()
> site is exposed on the same basis.
>
> Drop the has_writer check in bdrv_graph_co_rdunlock() and call
> aio_wait_kick() unconditionally. The helper itself loads num_waiters
> atomically and only schedules a dummy BH when a waiter exists, so the
> change is a no-op on the no-writer path and closes the missed-wakeup
> on the writer path.
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Hanna Reitz <hreitz@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Cc: Fiona Ebner <f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Thanks, applied to the block branch.
Kevin
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 2/2] block/qcow2: fix hangup in cache_clean_timer cancellation
2026-04-24 10:39 [PATCH 0/2] block: fix two missed-wakeup hangs on shutdown path Denis V. Lunev via qemu development
2026-04-24 10:39 ` [PATCH 1/2] block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock() Denis V. Lunev via qemu development
@ 2026-04-24 10:39 ` Denis V. Lunev via qemu development
2026-05-15 12:58 ` Kevin Wolf
2026-05-11 21:53 ` [PATCH 0/2] block: fix two missed-wakeup hangs on shutdown path Denis V. Lunev
2 siblings, 1 reply; 8+ messages in thread
From: Denis V. Lunev via qemu development @ 2026-04-24 10:39 UTC (permalink / raw)
To: qemu-devel
Cc: qemu-block, qemu-stable, Denis V. Lunev, Hanna Czenczek,
Kevin Wolf
cache_clean_timer_del_and_wait() cancels the cache-cleaner coroutine
by setting s->cache_clean_interval = 0 and calling qemu_co_sleep_wake()
to cut short its qemu_co_sleep_ns_wakeable(). qemu_co_sleep_wake() is
fire-and-forget: it reads w->to_wake and silently returns when it is
NULL. A sleeper that is between two iterations -- has just released
s->lock but has not yet set w->to_wake inside qemu_co_sleep() -- loses
the wake:
iothread0 timer coroutine main thread (qcow2 close)
------------------------- -------------------------
while-body (holding s->lock):
read interval = 600
wait_ns = 600 * NS
release s->lock
take s->lock
interval = 0
qemu_co_sleep_wake(w):
w->to_wake == NULL -> skip
return
qemu_co_queue_wait(exit, s->lock):
release s->lock
yield
qemu_co_sleep_ns_wakeable:
aio_timer_init(+600 s)
qemu_co_sleep:
cas scheduled NULL -> "qsns"
w->to_wake = co
yield [sleeps 600 s]
cache_clean_timer_del_and_wait() is now stuck waiting for
cache_clean_timer_exit; the timer will not signal it until its
original 600 s expiry fires. qcow2_close() is on the main thread
holding BQL, so RCU, VCPUs and every iothread path that needs BQL
stall behind it.
qemu_co_sleep_wake() has always been a hint: it has no way to
rendezvous with a sleeper still arming. Rather than mutate it (which
would change semantics for every other user -- mirror, stream,
backup), fix the caller.
Split the sleep in cache_clean_timer() into steps of at most one
second and move the s->cache_clean_interval check to the top of the
loop so it is re-evaluated under s->lock between steps. The
loop/wait structure itself is unchanged. The stop decision is now
made under the same lock that the teardown caller holds to set
cache_clean_interval = 0, so it cannot be missed.
qemu_co_sleep_wake() is still called opportunistically to cut short
the current step; if it misses, the next 1 s tick catches the change.
Worst-case cancellation latency is bounded at 1 s, independent of
cache_clean_interval.
Fixes: f86dde9a15 ("qcow2: Fix cache_clean_timer")
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Hanna Czenczek <hreitz@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
---
block/qcow2.c | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)
diff --git a/block/qcow2.c b/block/qcow2.c
index f6461743d2..3e249970d6 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -838,24 +838,30 @@ static const char *overlap_bool_option_names[QCOW2_OL_MAX_BITNR] = {
static void coroutine_fn cache_clean_timer(void *opaque)
{
BDRVQcow2State *s = opaque;
- uint64_t wait_ns;
+ uint64_t remaining_ns = 0;
- WITH_QEMU_LOCK_GUARD(&s->lock) {
- wait_ns = s->cache_clean_interval * NANOSECONDS_PER_SECOND;
- }
-
- while (wait_ns > 0) {
- qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
- QEMU_CLOCK_REALTIME, wait_ns);
+ for (;;) {
+ bool stop = false;
+ uint64_t step;
WITH_QEMU_LOCK_GUARD(&s->lock) {
- if (s->cache_clean_interval > 0) {
+ if (s->cache_clean_interval == 0) {
+ stop = true;
+ } else if (remaining_ns == 0) {
qcow2_cache_clean_unused(s->l2_table_cache);
qcow2_cache_clean_unused(s->refcount_block_cache);
+ remaining_ns = s->cache_clean_interval
+ * (uint64_t)NANOSECONDS_PER_SECOND;
}
-
- wait_ns = s->cache_clean_interval * NANOSECONDS_PER_SECOND;
}
+ if (stop) {
+ break;
+ }
+
+ step = MIN(remaining_ns, (uint64_t)NANOSECONDS_PER_SECOND);
+ qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
+ QEMU_CLOCK_REALTIME, step);
+ remaining_ns -= step;
}
WITH_QEMU_LOCK_GUARD(&s->lock) {
--
2.51.0
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH 2/2] block/qcow2: fix hangup in cache_clean_timer cancellation
2026-04-24 10:39 ` [PATCH 2/2] block/qcow2: fix hangup in cache_clean_timer cancellation Denis V. Lunev via qemu development
@ 2026-05-15 12:58 ` Kevin Wolf
2026-05-18 17:26 ` Denis V. Lunev
0 siblings, 1 reply; 8+ messages in thread
From: Kevin Wolf @ 2026-05-15 12:58 UTC (permalink / raw)
To: Denis V. Lunev; +Cc: qemu-devel, qemu-block, qemu-stable, Hanna Czenczek
Am 24.04.2026 um 12:39 hat Denis V. Lunev geschrieben:
> cache_clean_timer_del_and_wait() cancels the cache-cleaner coroutine
> by setting s->cache_clean_interval = 0 and calling qemu_co_sleep_wake()
> to cut short its qemu_co_sleep_ns_wakeable(). qemu_co_sleep_wake() is
> fire-and-forget: it reads w->to_wake and silently returns when it is
> NULL. A sleeper that is between two iterations -- has just released
> s->lock but has not yet set w->to_wake inside qemu_co_sleep() -- loses
> the wake:
>
> iothread0 timer coroutine main thread (qcow2 close)
> ------------------------- -------------------------
> while-body (holding s->lock):
> read interval = 600
> wait_ns = 600 * NS
> release s->lock
> take s->lock
> interval = 0
> qemu_co_sleep_wake(w):
> w->to_wake == NULL -> skip
> return
> qemu_co_queue_wait(exit, s->lock):
> release s->lock
> yield
> qemu_co_sleep_ns_wakeable:
> aio_timer_init(+600 s)
> qemu_co_sleep:
> cas scheduled NULL -> "qsns"
> w->to_wake = co
> yield [sleeps 600 s]
>
> cache_clean_timer_del_and_wait() is now stuck waiting for
> cache_clean_timer_exit; the timer will not signal it until its
> original 600 s expiry fires. qcow2_close() is on the main thread
> holding BQL, so RCU, VCPUs and every iothread path that needs BQL
> stall behind it.
>
> qemu_co_sleep_wake() has always been a hint: it has no way to
> rendezvous with a sleeper still arming. Rather than mutate it (which
> would change semantics for every other user -- mirror, stream,
> backup), fix the caller.
Would changing it be a problem for any caller?
I don't see the calls in mirror and stream, but for the backup job it
seems to be two cases: Cancelling the block job and updating the speed.
Cancelling is essentially the exact same case as closing qcow2, so
changing it fixes an unwanted delay. If updating the speed should always
wake up the job is more debatable, but once we've taken the decision
that it should do so, doing that always (even under the race condition)
doesn't seem like a problem either.
I'm asking because the workaround in this patch both makes the qcow2
code more complicated and still doesn't fully solve the problem - you
still get a delay, it's just shortened to what we think is acceptable.
If we fixed the qemu_co_sleep_wake() interface, the wakeup would be
instantaneous and the code in the callers would be simpler. (And I don't
think it would be much worse in the implementation either.)
> Split the sleep in cache_clean_timer() into steps of at most one
> second and move the s->cache_clean_interval check to the top of the
> loop so it is re-evaluated under s->lock between steps. The
> loop/wait structure itself is unchanged. The stop decision is now
> made under the same lock that the teardown caller holds to set
> cache_clean_interval = 0, so it cannot be missed.
> qemu_co_sleep_wake() is still called opportunistically to cut short
> the current step; if it misses, the next 1 s tick catches the change.
> Worst-case cancellation latency is bounded at 1 s, independent of
> cache_clean_interval.
>
> Fixes: f86dde9a15 ("qcow2: Fix cache_clean_timer")
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Cc: Hanna Czenczek <hreitz@redhat.com>
> Cc: Kevin Wolf <kwolf@redhat.com>
Kevin
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH 2/2] block/qcow2: fix hangup in cache_clean_timer cancellation
2026-05-15 12:58 ` Kevin Wolf
@ 2026-05-18 17:26 ` Denis V. Lunev
0 siblings, 0 replies; 8+ messages in thread
From: Denis V. Lunev @ 2026-05-18 17:26 UTC (permalink / raw)
To: Kevin Wolf, Denis V. Lunev
Cc: qemu-devel, qemu-block, qemu-stable, Hanna Czenczek
On 5/15/26 14:58, Kevin Wolf wrote:
> Am 24.04.2026 um 12:39 hat Denis V. Lunev geschrieben:
>> cache_clean_timer_del_and_wait() cancels the cache-cleaner coroutine
>> by setting s->cache_clean_interval = 0 and calling qemu_co_sleep_wake()
>> to cut short its qemu_co_sleep_ns_wakeable(). qemu_co_sleep_wake() is
>> fire-and-forget: it reads w->to_wake and silently returns when it is
>> NULL. A sleeper that is between two iterations -- has just released
>> s->lock but has not yet set w->to_wake inside qemu_co_sleep() -- loses
>> the wake:
>>
>> iothread0 timer coroutine main thread (qcow2 close)
>> ------------------------- -------------------------
>> while-body (holding s->lock):
>> read interval = 600
>> wait_ns = 600 * NS
>> release s->lock
>> take s->lock
>> interval = 0
>> qemu_co_sleep_wake(w):
>> w->to_wake == NULL -> skip
>> return
>> qemu_co_queue_wait(exit, s->lock):
>> release s->lock
>> yield
>> qemu_co_sleep_ns_wakeable:
>> aio_timer_init(+600 s)
>> qemu_co_sleep:
>> cas scheduled NULL -> "qsns"
>> w->to_wake = co
>> yield [sleeps 600 s]
>>
>> cache_clean_timer_del_and_wait() is now stuck waiting for
>> cache_clean_timer_exit; the timer will not signal it until its
>> original 600 s expiry fires. qcow2_close() is on the main thread
>> holding BQL, so RCU, VCPUs and every iothread path that needs BQL
>> stall behind it.
>>
>> qemu_co_sleep_wake() has always been a hint: it has no way to
>> rendezvous with a sleeper still arming. Rather than mutate it (which
>> would change semantics for every other user -- mirror, stream,
>> backup), fix the caller.
> Would changing it be a problem for any caller?
>
> I don't see the calls in mirror and stream, but for the backup job it
> seems to be two cases: Cancelling the block job and updating the speed.
> Cancelling is essentially the exact same case as closing qcow2, so
> changing it fixes an unwanted delay. If updating the speed should always
> wake up the job is more debatable, but once we've taken the decision
> that it should do so, doing that always (even under the race condition)
> doesn't seem like a problem either.
>
> I'm asking because the workaround in this patch both makes the qcow2
> code more complicated and still doesn't fully solve the problem - you
> still get a delay, it's just shortened to what we think is acceptable.
> If we fixed the qemu_co_sleep_wake() interface, the wakeup would be
> instantaneous and the code in the callers would be simpler. (And I don't
> think it would be much worse in the implementation either.)
>
>> Split the sleep in cache_clean_timer() into steps of at most one
>> second and move the s->cache_clean_interval check to the top of the
>> loop so it is re-evaluated under s->lock between steps. The
>> loop/wait structure itself is unchanged. The stop decision is now
>> made under the same lock that the teardown caller holds to set
>> cache_clean_interval = 0, so it cannot be missed.
>> qemu_co_sleep_wake() is still called opportunistically to cut short
>> the current step; if it misses, the next 1 s tick catches the change.
>> Worst-case cancellation latency is bounded at 1 s, independent of
>> cache_clean_interval.
>>
>> Fixes: f86dde9a15 ("qcow2: Fix cache_clean_timer")
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Cc: Hanna Czenczek <hreitz@redhat.com>
>> Cc: Kevin Wolf <kwolf@redhat.com>
> Kevin
>
ok. Let me try to write this.
Den
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] block: fix two missed-wakeup hangs on shutdown path
2026-04-24 10:39 [PATCH 0/2] block: fix two missed-wakeup hangs on shutdown path Denis V. Lunev via qemu development
2026-04-24 10:39 ` [PATCH 1/2] block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock() Denis V. Lunev via qemu development
2026-04-24 10:39 ` [PATCH 2/2] block/qcow2: fix hangup in cache_clean_timer cancellation Denis V. Lunev via qemu development
@ 2026-05-11 21:53 ` Denis V. Lunev
2026-05-12 18:28 ` Stefan Hajnoczi
2 siblings, 1 reply; 8+ messages in thread
From: Denis V. Lunev @ 2026-05-11 21:53 UTC (permalink / raw)
To: Denis V. Lunev, qemu-devel
Cc: qemu-block, qemu-stable, Kevin Wolf, Hanna Reitz, Stefan Hajnoczi,
Fiona Ebner
On 4/24/26 12:39, Denis V. Lunev wrote:
> Problem
> -------
>
> The qemu shutdown / blockdev-close path can deadlock permanently on
> upstream master. The main thread enters ppoll(timeout=-1) holding
> BQL, no other thread has a wake source that points back at it, and
> qemu has to be SIGKILLed. The hang has no timeout -- it is a hard
> deadlock, not a slow operation; behind BQL, RCU, VCPUs and every
> iothread path that needs BQL stall with it.
>
> Two independent missed-wakeup races in the block layer contribute.
> Both share the same shape: a waiter arms on one side, the waker
> reads stale state on its fast path and silently skips the kick, and
> nothing else on the AioContext will fire to recover. They are
> different bugs in different subsystems and each patch stands on its
> own; they are posted together because they surface through the same
> test and the same symptom and are easiest to diagnose side by side.
>
> Depending on which race fires, the main thread backtrace at the
> moment of hang is one of:
>
> ppoll -> aio_poll -> bdrv_graph_wrlock -> blk_remove_bs
> (patch 1 -- block/graph-lock)
>
> ppoll -> aio_poll -> cache_clean_timer_del_and_wait -> qcow2_close
> (patch 2 -- block/qcow2 cache_clean_timer)
>
> Race diagrams and the exact stale-state read are in each patch's
> commit message.
>
> Reproducer
> ----------
>
> Environment used for the numbers below: 4-vCPU VM guest,
> kernel 6.12.x, upstream master at bb230769b4. On modern bare-metal
> the window is narrow enough that the hangs rarely reproduce without
> a VM -- a VM guest under full CPU saturation is what makes the
> timing reliable. Downstream trees that still use plain
> bdrv_graph_wrlock() in blk_remove_bs() hit the graph-lock race on
> the first iteration without any stress at all.
>
> # reproducer
> stress-ng --cpu "$(nproc)" --timeout 0 &
> for r in $(seq 20); do
> timeout 120 ./build/tests/qemu-iotests/check -qcow2 iothreads-create
> done
> kill %1
>
> With `stress-ng --cpu $(nproc)` both races surface. With
> `stress-ng --cpu $(($(nproc) - 1))` or without a stressor neither
> reproduces reliably across 20 iterations.
>
> When a race fires, the Python QMP client times out on vm.run_job()
> after 5 s, the qemu process keeps running but never makes forward
> progress, and the outer `timeout 120` eventually kills it. attach
> gdb before the timeout kills qemu to capture the stack and
> distinguish which of the two races fired.
>
> Results
> -------
>
> Same guest, 20 iterations of the loop above:
>
> upstream master: 10/20 FAIL (first fail at iter #2)
> master + both patches: 20/20 PASS
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Hanna Reitz <hreitz@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Cc: Fiona Ebner <f.ebner@proxmox.com>
> Cc: Hanna Czenczek <hreitz@redhat.com>
>
> Denis V. Lunev (2):
> block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock()
> block/qcow2: fix hangup in cache_clean_timer cancellation
>
> block/graph-lock.c | 12 +++++-------
> block/qcow2.c | 28 +++++++++++++++++-----------
> 2 files changed, 22 insertions(+), 18 deletions(-)
>
> --
> 2.51.0
ping
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] block: fix two missed-wakeup hangs on shutdown path
2026-05-11 21:53 ` [PATCH 0/2] block: fix two missed-wakeup hangs on shutdown path Denis V. Lunev
@ 2026-05-12 18:28 ` Stefan Hajnoczi
0 siblings, 0 replies; 8+ messages in thread
From: Stefan Hajnoczi @ 2026-05-12 18:28 UTC (permalink / raw)
To: kwolf
Cc: Denis V. Lunev, qemu-devel, qemu-block, qemu-stable, Hanna Reitz,
Fiona Ebner, Denis V. Lunev
[-- Attachment #1: Type: text/plain, Size: 3604 bytes --]
On Mon, May 11, 2026 at 11:53:37PM +0200, Denis V. Lunev wrote:
> On 4/24/26 12:39, Denis V. Lunev wrote:
> > Problem
> > -------
> >
> > The qemu shutdown / blockdev-close path can deadlock permanently on
> > upstream master. The main thread enters ppoll(timeout=-1) holding
> > BQL, no other thread has a wake source that points back at it, and
> > qemu has to be SIGKILLed. The hang has no timeout -- it is a hard
> > deadlock, not a slow operation; behind BQL, RCU, VCPUs and every
> > iothread path that needs BQL stall with it.
> >
> > Two independent missed-wakeup races in the block layer contribute.
> > Both share the same shape: a waiter arms on one side, the waker
> > reads stale state on its fast path and silently skips the kick, and
> > nothing else on the AioContext will fire to recover. They are
> > different bugs in different subsystems and each patch stands on its
> > own; they are posted together because they surface through the same
> > test and the same symptom and are easiest to diagnose side by side.
> >
> > Depending on which race fires, the main thread backtrace at the
> > moment of hang is one of:
> >
> > ppoll -> aio_poll -> bdrv_graph_wrlock -> blk_remove_bs
> > (patch 1 -- block/graph-lock)
> >
> > ppoll -> aio_poll -> cache_clean_timer_del_and_wait -> qcow2_close
> > (patch 2 -- block/qcow2 cache_clean_timer)
> >
> > Race diagrams and the exact stale-state read are in each patch's
> > commit message.
> >
> > Reproducer
> > ----------
> >
> > Environment used for the numbers below: 4-vCPU VM guest,
> > kernel 6.12.x, upstream master at bb230769b4. On modern bare-metal
> > the window is narrow enough that the hangs rarely reproduce without
> > a VM -- a VM guest under full CPU saturation is what makes the
> > timing reliable. Downstream trees that still use plain
> > bdrv_graph_wrlock() in blk_remove_bs() hit the graph-lock race on
> > the first iteration without any stress at all.
> >
> > # reproducer
> > stress-ng --cpu "$(nproc)" --timeout 0 &
> > for r in $(seq 20); do
> > timeout 120 ./build/tests/qemu-iotests/check -qcow2 iothreads-create
> > done
> > kill %1
> >
> > With `stress-ng --cpu $(nproc)` both races surface. With
> > `stress-ng --cpu $(($(nproc) - 1))` or without a stressor neither
> > reproduces reliably across 20 iterations.
> >
> > When a race fires, the Python QMP client times out on vm.run_job()
> > after 5 s, the qemu process keeps running but never makes forward
> > progress, and the outer `timeout 120` eventually kills it. attach
> > gdb before the timeout kills qemu to capture the stack and
> > distinguish which of the two races fired.
> >
> > Results
> > -------
> >
> > Same guest, 20 iterations of the loop above:
> >
> > upstream master: 10/20 FAIL (first fail at iter #2)
> > master + both patches: 20/20 PASS
> >
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > Cc: Kevin Wolf <kwolf@redhat.com>
> > Cc: Hanna Reitz <hreitz@redhat.com>
> > Cc: Stefan Hajnoczi <stefanha@redhat.com>
> > Cc: Fiona Ebner <f.ebner@proxmox.com>
> > Cc: Hanna Czenczek <hreitz@redhat.com>
> >
> > Denis V. Lunev (2):
> > block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock()
> > block/qcow2: fix hangup in cache_clean_timer cancellation
> >
> > block/graph-lock.c | 12 +++++-------
> > block/qcow2.c | 28 +++++++++++++++++-----------
> > 2 files changed, 22 insertions(+), 18 deletions(-)
> >
> > --
> > 2.51.0
> ping
Hi Kevin,
This looks like a series for your block tree. If I can help in some way,
please let me know.
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread