From: Kevin Wolf <kwolf@redhat.com>
To: "Denis V. Lunev" <den@openvz.org>
Cc: qemu-devel@nongnu.org, qemu-block@nongnu.org,
qemu-stable@nongnu.org, Hanna Czenczek <hreitz@redhat.com>
Subject: Re: [PATCH 2/2] block/qcow2: fix hangup in cache_clean_timer cancellation
Date: Fri, 15 May 2026 14:58:58 +0200 [thread overview]
Message-ID: <agcYkmTWud8euTds@redhat.com> (raw)
In-Reply-To: <20260424103917.248668-3-den@openvz.org>
Am 24.04.2026 um 12:39 hat Denis V. Lunev geschrieben:
> cache_clean_timer_del_and_wait() cancels the cache-cleaner coroutine
> by setting s->cache_clean_interval = 0 and calling qemu_co_sleep_wake()
> to cut short its qemu_co_sleep_ns_wakeable(). qemu_co_sleep_wake() is
> fire-and-forget: it reads w->to_wake and silently returns when it is
> NULL. A sleeper that is between two iterations -- has just released
> s->lock but has not yet set w->to_wake inside qemu_co_sleep() -- loses
> the wake:
>
> iothread0 timer coroutine main thread (qcow2 close)
> ------------------------- -------------------------
> while-body (holding s->lock):
> read interval = 600
> wait_ns = 600 * NS
> release s->lock
> take s->lock
> interval = 0
> qemu_co_sleep_wake(w):
> w->to_wake == NULL -> skip
> return
> qemu_co_queue_wait(exit, s->lock):
> release s->lock
> yield
> qemu_co_sleep_ns_wakeable:
> aio_timer_init(+600 s)
> qemu_co_sleep:
> cas scheduled NULL -> "qsns"
> w->to_wake = co
> yield [sleeps 600 s]
>
> cache_clean_timer_del_and_wait() is now stuck waiting for
> cache_clean_timer_exit; the timer will not signal it until its
> original 600 s expiry fires. qcow2_close() is on the main thread
> holding BQL, so RCU, VCPUs and every iothread path that needs BQL
> stall behind it.
>
> qemu_co_sleep_wake() has always been a hint: it has no way to
> rendezvous with a sleeper still arming. Rather than mutate it (which
> would change semantics for every other user -- mirror, stream,
> backup), fix the caller.
Would changing it be a problem for any caller?
I don't see the calls in mirror and stream, but for the backup job it
seems to be two cases: Cancelling the block job and updating the speed.
Cancelling is essentially the exact same case as closing qcow2, so
changing it fixes an unwanted delay. If updating the speed should always
wake up the job is more debatable, but once we've taken the decision
that it should do so, doing that always (even under the race condition)
doesn't seem like a problem either.
I'm asking because the workaround in this patch both makes the qcow2
code more complicated and still doesn't fully solve the problem - you
still get a delay, it's just shortened to what we think is acceptable.
If we fixed the qemu_co_sleep_wake() interface, the wakeup would be
instantaneous and the code in the callers would be simpler. (And I don't
think it would be much worse in the implementation either.)
> Split the sleep in cache_clean_timer() into steps of at most one
> second and move the s->cache_clean_interval check to the top of the
> loop so it is re-evaluated under s->lock between steps. The
> loop/wait structure itself is unchanged. The stop decision is now
> made under the same lock that the teardown caller holds to set
> cache_clean_interval = 0, so it cannot be missed.
> qemu_co_sleep_wake() is still called opportunistically to cut short
> the current step; if it misses, the next 1 s tick catches the change.
> Worst-case cancellation latency is bounded at 1 s, independent of
> cache_clean_interval.
>
> Fixes: f86dde9a15 ("qcow2: Fix cache_clean_timer")
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Cc: Hanna Czenczek <hreitz@redhat.com>
> Cc: Kevin Wolf <kwolf@redhat.com>
Kevin
next prev parent reply other threads:[~2026-05-15 12:59 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-24 10:39 [PATCH 0/2] block: fix two missed-wakeup hangs on shutdown path Denis V. Lunev via qemu development
2026-04-24 10:39 ` [PATCH 1/2] block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock() Denis V. Lunev via qemu development
2026-05-15 13:17 ` Kevin Wolf
2026-04-24 10:39 ` [PATCH 2/2] block/qcow2: fix hangup in cache_clean_timer cancellation Denis V. Lunev via qemu development
2026-05-15 12:58 ` Kevin Wolf [this message]
2026-05-18 17:26 ` Denis V. Lunev
2026-05-11 21:53 ` [PATCH 0/2] block: fix two missed-wakeup hangs on shutdown path Denis V. Lunev
2026-05-12 18:28 ` Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=agcYkmTWud8euTds@redhat.com \
--to=kwolf@redhat.com \
--cc=den@openvz.org \
--cc=hreitz@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=qemu-stable@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.