From: Kevin Wolf <kwolf@redhat.com>
To: "Denis V. Lunev" <den@openvz.org>
Cc: qemu-devel@nongnu.org, qemu-block@nongnu.org,
qemu-stable@nongnu.org, hreitz@redhat.com, stefanha@redhat.com
Subject: Re: [PATCH v2 1/1] coroutine: fix lost wakeup in qemu_co_sleep_wake()
Date: Fri, 29 May 2026 16:40:41 +0200 [thread overview]
Message-ID: <ahmlab5emsQcYcr9@redhat.com> (raw)
In-Reply-To: <20260520193849.467936-2-den@openvz.org>
Am 20.05.2026 um 21:38 hat Denis V. Lunev geschrieben:
> cache_clean_timer_del_and_wait() cancels the cache-cleaner coroutine
> by setting s->cache_clean_interval = 0 and calling qemu_co_sleep_wake()
> to cut short its qemu_co_sleep_ns_wakeable(). qemu_co_sleep_wake() is
> fire-and-forget: it reads w->to_wake and silently returns when it is
> NULL. A sleeper that is between two iterations -- has just released
> s->lock but has not yet set w->to_wake inside qemu_co_sleep() -- loses
> the wake:
>
> iothread0 timer coroutine main thread (qcow2 close)
> ------------------------- -------------------------
> while-body (holding s->lock):
> read interval = 600
> wait_ns = 600 * NS
> release s->lock
> take s->lock
> interval = 0
> qemu_co_sleep_wake(w):
> w->to_wake == NULL -> skip
> return
> qemu_co_queue_wait(exit, s->lock):
> release s->lock
> yield
> qemu_co_sleep_ns_wakeable:
> aio_timer_init(+600 s)
> qemu_co_sleep:
> cas scheduled NULL -> "qsns"
> w->to_wake = co
> yield [sleeps 600 s]
>
> cache_clean_timer_del_and_wait() then blocks on cache_clean_timer_exit
> until the original 600 s expiry fires, and qcow2_close() holds BQL the
> whole time so the VM stalls behind it.
>
> block_copy_kick() has the same shape. Fix the primitive once instead
> of working around it in each caller.
>
> Use a tri-state for QemuCoSleep::to_wake:
>
> NULL - idle
> co - sleeper parked
> PENDING - wake delivered, no sleeper yet (sticky)
>
> qemu_co_sleep_wake() xchgs PENDING into to_wake: a real sleeper is
> woken, NULL/PENDING is left untouched so the wake stays sticky.
> qemu_co_sleep() consumes a sticky PENDING on entry; otherwise it
> cmpxchg-publishes itself, and a wake racing in between is caught
> because the cmpxchg observes PENDING and returns without yielding.
> On normal resume qemu_co_sleep() clears the PENDING the waker left
> behind so the next sleep starts clean.
>
> A double-fire (real wake plus timer callback) is harmless: the first
> xchg returns the coroutine and wakes it; the second returns PENDING
> and is a no-op. Cancellation latency through qemu_co_sleep_wake() is
> now bounded by aio_co_wake() rather than by the sleep duration.
>
> Fixes: f86dde9a15 ("qcow2: Fix cache_clean_timer")
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Cc: Hanna Czenczek <hreitz@redhat.com>
> Cc: Kevin Wolf <kwolf@redhat.com>
> ---
> include/qemu/coroutine.h | 17 ++++++++---
> util/qemu-coroutine-sleep.c | 60 +++++++++++++++++++++++++++----------
> 2 files changed, 58 insertions(+), 19 deletions(-)
>
> diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
> index e545bbf620..1c31de60f9 100644
> --- a/include/qemu/coroutine.h
> +++ b/include/qemu/coroutine.h
> @@ -260,10 +260,19 @@ int coroutine_fn qemu_co_timeout(CoroutineEntry *entry, void *opaque,
> uint64_t timeout_ns, CleanupFunc clean);
>
> /**
> - * Wake a coroutine if it is sleeping in qemu_co_sleep_ns. The timer will be
> - * deleted. @sleep_state must be the variable whose address was given to
> - * qemu_co_sleep_ns() and should be checked to be non-NULL before calling
> - * qemu_co_sleep_wake().
> + * Wake a coroutine sleeping in qemu_co_sleep() or qemu_co_sleep_ns_wakeable().
> + * The timer set up by the latter is deleted on wakeup.
> + *
> + * The wake is sticky: if no sleeper is parked on @w at the time of the call,
> + * the wake is recorded on @w and consumed by the next qemu_co_sleep() on the
> + * same @w, which then returns without yielding. This closes the lost-wakeup
> + * window between two sleeps and is the documented behavior callers should
> + * rely on -- e.g. a cancellation signal raised between iterations of a
> + * sleep/work loop will shorten the next sleep instead of being dropped.
> + *
> + * The state persists until consumed: if no further qemu_co_sleep() is ever
> + * called on @w, the pending wake is harmlessly discarded when @w goes away.
> + * Multiple wakes coalesce -- the next sleep consumes at most one.
> */
> void qemu_co_sleep_wake(QemuCoSleep *w);
>
> diff --git a/util/qemu-coroutine-sleep.c b/util/qemu-coroutine-sleep.c
> index edef117284..a5734a8eb7 100644
> --- a/util/qemu-coroutine-sleep.c
> +++ b/util/qemu-coroutine-sleep.c
> @@ -18,20 +18,29 @@
>
> static const char *qemu_co_sleep_ns__scheduled = "qemu_co_sleep_ns";
>
> +/*
> + * Sentinel stored in QemuCoSleep::to_wake by qemu_co_sleep_wake() when no
> + * sleeper has parked yet. The next qemu_co_sleep() consumes it and returns
> + * without yielding, so a wake that races the arming of a sleep is never
> + * lost.
> + */
> +#define QEMU_CO_SLEEP_PENDING ((Coroutine *)(uintptr_t)1)
> +
> void qemu_co_sleep_wake(QemuCoSleep *w)
> {
> Coroutine *co;
>
> - co = w->to_wake;
> - w->to_wake = NULL;
> - if (co) {
> - /* Write of schedule protected by barrier write in aio_co_schedule */
> - const char *scheduled = qatomic_cmpxchg(&co->scheduled,
> - qemu_co_sleep_ns__scheduled, NULL);
> -
> - assert(scheduled == qemu_co_sleep_ns__scheduled);
> - aio_co_wake(co);
> + co = qatomic_xchg(&w->to_wake, QEMU_CO_SLEEP_PENDING);
> + if (co == NULL || co == QEMU_CO_SLEEP_PENDING) {
> + /* No sleeper, or a wake is already pending. */
> + return;
> }
> +
> + /* Write of scheduled protected by barrier write in aio_co_schedule */
> + const char *scheduled = qatomic_cmpxchg(&co->scheduled,
> + qemu_co_sleep_ns__scheduled, NULL);
> + assert(scheduled == qemu_co_sleep_ns__scheduled);
> + aio_co_wake(co);
> }
>
> static void co_sleep_cb(void *opaque)
> @@ -43,6 +52,14 @@ static void co_sleep_cb(void *opaque)
> void coroutine_fn qemu_co_sleep(QemuCoSleep *w)
> {
> Coroutine *co = qemu_coroutine_self();
> + Coroutine *prev;
> +
> + /* Consume an already-delivered wake without yielding. */
> + prev = qatomic_cmpxchg(&w->to_wake, QEMU_CO_SLEEP_PENDING, NULL);
> + if (prev == QEMU_CO_SLEEP_PENDING) {
> + return;
> + }
> + assert(prev == NULL);
I'm wondering if this check serves a purpose when...
> const char *scheduled = qatomic_cmpxchg(&co->scheduled, NULL,
> qemu_co_sleep_ns__scheduled);
> @@ -53,11 +70,23 @@ void coroutine_fn qemu_co_sleep(QemuCoSleep *w)
> abort();
> }
>
> - w->to_wake = co;
> + /*
> + * Publish ourselves as the sleeper. If a wake raced in between the
> + * fast path above and now, the cmpxchg observes QEMU_CO_SLEEP_PENDING
> + * and we consume it without yielding.
> + */
> + prev = qatomic_cmpxchg(&w->to_wake, NULL, co);
> + if (prev == QEMU_CO_SLEEP_PENDING) {
...we essentially have to repeat it here anyway. Couldn't we always just
use this second check?
> + qatomic_set(&w->to_wake, NULL);
> + qatomic_set(&co->scheduled, NULL);
> + return;
> + }
> + assert(prev == NULL);
> +
> qemu_coroutine_yield();
>
> - /* w->to_wake is cleared before resuming this coroutine. */
> - assert(w->to_wake == NULL);
> + /* The waker left QEMU_CO_SLEEP_PENDING; clear it for the next sleep. */
> + qatomic_set(&w->to_wake, NULL);
> }
>
> void coroutine_fn qemu_co_sleep_ns_wakeable(QemuCoSleep *w,
> @@ -70,9 +99,10 @@ void coroutine_fn qemu_co_sleep_ns_wakeable(QemuCoSleep *w,
> timer_mod(&ts, qemu_clock_get_ns(type) + ns);
>
> /*
> - * The timer will fire in the current AiOContext, so the callback
> - * must happen after qemu_co_sleep yields and there is no race
> - * between timer_mod and qemu_co_sleep.
> + * A wake racing with the arming of the sleep -- including the timer
> + * we just armed firing in another AioContext before qemu_co_sleep()
> + * publishes itself -- is captured by the sticky PENDING state in
> + * qemu_co_sleep_wake() and consumed here without yielding.
> */
> qemu_co_sleep(w);
> timer_del(&ts);
Overall this looks good to me. Some unit tests would be nice, though.
Kevin
next prev parent reply other threads:[~2026-05-29 14:41 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-20 19:38 [PATCH v2 0/1] coroutine: fix lost wakeup in qemu_co_sleep_wake() Denis V. Lunev via qemu development
2026-05-20 19:38 ` [PATCH v2 1/1] " Denis V. Lunev via qemu development
2026-05-29 14:40 ` Kevin Wolf [this message]
2026-05-20 19:57 ` [PATCH v2 0/1] " Michael Tokarev
2026-05-20 20:11 ` Denis V. Lunev
2026-05-21 5:44 ` Michael Tokarev
2026-05-28 9:25 ` Denis V. Lunev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ahmlab5emsQcYcr9@redhat.com \
--to=kwolf@redhat.com \
--cc=den@openvz.org \
--cc=hreitz@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=qemu-stable@nongnu.org \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.