All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/1] coroutine: fix lost wakeup in qemu_co_sleep_wake()
@ 2026-05-20 19:38 Denis V. Lunev via qemu development
  2026-05-20 19:38 ` [PATCH v2 1/1] " Denis V. Lunev via qemu development
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Denis V. Lunev via qemu development @ 2026-05-20 19:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, qemu-stable, kwolf, hreitz, stefanha, Denis V. Lunev

Changes since v1
----------------

v1 was a two-patch series tracking two independent missed-wakeup
races on the qemu shutdown path.

  * Patch 1, "block/graph-lock: fix missed wakeup in
    bdrv_graph_co_rdunlock()", was applied as e3082ab3b3 by Kevin
    and is now in tree. This v2 carries only the remaining race.

  * Per Kevin's review of v1 patch 2 [1], the cache_clean_timer
    hang is no longer worked around inside block/qcow2.c. Instead,
    the underlying primitive -- qemu_co_sleep_wake() -- is fixed,
    closing the lost-wakeup window for every caller
    (cache_clean_timer, block_copy_kick, ...) rather than just
    qcow2. Cancellation latency through qemu_co_sleep_wake() drops
    from "next 1 s tick" (v1 workaround) to aio_co_wake().

[1] https://lore.kernel.org/qemu-devel/agcYkmTWud8euTds@redhat.com/

Problem
-------

The qemu shutdown / blockdev-close path can deadlock permanently on
upstream master. The main thread enters ppoll(timeout=-1) holding
BQL, no other thread has a wake source that points back at it, and
qemu has to be SIGKILLed. The hang has no timeout -- it is a hard
deadlock, not a slow operation; behind BQL, RCU, VCPUs and every
iothread path that needs BQL stall with it.

Two independent missed-wakeup races in the block layer contributed
to the symptom on v1. Both shared the same shape: a waiter arms on
one side, the waker reads stale state on its fast path and silently
skips the kick, and nothing else on the AioContext fires to
recover. The first (block/graph-lock) was fixed by e3082ab3b3 and
is now in tree. This patch closes the second one, exposed in
qcow2's cache_clean_timer cancellation path:

  ppoll -> aio_poll -> cache_clean_timer_del_and_wait -> qcow2_close

The race diagram and the exact stale-state read are in the patch's
commit message.

Reproducer
----------

Environment: 4-vCPU VM guest, kernel 6.12.x, upstream master at
e3082ab3b3 (with the graph-lock fix already applied). On modern
bare-metal the window is narrow enough that the hang rarely
reproduces without a VM -- a VM guest under full CPU saturation is
what makes the timing reliable.

    # reproducer
    stress-ng --cpu "$(nproc)" --timeout 0 &
    for r in $(seq 20); do
        timeout 120 ./build/tests/qemu-iotests/check -qcow2 iothreads-create
    done
    kill %1

With `stress-ng --cpu $(nproc)` the race surfaces. With
`stress-ng --cpu $(($(nproc) - 1))` or without a stressor it does
not reproduce reliably across 20 iterations.

When the race fires, the Python QMP client times out on
vm.run_job() after 5 s, the qemu process keeps running but never
makes forward progress, and the outer `timeout 120` eventually
kills it. Attach gdb before the timeout kills qemu to capture the
stack.

Results
-------

Same guest, 20 iterations of the loop above, master at e3082ab3b3:

  without this patch:  reproduces reliably (qcow2_close in ppoll)
  with this patch:     20/20 PASS

Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Hanna Reitz <hreitz@redhat.com>

Denis V. Lunev (1):
  coroutine: fix lost wakeup in qemu_co_sleep_wake()

 include/qemu/coroutine.h    | 17 ++++++++---
 util/qemu-coroutine-sleep.c | 60 +++++++++++++++++++++++++++----------
 2 files changed, 58 insertions(+), 19 deletions(-)

-- 
2.51.0



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/1] coroutine: fix lost wakeup in qemu_co_sleep_wake()
  2026-05-20 19:38 [PATCH v2 0/1] coroutine: fix lost wakeup in qemu_co_sleep_wake() Denis V. Lunev via qemu development
@ 2026-05-20 19:38 ` Denis V. Lunev via qemu development
  2026-05-29 14:40   ` Kevin Wolf
  2026-05-20 19:57 ` [PATCH v2 0/1] " Michael Tokarev
  2026-05-28  9:25 ` Denis V. Lunev
  2 siblings, 1 reply; 7+ messages in thread
From: Denis V. Lunev via qemu development @ 2026-05-20 19:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, qemu-stable, kwolf, hreitz, stefanha, Denis V. Lunev

cache_clean_timer_del_and_wait() cancels the cache-cleaner coroutine
by setting s->cache_clean_interval = 0 and calling qemu_co_sleep_wake()
to cut short its qemu_co_sleep_ns_wakeable(). qemu_co_sleep_wake() is
fire-and-forget: it reads w->to_wake and silently returns when it is
NULL. A sleeper that is between two iterations -- has just released
s->lock but has not yet set w->to_wake inside qemu_co_sleep() -- loses
the wake:

  iothread0 timer coroutine           main thread (qcow2 close)
  -------------------------           -------------------------
  while-body (holding s->lock):
    read interval = 600
    wait_ns = 600 * NS
    release s->lock
                                      take s->lock
                                      interval = 0
                                      qemu_co_sleep_wake(w):
                                        w->to_wake == NULL -> skip
                                        return
                                      qemu_co_queue_wait(exit, s->lock):
                                        release s->lock
                                        yield
  qemu_co_sleep_ns_wakeable:
    aio_timer_init(+600 s)
    qemu_co_sleep:
      cas scheduled NULL -> "qsns"
      w->to_wake = co
      yield  [sleeps 600 s]

cache_clean_timer_del_and_wait() then blocks on cache_clean_timer_exit
until the original 600 s expiry fires, and qcow2_close() holds BQL the
whole time so the VM stalls behind it.

block_copy_kick() has the same shape. Fix the primitive once instead
of working around it in each caller.

Use a tri-state for QemuCoSleep::to_wake:

  NULL     - idle
  co       - sleeper parked
  PENDING  - wake delivered, no sleeper yet (sticky)

qemu_co_sleep_wake() xchgs PENDING into to_wake: a real sleeper is
woken, NULL/PENDING is left untouched so the wake stays sticky.
qemu_co_sleep() consumes a sticky PENDING on entry; otherwise it
cmpxchg-publishes itself, and a wake racing in between is caught
because the cmpxchg observes PENDING and returns without yielding.
On normal resume qemu_co_sleep() clears the PENDING the waker left
behind so the next sleep starts clean.

A double-fire (real wake plus timer callback) is harmless: the first
xchg returns the coroutine and wakes it; the second returns PENDING
and is a no-op. Cancellation latency through qemu_co_sleep_wake() is
now bounded by aio_co_wake() rather than by the sleep duration.

Fixes: f86dde9a15 ("qcow2: Fix cache_clean_timer")
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Hanna Czenczek <hreitz@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
---
 include/qemu/coroutine.h    | 17 ++++++++---
 util/qemu-coroutine-sleep.c | 60 +++++++++++++++++++++++++++----------
 2 files changed, 58 insertions(+), 19 deletions(-)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index e545bbf620..1c31de60f9 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -260,10 +260,19 @@ int coroutine_fn qemu_co_timeout(CoroutineEntry *entry, void *opaque,
                                  uint64_t timeout_ns, CleanupFunc clean);
 
 /**
- * Wake a coroutine if it is sleeping in qemu_co_sleep_ns. The timer will be
- * deleted. @sleep_state must be the variable whose address was given to
- * qemu_co_sleep_ns() and should be checked to be non-NULL before calling
- * qemu_co_sleep_wake().
+ * Wake a coroutine sleeping in qemu_co_sleep() or qemu_co_sleep_ns_wakeable().
+ * The timer set up by the latter is deleted on wakeup.
+ *
+ * The wake is sticky: if no sleeper is parked on @w at the time of the call,
+ * the wake is recorded on @w and consumed by the next qemu_co_sleep() on the
+ * same @w, which then returns without yielding. This closes the lost-wakeup
+ * window between two sleeps and is the documented behavior callers should
+ * rely on -- e.g. a cancellation signal raised between iterations of a
+ * sleep/work loop will shorten the next sleep instead of being dropped.
+ *
+ * The state persists until consumed: if no further qemu_co_sleep() is ever
+ * called on @w, the pending wake is harmlessly discarded when @w goes away.
+ * Multiple wakes coalesce -- the next sleep consumes at most one.
  */
 void qemu_co_sleep_wake(QemuCoSleep *w);
 
diff --git a/util/qemu-coroutine-sleep.c b/util/qemu-coroutine-sleep.c
index edef117284..a5734a8eb7 100644
--- a/util/qemu-coroutine-sleep.c
+++ b/util/qemu-coroutine-sleep.c
@@ -18,20 +18,29 @@
 
 static const char *qemu_co_sleep_ns__scheduled = "qemu_co_sleep_ns";
 
+/*
+ * Sentinel stored in QemuCoSleep::to_wake by qemu_co_sleep_wake() when no
+ * sleeper has parked yet. The next qemu_co_sleep() consumes it and returns
+ * without yielding, so a wake that races the arming of a sleep is never
+ * lost.
+ */
+#define QEMU_CO_SLEEP_PENDING ((Coroutine *)(uintptr_t)1)
+
 void qemu_co_sleep_wake(QemuCoSleep *w)
 {
     Coroutine *co;
 
-    co = w->to_wake;
-    w->to_wake = NULL;
-    if (co) {
-        /* Write of schedule protected by barrier write in aio_co_schedule */
-        const char *scheduled = qatomic_cmpxchg(&co->scheduled,
-                                                qemu_co_sleep_ns__scheduled, NULL);
-
-        assert(scheduled == qemu_co_sleep_ns__scheduled);
-        aio_co_wake(co);
+    co = qatomic_xchg(&w->to_wake, QEMU_CO_SLEEP_PENDING);
+    if (co == NULL || co == QEMU_CO_SLEEP_PENDING) {
+        /* No sleeper, or a wake is already pending. */
+        return;
     }
+
+    /* Write of scheduled protected by barrier write in aio_co_schedule */
+    const char *scheduled = qatomic_cmpxchg(&co->scheduled,
+                                            qemu_co_sleep_ns__scheduled, NULL);
+    assert(scheduled == qemu_co_sleep_ns__scheduled);
+    aio_co_wake(co);
 }
 
 static void co_sleep_cb(void *opaque)
@@ -43,6 +52,14 @@ static void co_sleep_cb(void *opaque)
 void coroutine_fn qemu_co_sleep(QemuCoSleep *w)
 {
     Coroutine *co = qemu_coroutine_self();
+    Coroutine *prev;
+
+    /* Consume an already-delivered wake without yielding. */
+    prev = qatomic_cmpxchg(&w->to_wake, QEMU_CO_SLEEP_PENDING, NULL);
+    if (prev == QEMU_CO_SLEEP_PENDING) {
+        return;
+    }
+    assert(prev == NULL);
 
     const char *scheduled = qatomic_cmpxchg(&co->scheduled, NULL,
                                             qemu_co_sleep_ns__scheduled);
@@ -53,11 +70,23 @@ void coroutine_fn qemu_co_sleep(QemuCoSleep *w)
         abort();
     }
 
-    w->to_wake = co;
+    /*
+     * Publish ourselves as the sleeper. If a wake raced in between the
+     * fast path above and now, the cmpxchg observes QEMU_CO_SLEEP_PENDING
+     * and we consume it without yielding.
+     */
+    prev = qatomic_cmpxchg(&w->to_wake, NULL, co);
+    if (prev == QEMU_CO_SLEEP_PENDING) {
+        qatomic_set(&w->to_wake, NULL);
+        qatomic_set(&co->scheduled, NULL);
+        return;
+    }
+    assert(prev == NULL);
+
     qemu_coroutine_yield();
 
-    /* w->to_wake is cleared before resuming this coroutine.  */
-    assert(w->to_wake == NULL);
+    /* The waker left QEMU_CO_SLEEP_PENDING; clear it for the next sleep. */
+    qatomic_set(&w->to_wake, NULL);
 }
 
 void coroutine_fn qemu_co_sleep_ns_wakeable(QemuCoSleep *w,
@@ -70,9 +99,10 @@ void coroutine_fn qemu_co_sleep_ns_wakeable(QemuCoSleep *w,
     timer_mod(&ts, qemu_clock_get_ns(type) + ns);
 
     /*
-     * The timer will fire in the current AiOContext, so the callback
-     * must happen after qemu_co_sleep yields and there is no race
-     * between timer_mod and qemu_co_sleep.
+     * A wake racing with the arming of the sleep -- including the timer
+     * we just armed firing in another AioContext before qemu_co_sleep()
+     * publishes itself -- is captured by the sticky PENDING state in
+     * qemu_co_sleep_wake() and consumed here without yielding.
      */
     qemu_co_sleep(w);
     timer_del(&ts);
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 0/1] coroutine: fix lost wakeup in qemu_co_sleep_wake()
  2026-05-20 19:38 [PATCH v2 0/1] coroutine: fix lost wakeup in qemu_co_sleep_wake() Denis V. Lunev via qemu development
  2026-05-20 19:38 ` [PATCH v2 1/1] " Denis V. Lunev via qemu development
@ 2026-05-20 19:57 ` Michael Tokarev
  2026-05-20 20:11   ` Denis V. Lunev
  2026-05-28  9:25 ` Denis V. Lunev
  2 siblings, 1 reply; 7+ messages in thread
From: Michael Tokarev @ 2026-05-20 19:57 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel
  Cc: qemu-block, qemu-stable, kwolf, hreitz, stefanha

On 20.05.2026 22:38, Denis V. Lunev via wrote:
> Changes since v1
> ----------------
> 
> v1 was a two-patch series tracking two independent missed-wakeup
> races on the qemu shutdown path.

Um.  The v1 has already been applied to the master branch, see commit
e3082ab3b3.

And I already picked it up for stable, but not for 10.0.x

So I guess the next step is to send a follow-up on top of v1, or
to revert v1 and have this v2 instead.

Fun stuff ;)

/mjt



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 0/1] coroutine: fix lost wakeup in qemu_co_sleep_wake()
  2026-05-20 19:57 ` [PATCH v2 0/1] " Michael Tokarev
@ 2026-05-20 20:11   ` Denis V. Lunev
  2026-05-21  5:44     ` Michael Tokarev
  0 siblings, 1 reply; 7+ messages in thread
From: Denis V. Lunev @ 2026-05-20 20:11 UTC (permalink / raw)
  To: Michael Tokarev, Denis V. Lunev, qemu-devel
  Cc: qemu-block, qemu-stable, kwolf, hreitz, stefanha

On 5/20/26 21:57, Michael Tokarev wrote:
> On 20.05.2026 22:38, Denis V. Lunev via wrote:
>> Changes since v1
>> ----------------
>>
>> v1 was a two-patch series tracking two independent missed-wakeup
>> races on the qemu shutdown path.
> Um.  The v1 has already been applied to the master branch, see commit
> e3082ab3b3.
>
> And I already picked it up for stable, but not for 10.0.x
>
> So I guess the next step is to send a follow-up on top of v1, or
> to revert v1 and have this v2 instead.
>
> Fun stuff ;)
>
> /mjt
>
It seems that I have written things wrong in description.

v1 consists of 2 patches which were fixed very similar
but different patterns - 2 distinct hangs on cleanup:
* one in bdrv_graph_co_rdunlock() [1]
* and one in cache_clean_timer_del_and_wait() [2]

Patch [1] has been applied by Kevin to master.
Patch [2] has been reviewed with a proposal to revise
the approach.

This submission contains Patch [2] in a way proposed
by Kevin. Patch [1] is independent and stays merged.

Sorry one if this causes some problems.

Thank you in advance,
    Den


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 0/1] coroutine: fix lost wakeup in qemu_co_sleep_wake()
  2026-05-20 20:11   ` Denis V. Lunev
@ 2026-05-21  5:44     ` Michael Tokarev
  0 siblings, 0 replies; 7+ messages in thread
From: Michael Tokarev @ 2026-05-21  5:44 UTC (permalink / raw)
  To: Denis V. Lunev, Denis V. Lunev, qemu-devel
  Cc: qemu-block, qemu-stable, kwolf, hreitz, stefanha

On 20.05.2026 23:11, Denis V. Lunev wrote:
> On 5/20/26 21:57, Michael Tokarev wrote:
>> On 20.05.2026 22:38, Denis V. Lunev via wrote:
>>> Changes since v1
>>> ----------------
>>>
>>> v1 was a two-patch series tracking two independent missed-wakeup
>>> races on the qemu shutdown path.
>> Um.  The v1 has already been applied to the master branch, see commit
>> e3082ab3b3.
>>
>> And I already picked it up for stable, but not for 10.0.x
>>
>> So I guess the next step is to send a follow-up on top of v1, or
>> to revert v1 and have this v2 instead.
..
> It seems that I have written things wrong in description.
> 
> v1 consists of 2 patches which were fixed very similar
> but different patterns - 2 distinct hangs on cleanup:
> * one in bdrv_graph_co_rdunlock() [1]
> * and one in cache_clean_timer_del_and_wait() [2]
> 
> Patch [1] has been applied by Kevin to master.
> Patch [2] has been reviewed with a proposal to revise
> the approach.
> 
> This submission contains Patch [2] in a way proposed
> by Kevin. Patch [1] is independent and stays merged.

Aha, this makes sense.  Thank you for clarifying things!
And indeed, I missed this part of your description - which
is right in the part of your message which I quoted.  I'm
sorry for the noise.

Best,

/mjt


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 0/1] coroutine: fix lost wakeup in qemu_co_sleep_wake()
  2026-05-20 19:38 [PATCH v2 0/1] coroutine: fix lost wakeup in qemu_co_sleep_wake() Denis V. Lunev via qemu development
  2026-05-20 19:38 ` [PATCH v2 1/1] " Denis V. Lunev via qemu development
  2026-05-20 19:57 ` [PATCH v2 0/1] " Michael Tokarev
@ 2026-05-28  9:25 ` Denis V. Lunev
  2 siblings, 0 replies; 7+ messages in thread
From: Denis V. Lunev @ 2026-05-28  9:25 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel
  Cc: qemu-block, qemu-stable, kwolf, hreitz, stefanha

On 5/20/26 21:38, Denis V. Lunev wrote:
> Changes since v1
> ----------------
>
> v1 was a two-patch series tracking two independent missed-wakeup
> races on the qemu shutdown path.
>
>   * Patch 1, "block/graph-lock: fix missed wakeup in
>     bdrv_graph_co_rdunlock()", was applied as e3082ab3b3 by Kevin
>     and is now in tree. This v2 carries only the remaining race.
>
>   * Per Kevin's review of v1 patch 2 [1], the cache_clean_timer
>     hang is no longer worked around inside block/qcow2.c. Instead,
>     the underlying primitive -- qemu_co_sleep_wake() -- is fixed,
>     closing the lost-wakeup window for every caller
>     (cache_clean_timer, block_copy_kick, ...) rather than just
>     qcow2. Cancellation latency through qemu_co_sleep_wake() drops
>     from "next 1 s tick" (v1 workaround) to aio_co_wake().
>
> [1] https://lore.kernel.org/qemu-devel/agcYkmTWud8euTds@redhat.com/
>
> Problem
> -------
>
> The qemu shutdown / blockdev-close path can deadlock permanently on
> upstream master. The main thread enters ppoll(timeout=-1) holding
> BQL, no other thread has a wake source that points back at it, and
> qemu has to be SIGKILLed. The hang has no timeout -- it is a hard
> deadlock, not a slow operation; behind BQL, RCU, VCPUs and every
> iothread path that needs BQL stall with it.
>
> Two independent missed-wakeup races in the block layer contributed
> to the symptom on v1. Both shared the same shape: a waiter arms on
> one side, the waker reads stale state on its fast path and silently
> skips the kick, and nothing else on the AioContext fires to
> recover. The first (block/graph-lock) was fixed by e3082ab3b3 and
> is now in tree. This patch closes the second one, exposed in
> qcow2's cache_clean_timer cancellation path:
>
>   ppoll -> aio_poll -> cache_clean_timer_del_and_wait -> qcow2_close
>
> The race diagram and the exact stale-state read are in the patch's
> commit message.
>
> Reproducer
> ----------
>
> Environment: 4-vCPU VM guest, kernel 6.12.x, upstream master at
> e3082ab3b3 (with the graph-lock fix already applied). On modern
> bare-metal the window is narrow enough that the hang rarely
> reproduces without a VM -- a VM guest under full CPU saturation is
> what makes the timing reliable.
>
>     # reproducer
>     stress-ng --cpu "$(nproc)" --timeout 0 &
>     for r in $(seq 20); do
>         timeout 120 ./build/tests/qemu-iotests/check -qcow2 iothreads-create
>     done
>     kill %1
>
> With `stress-ng --cpu $(nproc)` the race surfaces. With
> `stress-ng --cpu $(($(nproc) - 1))` or without a stressor it does
> not reproduce reliably across 20 iterations.
>
> When the race fires, the Python QMP client times out on
> vm.run_job() after 5 s, the qemu process keeps running but never
> makes forward progress, and the outer `timeout 120` eventually
> kills it. Attach gdb before the timeout kills qemu to capture the
> stack.
>
> Results
> -------
>
> Same guest, 20 iterations of the loop above, master at e3082ab3b3:
>
>   without this patch:  reproduces reliably (qcow2_close in ppoll)
>   with this patch:     20/20 PASS
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Hanna Reitz <hreitz@redhat.com>
>
> Denis V. Lunev (1):
>   coroutine: fix lost wakeup in qemu_co_sleep_wake()
>
>  include/qemu/coroutine.h    | 17 ++++++++---
>  util/qemu-coroutine-sleep.c | 60 +++++++++++++++++++++++++++----------
>  2 files changed, 58 insertions(+), 19 deletions(-)
>
ping


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/1] coroutine: fix lost wakeup in qemu_co_sleep_wake()
  2026-05-20 19:38 ` [PATCH v2 1/1] " Denis V. Lunev via qemu development
@ 2026-05-29 14:40   ` Kevin Wolf
  0 siblings, 0 replies; 7+ messages in thread
From: Kevin Wolf @ 2026-05-29 14:40 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: qemu-devel, qemu-block, qemu-stable, hreitz, stefanha

Am 20.05.2026 um 21:38 hat Denis V. Lunev geschrieben:
> cache_clean_timer_del_and_wait() cancels the cache-cleaner coroutine
> by setting s->cache_clean_interval = 0 and calling qemu_co_sleep_wake()
> to cut short its qemu_co_sleep_ns_wakeable(). qemu_co_sleep_wake() is
> fire-and-forget: it reads w->to_wake and silently returns when it is
> NULL. A sleeper that is between two iterations -- has just released
> s->lock but has not yet set w->to_wake inside qemu_co_sleep() -- loses
> the wake:
> 
>   iothread0 timer coroutine           main thread (qcow2 close)
>   -------------------------           -------------------------
>   while-body (holding s->lock):
>     read interval = 600
>     wait_ns = 600 * NS
>     release s->lock
>                                       take s->lock
>                                       interval = 0
>                                       qemu_co_sleep_wake(w):
>                                         w->to_wake == NULL -> skip
>                                         return
>                                       qemu_co_queue_wait(exit, s->lock):
>                                         release s->lock
>                                         yield
>   qemu_co_sleep_ns_wakeable:
>     aio_timer_init(+600 s)
>     qemu_co_sleep:
>       cas scheduled NULL -> "qsns"
>       w->to_wake = co
>       yield  [sleeps 600 s]
> 
> cache_clean_timer_del_and_wait() then blocks on cache_clean_timer_exit
> until the original 600 s expiry fires, and qcow2_close() holds BQL the
> whole time so the VM stalls behind it.
> 
> block_copy_kick() has the same shape. Fix the primitive once instead
> of working around it in each caller.
> 
> Use a tri-state for QemuCoSleep::to_wake:
> 
>   NULL     - idle
>   co       - sleeper parked
>   PENDING  - wake delivered, no sleeper yet (sticky)
> 
> qemu_co_sleep_wake() xchgs PENDING into to_wake: a real sleeper is
> woken, NULL/PENDING is left untouched so the wake stays sticky.
> qemu_co_sleep() consumes a sticky PENDING on entry; otherwise it
> cmpxchg-publishes itself, and a wake racing in between is caught
> because the cmpxchg observes PENDING and returns without yielding.
> On normal resume qemu_co_sleep() clears the PENDING the waker left
> behind so the next sleep starts clean.
> 
> A double-fire (real wake plus timer callback) is harmless: the first
> xchg returns the coroutine and wakes it; the second returns PENDING
> and is a no-op. Cancellation latency through qemu_co_sleep_wake() is
> now bounded by aio_co_wake() rather than by the sleep duration.
> 
> Fixes: f86dde9a15 ("qcow2: Fix cache_clean_timer")
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Cc: Hanna Czenczek <hreitz@redhat.com>
> Cc: Kevin Wolf <kwolf@redhat.com>
> ---
>  include/qemu/coroutine.h    | 17 ++++++++---
>  util/qemu-coroutine-sleep.c | 60 +++++++++++++++++++++++++++----------
>  2 files changed, 58 insertions(+), 19 deletions(-)
> 
> diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
> index e545bbf620..1c31de60f9 100644
> --- a/include/qemu/coroutine.h
> +++ b/include/qemu/coroutine.h
> @@ -260,10 +260,19 @@ int coroutine_fn qemu_co_timeout(CoroutineEntry *entry, void *opaque,
>                                   uint64_t timeout_ns, CleanupFunc clean);
>  
>  /**
> - * Wake a coroutine if it is sleeping in qemu_co_sleep_ns. The timer will be
> - * deleted. @sleep_state must be the variable whose address was given to
> - * qemu_co_sleep_ns() and should be checked to be non-NULL before calling
> - * qemu_co_sleep_wake().
> + * Wake a coroutine sleeping in qemu_co_sleep() or qemu_co_sleep_ns_wakeable().
> + * The timer set up by the latter is deleted on wakeup.
> + *
> + * The wake is sticky: if no sleeper is parked on @w at the time of the call,
> + * the wake is recorded on @w and consumed by the next qemu_co_sleep() on the
> + * same @w, which then returns without yielding. This closes the lost-wakeup
> + * window between two sleeps and is the documented behavior callers should
> + * rely on -- e.g. a cancellation signal raised between iterations of a
> + * sleep/work loop will shorten the next sleep instead of being dropped.
> + *
> + * The state persists until consumed: if no further qemu_co_sleep() is ever
> + * called on @w, the pending wake is harmlessly discarded when @w goes away.
> + * Multiple wakes coalesce -- the next sleep consumes at most one.
>   */
>  void qemu_co_sleep_wake(QemuCoSleep *w);
>  
> diff --git a/util/qemu-coroutine-sleep.c b/util/qemu-coroutine-sleep.c
> index edef117284..a5734a8eb7 100644
> --- a/util/qemu-coroutine-sleep.c
> +++ b/util/qemu-coroutine-sleep.c
> @@ -18,20 +18,29 @@
>  
>  static const char *qemu_co_sleep_ns__scheduled = "qemu_co_sleep_ns";
>  
> +/*
> + * Sentinel stored in QemuCoSleep::to_wake by qemu_co_sleep_wake() when no
> + * sleeper has parked yet. The next qemu_co_sleep() consumes it and returns
> + * without yielding, so a wake that races the arming of a sleep is never
> + * lost.
> + */
> +#define QEMU_CO_SLEEP_PENDING ((Coroutine *)(uintptr_t)1)
> +
>  void qemu_co_sleep_wake(QemuCoSleep *w)
>  {
>      Coroutine *co;
>  
> -    co = w->to_wake;
> -    w->to_wake = NULL;
> -    if (co) {
> -        /* Write of schedule protected by barrier write in aio_co_schedule */
> -        const char *scheduled = qatomic_cmpxchg(&co->scheduled,
> -                                                qemu_co_sleep_ns__scheduled, NULL);
> -
> -        assert(scheduled == qemu_co_sleep_ns__scheduled);
> -        aio_co_wake(co);
> +    co = qatomic_xchg(&w->to_wake, QEMU_CO_SLEEP_PENDING);
> +    if (co == NULL || co == QEMU_CO_SLEEP_PENDING) {
> +        /* No sleeper, or a wake is already pending. */
> +        return;
>      }
> +
> +    /* Write of scheduled protected by barrier write in aio_co_schedule */
> +    const char *scheduled = qatomic_cmpxchg(&co->scheduled,
> +                                            qemu_co_sleep_ns__scheduled, NULL);
> +    assert(scheduled == qemu_co_sleep_ns__scheduled);
> +    aio_co_wake(co);
>  }
>  
>  static void co_sleep_cb(void *opaque)
> @@ -43,6 +52,14 @@ static void co_sleep_cb(void *opaque)
>  void coroutine_fn qemu_co_sleep(QemuCoSleep *w)
>  {
>      Coroutine *co = qemu_coroutine_self();
> +    Coroutine *prev;
> +
> +    /* Consume an already-delivered wake without yielding. */
> +    prev = qatomic_cmpxchg(&w->to_wake, QEMU_CO_SLEEP_PENDING, NULL);
> +    if (prev == QEMU_CO_SLEEP_PENDING) {
> +        return;
> +    }
> +    assert(prev == NULL);

I'm wondering if this check serves a purpose when...

>      const char *scheduled = qatomic_cmpxchg(&co->scheduled, NULL,
>                                              qemu_co_sleep_ns__scheduled);
> @@ -53,11 +70,23 @@ void coroutine_fn qemu_co_sleep(QemuCoSleep *w)
>          abort();
>      }
>  
> -    w->to_wake = co;
> +    /*
> +     * Publish ourselves as the sleeper. If a wake raced in between the
> +     * fast path above and now, the cmpxchg observes QEMU_CO_SLEEP_PENDING
> +     * and we consume it without yielding.
> +     */
> +    prev = qatomic_cmpxchg(&w->to_wake, NULL, co);
> +    if (prev == QEMU_CO_SLEEP_PENDING) {

...we essentially have to repeat it here anyway. Couldn't we always just
use this second check?

> +        qatomic_set(&w->to_wake, NULL);
> +        qatomic_set(&co->scheduled, NULL);
> +        return;
> +    }
> +    assert(prev == NULL);
> +
>      qemu_coroutine_yield();
>  
> -    /* w->to_wake is cleared before resuming this coroutine.  */
> -    assert(w->to_wake == NULL);
> +    /* The waker left QEMU_CO_SLEEP_PENDING; clear it for the next sleep. */
> +    qatomic_set(&w->to_wake, NULL);
>  }
>  
>  void coroutine_fn qemu_co_sleep_ns_wakeable(QemuCoSleep *w,
> @@ -70,9 +99,10 @@ void coroutine_fn qemu_co_sleep_ns_wakeable(QemuCoSleep *w,
>      timer_mod(&ts, qemu_clock_get_ns(type) + ns);
>  
>      /*
> -     * The timer will fire in the current AiOContext, so the callback
> -     * must happen after qemu_co_sleep yields and there is no race
> -     * between timer_mod and qemu_co_sleep.
> +     * A wake racing with the arming of the sleep -- including the timer
> +     * we just armed firing in another AioContext before qemu_co_sleep()
> +     * publishes itself -- is captured by the sticky PENDING state in
> +     * qemu_co_sleep_wake() and consumed here without yielding.
>       */
>      qemu_co_sleep(w);
>      timer_del(&ts);

Overall this looks good to me. Some unit tests would be nice, though.

Kevin



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-05-29 14:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-20 19:38 [PATCH v2 0/1] coroutine: fix lost wakeup in qemu_co_sleep_wake() Denis V. Lunev via qemu development
2026-05-20 19:38 ` [PATCH v2 1/1] " Denis V. Lunev via qemu development
2026-05-29 14:40   ` Kevin Wolf
2026-05-20 19:57 ` [PATCH v2 0/1] " Michael Tokarev
2026-05-20 20:11   ` Denis V. Lunev
2026-05-21  5:44     ` Michael Tokarev
2026-05-28  9:25 ` Denis V. Lunev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.