[PATCH v2 00/19] block: Some multi-threading fixes

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/19] block: Some multi-threading fixes
@ 2025-11-10 15:48 Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 01/19] block: Note on aio_co_wake use if not yet yielding Hanna Czenczek
                   ` (20 more replies)
  0 siblings, 21 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

Hi,

See the v1 cover letter for a general overview:

https://lists.nongnu.org/archive/html/qemu-block/2025-10/msg00501.html

Changes in v2:
- Kept `.ret = EINPGORESS`-style initializations where they already had
  been (curl, nvme)
- Dropped trivial BH waking code (i.e. which can be directly replaced by
  aio_co_wake()) in iscsi, nfs, nvme
- curl: Yield in curl_do_preadv() (former curl_setup_preadv()) and
  curl_find_buf() instead of returning whether curl_co_preadv() has to
  yield or not
- nvme: Added a patch that annotates some functions (primarily BHs and
  CBs) with which AioContext they (must) run in
- qcow2 cache-cleaning timer: Run the timer as a coroutine instead of in
  a timer CB; use a CoQueue to await it exiting instead of polling
  (well, we still need to poll in case we don’t run in a coroutine, but
  that’s standard procedure, I believe)
  - The need to initialize the CoQueue showed that there is a code path
    in qcow2 that doesn’t initialize its CoMutex.  Added a patch to do
    that.
  - Also added a patch to have the timer use realtime instead of virtual
    time.


git backport-diff against v1:

Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/19:[----] [--] 'block: Note on aio_co_wake use if not yet yielding'
002/19:[----] [--] 'rbd: Run co BH CB in the coroutine’s AioContext'
003/19:[0019] [FC] 'iscsi: Run co BH CB in the coroutine’s AioContext'
004/19:[0022] [FC] 'nfs: Run co BH CB in the coroutine’s AioContext'
005/19:[0060] [FC] 'curl: Fix coroutine waking'
006/19:[----] [--] 'gluster: Do not move coroutine into BDS context'
007/19:[----] [--] 'nvme: Kick and check completions in BDS context'
008/19:[0038] [FC] 'nvme: Fix coroutine waking'
009/19:[down] 'nvme: Note in which AioContext some functions run'
010/19:[----] [--] 'block/io: Take reqs_lock for tracked_requests'
011/19:[down] 'qcow2: Re-initialize lock in invalidate_cache'
012/19:[0145] [FC] 'qcow2: Fix cache_clean_timer'
013/19:[down] 'qcow2: Schedule cache-clean-timer in realtime'
014/19:[----] [--] 'ssh: Run restart_coroutine in current AioContext'
015/19:[----] [--] 'blkreplay: Run BH in coroutine’s AioContext'
016/19:[----] [--] 'block: Note in which AioContext AIO CBs are called'
017/19:[----] [--] 'iscsi: Create AIO BH in original AioContext'
018/19:[----] [--] 'null-aio: Run CB in original AioContext'
019/19:[----] [--] 'win32-aio: Run CB in original context'


Hanna Czenczek (19):
  block: Note on aio_co_wake use if not yet yielding
  rbd: Run co BH CB in the coroutine’s AioContext
  iscsi: Run co BH CB in the coroutine’s AioContext
  nfs: Run co BH CB in the coroutine’s AioContext
  curl: Fix coroutine waking
  gluster: Do not move coroutine into BDS context
  nvme: Kick and check completions in BDS context
  nvme: Fix coroutine waking
  nvme: Note in which AioContext some functions run
  block/io: Take reqs_lock for tracked_requests
  qcow2: Re-initialize lock in invalidate_cache
  qcow2: Fix cache_clean_timer
  qcow2: Schedule cache-clean-timer in realtime
  ssh: Run restart_coroutine in current AioContext
  blkreplay: Run BH in coroutine’s AioContext
  block: Note in which AioContext AIO CBs are called
  iscsi: Create AIO BH in original AioContext
  null-aio: Run CB in original AioContext
  win32-aio: Run CB in original context

 block/qcow2.h                    |   5 +-
 include/block/aio.h              |  15 ++++
 include/block/block_int-common.h |   7 +-
 block/blkreplay.c                |   3 +-
 block/curl.c                     |  45 +++++++---
 block/gluster.c                  |  17 ++--
 block/io.c                       |   3 +
 block/iscsi.c                    |  63 ++++++--------
 block/nfs.c                      |  41 ++++-----
 block/null.c                     |   7 +-
 block/nvme.c                     | 113 ++++++++++++++++--------
 block/qcow2.c                    | 145 ++++++++++++++++++++++++-------
 block/rbd.c                      |  12 +--
 block/ssh.c                      |  22 ++---
 block/win32-aio.c                |  31 +++++--
 15 files changed, 347 insertions(+), 182 deletions(-)

-- 
2.51.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 01/19] block: Note on aio_co_wake use if not yet yielding
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 02/19] rbd: Run co BH CB in the coroutine’s AioContext Hanna Czenczek
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

aio_co_wake() is generally safe to call regardless of whether the
coroutine is already yielding or not.  If it is not yet yielding, it
will be scheduled to run when it does yield.

Caveats:
- The caller must be independent of the coroutine (to ensure the
  coroutine must be yielding if both are in the same AioContext), i.e.
  must not be the same coroutine
- The coroutine must yield at some point

Make note of this so callers can reason that their use is safe.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 include/block/aio.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/block/aio.h b/include/block/aio.h
index 99ff48420b..6ed97f0a4c 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -650,6 +650,21 @@ void coroutine_fn aio_co_reschedule_self(AioContext *new_ctx);
  * aio_co_wake may be executed either in coroutine or non-coroutine
  * context.  The coroutine must not be entered by anyone else while
  * aio_co_wake() is active.
+ *
+ * If `co`'s AioContext differs from the current AioContext, this will call
+ * aio_co_schedule(), which makes this safe to use even when `co` has not
+ * yielded yet.  In such a case, it will be entered once it yields.
+ *
+ * In contrast, if `co`'s AioContext is equal to the current one, it is
+ * required for `co` to currently be yielding.  This is generally the case
+ * if the caller is not in `co` (i.e. invoked by `co`), because the only
+ * other way for the caller to be running then is for `co` to currently be
+ * yielding.
+ *
+ * Therefore, if there is no way for the caller to be invoked/entered by
+ * `co`, it is generally safe to call this regardless of whether `co` is
+ * known to already be yielding or not -- it only has to yield at some
+ * point.
  */
 void aio_co_wake(Coroutine *co);
 
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 02/19] rbd: Run co BH CB in the coroutine’s AioContext
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 01/19] block: Note on aio_co_wake use if not yet yielding Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 03/19] iscsi: " Hanna Czenczek
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

qemu_rbd_completion_cb() schedules the request completion code
(qemu_rbd_finish_bh()) to run in the BDS’s AioContext, assuming that
this is the same thread in which qemu_rbd_start_co() runs.

To explain, this is how both latter functions interact:

In qemu_rbd_start_co():

    while (!task.complete)
        qemu_coroutine_yield();

In qemu_rbd_finish_bh():

    task->complete = true;
    aio_co_wake(task->co); // task->co is qemu_rbd_start_co()

For this interaction to work reliably, both must run in the same thread
so that qemu_rbd_finish_bh() can only run once the coroutine yields.
Otherwise, finish_bh() may run before start_co() checks task.complete,
which will result in the latter seeing .complete as true immediately and
skipping the yield altogether, even though finish_bh() still wakes it.

With multiqueue, the BDS’s AioContext is not necessarily the thread
start_co() runs in, and so finish_bh() may be scheduled to run in a
different thread than start_co().  With the right timing, this will
cause the problems described above; waking a non-yielding coroutine is
not good, as can be reproduced by putting e.g. a usleep(100000) above
the while loop in start_co() (and using multiqueue), giving finish_bh()
a much better chance at exiting before start_co() can yield.

So instead of scheduling finish_bh() in the BDS’s AioContext, schedule
finish_bh() in task->co’s AioContext.

In addition, we can get rid of task.complete altogether because we will
get woken exactly once, when the task is indeed complete, no need to
check.

(We could go further and drop the BH, running aio_co_wake() directly in
qemu_rbd_completion_cb() because we are allowed to do that even if the
coroutine isn’t yet yielding and we’re in a different thread – but the
doc comment on qemu_rbd_completion_cb() says to be careful, so I decided
not to go so far here.)

Buglink: https://issues.redhat.com/browse/RHEL-67115
Reported-by: Junyao Zhao <junzhao@redhat.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/rbd.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 3611dc81cf..2a70b5a983 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -110,9 +110,7 @@ typedef struct BDRVRBDState {
 } BDRVRBDState;

 typedef struct RBDTask {
-    BlockDriverState *bs;
     Coroutine *co;
-    bool complete;
     int64_t ret;
 } RBDTask;

@@ -1309,7 +1307,6 @@ static int qemu_rbd_resize(BlockDriverState *bs, uint64_t size)
 static void qemu_rbd_finish_bh(void *opaque)
 {
     RBDTask *task = opaque;
-    task->complete = true;
     aio_co_wake(task->co);
 }

@@ -1326,7 +1323,7 @@ static void qemu_rbd_completion_cb(rbd_completion_t c, RBDTask *task)
 {
     task->ret = rbd_aio_get_return_value(c);
     rbd_aio_release(c);
-    aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs),
+    aio_bh_schedule_oneshot(qemu_coroutine_get_aio_context(task->co),
                             qemu_rbd_finish_bh, task);
 }

@@ -1338,7 +1335,7 @@ static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
                                           RBDAIOCmd cmd)
 {
     BDRVRBDState *s = bs->opaque;
-    RBDTask task = { .bs = bs, .co = qemu_coroutine_self() };
+    RBDTask task = { .co = qemu_coroutine_self() };
     rbd_completion_t c;
     int r;

@@ -1401,9 +1398,8 @@ static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
         return r;
     }

-    while (!task.complete) {
-        qemu_coroutine_yield();
-    }
+    /* Expect exactly a single wake from qemu_rbd_finish_bh() */
+    qemu_coroutine_yield();

     if (task.ret < 0) {
         error_report("rbd request failed: cmd %d offset %" PRIu64 " bytes %"
-- 
2.51.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 03/19] iscsi: Run co BH CB in the coroutine’s AioContext
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 01/19] block: Note on aio_co_wake use if not yet yielding Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 02/19] rbd: Run co BH CB in the coroutine’s AioContext Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 04/19] nfs: " Hanna Czenczek
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

For rbd (and others), as described in “rbd: Run co BH CB in the
coroutine’s AioContext”, the pattern of setting a completion flag and
waking a coroutine that yields while the flag is not set can only work
when both run in the same thread.

iscsi has the same pattern, but the details are a bit different:
iscsi_co_generic_cb() can (as far as I understand) only run through
iscsi_service(), not just from a random thread at a random time.
iscsi_service() in turn can only be run after iscsi_set_events() set up
an FD event handler, which is done in iscsi_co_wait_for_task().

As a result, iscsi_co_wait_for_task() will always yield exactly once,
because iscsi_co_generic_cb() can only run after iscsi_set_events(),
after the completion flag has already been checked, and the yielding
coroutine will then be woken only once the completion flag was set to
true.  So as far as I can tell, iscsi has no bug and already works fine.

Still, we don’t need the completion flag because we know we have to
yield exactly once, so we can drop it.  This simplifies the code and
makes it more obvious that the “rbd bug” isn’t present here.

This makes iscsi_co_generic_bh_cb() and iscsi_retry_timer_expired() a
bit boring, so at least the former we can drop and call aio_co_wake()
directly from scsi_co_generic_cb() to the same effect.  As for the
latter, the timer needs a CB, so we can’t drop it (I suppose we could
technically use aio_co_wake directly as the CB, but that would be
nasty), but we can put it into the coroutine’s AioContext to make its
aio_co_wake() a simple wrapper around qemu_coroutine_enter() without a
further BH indirection.

Finally, remove the iTask->co != NULL checks: This field is set by
iscsi_co_init_iscsitask(), which all users of IscsiTask run before even
setting up iscsi_co_generic_cb() as the callback, and it is never set or
cleared elsewhere, so it is impossible to not be set in
iscsi_co_generic_cb().

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/iscsi.c | 56 +++++++++++++++++++--------------------------------
 1 file changed, 21 insertions(+), 35 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 15b96ee880..852ecccf0d 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -107,7 +107,6 @@ typedef struct IscsiLun {
 
 typedef struct IscsiTask {
     int status;
-    int complete;
     int retries;
     int do_retry;
     struct scsi_task *task;
@@ -180,21 +179,10 @@ iscsi_schedule_bh(IscsiAIOCB *acb)
 
 #endif
 
-static void iscsi_co_generic_bh_cb(void *opaque)
-{
-    struct IscsiTask *iTask = opaque;
-
-    iTask->complete = 1;
-    aio_co_wake(iTask->co);
-}
-
 static void iscsi_retry_timer_expired(void *opaque)
 {
     struct IscsiTask *iTask = opaque;
-    iTask->complete = 1;
-    if (iTask->co) {
-        aio_co_wake(iTask->co);
-    }
+    aio_co_wake(iTask->co);
 }
 
 static inline unsigned exp_random(double mean)
@@ -239,6 +227,8 @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int status,
 {
     struct IscsiTask *iTask = opaque;
     struct scsi_task *task = command_data;
+    IscsiLun *iscsilun = iTask->iscsilun;
+    AioContext *itask_ctx = qemu_coroutine_get_aio_context(iTask->co);
 
     iTask->status = status;
     iTask->do_retry = 0;
@@ -263,9 +253,9 @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int status,
                              " (retry #%u in %u ms): %s",
                              iTask->retries, retry_time,
                              iscsi_get_error(iscsi));
-                aio_timer_init(iTask->iscsilun->aio_context,
-                               &iTask->retry_timer, QEMU_CLOCK_REALTIME,
-                               SCALE_MS, iscsi_retry_timer_expired, iTask);
+                aio_timer_init(itask_ctx, &iTask->retry_timer,
+                               QEMU_CLOCK_REALTIME, SCALE_MS,
+                               iscsi_retry_timer_expired, iTask);
                 timer_mod(&iTask->retry_timer,
                           qemu_clock_get_ms(QEMU_CLOCK_REALTIME) + retry_time);
                 iTask->do_retry = 1;
@@ -284,12 +274,17 @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int status,
         }
     }
 
-    if (iTask->co) {
-        replay_bh_schedule_oneshot_event(iTask->iscsilun->aio_context,
-                                         iscsi_co_generic_bh_cb, iTask);
-    } else {
-        iTask->complete = 1;
-    }
+    /*
+     * aio_co_wake() is safe to call: iscsi_service(), which called us, is only
+     * run from the event_timer and/or the FD handlers, never from the request
+     * coroutine.  The request coroutine in turn will yield unconditionally.
+     * We must release the lock, though, in case we enter the coroutine
+     * directly.  (Note that if do we enter the coroutine, iTask will probably
+     * be dangling once aio_co_wake() returns.)
+     */
+    qemu_mutex_unlock(&iscsilun->mutex);
+    aio_co_wake(iTask->co);
+    qemu_mutex_lock(&iscsilun->mutex);
 }
 
 static void coroutine_fn
@@ -592,12 +587,10 @@ static inline bool iscsi_allocmap_is_valid(IscsiLun *iscsilun,
 static void coroutine_fn iscsi_co_wait_for_task(IscsiTask *iTask,
                                                 IscsiLun *iscsilun)
 {
-    while (!iTask->complete) {
-        iscsi_set_events(iscsilun);
-        qemu_mutex_unlock(&iscsilun->mutex);
-        qemu_coroutine_yield();
-        qemu_mutex_lock(&iscsilun->mutex);
-    }
+    iscsi_set_events(iscsilun);
+    qemu_mutex_unlock(&iscsilun->mutex);
+    qemu_coroutine_yield();
+    qemu_mutex_lock(&iscsilun->mutex);
 }
 
 static int coroutine_fn
@@ -669,7 +662,6 @@ retry:
     }
 
     if (iTask.do_retry) {
-        iTask.complete = 0;
         goto retry;
     }
 
@@ -740,7 +732,6 @@ retry:
             scsi_free_scsi_task(iTask.task);
             iTask.task = NULL;
         }
-        iTask.complete = 0;
         goto retry;
     }
 
@@ -902,7 +893,6 @@ retry:
     }
 
     if (iTask.do_retry) {
-        iTask.complete = 0;
         goto retry;
     }
 
@@ -940,7 +930,6 @@ retry:
     }
 
     if (iTask.do_retry) {
-        iTask.complete = 0;
         goto retry;
     }
 
@@ -1184,7 +1173,6 @@ retry:
     }
 
     if (iTask.do_retry) {
-        iTask.complete = 0;
         goto retry;
     }
 
@@ -1301,7 +1289,6 @@ retry:
     }
 
     if (iTask.do_retry) {
-        iTask.complete = 0;
         goto retry;
     }
 
@@ -2390,7 +2377,6 @@ retry:
     iscsi_co_wait_for_task(&iscsi_task, dst_lun);
 
     if (iscsi_task.do_retry) {
-        iscsi_task.complete = 0;
         goto retry;
     }
 
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 04/19] nfs: Run co BH CB in the coroutine’s AioContext
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (2 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 03/19] iscsi: " Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 05/19] curl: Fix coroutine waking Hanna Czenczek
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

Like in “rbd: Run co BH CB in the coroutine’s AioContext”, drop the
completion flag, yield exactly once, and run the BH in the coroutine’s
AioContext.

(Can be reproduced with multiqueue by adding a usleep(100000) before the
`while (!task.complete)` loops.)

Like in “iscsi: Run co BH CB in the coroutine’s AioContext”, this makes
nfs_co_generic_bh_cb() trivial, so we can drop it in favor of just
calling aio_co_wake() directly.

Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/nfs.c | 41 ++++++++++++++++-------------------------
 1 file changed, 16 insertions(+), 25 deletions(-)

diff --git a/block/nfs.c b/block/nfs.c
index 0a7d38db09..1d3a34a30c 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -69,7 +69,6 @@ typedef struct NFSClient {
 typedef struct NFSRPC {
     BlockDriverState *bs;
     int ret;
-    int complete;
     QEMUIOVector *iov;
     struct stat *st;
     Coroutine *co;
@@ -230,14 +229,6 @@ static void coroutine_fn nfs_co_init_task(BlockDriverState *bs, NFSRPC *task)
     };
 }
 
-static void nfs_co_generic_bh_cb(void *opaque)
-{
-    NFSRPC *task = opaque;
-
-    task->complete = 1;
-    aio_co_wake(task->co);
-}
-
 /* Called (via nfs_service) with QemuMutex held.  */
 static void
 nfs_co_generic_cb(int ret, struct nfs_context *nfs, void *data,
@@ -256,8 +247,16 @@ nfs_co_generic_cb(int ret, struct nfs_context *nfs, void *data,
     if (task->ret < 0) {
         error_report("NFS Error: %s", nfs_get_error(nfs));
     }
-    replay_bh_schedule_oneshot_event(task->client->aio_context,
-                                     nfs_co_generic_bh_cb, task);
+
+    /*
+     * Safe to call: nfs_service(), which called us, is only run from the FD
+     * handlers, never from the request coroutine.  The request coroutine in
+     * turn will yield unconditionally.
+     * No need to release the lock, even if we directly enter the coroutine, as
+     * the lock is never re-taken after yielding.  (Note: If we do enter the
+     * coroutine, @task will probably be dangling once aio_co_wake() returns.)
+     */
+    aio_co_wake(task->co);
 }
 
 static int coroutine_fn nfs_co_preadv(BlockDriverState *bs, int64_t offset,
@@ -278,9 +277,7 @@ static int coroutine_fn nfs_co_preadv(BlockDriverState *bs, int64_t offset,
 
         nfs_set_events(client);
     }
-    while (!task.complete) {
-        qemu_coroutine_yield();
-    }
+    qemu_coroutine_yield();
 
     if (task.ret < 0) {
         return task.ret;
@@ -328,9 +325,7 @@ static int coroutine_fn nfs_co_pwritev(BlockDriverState *bs, int64_t offset,
 
         nfs_set_events(client);
     }
-    while (!task.complete) {
-        qemu_coroutine_yield();
-    }
+    qemu_coroutine_yield();
 
     if (my_buffer) {
         g_free(buf);
@@ -358,9 +353,7 @@ static int coroutine_fn nfs_co_flush(BlockDriverState *bs)
 
         nfs_set_events(client);
     }
-    while (!task.complete) {
-        qemu_coroutine_yield();
-    }
+    qemu_coroutine_yield();
 
     return task.ret;
 }
@@ -723,8 +716,8 @@ nfs_get_allocated_file_size_cb(int ret, struct nfs_context *nfs, void *data,
     if (task->ret < 0) {
         error_report("NFS Error: %s", nfs_get_error(nfs));
     }
-    replay_bh_schedule_oneshot_event(task->client->aio_context,
-                                     nfs_co_generic_bh_cb, task);
+    /* Safe to call, see nfs_co_generic_cb() */
+    aio_co_wake(task->co);
 }
 
 static int64_t coroutine_fn nfs_co_get_allocated_file_size(BlockDriverState *bs)
@@ -748,9 +741,7 @@ static int64_t coroutine_fn nfs_co_get_allocated_file_size(BlockDriverState *bs)
 
         nfs_set_events(client);
     }
-    while (!task.complete) {
-        qemu_coroutine_yield();
-    }
+    qemu_coroutine_yield();
 
     return (task.ret < 0 ? task.ret : st.st_blocks * 512);
 }
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 05/19] curl: Fix coroutine waking
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (3 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 04/19] nfs: " Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 06/19] gluster: Do not move coroutine into BDS context Hanna Czenczek
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

If we wake a coroutine from a different context, we must ensure that it
will yield exactly once (now or later), awaiting that wake.

curl’s current .ret == -EINPROGRESS loop may lead to the coroutine not
yielding if the request finishes before the loop gets run.  To fix it,
we must drop the loop and yield exactly once, if we need to yield.

Finding out that latter part ("if we need to yield") makes it a bit
complicated: Requests may be served from a cache internal to the curl
block driver, or fail before being submitted.  In these cases, we must
not yield.  However, if we find a matching but still ongoing request in
the cache, we will have to await that, i.e. still yield.

To address this, move the yield inside of the respective functions:
- Inside of curl_find_buf() when awaiting ongoing concurrent requests,
- Inside of curl_setup_preadv() when having created a new request.

Rename curl_setup_preadv() to curl_do_preadv() to reflect this.

(Can be reproduced with multiqueue by adding a usleep(100000) before the
`while (acb.ret == -EINPROGRESS)` loop.)

Also, add a comment why aio_co_wake() is safe regardless of whether the
coroutine and curl_multi_check_completion() run in the same context.

Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/curl.c | 45 +++++++++++++++++++++++++++++++--------------
 1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/block/curl.c b/block/curl.c
index d7d93d967f..4e77c93b46 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -258,8 +258,8 @@ read_end:
 }
 
 /* Called with s->mutex held.  */
-static bool curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
-                          CURLAIOCB *acb)
+static bool coroutine_fn
+curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len, CURLAIOCB *acb)
 {
     int i;
     uint64_t end = start + len;
@@ -307,6 +307,10 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
             for (j=0; j<CURL_NUM_ACB; j++) {
                 if (!state->acb[j]) {
                     state->acb[j] = acb;
+                    /* Await ongoing request */
+                    qemu_mutex_unlock(&s->mutex);
+                    qemu_coroutine_yield();
+                    qemu_mutex_lock(&s->mutex);
                     return true;
                 }
             }
@@ -378,6 +382,16 @@ static void curl_multi_check_completion(BDRVCURLState *s)
                 acb->ret = error ? -EIO : 0;
                 state->acb[i] = NULL;
                 qemu_mutex_unlock(&s->mutex);
+                /*
+                 * Current AioContext is the BDS context, which may or may not
+                 * be the request (coroutine) context.
+                 * - If it is, the coroutine must have yielded or the FD handler
+                 *   (curl_multi_do()/curl_multi_timeout_do()) could not have
+                 *   been called and we would not be here
+                 * - If it is not, it doesn't matter whether it has already
+                 *   yielded or not; it will be scheduled once it does yield
+                 * So aio_co_wake() is safe to call.
+                 */
                 aio_co_wake(acb->co);
                 qemu_mutex_lock(&s->mutex);
             }
@@ -868,7 +882,7 @@ out_noclean:
     return -EINVAL;
 }
 
-static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
+static void coroutine_fn curl_do_preadv(BlockDriverState *bs, CURLAIOCB *acb)
 {
     CURLState *state;
     int running;
@@ -880,10 +894,13 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
 
     qemu_mutex_lock(&s->mutex);
 
-    // In case we have the requested data already (e.g. read-ahead),
-    // we can just call the callback and be done.
+    /*
+     * In case we have the requested data already (e.g. read-ahead),
+     * we can just call the callback and be done.  This may have to
+     * await an ongoing request, in which case it itself will yield.
+     */
     if (curl_find_buf(s, start, acb->bytes, acb)) {
-        goto out;
+        goto dont_yield;
     }
 
     // No cache found, so let's start a new request
@@ -898,7 +915,7 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
     if (curl_init_state(s, state) < 0) {
         curl_clean_state(state);
         acb->ret = -EIO;
-        goto out;
+        goto dont_yield;
     }
 
     acb->start = 0;
@@ -913,7 +930,7 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
     if (state->buf_len && state->orig_buf == NULL) {
         curl_clean_state(state);
         acb->ret = -ENOMEM;
-        goto out;
+        goto dont_yield;
     }
     state->acb[0] = acb;
 
@@ -925,13 +942,16 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
         acb->ret = -EIO;
 
         curl_clean_state(state);
-        goto out;
+        goto dont_yield;
     }
 
     /* Tell curl it needs to kick things off */
     curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
+    qemu_mutex_unlock(&s->mutex);
+    qemu_coroutine_yield();
+    return;
 
-out:
+dont_yield:
     qemu_mutex_unlock(&s->mutex);
 }
 
@@ -947,10 +967,7 @@ static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
         .bytes = bytes
     };
 
-    curl_setup_preadv(bs, &acb);
-    while (acb.ret == -EINPROGRESS) {
-        qemu_coroutine_yield();
-    }
+    curl_do_preadv(bs, &acb);
     return acb.ret;
 }
 
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 06/19] gluster: Do not move coroutine into BDS context
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (4 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 05/19] curl: Fix coroutine waking Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 07/19] nvme: Kick and check completions in " Hanna Czenczek
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

The request coroutine may not run in the BDS AioContext.  We should wake
it in its own context, not move it.

With that, we can remove GlusterAIOCB.aio_context.

Also add a comment why aio_co_schedule() is safe to use in this way.

**Note:** Due to a lack of a gluster set-up, I have not tested this
commit.  It seemed safe enough to send anyway, just maybe not to
qemu-stable.  To be clear, I don’t know of any user-visible bugs that
would arise from the state without this patch; the request coroutine is
moved into the main BDS AioContext, so guest device completion code will
run in a different context than where the request started, which can’t
be good, but I haven’t actually confirmed any bugs (due to not being
able to test it).

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/gluster.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index 89abd40f31..4fb25b2c6d 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -56,7 +56,6 @@ typedef struct GlusterAIOCB {
     int64_t size;
     int ret;
     Coroutine *coroutine;
-    AioContext *aio_context;
 } GlusterAIOCB;
 
 typedef struct BDRVGlusterState {
@@ -743,7 +742,17 @@ static void gluster_finish_aiocb(struct glfs_fd *fd, ssize_t ret,
         acb->ret = -EIO; /* Partial read/write - fail it */
     }
 
-    aio_co_schedule(acb->aio_context, acb->coroutine);
+    /*
+     * Safe to call: The coroutine will yield exactly once awaiting this
+     * scheduling, and the context is its own context, so it will be scheduled
+     * once it does yield.
+     *
+     * (aio_co_wake() would call qemu_get_current_aio_context() to check whether
+     * we are in the same context, but we are not in a qemu thread, so we cannot
+     * do that.  Use aio_co_schedule() directly.)
+     */
+    aio_co_schedule(qemu_coroutine_get_aio_context(acb->coroutine),
+                    acb->coroutine);
 }
 
 static void qemu_gluster_parse_flags(int bdrv_flags, int *open_flags)
@@ -1006,7 +1015,6 @@ static coroutine_fn int qemu_gluster_co_pwrite_zeroes(BlockDriverState *bs,
     acb.size = bytes;
     acb.ret = 0;
     acb.coroutine = qemu_coroutine_self();
-    acb.aio_context = bdrv_get_aio_context(bs);
 
     ret = glfs_zerofill_async(s->fd, offset, bytes, gluster_finish_aiocb, &acb);
     if (ret < 0) {
@@ -1184,7 +1192,6 @@ static coroutine_fn int qemu_gluster_co_rw(BlockDriverState *bs,
     acb.size = size;
     acb.ret = 0;
     acb.coroutine = qemu_coroutine_self();
-    acb.aio_context = bdrv_get_aio_context(bs);
 
     if (write) {
         ret = glfs_pwritev_async(s->fd, qiov->iov, qiov->niov, offset, 0,
@@ -1251,7 +1258,6 @@ static coroutine_fn int qemu_gluster_co_flush_to_disk(BlockDriverState *bs)
     acb.size = 0;
     acb.ret = 0;
     acb.coroutine = qemu_coroutine_self();
-    acb.aio_context = bdrv_get_aio_context(bs);
 
     ret = glfs_fsync_async(s->fd, gluster_finish_aiocb, &acb);
     if (ret < 0) {
@@ -1299,7 +1305,6 @@ static coroutine_fn int qemu_gluster_co_pdiscard(BlockDriverState *bs,
     acb.size = 0;
     acb.ret = 0;
     acb.coroutine = qemu_coroutine_self();
-    acb.aio_context = bdrv_get_aio_context(bs);
 
     ret = glfs_discard_async(s->fd, offset, bytes, gluster_finish_aiocb, &acb);
     if (ret < 0) {
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 07/19] nvme: Kick and check completions in BDS context
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (5 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 06/19] gluster: Do not move coroutine into BDS context Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 08/19] nvme: Fix coroutine waking Hanna Czenczek
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

nvme_process_completion() must run in the main BDS context, so schedule
a BH for requests that aren’t there.

The context in which we kick does not matter, but let’s just keep kick
and process_completion together for simplicity’s sake.

(For what it’s worth, a quick fio bandwidth test indicates that on my
test hardware, if anything, this may be a bit better than kicking
immediately before scheduling a pure nvme_process_completion() BH.  But
I wouldn’t take more from those results than that it doesn’t really seem
to matter either way.)

Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/nvme.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index 8df53ee4ca..7ed5f570bc 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -481,7 +481,7 @@ static void nvme_trace_command(const NvmeCmd *cmd)
     }
 }
 
-static void nvme_deferred_fn(void *opaque)
+static void nvme_kick_and_check_completions(void *opaque)
 {
     NVMeQueuePair *q = opaque;
 
@@ -490,6 +490,18 @@ static void nvme_deferred_fn(void *opaque)
     nvme_process_completion(q);
 }
 
+static void nvme_deferred_fn(void *opaque)
+{
+    NVMeQueuePair *q = opaque;
+
+    if (qemu_get_current_aio_context() == q->s->aio_context) {
+        nvme_kick_and_check_completions(q);
+    } else {
+        aio_bh_schedule_oneshot(q->s->aio_context,
+                                nvme_kick_and_check_completions, q);
+    }
+}
+
 static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
                                 NvmeCmd *cmd, BlockCompletionFunc cb,
                                 void *opaque)
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 08/19] nvme: Fix coroutine waking
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (6 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 07/19] nvme: Kick and check completions in " Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 09/19] nvme: Note in which AioContext some functions run Hanna Czenczek
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

nvme wakes the request coroutine via qemu_coroutine_enter() from a BH
scheduled in the BDS AioContext.  This may not be the same context as
the one in which the request originally ran, which would be wrong:
- It could mean we enter the coroutine before it yields,
- We would move the coroutine in to a different context.

(Can be reproduced with multiqueue by adding a usleep(100000) before the
`while (data.ret == -EINPROGRESS)` loop.)

To fix that, use aio_co_wake() to run the coroutine in its home context.
Just like in the preceding iscsi and nfs patches, we can drop the
trivial nvme_rw_cb_bh() and use aio_co_wake() directly.

With this, we can remove NVMeCoData.ctx.

Note the check of data->co == NULL to bypass the BH/yield combination in
case nvme_rw_cb() is called from nvme_submit_command(): We probably want
to keep this fast path for performance reasons, but we have to be quite
careful about it:
- We cannot overload .ret for this, but have to use a dedicated
  .skip_yield field.  Otherwise, if nvme_rw_cb() runs in a different
  thread than the coroutine, it may see .ret set and skip the yield,
  while nvme_rw_cb() will still schedule a BH for waking.   Therefore,
  the signal to skip the yield can only be set in nvme_rw_cb() if waking
  too is skipped, which is independent from communicating the return
  value.
- We can only skip the yield if nvme_rw_cb() actually runs in the
  request coroutine.  Otherwise (specifically if they run in different
  AioContexts), the order between this function’s execution and the
  coroutine yielding (or not yielding) is not reliable.
- There is no point to yielding in a loop; there are no spurious wakes,
  so once we yield, we will only be re-entered once the command is done.
  Replace `while` by `if`.

Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/nvme.c | 56 +++++++++++++++++++++++++++-------------------------
 1 file changed, 29 insertions(+), 27 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 7ed5f570bc..b8262ebfd9 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -1176,25 +1176,35 @@ fail:
 
 typedef struct {
     Coroutine *co;
+    bool skip_yield;
     int ret;
-    AioContext *ctx;
 } NVMeCoData;
 
-static void nvme_rw_cb_bh(void *opaque)
-{
-    NVMeCoData *data = opaque;
-    qemu_coroutine_enter(data->co);
-}
-
 static void nvme_rw_cb(void *opaque, int ret)
 {
     NVMeCoData *data = opaque;
+
     data->ret = ret;
-    if (!data->co) {
-        /* The rw coroutine hasn't yielded, don't try to enter. */
-        return;
+
+    if (data->co == qemu_coroutine_self()) {
+        /*
+         * Fast path: We are inside of the request coroutine (through
+         * nvme_submit_command, nvme_deferred_fn, nvme_process_completion).
+         * We can set data->skip_yield here to keep the coroutine from
+         * yielding, and then we don't need to schedule a BH to wake it.
+         */
+        data->skip_yield = true;
+    } else {
+        /*
+         * Safe to call: The case where we run in the request coroutine is
+         * handled above, so we must be independent of it; and without
+         * skip_yield set, the coroutine will yield.
+         * No need to release NVMeQueuePair.lock (we are called without it
+         * held).  (Note: If we enter the coroutine here, @data will
+         * probably be dangling once aio_co_wake() returns.)
+         */
+        aio_co_wake(data->co);
     }
-    replay_bh_schedule_oneshot_event(data->ctx, nvme_rw_cb_bh, data);
 }
 
 static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
@@ -1218,7 +1228,7 @@ static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
         .cdw12 = cpu_to_le32(cdw12),
     };
     NVMeCoData data = {
-        .ctx = bdrv_get_aio_context(bs),
+        .co = qemu_coroutine_self(),
         .ret = -EINPROGRESS,
     };
 
@@ -1235,9 +1245,7 @@ static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
         return r;
     }
     nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
-
-    data.co = qemu_coroutine_self();
-    while (data.ret == -EINPROGRESS) {
+    if (!data.skip_yield) {
         qemu_coroutine_yield();
     }
 
@@ -1333,7 +1341,7 @@ static coroutine_fn int nvme_co_flush(BlockDriverState *bs)
         .nsid = cpu_to_le32(s->nsid),
     };
     NVMeCoData data = {
-        .ctx = bdrv_get_aio_context(bs),
+        .co = qemu_coroutine_self(),
         .ret = -EINPROGRESS,
     };
 
@@ -1341,9 +1349,7 @@ static coroutine_fn int nvme_co_flush(BlockDriverState *bs)
     req = nvme_get_free_req(ioq);
     assert(req);
     nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
-
-    data.co = qemu_coroutine_self();
-    if (data.ret == -EINPROGRESS) {
+    if (!data.skip_yield) {
         qemu_coroutine_yield();
     }
 
@@ -1384,7 +1390,7 @@ static coroutine_fn int nvme_co_pwrite_zeroes(BlockDriverState *bs,
     };
 
     NVMeCoData data = {
-        .ctx = bdrv_get_aio_context(bs),
+        .co = qemu_coroutine_self(),
         .ret = -EINPROGRESS,
     };
 
@@ -1404,9 +1410,7 @@ static coroutine_fn int nvme_co_pwrite_zeroes(BlockDriverState *bs,
     assert(req);
 
     nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
-
-    data.co = qemu_coroutine_self();
-    while (data.ret == -EINPROGRESS) {
+    if (!data.skip_yield) {
         qemu_coroutine_yield();
     }
 
@@ -1434,7 +1438,7 @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
     };
 
     NVMeCoData data = {
-        .ctx = bdrv_get_aio_context(bs),
+        .co = qemu_coroutine_self(),
         .ret = -EINPROGRESS,
     };
 
@@ -1479,9 +1483,7 @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
     trace_nvme_dsm(s, offset, bytes);
 
     nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
-
-    data.co = qemu_coroutine_self();
-    while (data.ret == -EINPROGRESS) {
+    if (!data.skip_yield) {
         qemu_coroutine_yield();
     }
 
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 09/19] nvme: Note in which AioContext some functions run
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (7 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 08/19] nvme: Fix coroutine waking Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 10/19] block/io: Take reqs_lock for tracked_requests Hanna Czenczek
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

Sprinkle comments throughout block/nvme.c noting for some functions
(where it may not be obvious) that they require a certain AioContext, or
in which AioContext they do happen to run (for callbacks, BHs, event
notifiers).

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/nvme.c | 43 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 34 insertions(+), 9 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index b8262ebfd9..919e14cef9 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -65,6 +65,7 @@ typedef struct {
 } NVMeQueue;
 
 typedef struct {
+    /* Called from nvme_process_completion() in the BDS's main AioContext */
     BlockCompletionFunc *cb;
     void *opaque;
     int cid;
@@ -84,6 +85,7 @@ typedef struct {
     uint8_t     *prp_list_pages;
 
     /* Fields protected by @lock */
+    /* Coroutines in this queue are woken in their own context */
     CoQueue     free_req_queue;
     NVMeQueue   sq, cq;
     int         cq_phase;
@@ -92,7 +94,7 @@ typedef struct {
     int         need_kick;
     int         inflight;
 
-    /* Thread-safe, no lock necessary */
+    /* Thread-safe, no lock necessary; runs in the BDS's main context */
     QEMUBH      *completion_bh;
 } NVMeQueuePair;
 
@@ -206,11 +208,13 @@ static void nvme_free_queue_pair(NVMeQueuePair *q)
     g_free(q);
 }
 
+/* Runs in the BDS's main AioContext */
 static void nvme_free_req_queue_cb(void *opaque)
 {
     NVMeQueuePair *q = opaque;
 
     qemu_mutex_lock(&q->lock);
+    /* qemu_co_enter_next() wakes the coroutine in its own AioContext */
     while (q->free_req_head != -1 &&
            qemu_co_enter_next(&q->free_req_queue, &q->lock)) {
         /* Retry waiting requests */
@@ -281,7 +285,7 @@ fail:
     return NULL;
 }
 
-/* With q->lock */
+/* With q->lock, must be run in the BDS's main AioContext */
 static void nvme_kick(NVMeQueuePair *q)
 {
     BDRVNVMeState *s = q->s;
@@ -308,7 +312,10 @@ static NVMeRequest *nvme_get_free_req_nofail_locked(NVMeQueuePair *q)
     return req;
 }
 
-/* Return a free request element if any, otherwise return NULL.  */
+/*
+ * Return a free request element if any, otherwise return NULL.
+ * May be run from any AioContext.
+ */
 static NVMeRequest *nvme_get_free_req_nowait(NVMeQueuePair *q)
 {
     QEMU_LOCK_GUARD(&q->lock);
@@ -321,6 +328,7 @@ static NVMeRequest *nvme_get_free_req_nowait(NVMeQueuePair *q)
 /*
  * Wait for a free request to become available if necessary, then
  * return it.
+ * May be called in any AioContext.
  */
 static coroutine_fn NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
 {
@@ -328,20 +336,21 @@ static coroutine_fn NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
 
     while (q->free_req_head == -1) {
         trace_nvme_free_req_queue_wait(q->s, q->index);
+        /* nvme_free_req_queue_cb() wakes us in our own AioContext */
         qemu_co_queue_wait(&q->free_req_queue, &q->lock);
     }
 
     return nvme_get_free_req_nofail_locked(q);
 }
 
-/* With q->lock */
+/* With q->lock, may be called in any AioContext */
 static void nvme_put_free_req_locked(NVMeQueuePair *q, NVMeRequest *req)
 {
     req->free_req_next = q->free_req_head;
     q->free_req_head = req - q->reqs;
 }
 
-/* With q->lock */
+/* With q->lock, may be called in any AioContext */
 static void nvme_wake_free_req_locked(NVMeQueuePair *q)
 {
     if (!qemu_co_queue_empty(&q->free_req_queue)) {
@@ -350,7 +359,7 @@ static void nvme_wake_free_req_locked(NVMeQueuePair *q)
     }
 }
 
-/* Insert a request in the freelist and wake waiters */
+/* Insert a request in the freelist and wake waiters (from any AioContext) */
 static void nvme_put_free_req_and_wake(NVMeQueuePair *q, NVMeRequest *req)
 {
     qemu_mutex_lock(&q->lock);
@@ -381,7 +390,7 @@ static inline int nvme_translate_error(const NvmeCqe *c)
     }
 }
 
-/* With q->lock */
+/* With q->lock, must be run in the BDS's main AioContext */
 static bool nvme_process_completion(NVMeQueuePair *q)
 {
     BDRVNVMeState *s = q->s;
@@ -451,6 +460,7 @@ static bool nvme_process_completion(NVMeQueuePair *q)
     return progress;
 }
 
+/* As q->completion_bh, runs in the BDS's main AioContext */
 static void nvme_process_completion_bh(void *opaque)
 {
     NVMeQueuePair *q = opaque;
@@ -481,6 +491,7 @@ static void nvme_trace_command(const NvmeCmd *cmd)
     }
 }
 
+/* Must be run in the BDS's main AioContext */
 static void nvme_kick_and_check_completions(void *opaque)
 {
     NVMeQueuePair *q = opaque;
@@ -490,6 +501,7 @@ static void nvme_kick_and_check_completions(void *opaque)
     nvme_process_completion(q);
 }
 
+/* Runs in nvme_submit_command()'s AioContext */
 static void nvme_deferred_fn(void *opaque)
 {
     NVMeQueuePair *q = opaque;
@@ -502,6 +514,7 @@ static void nvme_deferred_fn(void *opaque)
     }
 }
 
+/* May be run in any AioContext */
 static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
                                 NvmeCmd *cmd, BlockCompletionFunc cb,
                                 void *opaque)
@@ -523,6 +536,7 @@ static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
     defer_call(nvme_deferred_fn, q);
 }
 
+/* Put into NVMeRequest.cb, so runs in the BDS's main AioContext */
 static void nvme_admin_cmd_sync_cb(void *opaque, int ret)
 {
     int *pret = opaque;
@@ -530,6 +544,7 @@ static void nvme_admin_cmd_sync_cb(void *opaque, int ret)
     aio_wait_kick();
 }
 
+/* Must be run in the BDS's or qemu's main AioContext */
 static int nvme_admin_cmd_sync(BlockDriverState *bs, NvmeCmd *cmd)
 {
     BDRVNVMeState *s = bs->opaque;
@@ -638,6 +653,7 @@ out:
     return ret;
 }
 
+/* Must be run in the BDS's main AioContext */
 static void nvme_poll_queue(NVMeQueuePair *q)
 {
     const size_t cqe_offset = q->cq.head * NVME_CQ_ENTRY_BYTES;
@@ -660,6 +676,7 @@ static void nvme_poll_queue(NVMeQueuePair *q)
     qemu_mutex_unlock(&q->lock);
 }
 
+/* Must be run in the BDS's main AioContext */
 static void nvme_poll_queues(BDRVNVMeState *s)
 {
     int i;
@@ -669,6 +686,7 @@ static void nvme_poll_queues(BDRVNVMeState *s)
     }
 }
 
+/* Run as an event notifier in the BDS's main AioContext */
 static void nvme_handle_event(EventNotifier *n)
 {
     BDRVNVMeState *s = container_of(n, BDRVNVMeState,
@@ -722,6 +740,7 @@ out_error:
     return false;
 }
 
+/* Run as an event notifier in the BDS's main AioContext */
 static bool nvme_poll_cb(void *opaque)
 {
     EventNotifier *e = opaque;
@@ -745,6 +764,7 @@ static bool nvme_poll_cb(void *opaque)
     return false;
 }
 
+/* Run as an event notifier in the BDS's main AioContext */
 static void nvme_poll_ready(EventNotifier *e)
 {
     BDRVNVMeState *s = container_of(e, BDRVNVMeState,
@@ -1050,7 +1070,7 @@ static int nvme_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
     return 0;
 }
 
-/* Called with s->dma_map_lock */
+/* Called with s->dma_map_lock, may be run in any AioContext */
 static coroutine_fn int nvme_cmd_unmap_qiov(BlockDriverState *bs,
                                             QEMUIOVector *qiov)
 {
@@ -1061,13 +1081,17 @@ static coroutine_fn int nvme_cmd_unmap_qiov(BlockDriverState *bs,
     if (!s->dma_map_count && !qemu_co_queue_empty(&s->dma_flush_queue)) {
         r = qemu_vfio_dma_reset_temporary(s->vfio);
         if (!r) {
+            /*
+             * Queue access is protected by the dma_map_lock, and all
+             * coroutines are woken in their own AioContext
+             */
             qemu_co_queue_restart_all(&s->dma_flush_queue);
         }
     }
     return r;
 }
 
-/* Called with s->dma_map_lock */
+/* Called with s->dma_map_lock, may be run in any AioContext */
 static coroutine_fn int nvme_cmd_map_qiov(BlockDriverState *bs, NvmeCmd *cmd,
                                           NVMeRequest *req, QEMUIOVector *qiov)
 {
@@ -1180,6 +1204,7 @@ typedef struct {
     int ret;
 } NVMeCoData;
 
+/* Put into NVMeRequest.cb, so runs in the BDS's main AioContext */
 static void nvme_rw_cb(void *opaque, int ret)
 {
     NVMeCoData *data = opaque;
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 10/19] block/io: Take reqs_lock for tracked_requests
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (8 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 09/19] nvme: Note in which AioContext some functions run Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 11/19] qcow2: Re-initialize lock in invalidate_cache Hanna Czenczek
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

bdrv_co_get_self_request() does not take a lock around iterating through
bs->tracked_requests.  With multiqueue, it may thus iterate over a list
that is in the process of being modified, producing an assertion
failure:

../block/file-posix.c:3702: raw_do_pwrite_zeroes: Assertion `req' failed.

[0] abort() at /lib64/libc.so.6
[1] __assert_fail_base.cold() at /lib64/libc.so.6
[2] raw_do_pwrite_zeroes() at ../block/file-posix.c:3702
[3] bdrv_co_do_pwrite_zeroes() at ../block/io.c:1910
[4] bdrv_aligned_pwritev() at ../block/io.c:2109
[5] bdrv_co_do_zero_pwritev() at ../block/io.c:2192
[6] bdrv_co_pwritev_part() at ../block/io.c:2292
[7] bdrv_co_pwritev() at ../block/io.c:2225
[8] handle_alloc_space() at ../block/qcow2.c:2573
[9] qcow2_co_pwritev_task() at ../block/qcow2.c:2625

Fix this by taking reqs_lock.

Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/io.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/io.c b/block/io.c
index 928c02d1ad..c4a4301321 100644
--- a/block/io.c
+++ b/block/io.c
@@ -718,11 +718,14 @@ BdrvTrackedRequest *coroutine_fn bdrv_co_get_self_request(BlockDriverState *bs)
     Coroutine *self = qemu_coroutine_self();
     IO_CODE();
 
+    qemu_mutex_lock(&bs->reqs_lock);
     QLIST_FOREACH(req, &bs->tracked_requests, list) {
         if (req->co == self) {
+            qemu_mutex_unlock(&bs->reqs_lock);
             return req;
         }
     }
+    qemu_mutex_unlock(&bs->reqs_lock);
 
     return NULL;
 }
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 11/19] qcow2: Re-initialize lock in invalidate_cache
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (9 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 10/19] block/io: Take reqs_lock for tracked_requests Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 12/19] qcow2: Fix cache_clean_timer Hanna Czenczek
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

After clearing our state (memset()-ing it to 0), we should
re-initialize objects that need it.  Specifically, that applies to
s->lock, which is originally initialized in qcow2_open().

Given qemu_co_mutex_init() is just a memset() to 0, this is functionally
a no-op, but still seems like the right thing to do.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/qcow2.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 4aa9f9e068..d6e38926c8 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2873,6 +2873,8 @@ qcow2_co_invalidate_cache(BlockDriverState *bs, Error **errp)
     data_file = s->data_file;
     memset(s, 0, sizeof(BDRVQcow2State));
     s->data_file = data_file;
+    /* Re-initialize objects initialized in qcow2_open() */
+    qemu_co_mutex_init(&s->lock);
 
     options = qdict_clone_shallow(bs->options);
 
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 12/19] qcow2: Fix cache_clean_timer
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (10 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 11/19] qcow2: Re-initialize lock in invalidate_cache Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-17 14:50   ` Kevin Wolf
  2025-11-10 15:48 ` [PATCH v2 13/19] qcow2: Schedule cache-clean-timer in realtime Hanna Czenczek
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

The cache-cleaner runs as a timer CB in the BDS AioContext.  With
multiqueue, it can run concurrently to I/O requests, and because it does
not take any lock, this can break concurrent cache accesses, corrupting
the image.  While the chances of this happening are low, it can be
reproduced e.g. by modifying the code to schedule the timer CB every
5 ms (instead of at most once per second) and modifying the last (inner)
while loop of qcow2_cache_clean_unused() like so:

    while (i < c->size && can_clean_entry(c, i)) {
        for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
            usleep(100);
        }
        c->entries[i].offset = 0;
        c->entries[i].lru_counter = 0;
        i++;
        to_clean++;
    }

i.e. making it wait on purpose for the point in time where the cache is
in use by something else.

The solution chosen for this in this patch is not the best solution, I
hope, but I admittedly can’t come up with anything strictly better.

We can protect from concurrent cache accesses either by taking the
existing s->lock, or we introduce a new (non-coroutine) mutex
specifically for cache accesses.  I would prefer to avoid the latter so
as not to introduce additional (very slight) overhead.

Using s->lock, which is a coroutine mutex, however means that we need to
take it in a coroutine, so the timer must run in a coroutine.  We can
transform it from the current timer CB style into a coroutine that
sleeps for the set interval.  As a result, however, we can no longer
just deschedule the timer to instantly guarantee it won’t run anymore,
but have to await the coroutine’s exit.

(Note even before this patch there were places that may not have been so
guaranteed after all: Anything calling cache_clean_timer_del() from the
QEMU main AioContext could have been running concurrently to an existing
timer CB invocation.)

Polling to await the timer to actually settle seems very complicated for
something that’s rather a minor problem, but I can’t come up with any
better solution that doesn’t again just overlook potential problems.

(Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
I’m not too fond of this solution.)

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/qcow2.h |   5 +-
 block/qcow2.c | 143 ++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 118 insertions(+), 30 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index a9e3481c6e..3e38bccd87 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -345,8 +345,11 @@ typedef struct BDRVQcow2State {
 
     Qcow2Cache *l2_table_cache;
     Qcow2Cache *refcount_block_cache;
-    QEMUTimer *cache_clean_timer;
+    /* Non-NULL while the timer is running */
+    Coroutine *cache_clean_timer_co;
     unsigned cache_clean_interval;
+    QemuCoSleep cache_clean_timer_wake;
+    CoQueue cache_clean_timer_exit;
 
     QLIST_HEAD(, QCowL2Meta) cluster_allocs;
 
diff --git a/block/qcow2.c b/block/qcow2.c
index d6e38926c8..ecff3bed0e 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -835,41 +835,113 @@ static const char *overlap_bool_option_names[QCOW2_OL_MAX_BITNR] = {
     [QCOW2_OL_BITMAP_DIRECTORY_BITNR] = QCOW2_OPT_OVERLAP_BITMAP_DIRECTORY,
 };
 
-static void cache_clean_timer_cb(void *opaque)
+static void coroutine_fn cache_clean_timer(void *opaque)
 {
-    BlockDriverState *bs = opaque;
-    BDRVQcow2State *s = bs->opaque;
-    qcow2_cache_clean_unused(s->l2_table_cache);
-    qcow2_cache_clean_unused(s->refcount_block_cache);
-    timer_mod(s->cache_clean_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
-              (int64_t) s->cache_clean_interval * 1000);
+    BDRVQcow2State *s = opaque;
+    uint64_t wait_ns;
+
+    WITH_QEMU_LOCK_GUARD(&s->lock) {
+        wait_ns = s->cache_clean_interval * NANOSECONDS_PER_SECOND;
+    }
+
+    while (wait_ns > 0) {
+        qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
+                                  QEMU_CLOCK_VIRTUAL, wait_ns);
+
+        WITH_QEMU_LOCK_GUARD(&s->lock) {
+            if (s->cache_clean_interval > 0) {
+                qcow2_cache_clean_unused(s->l2_table_cache);
+                qcow2_cache_clean_unused(s->refcount_block_cache);
+            }
+
+            wait_ns = s->cache_clean_interval * NANOSECONDS_PER_SECOND;
+        }
+    }
+
+    WITH_QEMU_LOCK_GUARD(&s->lock) {
+        s->cache_clean_timer_co = NULL;
+        qemu_co_queue_restart_all(&s->cache_clean_timer_exit);
+    }
 }
 
 static void cache_clean_timer_init(BlockDriverState *bs, AioContext *context)
 {
     BDRVQcow2State *s = bs->opaque;
     if (s->cache_clean_interval > 0) {
-        s->cache_clean_timer =
-            aio_timer_new_with_attrs(context, QEMU_CLOCK_VIRTUAL,
-                                     SCALE_MS, QEMU_TIMER_ATTR_EXTERNAL,
-                                     cache_clean_timer_cb, bs);
-        timer_mod(s->cache_clean_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
-                  (int64_t) s->cache_clean_interval * 1000);
+        assert(!s->cache_clean_timer_co);
+        s->cache_clean_timer_co = qemu_coroutine_create(cache_clean_timer, s);
+        aio_co_enter(context, s->cache_clean_timer_co);
+    }
+}
+
+/**
+ * Delete the cache clean timer and await any yet running instance.
+ * Called holding s->lock.
+ */
+static void coroutine_fn
+cache_clean_timer_co_locked_del_and_wait(BlockDriverState *bs)
+{
+    BDRVQcow2State *s = bs->opaque;
+
+    if (s->cache_clean_timer_co) {
+        s->cache_clean_interval = 0;
+        qemu_co_sleep_wake(&s->cache_clean_timer_wake);
+        qemu_co_queue_wait(&s->cache_clean_timer_exit, &s->lock);
     }
 }
 
-static void cache_clean_timer_del(BlockDriverState *bs)
+/**
+ * Same as cache_clean_timer_co_locked_del_and_wait(), but takes s->lock.
+ */
+static void coroutine_fn
+cache_clean_timer_co_del_and_wait(BlockDriverState *bs)
 {
     BDRVQcow2State *s = bs->opaque;
-    if (s->cache_clean_timer) {
-        timer_free(s->cache_clean_timer);
-        s->cache_clean_timer = NULL;
+
+    WITH_QEMU_LOCK_GUARD(&s->lock) {
+        cache_clean_timer_co_locked_del_and_wait(bs);
+    }
+}
+
+struct CacheCleanTimerDelAndWaitCoParams {
+    BlockDriverState *bs;
+    bool done;
+};
+
+static void coroutine_fn cache_clean_timer_del_and_wait_co_entry(void *opaque)
+{
+    struct CacheCleanTimerDelAndWaitCoParams *p = opaque;
+
+    cache_clean_timer_co_del_and_wait(p->bs);
+    p->done = true;
+    aio_wait_kick();
+}
+
+/**
+ * Delete the cache clean timer and await any yet running instance.
+ * Must be called from the main or BDS AioContext without s->lock held.
+ */
+static void coroutine_mixed_fn
+cache_clean_timer_del_and_wait(BlockDriverState *bs)
+{
+    IO_OR_GS_CODE();
+
+    if (qemu_in_coroutine()) {
+        cache_clean_timer_co_del_and_wait(bs);
+    } else {
+        struct CacheCleanTimerDelAndWaitCoParams p = { .bs = bs };
+        Coroutine *co;
+
+        co = qemu_coroutine_create(cache_clean_timer_del_and_wait_co_entry, &p);
+        qemu_coroutine_enter(co);
+
+        BDRV_POLL_WHILE(bs, !p.done);
     }
 }
 
 static void qcow2_detach_aio_context(BlockDriverState *bs)
 {
-    cache_clean_timer_del(bs);
+    cache_clean_timer_del_and_wait(bs);
 }
 
 static void qcow2_attach_aio_context(BlockDriverState *bs,
@@ -1214,12 +1286,24 @@ fail:
     return ret;
 }
 
+/* s_locked specifies whether s->lock is held or not */
 static void qcow2_update_options_commit(BlockDriverState *bs,
-                                        Qcow2ReopenState *r)
+                                        Qcow2ReopenState *r,
+                                        bool s_locked)
 {
     BDRVQcow2State *s = bs->opaque;
     int i;
 
+    /*
+     * We need to stop the cache-clean-timer before destroying the metadata
+     * table caches
+     */
+    if (s_locked) {
+        cache_clean_timer_co_locked_del_and_wait(bs);
+    } else {
+        cache_clean_timer_del_and_wait(bs);
+    }
+
     if (s->l2_table_cache) {
         qcow2_cache_destroy(s->l2_table_cache);
     }
@@ -1228,6 +1312,10 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
     }
     s->l2_table_cache = r->l2_table_cache;
     s->refcount_block_cache = r->refcount_block_cache;
+
+    s->cache_clean_interval = r->cache_clean_interval;
+    cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
+
     s->l2_slice_size = r->l2_slice_size;
 
     s->overlap_check = r->overlap_check;
@@ -1239,12 +1327,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
 
     s->discard_no_unref = r->discard_no_unref;
 
-    if (s->cache_clean_interval != r->cache_clean_interval) {
-        cache_clean_timer_del(bs);
-        s->cache_clean_interval = r->cache_clean_interval;
-        cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
-    }
-
     qapi_free_QCryptoBlockOpenOptions(s->crypto_opts);
     s->crypto_opts = r->crypto_opts;
 }
@@ -1261,6 +1343,7 @@ static void qcow2_update_options_abort(BlockDriverState *bs,
     qapi_free_QCryptoBlockOpenOptions(r->crypto_opts);
 }
 
+/* Called with s->lock held */
 static int coroutine_fn GRAPH_RDLOCK
 qcow2_update_options(BlockDriverState *bs, QDict *options, int flags,
                      Error **errp)
@@ -1270,7 +1353,7 @@ qcow2_update_options(BlockDriverState *bs, QDict *options, int flags,
 
     ret = qcow2_update_options_prepare(bs, &r, options, flags, errp);
     if (ret >= 0) {
-        qcow2_update_options_commit(bs, &r);
+        qcow2_update_options_commit(bs, &r, true);
     } else {
         qcow2_update_options_abort(bs, &r);
     }
@@ -1908,7 +1991,7 @@ qcow2_do_open(BlockDriverState *bs, QDict *options, int flags,
     qemu_vfree(s->l1_table);
     /* else pre-write overlap checks in cache_destroy may crash */
     s->l1_table = NULL;
-    cache_clean_timer_del(bs);
+    cache_clean_timer_co_locked_del_and_wait(bs);
     if (s->l2_table_cache) {
         qcow2_cache_destroy(s->l2_table_cache);
     }
@@ -1963,6 +2046,7 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
 
     /* Initialise locks */
     qemu_co_mutex_init(&s->lock);
+    qemu_co_queue_init(&s->cache_clean_timer_exit);
 
     assert(!qemu_in_coroutine());
     assert(qemu_get_current_aio_context() == qemu_get_aio_context());
@@ -2048,7 +2132,7 @@ static void qcow2_reopen_commit(BDRVReopenState *state)
 
     GRAPH_RDLOCK_GUARD_MAINLOOP();
 
-    qcow2_update_options_commit(state->bs, state->opaque);
+    qcow2_update_options_commit(state->bs, state->opaque, false);
     if (!s->data_file) {
         /*
          * If we don't have an external data file, s->data_file was cleared by
@@ -2805,7 +2889,7 @@ qcow2_do_close(BlockDriverState *bs, bool close_data_file)
         qcow2_inactivate(bs);
     }
 
-    cache_clean_timer_del(bs);
+    cache_clean_timer_del_and_wait(bs);
     qcow2_cache_destroy(s->l2_table_cache);
     qcow2_cache_destroy(s->refcount_block_cache);
 
@@ -2875,6 +2959,7 @@ qcow2_co_invalidate_cache(BlockDriverState *bs, Error **errp)
     s->data_file = data_file;
     /* Re-initialize objects initialized in qcow2_open() */
     qemu_co_mutex_init(&s->lock);
+    qemu_co_queue_init(&s->cache_clean_timer_exit);
 
     options = qdict_clone_shallow(bs->options);
 
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 13/19] qcow2: Schedule cache-clean-timer in realtime
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (11 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 12/19] qcow2: Fix cache_clean_timer Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 14/19] ssh: Run restart_coroutine in current AioContext Hanna Czenczek
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

There is no reason why the cache cleaning timer should run in virtual
time, run it in realtime instead.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/qcow2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index ecff3bed0e..d13cb9b42a 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -846,7 +846,7 @@ static void coroutine_fn cache_clean_timer(void *opaque)
 
     while (wait_ns > 0) {
         qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
-                                  QEMU_CLOCK_VIRTUAL, wait_ns);
+                                  QEMU_CLOCK_REALTIME, wait_ns);
 
         WITH_QEMU_LOCK_GUARD(&s->lock) {
             if (s->cache_clean_interval > 0) {
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 14/19] ssh: Run restart_coroutine in current AioContext
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (12 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 13/19] qcow2: Schedule cache-clean-timer in realtime Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 15/19] blkreplay: Run BH in coroutine’s AioContext Hanna Czenczek
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

restart_coroutine() is attached as an FD handler just to wake the
current coroutine after yielding.  It makes most sense to attach it to
the current (request) AioContext instead of the BDS main context.  This
way, the coroutine can be entered directly from the BH instead of having
yet another indirection through AioContext.co_schedule_bh.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/ssh.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/ssh.c b/block/ssh.c
index 70fe7cf86e..bdec94e9e9 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -1010,19 +1010,18 @@ static int ssh_has_zero_init(BlockDriverState *bs)
 }
 
 typedef struct BDRVSSHRestart {
-    BlockDriverState *bs;
+    BDRVSSHState *s;
     Coroutine *co;
 } BDRVSSHRestart;
 
 static void restart_coroutine(void *opaque)
 {
     BDRVSSHRestart *restart = opaque;
-    BlockDriverState *bs = restart->bs;
-    BDRVSSHState *s = bs->opaque;
-    AioContext *ctx = bdrv_get_aio_context(bs);
+    BDRVSSHState *s = restart->s;
 
     trace_ssh_restart_coroutine(restart->co);
-    aio_set_fd_handler(ctx, s->sock, NULL, NULL, NULL, NULL, NULL);
+    aio_set_fd_handler(qemu_get_current_aio_context(), s->sock,
+                       NULL, NULL, NULL, NULL, NULL);
 
     aio_co_wake(restart->co);
 }
@@ -1031,12 +1030,13 @@ static void restart_coroutine(void *opaque)
  * handlers are set up so that we'll be rescheduled when there is an
  * interesting event on the socket.
  */
-static coroutine_fn void co_yield(BDRVSSHState *s, BlockDriverState *bs)
+static coroutine_fn void co_yield(BDRVSSHState *s)
 {
     int r;
     IOHandler *rd_handler = NULL, *wr_handler = NULL;
+    AioContext *ctx = qemu_get_current_aio_context();
     BDRVSSHRestart restart = {
-        .bs = bs,
+        .s = s,
         .co = qemu_coroutine_self()
     };
 
@@ -1051,7 +1051,7 @@ static coroutine_fn void co_yield(BDRVSSHState *s, BlockDriverState *bs)
 
     trace_ssh_co_yield(s->sock, rd_handler, wr_handler);
 
-    aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock,
+    aio_set_fd_handler(ctx, s->sock,
                        rd_handler, wr_handler, NULL, NULL, &restart);
     qemu_coroutine_yield();
     trace_ssh_co_yield_back(s->sock);
@@ -1093,7 +1093,7 @@ static coroutine_fn int ssh_read(BDRVSSHState *s, BlockDriverState *bs,
         trace_ssh_read_return(r, sftp_get_error(s->sftp));
 
         if (r == SSH_AGAIN) {
-            co_yield(s, bs);
+            co_yield(s);
             goto again;
         }
         if (r == SSH_EOF || (r == 0 && sftp_get_error(s->sftp) == SSH_FX_EOF)) {
@@ -1168,7 +1168,7 @@ static coroutine_fn int ssh_write(BDRVSSHState *s, BlockDriverState *bs,
         trace_ssh_write_return(r, sftp_get_error(s->sftp));
 
         if (r == SSH_AGAIN) {
-            co_yield(s, bs);
+            co_yield(s);
             goto again;
         }
         if (r < 0) {
@@ -1233,7 +1233,7 @@ static coroutine_fn int ssh_flush(BDRVSSHState *s, BlockDriverState *bs)
  again:
     r = sftp_fsync(s->sftp_handle);
     if (r == SSH_AGAIN) {
-        co_yield(s, bs);
+        co_yield(s);
         goto again;
     }
     if (r < 0) {
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 15/19] blkreplay: Run BH in coroutine’s AioContext
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (13 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 14/19] ssh: Run restart_coroutine in current AioContext Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 16/19] block: Note in which AioContext AIO CBs are called Hanna Czenczek
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

While it does not matter in which AioContext we run aio_co_wake() to
continue an exactly-once-yielding coroutine, making this commit not
strictly necessary, there is also no reason why the BH should run in any
context but the request’s AioContext.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/blkreplay.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blkreplay.c b/block/blkreplay.c
index 16d8b12dd9..8a6845425e 100644
--- a/block/blkreplay.c
+++ b/block/blkreplay.c
@@ -63,9 +63,10 @@ static void block_request_create(uint64_t reqid, BlockDriverState *bs,
                                  Coroutine *co)
 {
     Request *req = g_new(Request, 1);
+    AioContext *ctx = qemu_coroutine_get_aio_context(co);
     *req = (Request) {
         .co = co,
-        .bh = aio_bh_new(bdrv_get_aio_context(bs), blkreplay_bh_cb, req),
+        .bh = aio_bh_new(ctx, blkreplay_bh_cb, req),
     };
     replay_block_event(req->bh, reqid);
 }
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 16/19] block: Note in which AioContext AIO CBs are called
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (14 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 15/19] blkreplay: Run BH in coroutine’s AioContext Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 17/19] iscsi: Create AIO BH in original AioContext Hanna Czenczek
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

This doesn’t seem to be specified anywhere, but is something we probably
want to be clear.  I believe it is reasonable to implicitly assume that
callbacks are run in the current thread (unless explicitly noted
otherwise), so codify that assumption.

Some implementations don’t actually fulfill this contract yet.  The next
patches should rectify that.

Note: I don’t know of any user-visible bugs produced by not running AIO
callbacks in the original context.  AIO functionality is generally
mapped to coroutines through the use of bdrv_co_io_em_complete(), which
can run in any AioContext, and will always wake the yielding coroutine
in its original context.  The only benefit here is that running
bdrv_co_io_em_complete() in the original context will make that
aio_co_wake() most likely a simpler qemu_coroutine_enter() instead of
scheduling the wakeup through AioContext.co_schedule_bh.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 include/block/block_int-common.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index f2a4e863fc..cb0143ea77 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -508,7 +508,12 @@ struct BlockDriver {
         BlockDriverState *bs, BlockdevAmendOptions *opts, bool force,
         Error **errp);

-    /* aio */
+    /*
+     * AIO
+     * The given completion callback will be run in the same AioContext as the
+     * one in which the AIO function was called.
+     */
+
     BlockAIOCB * GRAPH_RDLOCK_PTR (*bdrv_aio_preadv)(BlockDriverState *bs,
         int64_t offset, int64_t bytes, QEMUIOVector *qiov,
         BdrvRequestFlags flags, BlockCompletionFunc *cb, void *opaque);
-- 
2.51.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 17/19] iscsi: Create AIO BH in original AioContext
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (15 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 16/19] block: Note in which AioContext AIO CBs are called Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 18/19] null-aio: Run CB " Hanna Czenczek
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

AIO callbacks must be called in the original request’s AioContext,
regardless of the BDS’s “main” AioContext.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/iscsi.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 852ecccf0d..7d6bf185ea 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -119,6 +119,7 @@ typedef struct IscsiTask {
 
 typedef struct IscsiAIOCB {
     BlockAIOCB common;
+    AioContext *ctx;
     QEMUBH *bh;
     IscsiLun *iscsilun;
     struct scsi_task *task;
@@ -173,7 +174,7 @@ iscsi_schedule_bh(IscsiAIOCB *acb)
     if (acb->bh) {
         return;
     }
-    acb->bh = aio_bh_new(acb->iscsilun->aio_context, iscsi_bh_cb, acb);
+    acb->bh = aio_bh_new(acb->ctx, iscsi_bh_cb, acb);
     qemu_bh_schedule(acb->bh);
 }
 
@@ -1012,8 +1013,7 @@ static void iscsi_ioctl_handle_emulated(IscsiAIOCB *acb, int req, void *buf)
         ret = -EINVAL;
     }
     assert(!acb->bh);
-    acb->bh = aio_bh_new(bdrv_get_aio_context(bs),
-                         iscsi_ioctl_bh_completion, acb);
+    acb->bh = aio_bh_new(acb->ctx, iscsi_ioctl_bh_completion, acb);
     acb->ret = ret;
     qemu_bh_schedule(acb->bh);
 }
@@ -1030,6 +1030,7 @@ static BlockAIOCB *iscsi_aio_ioctl(BlockDriverState *bs,
     acb = qemu_aio_get(&iscsi_aiocb_info, bs, cb, opaque);
 
     acb->iscsilun = iscsilun;
+    acb->ctx         = qemu_get_current_aio_context();
     acb->bh          = NULL;
     acb->status      = -EINPROGRESS;
     acb->ioh         = buf;
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 18/19] null-aio: Run CB in original AioContext
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (16 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 17/19] iscsi: Create AIO BH in original AioContext Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-10 15:48 ` [PATCH v2 19/19] win32-aio: Run CB in original context Hanna Czenczek
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

AIO callbacks must be called in the originally calling AioContext,
regardless of the BDS’s “main” AioContext.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/null.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/null.c b/block/null.c
index 4e448d593d..253d20ccbb 100644
--- a/block/null.c
+++ b/block/null.c
@@ -173,18 +173,17 @@ static inline BlockAIOCB *null_aio_common(BlockDriverState *bs,
 {
     NullAIOCB *acb;
     BDRVNullState *s = bs->opaque;
+    AioContext *ctx = qemu_get_current_aio_context();
 
     acb = qemu_aio_get(&null_aiocb_info, bs, cb, opaque);
     /* Only emulate latency after vcpu is running. */
     if (s->latency_ns) {
-        aio_timer_init(bdrv_get_aio_context(bs), &acb->timer,
-                       QEMU_CLOCK_REALTIME, SCALE_NS,
+        aio_timer_init(ctx, &acb->timer, QEMU_CLOCK_REALTIME, SCALE_NS,
                        null_timer_cb, acb);
         timer_mod_ns(&acb->timer,
                      qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + s->latency_ns);
     } else {
-        replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs),
-                                         null_bh_cb, acb);
+        replay_bh_schedule_oneshot_event(ctx, null_bh_cb, acb);
     }
     return &acb->common;
 }
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 19/19] win32-aio: Run CB in original context
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (17 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 18/19] null-aio: Run CB " Hanna Czenczek
@ 2025-11-10 15:48 ` Hanna Czenczek
  2025-11-17 15:11 ` [PATCH v2 00/19] block: Some multi-threading fixes Kevin Wolf
  2025-11-17 19:11 ` Stefan Hajnoczi
  20 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-10 15:48 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
	Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

AIO callbacks must be called in the originally calling AioContext,
regardless of the BDS’s “main” AioContext.

Note: I tried to test this (under wine), but failed.  Whenever I tried
to use multiqueue or even just an I/O thread for a virtio-blk (or
virtio-scsi) device, I/O stalled, both with and without this patch.

For what it’s worth, when not using an I/O thread, I/O continued to work
with this patch.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/win32-aio.c | 31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/block/win32-aio.c b/block/win32-aio.c
index 6327861e1d..f0689f3ee9 100644
--- a/block/win32-aio.c
+++ b/block/win32-aio.c
@@ -48,48 +48,62 @@ struct QEMUWin32AIOState {
 typedef struct QEMUWin32AIOCB {
     BlockAIOCB common;
     struct QEMUWin32AIOState *ctx;
+    AioContext *req_ctx;
     int nbytes;
     OVERLAPPED ov;
     QEMUIOVector *qiov;
     void *buf;
     bool is_read;
     bool is_linear;
+    int ret;
 } QEMUWin32AIOCB;
 
+static void win32_aio_completion_cb_bh(void *opaque)
+{
+    QEMUWin32AIOCB *waiocb = opaque;
+
+    waiocb->common.cb(waiocb->common.opaque, waiocb->ret);
+    aio_context_unref(waiocb->req_ctx);
+    qemu_aio_unref(waiocb);
+}
+
 /*
  * Completes an AIO request (calls the callback and frees the ACB).
  */
 static void win32_aio_process_completion(QEMUWin32AIOState *s,
     QEMUWin32AIOCB *waiocb, DWORD count)
 {
-    int ret;
     s->count--;
 
     if (waiocb->ov.Internal != 0) {
-        ret = -EIO;
+        waiocb->ret = -EIO;
     } else {
-        ret = 0;
+        waiocb->ret = 0;
         if (count < waiocb->nbytes) {
             /* Short reads mean EOF, pad with zeros. */
             if (waiocb->is_read) {
                 qemu_iovec_memset(waiocb->qiov, count, 0,
                     waiocb->qiov->size - count);
             } else {
-                ret = -EINVAL;
+                waiocb->ret = -EINVAL;
             }
        }
     }
 
     if (!waiocb->is_linear) {
-        if (ret == 0 && waiocb->is_read) {
+        if (waiocb->ret == 0 && waiocb->is_read) {
             QEMUIOVector *qiov = waiocb->qiov;
             iov_from_buf(qiov->iov, qiov->niov, 0, waiocb->buf, qiov->size);
         }
         qemu_vfree(waiocb->buf);
     }
 
-    waiocb->common.cb(waiocb->common.opaque, ret);
-    qemu_aio_unref(waiocb);
+    if (waiocb->req_ctx == s->aio_ctx) {
+        win32_aio_completion_cb_bh(waiocb);
+    } else {
+        aio_bh_schedule_oneshot(waiocb->req_ctx, win32_aio_completion_cb_bh,
+                                waiocb);
+    }
 }
 
 static void win32_aio_completion_cb(EventNotifier *e)
@@ -120,10 +134,13 @@ BlockAIOCB *win32_aio_submit(BlockDriverState *bs,
     DWORD rc;
 
     waiocb = qemu_aio_get(&win32_aiocb_info, bs, cb, opaque);
+    waiocb->req_ctx = qemu_get_current_aio_context();
     waiocb->nbytes = bytes;
     waiocb->qiov = qiov;
     waiocb->is_read = (type == QEMU_AIO_READ);
 
+    aio_context_ref(waiocb->req_ctx);
+
     if (qiov->niov > 1) {
         waiocb->buf = qemu_try_blockalign(bs, qiov->size);
         if (waiocb->buf == NULL) {
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 12/19] qcow2: Fix cache_clean_timer
  2025-11-10 15:48 ` [PATCH v2 12/19] qcow2: Fix cache_clean_timer Hanna Czenczek
@ 2025-11-17 14:50   ` Kevin Wolf
  2025-11-18 11:01     ` Hanna Czenczek
  0 siblings, 1 reply; 26+ messages in thread
From: Kevin Wolf @ 2025-11-17 14:50 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
	Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

Am 10.11.2025 um 16:48 hat Hanna Czenczek geschrieben:
> The cache-cleaner runs as a timer CB in the BDS AioContext.  With
> multiqueue, it can run concurrently to I/O requests, and because it does
> not take any lock, this can break concurrent cache accesses, corrupting
> the image.  While the chances of this happening are low, it can be
> reproduced e.g. by modifying the code to schedule the timer CB every
> 5 ms (instead of at most once per second) and modifying the last (inner)
> while loop of qcow2_cache_clean_unused() like so:
> 
>     while (i < c->size && can_clean_entry(c, i)) {
>         for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
>             usleep(100);
>         }
>         c->entries[i].offset = 0;
>         c->entries[i].lru_counter = 0;
>         i++;
>         to_clean++;
>     }
> 
> i.e. making it wait on purpose for the point in time where the cache is
> in use by something else.
> 
> The solution chosen for this in this patch is not the best solution, I
> hope, but I admittedly can’t come up with anything strictly better.
> 
> We can protect from concurrent cache accesses either by taking the
> existing s->lock, or we introduce a new (non-coroutine) mutex
> specifically for cache accesses.  I would prefer to avoid the latter so
> as not to introduce additional (very slight) overhead.
> 
> Using s->lock, which is a coroutine mutex, however means that we need to
> take it in a coroutine, so the timer must run in a coroutine.  We can
> transform it from the current timer CB style into a coroutine that
> sleeps for the set interval.  As a result, however, we can no longer
> just deschedule the timer to instantly guarantee it won’t run anymore,
> but have to await the coroutine’s exit.
> 
> (Note even before this patch there were places that may not have been so
> guaranteed after all: Anything calling cache_clean_timer_del() from the
> QEMU main AioContext could have been running concurrently to an existing
> timer CB invocation.)
> 
> Polling to await the timer to actually settle seems very complicated for
> something that’s rather a minor problem, but I can’t come up with any
> better solution that doesn’t again just overlook potential problems.
> 
> (Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
> I’m not too fond of this solution.)
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>

> -static void cache_clean_timer_cb(void *opaque)
> +static void coroutine_fn cache_clean_timer(void *opaque)
>  {
> -    BlockDriverState *bs = opaque;
> -    BDRVQcow2State *s = bs->opaque;
> -    qcow2_cache_clean_unused(s->l2_table_cache);
> -    qcow2_cache_clean_unused(s->refcount_block_cache);
> -    timer_mod(s->cache_clean_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
> -              (int64_t) s->cache_clean_interval * 1000);
> +    BDRVQcow2State *s = opaque;
> +    uint64_t wait_ns;
> +
> +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> +        wait_ns = s->cache_clean_interval * NANOSECONDS_PER_SECOND;
> +    }
> +
> +    while (wait_ns > 0) {
> +        qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
> +                                  QEMU_CLOCK_VIRTUAL, wait_ns);
> +
> +        WITH_QEMU_LOCK_GUARD(&s->lock) {
> +            if (s->cache_clean_interval > 0) {
> +                qcow2_cache_clean_unused(s->l2_table_cache);
> +                qcow2_cache_clean_unused(s->refcount_block_cache);
> +            }
> +
> +            wait_ns = s->cache_clean_interval * NANOSECONDS_PER_SECOND;
> +        }
> +    }
> +
> +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> +        s->cache_clean_timer_co = NULL;
> +        qemu_co_queue_restart_all(&s->cache_clean_timer_exit);
> +    }
>  }

> +/**
> + * Delete the cache clean timer and await any yet running instance.
> + * Called holding s->lock.
> + */
> +static void coroutine_fn
> +cache_clean_timer_co_locked_del_and_wait(BlockDriverState *bs)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +
> +    if (s->cache_clean_timer_co) {
> +        s->cache_clean_interval = 0;
> +        qemu_co_sleep_wake(&s->cache_clean_timer_wake);
> +        qemu_co_queue_wait(&s->cache_clean_timer_exit, &s->lock);
>      }
>  }

I don't want to count the number of context switches this dance between
cache_clean_timer_co_locked_del_and_wait() and cache_clean_timer()
takes! Good that it's not a hot path. :-)

> +/* s_locked specifies whether s->lock is held or not */
>  static void qcow2_update_options_commit(BlockDriverState *bs,
> -                                        Qcow2ReopenState *r)
> +                                        Qcow2ReopenState *r,
> +                                        bool s_locked)
>  {
>      BDRVQcow2State *s = bs->opaque;
>      int i;
>  
> +    /*
> +     * We need to stop the cache-clean-timer before destroying the metadata
> +     * table caches
> +     */
> +    if (s_locked) {
> +        cache_clean_timer_co_locked_del_and_wait(bs);
> +    } else {
> +        cache_clean_timer_del_and_wait(bs);
> +    }
> +
>      if (s->l2_table_cache) {
>          qcow2_cache_destroy(s->l2_table_cache);
>      }
> @@ -1228,6 +1312,10 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>      }
>      s->l2_table_cache = r->l2_table_cache;
>      s->refcount_block_cache = r->refcount_block_cache;
> +
> +    s->cache_clean_interval = r->cache_clean_interval;
> +    cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> +
>      s->l2_slice_size = r->l2_slice_size;
>  
>      s->overlap_check = r->overlap_check;
> @@ -1239,12 +1327,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>  
>      s->discard_no_unref = r->discard_no_unref;
>  
> -    if (s->cache_clean_interval != r->cache_clean_interval) {
> -        cache_clean_timer_del(bs);
> -        s->cache_clean_interval = r->cache_clean_interval;
> -        cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> -    }
> -
>      qapi_free_QCryptoBlockOpenOptions(s->crypto_opts);
>      s->crypto_opts = r->crypto_opts;
>  }

Is there any specific reason why you move cache_clean_timer_init()
earlier? I don't see an actual problem with the code as it is after this
change, but s->l2_slice_size is related to the cache in a way, so it
would feel safer if the cache cleaner were only started once all the
settings are updated and not only those that it actually happens to
access at the moment.

Kevin



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 00/19] block: Some multi-threading fixes
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (18 preceding siblings ...)
  2025-11-10 15:48 ` [PATCH v2 19/19] win32-aio: Run CB in original context Hanna Czenczek
@ 2025-11-17 15:11 ` Kevin Wolf
  2025-11-17 19:11 ` Stefan Hajnoczi
  20 siblings, 0 replies; 26+ messages in thread
From: Kevin Wolf @ 2025-11-17 15:11 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
	Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

Am 10.11.2025 um 16:48 hat Hanna Czenczek geschrieben:
> Hi,
> 
> See the v1 cover letter for a general overview:
> 
> https://lists.nongnu.org/archive/html/qemu-block/2025-10/msg00501.html
> 
> Changes in v2:
> - Kept `.ret = EINPGORESS`-style initializations where they already had
>   been (curl, nvme)
> - Dropped trivial BH waking code (i.e. which can be directly replaced by
>   aio_co_wake()) in iscsi, nfs, nvme
> - curl: Yield in curl_do_preadv() (former curl_setup_preadv()) and
>   curl_find_buf() instead of returning whether curl_co_preadv() has to
>   yield or not
> - nvme: Added a patch that annotates some functions (primarily BHs and
>   CBs) with which AioContext they (must) run in
> - qcow2 cache-cleaning timer: Run the timer as a coroutine instead of in
>   a timer CB; use a CoQueue to await it exiting instead of polling
>   (well, we still need to poll in case we don’t run in a coroutine, but
>   that’s standard procedure, I believe)
>   - The need to initialize the CoQueue showed that there is a code path
>     in qcow2 that doesn’t initialize its CoMutex.  Added a patch to do
>     that.
>   - Also added a patch to have the timer use realtime instead of virtual
>     time.

I'll still wait for your answer on patch 12 before applying (and
possibly making that last change), but you can already have:

Reviewed-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 00/19] block: Some multi-threading fixes
  2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
                   ` (19 preceding siblings ...)
  2025-11-17 15:11 ` [PATCH v2 00/19] block: Some multi-threading fixes Kevin Wolf
@ 2025-11-17 19:11 ` Stefan Hajnoczi
  20 siblings, 0 replies; 26+ messages in thread
From: Stefan Hajnoczi @ 2025-11-17 19:11 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Kevin Wolf, Paolo Bonzini,
	Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

[-- Attachment #1: Type: text/plain, Size: 4769 bytes --]

On Mon, Nov 10, 2025 at 04:48:35PM +0100, Hanna Czenczek wrote:
> Hi,
> 
> See the v1 cover letter for a general overview:
> 
> https://lists.nongnu.org/archive/html/qemu-block/2025-10/msg00501.html
> 
> Changes in v2:
> - Kept `.ret = EINPGORESS`-style initializations where they already had
>   been (curl, nvme)
> - Dropped trivial BH waking code (i.e. which can be directly replaced by
>   aio_co_wake()) in iscsi, nfs, nvme
> - curl: Yield in curl_do_preadv() (former curl_setup_preadv()) and
>   curl_find_buf() instead of returning whether curl_co_preadv() has to
>   yield or not
> - nvme: Added a patch that annotates some functions (primarily BHs and
>   CBs) with which AioContext they (must) run in
> - qcow2 cache-cleaning timer: Run the timer as a coroutine instead of in
>   a timer CB; use a CoQueue to await it exiting instead of polling
>   (well, we still need to poll in case we don’t run in a coroutine, but
>   that’s standard procedure, I believe)
>   - The need to initialize the CoQueue showed that there is a code path
>     in qcow2 that doesn’t initialize its CoMutex.  Added a patch to do
>     that.
>   - Also added a patch to have the timer use realtime instead of virtual
>     time.
> 
> 
> git backport-diff against v1:
> 
> Key:
> [----] : patches are identical
> [####] : number of functional differences between upstream/downstream patch
> [down] : patch is downstream-only
> The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively
> 
> 001/19:[----] [--] 'block: Note on aio_co_wake use if not yet yielding'
> 002/19:[----] [--] 'rbd: Run co BH CB in the coroutine’s AioContext'
> 003/19:[0019] [FC] 'iscsi: Run co BH CB in the coroutine’s AioContext'
> 004/19:[0022] [FC] 'nfs: Run co BH CB in the coroutine’s AioContext'
> 005/19:[0060] [FC] 'curl: Fix coroutine waking'
> 006/19:[----] [--] 'gluster: Do not move coroutine into BDS context'
> 007/19:[----] [--] 'nvme: Kick and check completions in BDS context'
> 008/19:[0038] [FC] 'nvme: Fix coroutine waking'
> 009/19:[down] 'nvme: Note in which AioContext some functions run'
> 010/19:[----] [--] 'block/io: Take reqs_lock for tracked_requests'
> 011/19:[down] 'qcow2: Re-initialize lock in invalidate_cache'
> 012/19:[0145] [FC] 'qcow2: Fix cache_clean_timer'
> 013/19:[down] 'qcow2: Schedule cache-clean-timer in realtime'
> 014/19:[----] [--] 'ssh: Run restart_coroutine in current AioContext'
> 015/19:[----] [--] 'blkreplay: Run BH in coroutine’s AioContext'
> 016/19:[----] [--] 'block: Note in which AioContext AIO CBs are called'
> 017/19:[----] [--] 'iscsi: Create AIO BH in original AioContext'
> 018/19:[----] [--] 'null-aio: Run CB in original AioContext'
> 019/19:[----] [--] 'win32-aio: Run CB in original context'
> 
> 
> Hanna Czenczek (19):
>   block: Note on aio_co_wake use if not yet yielding
>   rbd: Run co BH CB in the coroutine’s AioContext
>   iscsi: Run co BH CB in the coroutine’s AioContext
>   nfs: Run co BH CB in the coroutine’s AioContext
>   curl: Fix coroutine waking
>   gluster: Do not move coroutine into BDS context
>   nvme: Kick and check completions in BDS context
>   nvme: Fix coroutine waking
>   nvme: Note in which AioContext some functions run
>   block/io: Take reqs_lock for tracked_requests
>   qcow2: Re-initialize lock in invalidate_cache
>   qcow2: Fix cache_clean_timer
>   qcow2: Schedule cache-clean-timer in realtime
>   ssh: Run restart_coroutine in current AioContext
>   blkreplay: Run BH in coroutine’s AioContext
>   block: Note in which AioContext AIO CBs are called
>   iscsi: Create AIO BH in original AioContext
>   null-aio: Run CB in original AioContext
>   win32-aio: Run CB in original context
> 
>  block/qcow2.h                    |   5 +-
>  include/block/aio.h              |  15 ++++
>  include/block/block_int-common.h |   7 +-
>  block/blkreplay.c                |   3 +-
>  block/curl.c                     |  45 +++++++---
>  block/gluster.c                  |  17 ++--
>  block/io.c                       |   3 +
>  block/iscsi.c                    |  63 ++++++--------
>  block/nfs.c                      |  41 ++++-----
>  block/null.c                     |   7 +-
>  block/nvme.c                     | 113 ++++++++++++++++--------
>  block/qcow2.c                    | 145 ++++++++++++++++++++++++-------
>  block/rbd.c                      |  12 +--
>  block/ssh.c                      |  22 ++---
>  block/win32-aio.c                |  31 +++++--
>  15 files changed, 347 insertions(+), 182 deletions(-)
> 
> -- 
> 2.51.1
> 

I reviewed the nvme changes and the core block layer changes:

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 12/19] qcow2: Fix cache_clean_timer
  2025-11-17 14:50   ` Kevin Wolf
@ 2025-11-18 11:01     ` Hanna Czenczek
  2025-11-18 17:06       ` Kevin Wolf
  0 siblings, 1 reply; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-18 11:01 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
	Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

On 17.11.25 15:50, Kevin Wolf wrote:
> Am 10.11.2025 um 16:48 hat Hanna Czenczek geschrieben:
>> The cache-cleaner runs as a timer CB in the BDS AioContext.  With
>> multiqueue, it can run concurrently to I/O requests, and because it does
>> not take any lock, this can break concurrent cache accesses, corrupting
>> the image.  While the chances of this happening are low, it can be
>> reproduced e.g. by modifying the code to schedule the timer CB every
>> 5 ms (instead of at most once per second) and modifying the last (inner)
>> while loop of qcow2_cache_clean_unused() like so:
>>
>>      while (i < c->size && can_clean_entry(c, i)) {
>>          for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
>>              usleep(100);
>>          }
>>          c->entries[i].offset = 0;
>>          c->entries[i].lru_counter = 0;
>>          i++;
>>          to_clean++;
>>      }
>>
>> i.e. making it wait on purpose for the point in time where the cache is
>> in use by something else.
>>
>> The solution chosen for this in this patch is not the best solution, I
>> hope, but I admittedly can’t come up with anything strictly better.
>>
>> We can protect from concurrent cache accesses either by taking the
>> existing s->lock, or we introduce a new (non-coroutine) mutex
>> specifically for cache accesses.  I would prefer to avoid the latter so
>> as not to introduce additional (very slight) overhead.
>>
>> Using s->lock, which is a coroutine mutex, however means that we need to
>> take it in a coroutine, so the timer must run in a coroutine.  We can
>> transform it from the current timer CB style into a coroutine that
>> sleeps for the set interval.  As a result, however, we can no longer
>> just deschedule the timer to instantly guarantee it won’t run anymore,
>> but have to await the coroutine’s exit.
>>
>> (Note even before this patch there were places that may not have been so
>> guaranteed after all: Anything calling cache_clean_timer_del() from the
>> QEMU main AioContext could have been running concurrently to an existing
>> timer CB invocation.)
>>
>> Polling to await the timer to actually settle seems very complicated for
>> something that’s rather a minor problem, but I can’t come up with any
>> better solution that doesn’t again just overlook potential problems.
>>
>> (Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
>> I’m not too fond of this solution.)
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> -static void cache_clean_timer_cb(void *opaque)
>> +static void coroutine_fn cache_clean_timer(void *opaque)
>>   {
>> -    BlockDriverState *bs = opaque;
>> -    BDRVQcow2State *s = bs->opaque;
>> -    qcow2_cache_clean_unused(s->l2_table_cache);
>> -    qcow2_cache_clean_unused(s->refcount_block_cache);
>> -    timer_mod(s->cache_clean_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
>> -              (int64_t) s->cache_clean_interval * 1000);
>> +    BDRVQcow2State *s = opaque;
>> +    uint64_t wait_ns;
>> +
>> +    WITH_QEMU_LOCK_GUARD(&s->lock) {
>> +        wait_ns = s->cache_clean_interval * NANOSECONDS_PER_SECOND;
>> +    }
>> +
>> +    while (wait_ns > 0) {
>> +        qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
>> +                                  QEMU_CLOCK_VIRTUAL, wait_ns);
>> +
>> +        WITH_QEMU_LOCK_GUARD(&s->lock) {
>> +            if (s->cache_clean_interval > 0) {
>> +                qcow2_cache_clean_unused(s->l2_table_cache);
>> +                qcow2_cache_clean_unused(s->refcount_block_cache);
>> +            }
>> +
>> +            wait_ns = s->cache_clean_interval * NANOSECONDS_PER_SECOND;
>> +        }
>> +    }
>> +
>> +    WITH_QEMU_LOCK_GUARD(&s->lock) {
>> +        s->cache_clean_timer_co = NULL;
>> +        qemu_co_queue_restart_all(&s->cache_clean_timer_exit);
>> +    }
>>   }
>> +/**
>> + * Delete the cache clean timer and await any yet running instance.
>> + * Called holding s->lock.
>> + */
>> +static void coroutine_fn
>> +cache_clean_timer_co_locked_del_and_wait(BlockDriverState *bs)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +
>> +    if (s->cache_clean_timer_co) {
>> +        s->cache_clean_interval = 0;
>> +        qemu_co_sleep_wake(&s->cache_clean_timer_wake);
>> +        qemu_co_queue_wait(&s->cache_clean_timer_exit, &s->lock);
>>       }
>>   }
> I don't want to count the number of context switches this dance between
> cache_clean_timer_co_locked_del_and_wait() and cache_clean_timer()
> takes! Good that it's not a hot path. :-)
>
>> +/* s_locked specifies whether s->lock is held or not */
>>   static void qcow2_update_options_commit(BlockDriverState *bs,
>> -                                        Qcow2ReopenState *r)
>> +                                        Qcow2ReopenState *r,
>> +                                        bool s_locked)
>>   {
>>       BDRVQcow2State *s = bs->opaque;
>>       int i;
>>   
>> +    /*
>> +     * We need to stop the cache-clean-timer before destroying the metadata
>> +     * table caches
>> +     */
>> +    if (s_locked) {
>> +        cache_clean_timer_co_locked_del_and_wait(bs);
>> +    } else {
>> +        cache_clean_timer_del_and_wait(bs);
>> +    }
>> +
>>       if (s->l2_table_cache) {
>>           qcow2_cache_destroy(s->l2_table_cache);
>>       }
>> @@ -1228,6 +1312,10 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>>       }
>>       s->l2_table_cache = r->l2_table_cache;
>>       s->refcount_block_cache = r->refcount_block_cache;
>> +
>> +    s->cache_clean_interval = r->cache_clean_interval;
>> +    cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
>> +
>>       s->l2_slice_size = r->l2_slice_size;
>>   
>>       s->overlap_check = r->overlap_check;
>> @@ -1239,12 +1327,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>>   
>>       s->discard_no_unref = r->discard_no_unref;
>>   
>> -    if (s->cache_clean_interval != r->cache_clean_interval) {
>> -        cache_clean_timer_del(bs);
>> -        s->cache_clean_interval = r->cache_clean_interval;
>> -        cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
>> -    }
>> -
>>       qapi_free_QCryptoBlockOpenOptions(s->crypto_opts);
>>       s->crypto_opts = r->crypto_opts;
>>   }
> Is there any specific reason why you move cache_clean_timer_init()
> earlier? I don't see an actual problem with the code as it is after this
> change, but s->l2_slice_size is related to the cache in a way, so it
> would feel safer if the cache cleaner were only started once all the
> settings are updated and not only those that it actually happens to
> access at the moment.

Oh.  I don’t think there’s a good reason.  I think it makes sense to 
keep the set-up in the old place.  Can you do that in your tree?

(I think it’s just because when I split it to stop the timer before 
deleting the caches, I decided it would conversely make sense to start 
it after they have been set up.)

Hanna



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 12/19] qcow2: Fix cache_clean_timer
  2025-11-18 11:01     ` Hanna Czenczek
@ 2025-11-18 17:06       ` Kevin Wolf
  2025-11-18 17:19         ` Hanna Czenczek
  0 siblings, 1 reply; 26+ messages in thread
From: Kevin Wolf @ 2025-11-18 17:06 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
	Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

Am 18.11.2025 um 12:01 hat Hanna Czenczek geschrieben:
> On 17.11.25 15:50, Kevin Wolf wrote:
> > Am 10.11.2025 um 16:48 hat Hanna Czenczek geschrieben:
> > > +/* s_locked specifies whether s->lock is held or not */
> > >   static void qcow2_update_options_commit(BlockDriverState *bs,
> > > -                                        Qcow2ReopenState *r)
> > > +                                        Qcow2ReopenState *r,
> > > +                                        bool s_locked)
> > >   {
> > >       BDRVQcow2State *s = bs->opaque;
> > >       int i;
> > > +    /*
> > > +     * We need to stop the cache-clean-timer before destroying the metadata
> > > +     * table caches
> > > +     */
> > > +    if (s_locked) {
> > > +        cache_clean_timer_co_locked_del_and_wait(bs);
> > > +    } else {
> > > +        cache_clean_timer_del_and_wait(bs);
> > > +    }
> > > +
> > >       if (s->l2_table_cache) {
> > >           qcow2_cache_destroy(s->l2_table_cache);
> > >       }
> > > @@ -1228,6 +1312,10 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
> > >       }
> > >       s->l2_table_cache = r->l2_table_cache;
> > >       s->refcount_block_cache = r->refcount_block_cache;
> > > +
> > > +    s->cache_clean_interval = r->cache_clean_interval;
> > > +    cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> > > +
> > >       s->l2_slice_size = r->l2_slice_size;
> > >       s->overlap_check = r->overlap_check;
> > > @@ -1239,12 +1327,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
> > >       s->discard_no_unref = r->discard_no_unref;
> > > -    if (s->cache_clean_interval != r->cache_clean_interval) {
> > > -        cache_clean_timer_del(bs);
> > > -        s->cache_clean_interval = r->cache_clean_interval;
> > > -        cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> > > -    }
> > > -
> > >       qapi_free_QCryptoBlockOpenOptions(s->crypto_opts);
> > >       s->crypto_opts = r->crypto_opts;
> > >   }
> > Is there any specific reason why you move cache_clean_timer_init()
> > earlier? I don't see an actual problem with the code as it is after this
> > change, but s->l2_slice_size is related to the cache in a way, so it
> > would feel safer if the cache cleaner were only started once all the
> > settings are updated and not only those that it actually happens to
> > access at the moment.
> 
> Oh.  I don’t think there’s a good reason.  I think it makes sense to keep
> the set-up in the old place.  Can you do that in your tree?

Yes, I changed it back and applied the series to my block branch.
Thanks.

Kevin



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 12/19] qcow2: Fix cache_clean_timer
  2025-11-18 17:06       ` Kevin Wolf
@ 2025-11-18 17:19         ` Hanna Czenczek
  0 siblings, 0 replies; 26+ messages in thread
From: Hanna Czenczek @ 2025-11-18 17:19 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
	Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
	Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
	Ronnie Sahlberg

On 18.11.25 18:06, Kevin Wolf wrote:
> Am 18.11.2025 um 12:01 hat Hanna Czenczek geschrieben:
>> On 17.11.25 15:50, Kevin Wolf wrote:
>>> Am 10.11.2025 um 16:48 hat Hanna Czenczek geschrieben:
>>>> +/* s_locked specifies whether s->lock is held or not */
>>>>    static void qcow2_update_options_commit(BlockDriverState *bs,
>>>> -                                        Qcow2ReopenState *r)
>>>> +                                        Qcow2ReopenState *r,
>>>> +                                        bool s_locked)
>>>>    {
>>>>        BDRVQcow2State *s = bs->opaque;
>>>>        int i;
>>>> +    /*
>>>> +     * We need to stop the cache-clean-timer before destroying the metadata
>>>> +     * table caches
>>>> +     */
>>>> +    if (s_locked) {
>>>> +        cache_clean_timer_co_locked_del_and_wait(bs);
>>>> +    } else {
>>>> +        cache_clean_timer_del_and_wait(bs);
>>>> +    }
>>>> +
>>>>        if (s->l2_table_cache) {
>>>>            qcow2_cache_destroy(s->l2_table_cache);
>>>>        }
>>>> @@ -1228,6 +1312,10 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>>>>        }
>>>>        s->l2_table_cache = r->l2_table_cache;
>>>>        s->refcount_block_cache = r->refcount_block_cache;
>>>> +
>>>> +    s->cache_clean_interval = r->cache_clean_interval;
>>>> +    cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
>>>> +
>>>>        s->l2_slice_size = r->l2_slice_size;
>>>>        s->overlap_check = r->overlap_check;
>>>> @@ -1239,12 +1327,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>>>>        s->discard_no_unref = r->discard_no_unref;
>>>> -    if (s->cache_clean_interval != r->cache_clean_interval) {
>>>> -        cache_clean_timer_del(bs);
>>>> -        s->cache_clean_interval = r->cache_clean_interval;
>>>> -        cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
>>>> -    }
>>>> -
>>>>        qapi_free_QCryptoBlockOpenOptions(s->crypto_opts);
>>>>        s->crypto_opts = r->crypto_opts;
>>>>    }
>>> Is there any specific reason why you move cache_clean_timer_init()
>>> earlier? I don't see an actual problem with the code as it is after this
>>> change, but s->l2_slice_size is related to the cache in a way, so it
>>> would feel safer if the cache cleaner were only started once all the
>>> settings are updated and not only those that it actually happens to
>>> access at the moment.
>> Oh.  I don’t think there’s a good reason.  I think it makes sense to keep
>> the set-up in the old place.  Can you do that in your tree?
> Yes, I changed it back and applied the series to my block branch.
> Thanks.

Thanks a lot!

Hanna



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-11-18 17:20 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-10 15:48 [PATCH v2 00/19] block: Some multi-threading fixes Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 01/19] block: Note on aio_co_wake use if not yet yielding Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 02/19] rbd: Run co BH CB in the coroutine’s AioContext Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 03/19] iscsi: " Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 04/19] nfs: " Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 05/19] curl: Fix coroutine waking Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 06/19] gluster: Do not move coroutine into BDS context Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 07/19] nvme: Kick and check completions in " Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 08/19] nvme: Fix coroutine waking Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 09/19] nvme: Note in which AioContext some functions run Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 10/19] block/io: Take reqs_lock for tracked_requests Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 11/19] qcow2: Re-initialize lock in invalidate_cache Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 12/19] qcow2: Fix cache_clean_timer Hanna Czenczek
2025-11-17 14:50   ` Kevin Wolf
2025-11-18 11:01     ` Hanna Czenczek
2025-11-18 17:06       ` Kevin Wolf
2025-11-18 17:19         ` Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 13/19] qcow2: Schedule cache-clean-timer in realtime Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 14/19] ssh: Run restart_coroutine in current AioContext Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 15/19] blkreplay: Run BH in coroutine’s AioContext Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 16/19] block: Note in which AioContext AIO CBs are called Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 17/19] iscsi: Create AIO BH in original AioContext Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 18/19] null-aio: Run CB " Hanna Czenczek
2025-11-10 15:48 ` [PATCH v2 19/19] win32-aio: Run CB in original context Hanna Czenczek
2025-11-17 15:11 ` [PATCH v2 00/19] block: Some multi-threading fixes Kevin Wolf
2025-11-17 19:11 ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).