* [PATCH 00/16] block: Some multi-threading fixes
@ 2025-10-28 16:33 Hanna Czenczek
2025-10-28 16:33 ` [PATCH 01/16] block: Note on aio_co_wake use if not yet yielding Hanna Czenczek
` (15 more replies)
0 siblings, 16 replies; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
Hi,
As noted in “the original patch”[1] (which, in slightly different form,
is still part of this series), with multi-queue, a BDS’s “main”
AioContext should have little meaning. Instead, every request can be in
any AioContext. This series fixes some of the places where that doesn’t
seem to be considered yet, some of which can cause user-visible bugs
(those patches are Cc-ed to stable, I hope).
[1] https://lists.nongnu.org/archive/html/qemu-block/2025-02/msg00123.html
A common problem pattern is that the request coroutine yields, awaiting
a completion function that runs in a different thread. Waking the
coroutine is then often done in the BDS’s main AioContext (e.g. via a BH
scheduled there), i.e. when using multiqueue, still a different thread
from the original request. This can cause races, particularly in case
the coroutine yields in a loop awaiting request completion, which can
cause it to see completion before yielding for the first time, even
though the completion function is still going to wake it. (A wake
without a yield is bad.)
In other cases, there is no race (patch 1 tries to clarify when
aio_co_wake() is safe to call), but scheduling the completion wake-up in
the BDS main AioContext still doesn’t make too much sense, and it should
instead run in the request’s context, so it can directly enter the
request coroutine instead of just scheduling it yet again.
Patches 7, 9, and 10 are general concurrency fixes that have nothing to
do with this wake-up pattern, I just found those issues along the way.
As for the last four patches: The block layer currently doesn’t specify
the context in which AIO callbacks run. Callers probably expect this to
be the same context in which they issued the request, but we don’t
explicitly say so. Now, the only caller of these AIO-style methods is
block/io.c, which immediately “maps” them to coroutines in a non-racey
manner, i.e. it doesn’t actually care much about the context.
So while it makes sense to specify the AIOCB context (and then make the
implementations adhere to it), in practice, the only caller doesn’t
really care, and the block layer as a whole doesn’t really care about
the AIO context either. So maybe we should just drop the last four
patches, or keep patch 13, but instead of stating that the CB is run in
the request context, explicitly say that it may be run in any
AioContext.
Hanna Czenczek (16):
block: Note on aio_co_wake use if not yet yielding
rbd: Run co BH CB in the coroutine’s AioContext
iscsi: Run co BH CB in the coroutine’s AioContext
nfs: Run co BH CB in the coroutine’s AioContext
curl: Fix coroutine waking
gluster: Do not move coroutine into BDS context
nvme: Kick and check completions in BDS context
nvme: Fix coroutine waking
block/io: Take reqs_lock for tracked_requests
qcow2: Fix cache_clean_timer
ssh: Run restart_coroutine in current AioContext
blkreplay: Run BH in coroutine’s AioContext
block: Note in which AioContext AIO CBs are called
iscsi: Create AIO BH in original AioContext
null-aio: Run CB in original AioContext
win32-aio: Run CB in original context
block/qcow2.h | 1 +
include/block/aio.h | 15 ++++++
include/block/block_int-common.h | 7 ++-
block/blkreplay.c | 3 +-
block/curl.c | 55 ++++++++++++-------
block/gluster.c | 17 +++---
block/io.c | 3 ++
block/iscsi.c | 46 +++++-----------
block/nfs.c | 23 +++-----
block/null.c | 7 ++-
block/nvme.c | 70 +++++++++++++------------
block/qcow2.c | 90 +++++++++++++++++++++++++++-----
block/rbd.c | 12 ++---
block/ssh.c | 22 ++++----
block/win32-aio.c | 31 ++++++++---
15 files changed, 251 insertions(+), 151 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH 01/16] block: Note on aio_co_wake use if not yet yielding
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 02/16] rbd: Run co BH CB in the coroutine’s AioContext Hanna Czenczek
` (14 subsequent siblings)
15 siblings, 0 replies; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
aio_co_wake() is generally safe to call regardless of whether the
coroutine is already yielding or not. If it is not yet yielding, it
will be scheduled to run when it does yield.
Caveats:
- The caller must be independent of the coroutine (to ensure the
coroutine must be yielding if both are in the same AioContext), i.e.
must not be the same coroutine
- The coroutine must yield at some point
Make note of this so callers can reason that their use is safe.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
include/block/aio.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/include/block/aio.h b/include/block/aio.h
index 99ff48420b..6ed97f0a4c 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -650,6 +650,21 @@ void coroutine_fn aio_co_reschedule_self(AioContext *new_ctx);
* aio_co_wake may be executed either in coroutine or non-coroutine
* context. The coroutine must not be entered by anyone else while
* aio_co_wake() is active.
+ *
+ * If `co`'s AioContext differs from the current AioContext, this will call
+ * aio_co_schedule(), which makes this safe to use even when `co` has not
+ * yielded yet. In such a case, it will be entered once it yields.
+ *
+ * In contrast, if `co`'s AioContext is equal to the current one, it is
+ * required for `co` to currently be yielding. This is generally the case
+ * if the caller is not in `co` (i.e. invoked by `co`), because the only
+ * other way for the caller to be running then is for `co` to currently be
+ * yielding.
+ *
+ * Therefore, if there is no way for the caller to be invoked/entered by
+ * `co`, it is generally safe to call this regardless of whether `co` is
+ * known to already be yielding or not -- it only has to yield at some
+ * point.
*/
void aio_co_wake(Coroutine *co);
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 02/16] rbd: Run co BH CB in the coroutine’s AioContext
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
2025-10-28 16:33 ` [PATCH 01/16] block: Note on aio_co_wake use if not yet yielding Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 03/16] iscsi: " Hanna Czenczek
` (13 subsequent siblings)
15 siblings, 0 replies; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
qemu_rbd_completion_cb() schedules the request completion code
(qemu_rbd_finish_bh()) to run in the BDS’s AioContext, assuming that
this is the same thread in which qemu_rbd_start_co() runs.
To explain, this is how both latter functions interact:
In qemu_rbd_start_co():
while (!task.complete)
qemu_coroutine_yield();
In qemu_rbd_finish_bh():
task->complete = true;
aio_co_wake(task->co); // task->co is qemu_rbd_start_co()
For this interaction to work reliably, both must run in the same thread
so that qemu_rbd_finish_bh() can only run once the coroutine yields.
Otherwise, finish_bh() may run before start_co() checks task.complete,
which will result in the latter seeing .complete as true immediately and
skipping the yield altogether, even though finish_bh() still wakes it.
With multiqueue, the BDS’s AioContext is not necessarily the thread
start_co() runs in, and so finish_bh() may be scheduled to run in a
different thread than start_co(). With the right timing, this will
cause the problems described above; waking a non-yielding coroutine is
not good, as can be reproduced by putting e.g. a usleep(100000) above
the while loop in start_co() (and using multiqueue), giving finish_bh()
a much better chance at exiting before start_co() can yield.
So instead of scheduling finish_bh() in the BDS’s AioContext, schedule
finish_bh() in task->co’s AioContext.
In addition, we can get rid of task.complete altogether because we will
get woken exactly once, when the task is indeed complete, no need to
check.
(We could go further and drop the BH, running aio_co_wake() directly in
qemu_rbd_completion_cb() because we are allowed to do that even if the
coroutine isn’t yet yielding and we’re in a different thread – but the
doc comment on qemu_rbd_completion_cb() says to be careful, so I decided
not to go so far here.)
Buglink: https://issues.redhat.com/browse/RHEL-67115
Reported-by: Junyao Zhao <junzhao@redhat.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/rbd.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)
diff --git a/block/rbd.c b/block/rbd.c
index 3611dc81cf..2a70b5a983 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -110,9 +110,7 @@ typedef struct BDRVRBDState {
} BDRVRBDState;
typedef struct RBDTask {
- BlockDriverState *bs;
Coroutine *co;
- bool complete;
int64_t ret;
} RBDTask;
@@ -1309,7 +1307,6 @@ static int qemu_rbd_resize(BlockDriverState *bs, uint64_t size)
static void qemu_rbd_finish_bh(void *opaque)
{
RBDTask *task = opaque;
- task->complete = true;
aio_co_wake(task->co);
}
@@ -1326,7 +1323,7 @@ static void qemu_rbd_completion_cb(rbd_completion_t c, RBDTask *task)
{
task->ret = rbd_aio_get_return_value(c);
rbd_aio_release(c);
- aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs),
+ aio_bh_schedule_oneshot(qemu_coroutine_get_aio_context(task->co),
qemu_rbd_finish_bh, task);
}
@@ -1338,7 +1335,7 @@ static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
RBDAIOCmd cmd)
{
BDRVRBDState *s = bs->opaque;
- RBDTask task = { .bs = bs, .co = qemu_coroutine_self() };
+ RBDTask task = { .co = qemu_coroutine_self() };
rbd_completion_t c;
int r;
@@ -1401,9 +1398,8 @@ static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
return r;
}
- while (!task.complete) {
- qemu_coroutine_yield();
- }
+ /* Expect exactly a single wake from qemu_rbd_finish_bh() */
+ qemu_coroutine_yield();
if (task.ret < 0) {
error_report("rbd request failed: cmd %d offset %" PRIu64 " bytes %"
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 03/16] iscsi: Run co BH CB in the coroutine’s AioContext
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
2025-10-28 16:33 ` [PATCH 01/16] block: Note on aio_co_wake use if not yet yielding Hanna Czenczek
2025-10-28 16:33 ` [PATCH 02/16] rbd: Run co BH CB in the coroutine’s AioContext Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-29 14:27 ` Kevin Wolf
2025-10-28 16:33 ` [PATCH 04/16] nfs: " Hanna Czenczek
` (12 subsequent siblings)
15 siblings, 1 reply; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
For rbd (and others), as described in “rbd: Run co BH CB in the
coroutine’s AioContext”, the pattern of setting a completion flag and
waking a coroutine that yields while the flag is not set can only work
when both run in the same thread.
iscsi has the same pattern, but the details are a bit different:
iscsi_co_generic_cb() can (as far as I understand) only run through
iscsi_service(), not just from a random thread at a random time.
iscsi_service() in turn can only be run after iscsi_set_events() set up
an FD event handler, which is done in iscsi_co_wait_for_task().
As a result, iscsi_co_wait_for_task() will always yield exactly once,
because iscsi_co_generic_cb() can only run after iscsi_set_events(),
after the completion flag has already been checked, and the yielding
coroutine will then be woken only once the completion flag was set to
true. So as far as I can tell, iscsi has no bug and already works fine.
Still, we don’t need the completion flag because we know we have to
yield exactly once, so we can drop it. This simplifies the code and
makes it more obvious that the “rbd bug” isn’t present here.
This makes iscsi_co_generic_bh_cb() and iscsi_retry_timer_expired() a
bit boring, and actually, for the former, we could drop it and run
aio_co_wake() directly from scsi_co_generic_cb() to the same effect; but
that would remove the replay_bh_schedule_oneshot_event(), and I assume
we shouldn’t do that. At least schedule both the BH and the timer in
the coroutine’s AioContext to make them simple wrappers around
qemu_coroutine_enter(), without a further BH indirection.
Finally, remove the iTask->co != NULL checks: This field is set by
iscsi_co_init_iscsitask(), which all users of IscsiTask run before even
setting up iscsi_co_generic_cb() as the callback, and it is never set or
cleared elsewhere, so it is impossible to not be set in
iscsi_co_generic_cb().
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/iscsi.c | 39 ++++++++++-----------------------------
1 file changed, 10 insertions(+), 29 deletions(-)
diff --git a/block/iscsi.c b/block/iscsi.c
index 15b96ee880..76c15e20ea 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -107,7 +107,6 @@ typedef struct IscsiLun {
typedef struct IscsiTask {
int status;
- int complete;
int retries;
int do_retry;
struct scsi_task *task;
@@ -183,18 +182,13 @@ iscsi_schedule_bh(IscsiAIOCB *acb)
static void iscsi_co_generic_bh_cb(void *opaque)
{
struct IscsiTask *iTask = opaque;
-
- iTask->complete = 1;
aio_co_wake(iTask->co);
}
static void iscsi_retry_timer_expired(void *opaque)
{
struct IscsiTask *iTask = opaque;
- iTask->complete = 1;
- if (iTask->co) {
- aio_co_wake(iTask->co);
- }
+ aio_co_wake(iTask->co);
}
static inline unsigned exp_random(double mean)
@@ -239,6 +233,7 @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int status,
{
struct IscsiTask *iTask = opaque;
struct scsi_task *task = command_data;
+ AioContext *itask_ctx = qemu_coroutine_get_aio_context(iTask->co);
iTask->status = status;
iTask->do_retry = 0;
@@ -263,9 +258,9 @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int status,
" (retry #%u in %u ms): %s",
iTask->retries, retry_time,
iscsi_get_error(iscsi));
- aio_timer_init(iTask->iscsilun->aio_context,
- &iTask->retry_timer, QEMU_CLOCK_REALTIME,
- SCALE_MS, iscsi_retry_timer_expired, iTask);
+ aio_timer_init(itask_ctx, &iTask->retry_timer,
+ QEMU_CLOCK_REALTIME, SCALE_MS,
+ iscsi_retry_timer_expired, iTask);
timer_mod(&iTask->retry_timer,
qemu_clock_get_ms(QEMU_CLOCK_REALTIME) + retry_time);
iTask->do_retry = 1;
@@ -284,12 +279,7 @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int status,
}
}
- if (iTask->co) {
- replay_bh_schedule_oneshot_event(iTask->iscsilun->aio_context,
- iscsi_co_generic_bh_cb, iTask);
- } else {
- iTask->complete = 1;
- }
+ replay_bh_schedule_oneshot_event(itask_ctx, iscsi_co_generic_bh_cb, iTask);
}
static void coroutine_fn
@@ -592,12 +582,10 @@ static inline bool iscsi_allocmap_is_valid(IscsiLun *iscsilun,
static void coroutine_fn iscsi_co_wait_for_task(IscsiTask *iTask,
IscsiLun *iscsilun)
{
- while (!iTask->complete) {
- iscsi_set_events(iscsilun);
- qemu_mutex_unlock(&iscsilun->mutex);
- qemu_coroutine_yield();
- qemu_mutex_lock(&iscsilun->mutex);
- }
+ iscsi_set_events(iscsilun);
+ qemu_mutex_unlock(&iscsilun->mutex);
+ qemu_coroutine_yield();
+ qemu_mutex_lock(&iscsilun->mutex);
}
static int coroutine_fn
@@ -669,7 +657,6 @@ retry:
}
if (iTask.do_retry) {
- iTask.complete = 0;
goto retry;
}
@@ -740,7 +727,6 @@ retry:
scsi_free_scsi_task(iTask.task);
iTask.task = NULL;
}
- iTask.complete = 0;
goto retry;
}
@@ -902,7 +888,6 @@ retry:
}
if (iTask.do_retry) {
- iTask.complete = 0;
goto retry;
}
@@ -940,7 +925,6 @@ retry:
}
if (iTask.do_retry) {
- iTask.complete = 0;
goto retry;
}
@@ -1184,7 +1168,6 @@ retry:
}
if (iTask.do_retry) {
- iTask.complete = 0;
goto retry;
}
@@ -1301,7 +1284,6 @@ retry:
}
if (iTask.do_retry) {
- iTask.complete = 0;
goto retry;
}
@@ -2390,7 +2372,6 @@ retry:
iscsi_co_wait_for_task(&iscsi_task, dst_lun);
if (iscsi_task.do_retry) {
- iscsi_task.complete = 0;
goto retry;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 04/16] nfs: Run co BH CB in the coroutine’s AioContext
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
` (2 preceding siblings ...)
2025-10-28 16:33 ` [PATCH 03/16] iscsi: " Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 05/16] curl: Fix coroutine waking Hanna Czenczek
` (11 subsequent siblings)
15 siblings, 0 replies; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
Like in “rbd: Run co BH CB in the coroutine’s AioContext”, drop the
completion flag, yield exactly once, and run the BH in the coroutine’s
AioContext.
(Can be reproduced with multiqueue by adding a usleep(100000) before the
`while (!task.complete)` loops.)
Like in “iscsi: Run co BH CB in the coroutine’s AioContext”, this makes
nfs_co_generic_bh_cb() trivial, and we could just run aio_co_wake()
directly from nfs_co_generic_cb(). Like in iscsi, we don’t do that, so
as to keep the replay_bh_schedule_oneshot_event(), but we should at
least run the BH in the coroutine’s context to remove further BH
indirection.
Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/nfs.c | 23 ++++++-----------------
1 file changed, 6 insertions(+), 17 deletions(-)
diff --git a/block/nfs.c b/block/nfs.c
index 0a7d38db09..4667d49416 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -69,7 +69,6 @@ typedef struct NFSClient {
typedef struct NFSRPC {
BlockDriverState *bs;
int ret;
- int complete;
QEMUIOVector *iov;
struct stat *st;
Coroutine *co;
@@ -233,8 +232,6 @@ static void coroutine_fn nfs_co_init_task(BlockDriverState *bs, NFSRPC *task)
static void nfs_co_generic_bh_cb(void *opaque)
{
NFSRPC *task = opaque;
-
- task->complete = 1;
aio_co_wake(task->co);
}
@@ -256,7 +253,7 @@ nfs_co_generic_cb(int ret, struct nfs_context *nfs, void *data,
if (task->ret < 0) {
error_report("NFS Error: %s", nfs_get_error(nfs));
}
- replay_bh_schedule_oneshot_event(task->client->aio_context,
+ replay_bh_schedule_oneshot_event(qemu_coroutine_get_aio_context(task->co),
nfs_co_generic_bh_cb, task);
}
@@ -278,9 +275,7 @@ static int coroutine_fn nfs_co_preadv(BlockDriverState *bs, int64_t offset,
nfs_set_events(client);
}
- while (!task.complete) {
- qemu_coroutine_yield();
- }
+ qemu_coroutine_yield();
if (task.ret < 0) {
return task.ret;
@@ -328,9 +323,7 @@ static int coroutine_fn nfs_co_pwritev(BlockDriverState *bs, int64_t offset,
nfs_set_events(client);
}
- while (!task.complete) {
- qemu_coroutine_yield();
- }
+ qemu_coroutine_yield();
if (my_buffer) {
g_free(buf);
@@ -358,9 +351,7 @@ static int coroutine_fn nfs_co_flush(BlockDriverState *bs)
nfs_set_events(client);
}
- while (!task.complete) {
- qemu_coroutine_yield();
- }
+ qemu_coroutine_yield();
return task.ret;
}
@@ -723,7 +714,7 @@ nfs_get_allocated_file_size_cb(int ret, struct nfs_context *nfs, void *data,
if (task->ret < 0) {
error_report("NFS Error: %s", nfs_get_error(nfs));
}
- replay_bh_schedule_oneshot_event(task->client->aio_context,
+ replay_bh_schedule_oneshot_event(qemu_coroutine_get_aio_context(task->co),
nfs_co_generic_bh_cb, task);
}
@@ -748,9 +739,7 @@ static int64_t coroutine_fn nfs_co_get_allocated_file_size(BlockDriverState *bs)
nfs_set_events(client);
}
- while (!task.complete) {
- qemu_coroutine_yield();
- }
+ qemu_coroutine_yield();
return (task.ret < 0 ? task.ret : st.st_blocks * 512);
}
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 05/16] curl: Fix coroutine waking
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
` (3 preceding siblings ...)
2025-10-28 16:33 ` [PATCH 04/16] nfs: " Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-29 16:57 ` Kevin Wolf
2025-10-28 16:33 ` [PATCH 06/16] gluster: Do not move coroutine into BDS context Hanna Czenczek
` (10 subsequent siblings)
15 siblings, 1 reply; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
If we wake a coroutine from a different context, we must ensure that it
will yield exactly once (now or later), awaiting that wake.
curl’s current .ret == -EINPROGRESS loop may lead to the coroutine not
yielding if the request finishes before the loop gets run. To fix it,
drop the loop and just yield exactly once, unless the request is served
from the cache or failed before it was submitted – that last part makes
it a bit complicated, as the result of curl_find_buf() now needs to be a
tristate.
(Can be reproduced with multiqueue by adding a usleep(100000) before the
`while (acb.ret == -EINPROGRESS)` loop.)
Also, add a comment why aio_co_wake() is safe regardless of whether the
coroutine and curl_multi_check_completion() run in the same context.
Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/curl.c | 55 +++++++++++++++++++++++++++++++++++-----------------
1 file changed, 37 insertions(+), 18 deletions(-)
diff --git a/block/curl.c b/block/curl.c
index 68cf83ce55..65996a8866 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -124,6 +124,16 @@ typedef struct BDRVCURLState {
char *proxypassword;
} BDRVCURLState;
+/** Possible result states of curl_find_buf() */
+typedef enum {
+ /* No buffer found, need to create new request */
+ CURL_NO_BUF_FOUND,
+ /* Buffer found, request filled and done */
+ CURL_REQUEST_FILLED,
+ /* Ongoing request found, need to yield */
+ CURL_REQUEST_ONGOING,
+} CURLFindBufResult;
+
static void curl_clean_state(CURLState *s);
static void curl_multi_do(void *arg);
@@ -258,8 +268,8 @@ read_end:
}
/* Called with s->mutex held. */
-static bool curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
- CURLAIOCB *acb)
+static CURLFindBufResult curl_find_buf(BDRVCURLState *s, uint64_t start,
+ uint64_t len, CURLAIOCB *acb)
{
int i;
uint64_t end = start + len;
@@ -289,7 +299,7 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
qemu_iovec_memset(acb->qiov, clamped_len, 0, len - clamped_len);
}
acb->ret = 0;
- return true;
+ return CURL_REQUEST_FILLED;
}
// Wait for unfinished chunks
@@ -307,13 +317,13 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
for (j=0; j<CURL_NUM_ACB; j++) {
if (!state->acb[j]) {
state->acb[j] = acb;
- return true;
+ return CURL_REQUEST_ONGOING;
}
}
}
}
- return false;
+ return CURL_NO_BUF_FOUND;
}
/* Called with s->mutex held. */
@@ -378,6 +388,16 @@ static void curl_multi_check_completion(BDRVCURLState *s)
acb->ret = error ? -EIO : 0;
state->acb[i] = NULL;
qemu_mutex_unlock(&s->mutex);
+ /*
+ * Current AioContext is the BDS context, which may or may not
+ * be the request (coroutine) context.
+ * - If it is, the coroutine must have yielded or the FD handler
+ * (curl_multi_do()/curl_multi_timeout_do()) could not have
+ * been called and we would not be here
+ * - If it is not, it doesn't matter whether it has already
+ * yielded or not; it will be scheduled once it does yield
+ * So aio_co_wake() is safe to call.
+ */
aio_co_wake(acb->co);
qemu_mutex_lock(&s->mutex);
}
@@ -868,7 +888,8 @@ out_noclean:
return -EINVAL;
}
-static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
+/* Return whether a request was submitted that requires yielding */
+static bool coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
{
CURLState *state;
int running;
@@ -877,13 +898,15 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
uint64_t start = acb->offset;
uint64_t end;
+ CURLFindBufResult find_buf_res;
- qemu_mutex_lock(&s->mutex);
+ QEMU_LOCK_GUARD(&s->mutex);
// In case we have the requested data already (e.g. read-ahead),
// we can just call the callback and be done.
- if (curl_find_buf(s, start, acb->bytes, acb)) {
- goto out;
+ find_buf_res = curl_find_buf(s, start, acb->bytes, acb);
+ if (find_buf_res != CURL_NO_BUF_FOUND) {
+ return find_buf_res == CURL_REQUEST_ONGOING;
}
// No cache found, so let's start a new request
@@ -898,7 +921,7 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
if (curl_init_state(s, state) < 0) {
curl_clean_state(state);
acb->ret = -EIO;
- goto out;
+ return false;
}
acb->start = 0;
@@ -913,7 +936,7 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
if (state->buf_len && state->orig_buf == NULL) {
curl_clean_state(state);
acb->ret = -ENOMEM;
- goto out;
+ return false;
}
state->acb[0] = acb;
@@ -925,14 +948,12 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
acb->ret = -EIO;
curl_clean_state(state);
- goto out;
+ return false;
}
/* Tell curl it needs to kick things off */
curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
-
-out:
- qemu_mutex_unlock(&s->mutex);
+ return true;
}
static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
@@ -941,14 +962,12 @@ static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
{
CURLAIOCB acb = {
.co = qemu_coroutine_self(),
- .ret = -EINPROGRESS,
.qiov = qiov,
.offset = offset,
.bytes = bytes
};
- curl_setup_preadv(bs, &acb);
- while (acb.ret == -EINPROGRESS) {
+ if (curl_setup_preadv(bs, &acb)) {
qemu_coroutine_yield();
}
return acb.ret;
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 06/16] gluster: Do not move coroutine into BDS context
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
` (4 preceding siblings ...)
2025-10-28 16:33 ` [PATCH 05/16] curl: Fix coroutine waking Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-29 17:10 ` Kevin Wolf
2025-10-28 16:33 ` [PATCH 07/16] nvme: Kick and check completions in " Hanna Czenczek
` (9 subsequent siblings)
15 siblings, 1 reply; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
The request coroutine may not run in the BDS AioContext. We should wake
it in its own context, not move it.
With that, we can remove GlusterAIOCB.aio_context.
Also add a comment why aio_co_schedule() is safe to use in this way.
**Note:** Due to a lack of a gluster set-up, I have not tested this
commit. It seemed safe enough to send anyway, just maybe not to
qemu-stable. To be clear, I don’t know of any user-visible bugs that
would arise from the state without this patch; the request coroutine is
moved into the main BDS AioContext, which feels wrong, but I’m not sure
whether it can actually produce hard bugs.
I’ll leave it to others’ opinions whether to keep or drop this patch
under these circumstances.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/gluster.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/block/gluster.c b/block/gluster.c
index 89abd40f31..4fb25b2c6d 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -56,7 +56,6 @@ typedef struct GlusterAIOCB {
int64_t size;
int ret;
Coroutine *coroutine;
- AioContext *aio_context;
} GlusterAIOCB;
typedef struct BDRVGlusterState {
@@ -743,7 +742,17 @@ static void gluster_finish_aiocb(struct glfs_fd *fd, ssize_t ret,
acb->ret = -EIO; /* Partial read/write - fail it */
}
- aio_co_schedule(acb->aio_context, acb->coroutine);
+ /*
+ * Safe to call: The coroutine will yield exactly once awaiting this
+ * scheduling, and the context is its own context, so it will be scheduled
+ * once it does yield.
+ *
+ * (aio_co_wake() would call qemu_get_current_aio_context() to check whether
+ * we are in the same context, but we are not in a qemu thread, so we cannot
+ * do that. Use aio_co_schedule() directly.)
+ */
+ aio_co_schedule(qemu_coroutine_get_aio_context(acb->coroutine),
+ acb->coroutine);
}
static void qemu_gluster_parse_flags(int bdrv_flags, int *open_flags)
@@ -1006,7 +1015,6 @@ static coroutine_fn int qemu_gluster_co_pwrite_zeroes(BlockDriverState *bs,
acb.size = bytes;
acb.ret = 0;
acb.coroutine = qemu_coroutine_self();
- acb.aio_context = bdrv_get_aio_context(bs);
ret = glfs_zerofill_async(s->fd, offset, bytes, gluster_finish_aiocb, &acb);
if (ret < 0) {
@@ -1184,7 +1192,6 @@ static coroutine_fn int qemu_gluster_co_rw(BlockDriverState *bs,
acb.size = size;
acb.ret = 0;
acb.coroutine = qemu_coroutine_self();
- acb.aio_context = bdrv_get_aio_context(bs);
if (write) {
ret = glfs_pwritev_async(s->fd, qiov->iov, qiov->niov, offset, 0,
@@ -1251,7 +1258,6 @@ static coroutine_fn int qemu_gluster_co_flush_to_disk(BlockDriverState *bs)
acb.size = 0;
acb.ret = 0;
acb.coroutine = qemu_coroutine_self();
- acb.aio_context = bdrv_get_aio_context(bs);
ret = glfs_fsync_async(s->fd, gluster_finish_aiocb, &acb);
if (ret < 0) {
@@ -1299,7 +1305,6 @@ static coroutine_fn int qemu_gluster_co_pdiscard(BlockDriverState *bs,
acb.size = 0;
acb.ret = 0;
acb.coroutine = qemu_coroutine_self();
- acb.aio_context = bdrv_get_aio_context(bs);
ret = glfs_discard_async(s->fd, offset, bytes, gluster_finish_aiocb, &acb);
if (ret < 0) {
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 07/16] nvme: Kick and check completions in BDS context
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
` (5 preceding siblings ...)
2025-10-28 16:33 ` [PATCH 06/16] gluster: Do not move coroutine into BDS context Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-29 17:23 ` Kevin Wolf
2025-10-28 16:33 ` [PATCH 08/16] nvme: Fix coroutine waking Hanna Czenczek
` (8 subsequent siblings)
15 siblings, 1 reply; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
nvme_process_completion() must run in the main BDS context, so schedule
a BH for requests that aren’t there.
The context in which we kick does not matter, but let’s just keep kick
and process_completion together for simplicity’s sake.
(For what it’s worth, a quick fio bandwidth test indicates that on my
test hardware, if anything, this may be a bit better than kicking
immediately before scheduling a pure nvme_process_completion() BH. But
I wouldn’t take more from those results than that it doesn’t really seem
to matter either way.)
Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/nvme.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/block/nvme.c b/block/nvme.c
index 8df53ee4ca..7ed5f570bc 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -481,7 +481,7 @@ static void nvme_trace_command(const NvmeCmd *cmd)
}
}
-static void nvme_deferred_fn(void *opaque)
+static void nvme_kick_and_check_completions(void *opaque)
{
NVMeQueuePair *q = opaque;
@@ -490,6 +490,18 @@ static void nvme_deferred_fn(void *opaque)
nvme_process_completion(q);
}
+static void nvme_deferred_fn(void *opaque)
+{
+ NVMeQueuePair *q = opaque;
+
+ if (qemu_get_current_aio_context() == q->s->aio_context) {
+ nvme_kick_and_check_completions(q);
+ } else {
+ aio_bh_schedule_oneshot(q->s->aio_context,
+ nvme_kick_and_check_completions, q);
+ }
+}
+
static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
NvmeCmd *cmd, BlockCompletionFunc cb,
void *opaque)
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 08/16] nvme: Fix coroutine waking
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
` (6 preceding siblings ...)
2025-10-28 16:33 ` [PATCH 07/16] nvme: Kick and check completions in " Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-29 17:43 ` Kevin Wolf
2025-10-28 16:33 ` [PATCH 09/16] block/io: Take reqs_lock for tracked_requests Hanna Czenczek
` (7 subsequent siblings)
15 siblings, 1 reply; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
nvme wakes the request coroutine via qemu_coroutine_enter() from a BH
scheduled in the BDS AioContext. This may not be the same context as
the one in which the request originally ran, which would be wrong:
- It could mean we enter the coroutine before it yields,
- We would move the coroutine in to a different context.
(Can be reproduced with multiqueue by adding a usleep(100000) before the
`while (data.ret == -EINPROGRESS)` loop.)
To fix that, schedule nvme_rw_cb_bh() in the coroutine AioContext.
(Just like in the preceding iscsi and nfs patches, we could drop the
trivial nvme_rw_cb_bh() and just use aio_co_wake() directly, but don’t
do that to keep replay_bh_schedule_oneshot_event().)
With this, we can remove NVMeCoData.ctx.
Note the check of data->co == NULL to bypass the BH/yield combination in
case nvme_rw_cb() is called from nvme_submit_command(): We probably want
to keep this fast path for performance reasons, but we have to be quite
careful about it:
- We cannot overload .ret for this, but have to use a dedicated
.skip_yield field. Otherwise, if nvme_rw_cb() runs in a different
thread than the coroutine, it may see .ret set and skip the yield,
while nvme_rw_cb() will still schedule a BH for waking. Therefore,
the signal to skip the yield can only be set in nvme_rw_cb() if waking
too is skipped, which is independent from communicating the return
value.
- We can only skip the yield if nvme_rw_cb() actually runs in the
request coroutine. Otherwise (specifically if they run in different
AioContexts), the order between this function’s execution and the
coroutine yielding (or not yielding) is not reliable.
- There is no point to yielding in a loop; there are no spurious wakes,
so once we yield, we will only be re-entered once the command is done.
Replace `while` by `if`.
Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/nvme.c | 56 +++++++++++++++++++++-------------------------------
1 file changed, 23 insertions(+), 33 deletions(-)
diff --git a/block/nvme.c b/block/nvme.c
index 7ed5f570bc..4b1f623e7d 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -1176,8 +1176,8 @@ fail:
typedef struct {
Coroutine *co;
+ bool skip_yield;
int ret;
- AioContext *ctx;
} NVMeCoData;
static void nvme_rw_cb_bh(void *opaque)
@@ -1189,12 +1189,22 @@ static void nvme_rw_cb_bh(void *opaque)
static void nvme_rw_cb(void *opaque, int ret)
{
NVMeCoData *data = opaque;
+ AioContext *ctx;
+
data->ret = ret;
- if (!data->co) {
- /* The rw coroutine hasn't yielded, don't try to enter. */
- return;
+
+ if (data->co == qemu_coroutine_self()) {
+ /*
+ * Fast path: We are inside of the request coroutine (through
+ * nvme_submit_command, nvme_deferred_fn, nvme_process_completion).
+ * We can set data->skip_yield here to keep the coroutine from
+ * yielding, and then we don't need to schedule a BH to wake it.
+ */
+ data->skip_yield = true;
+ } else {
+ ctx = qemu_coroutine_get_aio_context(data->co);
+ replay_bh_schedule_oneshot_event(ctx, nvme_rw_cb_bh, data);
}
- replay_bh_schedule_oneshot_event(data->ctx, nvme_rw_cb_bh, data);
}
static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
@@ -1217,10 +1227,7 @@ static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
.cdw11 = cpu_to_le32(((offset >> s->blkshift) >> 32) & 0xFFFFFFFF),
.cdw12 = cpu_to_le32(cdw12),
};
- NVMeCoData data = {
- .ctx = bdrv_get_aio_context(bs),
- .ret = -EINPROGRESS,
- };
+ NVMeCoData data = { .co = qemu_coroutine_self() };
trace_nvme_prw_aligned(s, is_write, offset, bytes, flags, qiov->niov);
assert(s->queue_count > 1);
@@ -1235,9 +1242,7 @@ static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
return r;
}
nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
-
- data.co = qemu_coroutine_self();
- while (data.ret == -EINPROGRESS) {
+ if (!data.skip_yield) {
qemu_coroutine_yield();
}
@@ -1332,18 +1337,13 @@ static coroutine_fn int nvme_co_flush(BlockDriverState *bs)
.opcode = NVME_CMD_FLUSH,
.nsid = cpu_to_le32(s->nsid),
};
- NVMeCoData data = {
- .ctx = bdrv_get_aio_context(bs),
- .ret = -EINPROGRESS,
- };
+ NVMeCoData data = { .co = qemu_coroutine_self() };
assert(s->queue_count > 1);
req = nvme_get_free_req(ioq);
assert(req);
nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
-
- data.co = qemu_coroutine_self();
- if (data.ret == -EINPROGRESS) {
+ if (!data.skip_yield) {
qemu_coroutine_yield();
}
@@ -1383,10 +1383,7 @@ static coroutine_fn int nvme_co_pwrite_zeroes(BlockDriverState *bs,
.cdw11 = cpu_to_le32(((offset >> s->blkshift) >> 32) & 0xFFFFFFFF),
};
- NVMeCoData data = {
- .ctx = bdrv_get_aio_context(bs),
- .ret = -EINPROGRESS,
- };
+ NVMeCoData data = { .co = qemu_coroutine_self() };
if (flags & BDRV_REQ_MAY_UNMAP) {
cdw12 |= (1 << 25);
@@ -1404,9 +1401,7 @@ static coroutine_fn int nvme_co_pwrite_zeroes(BlockDriverState *bs,
assert(req);
nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
-
- data.co = qemu_coroutine_self();
- while (data.ret == -EINPROGRESS) {
+ if (!data.skip_yield) {
qemu_coroutine_yield();
}
@@ -1433,10 +1428,7 @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
.cdw11 = cpu_to_le32(1 << 2), /*deallocate bit*/
};
- NVMeCoData data = {
- .ctx = bdrv_get_aio_context(bs),
- .ret = -EINPROGRESS,
- };
+ NVMeCoData data = { .co = qemu_coroutine_self() };
if (!s->supports_discard) {
return -ENOTSUP;
@@ -1479,9 +1471,7 @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
trace_nvme_dsm(s, offset, bytes);
nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
-
- data.co = qemu_coroutine_self();
- while (data.ret == -EINPROGRESS) {
+ if (!data.skip_yield) {
qemu_coroutine_yield();
}
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 09/16] block/io: Take reqs_lock for tracked_requests
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
` (7 preceding siblings ...)
2025-10-28 16:33 ` [PATCH 08/16] nvme: Fix coroutine waking Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 10/16] qcow2: Fix cache_clean_timer Hanna Czenczek
` (6 subsequent siblings)
15 siblings, 0 replies; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
bdrv_co_get_self_request() does not take a lock around iterating through
bs->tracked_requests. With multiqueue, it may thus iterate over a list
that is in the process of being modified, producing an assertion
failure:
../block/file-posix.c:3702: raw_do_pwrite_zeroes: Assertion `req' failed.
[0] abort() at /lib64/libc.so.6
[1] __assert_fail_base.cold() at /lib64/libc.so.6
[2] raw_do_pwrite_zeroes() at ../block/file-posix.c:3702
[3] bdrv_co_do_pwrite_zeroes() at ../block/io.c:1910
[4] bdrv_aligned_pwritev() at ../block/io.c:2109
[5] bdrv_co_do_zero_pwritev() at ../block/io.c:2192
[6] bdrv_co_pwritev_part() at ../block/io.c:2292
[7] bdrv_co_pwritev() at ../block/io.c:2225
[8] handle_alloc_space() at ../block/qcow2.c:2573
[9] qcow2_co_pwritev_task() at ../block/qcow2.c:2625
Fix this by taking reqs_lock.
Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/io.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/block/io.c b/block/io.c
index 9bd8ba8431..37df1e0253 100644
--- a/block/io.c
+++ b/block/io.c
@@ -721,11 +721,14 @@ BdrvTrackedRequest *coroutine_fn bdrv_co_get_self_request(BlockDriverState *bs)
Coroutine *self = qemu_coroutine_self();
IO_CODE();
+ qemu_mutex_lock(&bs->reqs_lock);
QLIST_FOREACH(req, &bs->tracked_requests, list) {
if (req->co == self) {
+ qemu_mutex_unlock(&bs->reqs_lock);
return req;
}
}
+ qemu_mutex_unlock(&bs->reqs_lock);
return NULL;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 10/16] qcow2: Fix cache_clean_timer
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
` (8 preceding siblings ...)
2025-10-28 16:33 ` [PATCH 09/16] block/io: Take reqs_lock for tracked_requests Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-29 20:23 ` Kevin Wolf
2025-10-28 16:33 ` [PATCH 11/16] ssh: Run restart_coroutine in current AioContext Hanna Czenczek
` (5 subsequent siblings)
15 siblings, 1 reply; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
The cache-cleaner runs as a timer CB in the BDS AioContext. With
multiqueue, it can run concurrently to I/O requests, and because it does
not take any lock, this can break concurrent cache accesses, corrupting
the image. While the chances of this happening are low, it can be
reproduced e.g. by modifying the code to schedule the timer CB every
5 ms (instead of at most once per second) and modifying the last (inner)
while loop of qcow2_cache_clean_unused() like so:
while (i < c->size && can_clean_entry(c, i)) {
for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
usleep(100);
}
c->entries[i].offset = 0;
c->entries[i].lru_counter = 0;
i++;
to_clean++;
}
i.e. making it wait on purpose for the point in time where the cache is
in use by something else.
The solution chosen for this in this patch is not the best solution, I
hope, but I admittedly can’t come up with anything strictly better.
We can protect from concurrent cache accesses either by taking the
existing s->lock, or we introduce a new (non-coroutine) mutex
specifically for cache accesses. I would prefer to avoid the latter so
as not to introduce additional (very slight) overhead.
Using s->lock, which is a coroutine mutex, however means that we need to
take it in a coroutine, so the timer CB must enter such a coroutine. As
a result, descheduling the timer is no longer a guarantee that the
cache-cleaner will not run, because it may now be yielding in
qemu_co_mutex_lock().
(Note even now this was only guaranteed for cache_clean_timer_del()
callers that run in the BDS (the timer’s) AioContext. For callers
running in the main context, the problem may have already existed,
though maybe the BQL prevents timers from running in other contexts, I’m
not sure.)
Polling to await the timer to actually settle seems very complicated for
something that’s rather a minor problem, but I can’t come up with any
better solution that doesn’t again just overlook potential problems.
(One cleaner idea may be to have a generic way to have timers run
coroutines, and to await those when descheduling the timer. But while
cleaner, it would also be more complicated, and I don’t think worth it
at this point.)
(Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
I’m not too fond of this solution.)
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/qcow2.h | 1 +
block/qcow2.c | 90 ++++++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 79 insertions(+), 12 deletions(-)
diff --git a/block/qcow2.h b/block/qcow2.h
index a9e3481c6e..272d34def1 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -347,6 +347,7 @@ typedef struct BDRVQcow2State {
Qcow2Cache *refcount_block_cache;
QEMUTimer *cache_clean_timer;
unsigned cache_clean_interval;
+ bool cache_clean_running, cache_clean_polling;
QLIST_HEAD(, QCowL2Meta) cluster_allocs;
diff --git a/block/qcow2.c b/block/qcow2.c
index 4aa9f9e068..ef851794c3 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -835,12 +835,38 @@ static const char *overlap_bool_option_names[QCOW2_OL_MAX_BITNR] = {
[QCOW2_OL_BITMAP_DIRECTORY_BITNR] = QCOW2_OPT_OVERLAP_BITMAP_DIRECTORY,
};
-static void cache_clean_timer_cb(void *opaque)
+static void coroutine_fn cache_clean_timer_co(void *opaque)
{
BlockDriverState *bs = opaque;
BDRVQcow2State *s = bs->opaque;
+
+ qemu_co_mutex_lock(&s->lock);
+ if (!s->cache_clean_timer) {
+ /* cache_clean_timer_del() has been called, skip doing anything */
+ goto out;
+ }
qcow2_cache_clean_unused(s->l2_table_cache);
qcow2_cache_clean_unused(s->refcount_block_cache);
+
+out:
+ qemu_co_mutex_unlock(&s->lock);
+ qatomic_set(&s->cache_clean_running, false);
+ if (qatomic_xchg(&s->cache_clean_polling, false)) {
+ aio_wait_kick();
+ }
+}
+
+static void cache_clean_timer_cb(void *opaque)
+{
+ BlockDriverState *bs = opaque;
+ BDRVQcow2State *s = bs->opaque;
+ Coroutine *co;
+
+ co = qemu_coroutine_create(cache_clean_timer_co, bs);
+ /* cleared in cache_clean_timer_co() */
+ qatomic_set(&s->cache_clean_running, true);
+ qemu_coroutine_enter(co);
+
timer_mod(s->cache_clean_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
(int64_t) s->cache_clean_interval * 1000);
}
@@ -867,6 +893,39 @@ static void cache_clean_timer_del(BlockDriverState *bs)
}
}
+/*
+ * Delete the cache clean timer and await any yet running instance.
+ * Must be called from the main or BDS AioContext, holding s->lock.
+ */
+static void coroutine_fn
+cache_clean_timer_locked_co_del_and_wait(BlockDriverState *bs)
+{
+ BDRVQcow2State *s = bs->opaque;
+ IO_OR_GS_CODE();
+ cache_clean_timer_del(bs);
+ if (qatomic_read(&s->cache_clean_running)) {
+ qemu_co_mutex_unlock(&s->lock);
+ qatomic_set(&s->cache_clean_polling, true);
+ BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
+ qemu_co_mutex_lock(&s->lock);
+ }
+}
+
+/*
+ * Delete the cache clean timer and await any yet running instance.
+ * Must be called from the main or BDS AioContext without s->lock held.
+ */
+static void cache_clean_timer_del_and_wait(BlockDriverState *bs)
+{
+ BDRVQcow2State *s = bs->opaque;
+ IO_OR_GS_CODE();
+ cache_clean_timer_del(bs);
+ if (qatomic_read(&s->cache_clean_running)) {
+ qatomic_set(&s->cache_clean_polling, true);
+ BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
+ }
+}
+
static void qcow2_detach_aio_context(BlockDriverState *bs)
{
cache_clean_timer_del(bs);
@@ -1214,12 +1273,20 @@ fail:
return ret;
}
+/* s_locked specifies whether s->lock is held or not */
static void qcow2_update_options_commit(BlockDriverState *bs,
- Qcow2ReopenState *r)
+ Qcow2ReopenState *r,
+ bool s_locked)
{
BDRVQcow2State *s = bs->opaque;
int i;
+ if (s_locked) {
+ cache_clean_timer_locked_co_del_and_wait(bs);
+ } else {
+ cache_clean_timer_del_and_wait(bs);
+ }
+
if (s->l2_table_cache) {
qcow2_cache_destroy(s->l2_table_cache);
}
@@ -1228,6 +1295,10 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
}
s->l2_table_cache = r->l2_table_cache;
s->refcount_block_cache = r->refcount_block_cache;
+
+ s->cache_clean_interval = r->cache_clean_interval;
+ cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
+
s->l2_slice_size = r->l2_slice_size;
s->overlap_check = r->overlap_check;
@@ -1239,12 +1310,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
s->discard_no_unref = r->discard_no_unref;
- if (s->cache_clean_interval != r->cache_clean_interval) {
- cache_clean_timer_del(bs);
- s->cache_clean_interval = r->cache_clean_interval;
- cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
- }
-
qapi_free_QCryptoBlockOpenOptions(s->crypto_opts);
s->crypto_opts = r->crypto_opts;
}
@@ -1261,6 +1326,7 @@ static void qcow2_update_options_abort(BlockDriverState *bs,
qapi_free_QCryptoBlockOpenOptions(r->crypto_opts);
}
+/* Called with s->lock held */
static int coroutine_fn GRAPH_RDLOCK
qcow2_update_options(BlockDriverState *bs, QDict *options, int flags,
Error **errp)
@@ -1270,7 +1336,7 @@ qcow2_update_options(BlockDriverState *bs, QDict *options, int flags,
ret = qcow2_update_options_prepare(bs, &r, options, flags, errp);
if (ret >= 0) {
- qcow2_update_options_commit(bs, &r);
+ qcow2_update_options_commit(bs, &r, true);
} else {
qcow2_update_options_abort(bs, &r);
}
@@ -1908,7 +1974,7 @@ qcow2_do_open(BlockDriverState *bs, QDict *options, int flags,
qemu_vfree(s->l1_table);
/* else pre-write overlap checks in cache_destroy may crash */
s->l1_table = NULL;
- cache_clean_timer_del(bs);
+ cache_clean_timer_locked_co_del_and_wait(bs);
if (s->l2_table_cache) {
qcow2_cache_destroy(s->l2_table_cache);
}
@@ -2048,7 +2114,7 @@ static void qcow2_reopen_commit(BDRVReopenState *state)
GRAPH_RDLOCK_GUARD_MAINLOOP();
- qcow2_update_options_commit(state->bs, state->opaque);
+ qcow2_update_options_commit(state->bs, state->opaque, false);
if (!s->data_file) {
/*
* If we don't have an external data file, s->data_file was cleared by
@@ -2805,7 +2871,7 @@ qcow2_do_close(BlockDriverState *bs, bool close_data_file)
qcow2_inactivate(bs);
}
- cache_clean_timer_del(bs);
+ cache_clean_timer_del_and_wait(bs);
qcow2_cache_destroy(s->l2_table_cache);
qcow2_cache_destroy(s->refcount_block_cache);
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 11/16] ssh: Run restart_coroutine in current AioContext
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
` (9 preceding siblings ...)
2025-10-28 16:33 ` [PATCH 10/16] qcow2: Fix cache_clean_timer Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 12/16] blkreplay: Run BH in coroutine’s AioContext Hanna Czenczek
` (4 subsequent siblings)
15 siblings, 0 replies; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
restart_coroutine() is attached as an FD handler just to wake the
current coroutine after yielding. It makes most sense to attach it to
the current (request) AioContext instead of the BDS main context. This
way, the coroutine can be entered directly from the BH instead of having
yet another indirection through AioContext.co_schedule_bh.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/ssh.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/block/ssh.c b/block/ssh.c
index 70fe7cf86e..bdec94e9e9 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -1010,19 +1010,18 @@ static int ssh_has_zero_init(BlockDriverState *bs)
}
typedef struct BDRVSSHRestart {
- BlockDriverState *bs;
+ BDRVSSHState *s;
Coroutine *co;
} BDRVSSHRestart;
static void restart_coroutine(void *opaque)
{
BDRVSSHRestart *restart = opaque;
- BlockDriverState *bs = restart->bs;
- BDRVSSHState *s = bs->opaque;
- AioContext *ctx = bdrv_get_aio_context(bs);
+ BDRVSSHState *s = restart->s;
trace_ssh_restart_coroutine(restart->co);
- aio_set_fd_handler(ctx, s->sock, NULL, NULL, NULL, NULL, NULL);
+ aio_set_fd_handler(qemu_get_current_aio_context(), s->sock,
+ NULL, NULL, NULL, NULL, NULL);
aio_co_wake(restart->co);
}
@@ -1031,12 +1030,13 @@ static void restart_coroutine(void *opaque)
* handlers are set up so that we'll be rescheduled when there is an
* interesting event on the socket.
*/
-static coroutine_fn void co_yield(BDRVSSHState *s, BlockDriverState *bs)
+static coroutine_fn void co_yield(BDRVSSHState *s)
{
int r;
IOHandler *rd_handler = NULL, *wr_handler = NULL;
+ AioContext *ctx = qemu_get_current_aio_context();
BDRVSSHRestart restart = {
- .bs = bs,
+ .s = s,
.co = qemu_coroutine_self()
};
@@ -1051,7 +1051,7 @@ static coroutine_fn void co_yield(BDRVSSHState *s, BlockDriverState *bs)
trace_ssh_co_yield(s->sock, rd_handler, wr_handler);
- aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock,
+ aio_set_fd_handler(ctx, s->sock,
rd_handler, wr_handler, NULL, NULL, &restart);
qemu_coroutine_yield();
trace_ssh_co_yield_back(s->sock);
@@ -1093,7 +1093,7 @@ static coroutine_fn int ssh_read(BDRVSSHState *s, BlockDriverState *bs,
trace_ssh_read_return(r, sftp_get_error(s->sftp));
if (r == SSH_AGAIN) {
- co_yield(s, bs);
+ co_yield(s);
goto again;
}
if (r == SSH_EOF || (r == 0 && sftp_get_error(s->sftp) == SSH_FX_EOF)) {
@@ -1168,7 +1168,7 @@ static coroutine_fn int ssh_write(BDRVSSHState *s, BlockDriverState *bs,
trace_ssh_write_return(r, sftp_get_error(s->sftp));
if (r == SSH_AGAIN) {
- co_yield(s, bs);
+ co_yield(s);
goto again;
}
if (r < 0) {
@@ -1233,7 +1233,7 @@ static coroutine_fn int ssh_flush(BDRVSSHState *s, BlockDriverState *bs)
again:
r = sftp_fsync(s->sftp_handle);
if (r == SSH_AGAIN) {
- co_yield(s, bs);
+ co_yield(s);
goto again;
}
if (r < 0) {
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 12/16] blkreplay: Run BH in coroutine’s AioContext
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
` (10 preceding siblings ...)
2025-10-28 16:33 ` [PATCH 11/16] ssh: Run restart_coroutine in current AioContext Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 13/16] block: Note in which AioContext AIO CBs are called Hanna Czenczek
` (3 subsequent siblings)
15 siblings, 0 replies; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
While it does not matter in which AioContext we run aio_co_wake() to
continue an exactly-once-yielding coroutine, making this commit not
strictly necessary, there is also no reason why the BH should run in any
context but the request’s AioContext.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/blkreplay.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/block/blkreplay.c b/block/blkreplay.c
index 16d8b12dd9..8a6845425e 100644
--- a/block/blkreplay.c
+++ b/block/blkreplay.c
@@ -63,9 +63,10 @@ static void block_request_create(uint64_t reqid, BlockDriverState *bs,
Coroutine *co)
{
Request *req = g_new(Request, 1);
+ AioContext *ctx = qemu_coroutine_get_aio_context(co);
*req = (Request) {
.co = co,
- .bh = aio_bh_new(bdrv_get_aio_context(bs), blkreplay_bh_cb, req),
+ .bh = aio_bh_new(ctx, blkreplay_bh_cb, req),
};
replay_block_event(req->bh, reqid);
}
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 13/16] block: Note in which AioContext AIO CBs are called
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
` (11 preceding siblings ...)
2025-10-28 16:33 ` [PATCH 12/16] blkreplay: Run BH in coroutine’s AioContext Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 14/16] iscsi: Create AIO BH in original AioContext Hanna Czenczek
` (2 subsequent siblings)
15 siblings, 0 replies; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
This doesn’t seem to be specified anywhere, but is something we probably
want to be clear. I believe it is reasonable to implicitly assume that
callbacks are run in the current thread (unless explicitly noted
otherwise), so codify that assumption.
Some implementations don’t actually fulfill this contract yet. The next
patches should rectify that.
Note: I don’t know of any user-visible bugs produced by not running AIO
callbacks in the original context. AIO functionality is generally
mapped to coroutines through the use of bdrv_co_io_em_complete(), which
can run in any AioContext, and will always wake the yielding coroutine
in its original context. The only benefit here is that running
bdrv_co_io_em_complete() in the original context will make that
aio_co_wake() most likely a simpler qemu_coroutine_enter() instead of
scheduling the wakeup through AioContext.co_schedule_bh.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
include/block/block_int-common.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 034c0634c8..a584f20aea 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -508,7 +508,12 @@ struct BlockDriver {
BlockDriverState *bs, BlockdevAmendOptions *opts, bool force,
Error **errp);
- /* aio */
+ /*
+ * AIO
+ * The given completion callback will be run in the same AioContext as the
+ * one in which the AIO function was called.
+ */
+
BlockAIOCB * GRAPH_RDLOCK_PTR (*bdrv_aio_preadv)(BlockDriverState *bs,
int64_t offset, int64_t bytes, QEMUIOVector *qiov,
BdrvRequestFlags flags, BlockCompletionFunc *cb, void *opaque);
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 14/16] iscsi: Create AIO BH in original AioContext
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
` (12 preceding siblings ...)
2025-10-28 16:33 ` [PATCH 13/16] block: Note in which AioContext AIO CBs are called Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 15/16] null-aio: Run CB " Hanna Czenczek
2025-10-28 16:33 ` [PATCH 16/16] win32-aio: Run CB in original context Hanna Czenczek
15 siblings, 0 replies; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
AIO callbacks must be called in the original request’s AioContext,
regardless of the BDS’s “main” AioContext.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/iscsi.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/block/iscsi.c b/block/iscsi.c
index 76c15e20ea..fd51aae692 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -119,6 +119,7 @@ typedef struct IscsiTask {
typedef struct IscsiAIOCB {
BlockAIOCB common;
+ AioContext *ctx;
QEMUBH *bh;
IscsiLun *iscsilun;
struct scsi_task *task;
@@ -173,7 +174,7 @@ iscsi_schedule_bh(IscsiAIOCB *acb)
if (acb->bh) {
return;
}
- acb->bh = aio_bh_new(acb->iscsilun->aio_context, iscsi_bh_cb, acb);
+ acb->bh = aio_bh_new(acb->ctx, iscsi_bh_cb, acb);
qemu_bh_schedule(acb->bh);
}
@@ -1007,8 +1008,7 @@ static void iscsi_ioctl_handle_emulated(IscsiAIOCB *acb, int req, void *buf)
ret = -EINVAL;
}
assert(!acb->bh);
- acb->bh = aio_bh_new(bdrv_get_aio_context(bs),
- iscsi_ioctl_bh_completion, acb);
+ acb->bh = aio_bh_new(acb->ctx, iscsi_ioctl_bh_completion, acb);
acb->ret = ret;
qemu_bh_schedule(acb->bh);
}
@@ -1025,6 +1025,7 @@ static BlockAIOCB *iscsi_aio_ioctl(BlockDriverState *bs,
acb = qemu_aio_get(&iscsi_aiocb_info, bs, cb, opaque);
acb->iscsilun = iscsilun;
+ acb->ctx = qemu_get_current_aio_context();
acb->bh = NULL;
acb->status = -EINPROGRESS;
acb->ioh = buf;
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 15/16] null-aio: Run CB in original AioContext
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
` (13 preceding siblings ...)
2025-10-28 16:33 ` [PATCH 14/16] iscsi: Create AIO BH in original AioContext Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 16/16] win32-aio: Run CB in original context Hanna Czenczek
15 siblings, 0 replies; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
AIO callbacks must be called in the originally calling AioContext,
regardless of the BDS’s “main” AioContext.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/null.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/block/null.c b/block/null.c
index 4e448d593d..253d20ccbb 100644
--- a/block/null.c
+++ b/block/null.c
@@ -173,18 +173,17 @@ static inline BlockAIOCB *null_aio_common(BlockDriverState *bs,
{
NullAIOCB *acb;
BDRVNullState *s = bs->opaque;
+ AioContext *ctx = qemu_get_current_aio_context();
acb = qemu_aio_get(&null_aiocb_info, bs, cb, opaque);
/* Only emulate latency after vcpu is running. */
if (s->latency_ns) {
- aio_timer_init(bdrv_get_aio_context(bs), &acb->timer,
- QEMU_CLOCK_REALTIME, SCALE_NS,
+ aio_timer_init(ctx, &acb->timer, QEMU_CLOCK_REALTIME, SCALE_NS,
null_timer_cb, acb);
timer_mod_ns(&acb->timer,
qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + s->latency_ns);
} else {
- replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs),
- null_bh_cb, acb);
+ replay_bh_schedule_oneshot_event(ctx, null_bh_cb, acb);
}
return &acb->common;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 16/16] win32-aio: Run CB in original context
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
` (14 preceding siblings ...)
2025-10-28 16:33 ` [PATCH 15/16] null-aio: Run CB " Hanna Czenczek
@ 2025-10-28 16:33 ` Hanna Czenczek
2025-10-30 14:12 ` Kevin Wolf
15 siblings, 1 reply; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-28 16:33 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Stefan Hajnoczi,
Paolo Bonzini, Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
AIO callbacks must be called in the originally calling AioContext,
regardless of the BDS’s “main” AioContext.
Note: I tried to test this (under wine), but failed. Whenever I tried
to use multiqueue or even just an I/O thread for a virtio-blk (or
virtio-scsi) device, I/O stalled, both with and without this patch.
For what it’s worth, when not using an I/O thread, I/O continued to work
with this patch.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/win32-aio.c | 31 ++++++++++++++++++++++++-------
1 file changed, 24 insertions(+), 7 deletions(-)
diff --git a/block/win32-aio.c b/block/win32-aio.c
index 6327861e1d..f0689f3ee9 100644
--- a/block/win32-aio.c
+++ b/block/win32-aio.c
@@ -48,48 +48,62 @@ struct QEMUWin32AIOState {
typedef struct QEMUWin32AIOCB {
BlockAIOCB common;
struct QEMUWin32AIOState *ctx;
+ AioContext *req_ctx;
int nbytes;
OVERLAPPED ov;
QEMUIOVector *qiov;
void *buf;
bool is_read;
bool is_linear;
+ int ret;
} QEMUWin32AIOCB;
+static void win32_aio_completion_cb_bh(void *opaque)
+{
+ QEMUWin32AIOCB *waiocb = opaque;
+
+ waiocb->common.cb(waiocb->common.opaque, waiocb->ret);
+ aio_context_unref(waiocb->req_ctx);
+ qemu_aio_unref(waiocb);
+}
+
/*
* Completes an AIO request (calls the callback and frees the ACB).
*/
static void win32_aio_process_completion(QEMUWin32AIOState *s,
QEMUWin32AIOCB *waiocb, DWORD count)
{
- int ret;
s->count--;
if (waiocb->ov.Internal != 0) {
- ret = -EIO;
+ waiocb->ret = -EIO;
} else {
- ret = 0;
+ waiocb->ret = 0;
if (count < waiocb->nbytes) {
/* Short reads mean EOF, pad with zeros. */
if (waiocb->is_read) {
qemu_iovec_memset(waiocb->qiov, count, 0,
waiocb->qiov->size - count);
} else {
- ret = -EINVAL;
+ waiocb->ret = -EINVAL;
}
}
}
if (!waiocb->is_linear) {
- if (ret == 0 && waiocb->is_read) {
+ if (waiocb->ret == 0 && waiocb->is_read) {
QEMUIOVector *qiov = waiocb->qiov;
iov_from_buf(qiov->iov, qiov->niov, 0, waiocb->buf, qiov->size);
}
qemu_vfree(waiocb->buf);
}
- waiocb->common.cb(waiocb->common.opaque, ret);
- qemu_aio_unref(waiocb);
+ if (waiocb->req_ctx == s->aio_ctx) {
+ win32_aio_completion_cb_bh(waiocb);
+ } else {
+ aio_bh_schedule_oneshot(waiocb->req_ctx, win32_aio_completion_cb_bh,
+ waiocb);
+ }
}
static void win32_aio_completion_cb(EventNotifier *e)
@@ -120,10 +134,13 @@ BlockAIOCB *win32_aio_submit(BlockDriverState *bs,
DWORD rc;
waiocb = qemu_aio_get(&win32_aiocb_info, bs, cb, opaque);
+ waiocb->req_ctx = qemu_get_current_aio_context();
waiocb->nbytes = bytes;
waiocb->qiov = qiov;
waiocb->is_read = (type == QEMU_AIO_READ);
+ aio_context_ref(waiocb->req_ctx);
+
if (qiov->niov > 1) {
waiocb->buf = qemu_try_blockalign(bs, qiov->size);
if (waiocb->buf == NULL) {
--
2.51.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH 03/16] iscsi: Run co BH CB in the coroutine’s AioContext
2025-10-28 16:33 ` [PATCH 03/16] iscsi: " Hanna Czenczek
@ 2025-10-29 14:27 ` Kevin Wolf
2025-10-31 9:07 ` Hanna Czenczek
0 siblings, 1 reply; 36+ messages in thread
From: Kevin Wolf @ 2025-10-29 14:27 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> For rbd (and others), as described in “rbd: Run co BH CB in the
> coroutine’s AioContext”, the pattern of setting a completion flag and
> waking a coroutine that yields while the flag is not set can only work
> when both run in the same thread.
>
> iscsi has the same pattern, but the details are a bit different:
> iscsi_co_generic_cb() can (as far as I understand) only run through
> iscsi_service(), not just from a random thread at a random time.
> iscsi_service() in turn can only be run after iscsi_set_events() set up
> an FD event handler, which is done in iscsi_co_wait_for_task().
>
> As a result, iscsi_co_wait_for_task() will always yield exactly once,
> because iscsi_co_generic_cb() can only run after iscsi_set_events(),
> after the completion flag has already been checked, and the yielding
> coroutine will then be woken only once the completion flag was set to
> true. So as far as I can tell, iscsi has no bug and already works fine.
>
> Still, we don’t need the completion flag because we know we have to
> yield exactly once, so we can drop it. This simplifies the code and
> makes it more obvious that the “rbd bug” isn’t present here.
>
> This makes iscsi_co_generic_bh_cb() and iscsi_retry_timer_expired() a
> bit boring, and actually, for the former, we could drop it and run
> aio_co_wake() directly from scsi_co_generic_cb() to the same effect; but
> that would remove the replay_bh_schedule_oneshot_event(), and I assume
> we shouldn’t do that. At least schedule both the BH and the timer in
> the coroutine’s AioContext to make them simple wrappers around
> qemu_coroutine_enter(), without a further BH indirection.
I don't think we have to keep the BH. Is your concern about replay? I
doubt that this works across different QEMU versions anyway, and if it
does, it's pure luck.
> Finally, remove the iTask->co != NULL checks: This field is set by
> iscsi_co_init_iscsitask(), which all users of IscsiTask run before even
> setting up iscsi_co_generic_cb() as the callback, and it is never set or
> cleared elsewhere, so it is impossible to not be set in
> iscsi_co_generic_cb().
>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Kevin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 05/16] curl: Fix coroutine waking
2025-10-28 16:33 ` [PATCH 05/16] curl: Fix coroutine waking Hanna Czenczek
@ 2025-10-29 16:57 ` Kevin Wolf
2025-10-31 9:15 ` Hanna Czenczek
0 siblings, 1 reply; 36+ messages in thread
From: Kevin Wolf @ 2025-10-29 16:57 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> If we wake a coroutine from a different context, we must ensure that it
> will yield exactly once (now or later), awaiting that wake.
>
> curl’s current .ret == -EINPROGRESS loop may lead to the coroutine not
> yielding if the request finishes before the loop gets run. To fix it,
> drop the loop and just yield exactly once, unless the request is served
> from the cache or failed before it was submitted – that last part makes
> it a bit complicated, as the result of curl_find_buf() now needs to be a
> tristate.
>
> (Can be reproduced with multiqueue by adding a usleep(100000) before the
> `while (acb.ret == -EINPROGRESS)` loop.)
>
> Also, add a comment why aio_co_wake() is safe regardless of whether the
> coroutine and curl_multi_check_completion() run in the same context.
>
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
> block/curl.c | 55 +++++++++++++++++++++++++++++++++++-----------------
> 1 file changed, 37 insertions(+), 18 deletions(-)
>
> diff --git a/block/curl.c b/block/curl.c
> index 68cf83ce55..65996a8866 100644
> --- a/block/curl.c
> +++ b/block/curl.c
> @@ -124,6 +124,16 @@ typedef struct BDRVCURLState {
> char *proxypassword;
> } BDRVCURLState;
>
> +/** Possible result states of curl_find_buf() */
> +typedef enum {
> + /* No buffer found, need to create new request */
> + CURL_NO_BUF_FOUND,
> + /* Buffer found, request filled and done */
> + CURL_REQUEST_FILLED,
> + /* Ongoing request found, need to yield */
> + CURL_REQUEST_ONGOING,
> +} CURLFindBufResult;
> +
> static void curl_clean_state(CURLState *s);
> static void curl_multi_do(void *arg);
>
> @@ -258,8 +268,8 @@ read_end:
> }
>
> /* Called with s->mutex held. */
> -static bool curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
> - CURLAIOCB *acb)
> +static CURLFindBufResult curl_find_buf(BDRVCURLState *s, uint64_t start,
> + uint64_t len, CURLAIOCB *acb)
> {
> int i;
> uint64_t end = start + len;
> @@ -289,7 +299,7 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
> qemu_iovec_memset(acb->qiov, clamped_len, 0, len - clamped_len);
> }
> acb->ret = 0;
> - return true;
> + return CURL_REQUEST_FILLED;
> }
>
> // Wait for unfinished chunks
> @@ -307,13 +317,13 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
> for (j=0; j<CURL_NUM_ACB; j++) {
> if (!state->acb[j]) {
> state->acb[j] = acb;
> - return true;
> + return CURL_REQUEST_ONGOING;
> }
> }
> }
> }
>
> - return false;
> + return CURL_NO_BUF_FOUND;
> }
>
> /* Called with s->mutex held. */
> @@ -378,6 +388,16 @@ static void curl_multi_check_completion(BDRVCURLState *s)
> acb->ret = error ? -EIO : 0;
> state->acb[i] = NULL;
> qemu_mutex_unlock(&s->mutex);
> + /*
> + * Current AioContext is the BDS context, which may or may not
> + * be the request (coroutine) context.
> + * - If it is, the coroutine must have yielded or the FD handler
> + * (curl_multi_do()/curl_multi_timeout_do()) could not have
> + * been called and we would not be here
> + * - If it is not, it doesn't matter whether it has already
> + * yielded or not; it will be scheduled once it does yield
> + * So aio_co_wake() is safe to call.
> + */
> aio_co_wake(acb->co);
> qemu_mutex_lock(&s->mutex);
> }
> @@ -868,7 +888,8 @@ out_noclean:
> return -EINVAL;
> }
>
> -static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
> +/* Return whether a request was submitted that requires yielding */
> +static bool coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
> {
> CURLState *state;
> int running;
> @@ -877,13 +898,15 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
>
> uint64_t start = acb->offset;
> uint64_t end;
> + CURLFindBufResult find_buf_res;
>
> - qemu_mutex_lock(&s->mutex);
> + QEMU_LOCK_GUARD(&s->mutex);
>
> // In case we have the requested data already (e.g. read-ahead),
> // we can just call the callback and be done.
> - if (curl_find_buf(s, start, acb->bytes, acb)) {
> - goto out;
> + find_buf_res = curl_find_buf(s, start, acb->bytes, acb);
> + if (find_buf_res != CURL_NO_BUF_FOUND) {
> + return find_buf_res == CURL_REQUEST_ONGOING;
> }
>
> // No cache found, so let's start a new request
> @@ -898,7 +921,7 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
> if (curl_init_state(s, state) < 0) {
> curl_clean_state(state);
> acb->ret = -EIO;
> - goto out;
> + return false;
> }
>
> acb->start = 0;
> @@ -913,7 +936,7 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
> if (state->buf_len && state->orig_buf == NULL) {
> curl_clean_state(state);
> acb->ret = -ENOMEM;
> - goto out;
> + return false;
> }
> state->acb[0] = acb;
>
> @@ -925,14 +948,12 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
> acb->ret = -EIO;
>
> curl_clean_state(state);
> - goto out;
> + return false;
> }
>
> /* Tell curl it needs to kick things off */
> curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
> -
> -out:
> - qemu_mutex_unlock(&s->mutex);
> + return true;
> }
>
> static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
> @@ -941,14 +962,12 @@ static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
> {
> CURLAIOCB acb = {
> .co = qemu_coroutine_self(),
> - .ret = -EINPROGRESS,
> .qiov = qiov,
> .offset = offset,
> .bytes = bytes
> };
Let's leave -EINPROGRESS here even if no other code checks for this
value any more. It can be helpful for debugging when you can distinguish
"completed successfully" from "still running".
>
> - curl_setup_preadv(bs, &acb);
> - while (acb.ret == -EINPROGRESS) {
> + if (curl_setup_preadv(bs, &acb)) {
> qemu_coroutine_yield();
> }
> return acb.ret;
That whole pattern of returning true and false or even a new enum
everywhere to tell if we are waiting for something felt strange to me.
Took me a while, but I think now I know what I expected instead: Why
don't these places just yield immediately instead of requiring the outer
layer to understand what happened in the functions it called?
Kevin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 06/16] gluster: Do not move coroutine into BDS context
2025-10-28 16:33 ` [PATCH 06/16] gluster: Do not move coroutine into BDS context Hanna Czenczek
@ 2025-10-29 17:10 ` Kevin Wolf
2025-10-31 9:16 ` Hanna Czenczek
0 siblings, 1 reply; 36+ messages in thread
From: Kevin Wolf @ 2025-10-29 17:10 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> The request coroutine may not run in the BDS AioContext. We should wake
> it in its own context, not move it.
>
> With that, we can remove GlusterAIOCB.aio_context.
>
> Also add a comment why aio_co_schedule() is safe to use in this way.
>
> **Note:** Due to a lack of a gluster set-up, I have not tested this
> commit. It seemed safe enough to send anyway, just maybe not to
> qemu-stable. To be clear, I don’t know of any user-visible bugs that
> would arise from the state without this patch; the request coroutine is
> moved into the main BDS AioContext, which feels wrong, but I’m not sure
> whether it can actually produce hard bugs.
Doesn't moving into a different AioContext mean that everything down to
the AIO callback in the guest device runs in a different thread? That's
definitely a big no-no for virtio-scsi, and I think also for virtio-blk.
> I’ll leave it to others’ opinions whether to keep or drop this patch
> under these circumstances.
>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
It looks trivially correct (famous last words...) and the bugs not
unlikely to be hit, so I'd say keep it.
I have no idea if the gluster library is actually thread safe, but if it
isn't, that breaks before gluster_finish_aiocb(). After reentering the
coroutine, the library isn't called any more.
Kevin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 07/16] nvme: Kick and check completions in BDS context
2025-10-28 16:33 ` [PATCH 07/16] nvme: Kick and check completions in " Hanna Czenczek
@ 2025-10-29 17:23 ` Kevin Wolf
2025-10-29 17:39 ` Kevin Wolf
2025-10-31 9:19 ` Hanna Czenczek
0 siblings, 2 replies; 36+ messages in thread
From: Kevin Wolf @ 2025-10-29 17:23 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> nvme_process_completion() must run in the main BDS context, so schedule
> a BH for requests that aren’t there.
>
> The context in which we kick does not matter, but let’s just keep kick
> and process_completion together for simplicity’s sake.
Ok, fair, move the main BDS context for calling these functions. But
doesn't that mean that we need to move back to the request context for
calling the callback?
In particular, I see this:
static void nvme_rw_cb_bh(void *opaque)
{
NVMeCoData *data = opaque;
qemu_coroutine_enter(data->co);
}
The next patch changes some things about coroutine wakeup, but it
doesn't touch this qemu_coroutine_enter(). So I think the coroutine is
now running in the wrong thread.
I also feel that it gets a bit confusing what is running in which
context, so maybe we can add comments to each of the callbacks telling
that they are running in main BDS context or request coroutine context.
> (For what it’s worth, a quick fio bandwidth test indicates that on my
> test hardware, if anything, this may be a bit better than kicking
> immediately before scheduling a pure nvme_process_completion() BH. But
> I wouldn’t take more from those results than that it doesn’t really seem
> to matter either way.)
>
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Kevin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 07/16] nvme: Kick and check completions in BDS context
2025-10-29 17:23 ` Kevin Wolf
@ 2025-10-29 17:39 ` Kevin Wolf
2025-10-31 9:18 ` Hanna Czenczek
2025-10-31 9:19 ` Hanna Czenczek
1 sibling, 1 reply; 36+ messages in thread
From: Kevin Wolf @ 2025-10-29 17:39 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
Am 29.10.2025 um 18:23 hat Kevin Wolf geschrieben:
> Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> > nvme_process_completion() must run in the main BDS context, so schedule
> > a BH for requests that aren’t there.
> >
> > The context in which we kick does not matter, but let’s just keep kick
> > and process_completion together for simplicity’s sake.
>
> Ok, fair, move the main BDS context for calling these functions. But
> doesn't that mean that we need to move back to the request context for
> calling the callback?
>
> In particular, I see this:
>
> static void nvme_rw_cb_bh(void *opaque)
> {
> NVMeCoData *data = opaque;
> qemu_coroutine_enter(data->co);
> }
>
> The next patch changes some things about coroutine wakeup, but it
> doesn't touch this qemu_coroutine_enter(). So I think the coroutine is
> now running in the wrong thread.
It actually isn't because the patch changes in which AioContext the BH
is called. Quite confusing with all the indirections. Let's get rid of
the BH with qemu_coroutine_enter() and just call aio_co_wake() directly.
Kevin
> I also feel that it gets a bit confusing what is running in which
> context, so maybe we can add comments to each of the callbacks telling
> that they are running in main BDS context or request coroutine context.
>
> > (For what it’s worth, a quick fio bandwidth test indicates that on my
> > test hardware, if anything, this may be a bit better than kicking
> > immediately before scheduling a pure nvme_process_completion() BH. But
> > I wouldn’t take more from those results than that it doesn’t really seem
> > to matter either way.)
> >
> > Cc: qemu-stable@nongnu.org
> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>
> Kevin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 08/16] nvme: Fix coroutine waking
2025-10-28 16:33 ` [PATCH 08/16] nvme: Fix coroutine waking Hanna Czenczek
@ 2025-10-29 17:43 ` Kevin Wolf
0 siblings, 0 replies; 36+ messages in thread
From: Kevin Wolf @ 2025-10-29 17:43 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> nvme wakes the request coroutine via qemu_coroutine_enter() from a BH
> scheduled in the BDS AioContext. This may not be the same context as
> the one in which the request originally ran, which would be wrong:
> - It could mean we enter the coroutine before it yields,
> - We would move the coroutine in to a different context.
>
> (Can be reproduced with multiqueue by adding a usleep(100000) before the
> `while (data.ret == -EINPROGRESS)` loop.)
>
> To fix that, schedule nvme_rw_cb_bh() in the coroutine AioContext.
> (Just like in the preceding iscsi and nfs patches, we could drop the
> trivial nvme_rw_cb_bh() and just use aio_co_wake() directly, but don’t
> do that to keep replay_bh_schedule_oneshot_event().)
>
> With this, we can remove NVMeCoData.ctx.
>
> Note the check of data->co == NULL to bypass the BH/yield combination in
> case nvme_rw_cb() is called from nvme_submit_command(): We probably want
> to keep this fast path for performance reasons, but we have to be quite
> careful about it:
> - We cannot overload .ret for this, but have to use a dedicated
> .skip_yield field. Otherwise, if nvme_rw_cb() runs in a different
> thread than the coroutine, it may see .ret set and skip the yield,
> while nvme_rw_cb() will still schedule a BH for waking. Therefore,
> the signal to skip the yield can only be set in nvme_rw_cb() if waking
> too is skipped, which is independent from communicating the return
> value.
> - We can only skip the yield if nvme_rw_cb() actually runs in the
> request coroutine. Otherwise (specifically if they run in different
> AioContexts), the order between this function’s execution and the
> coroutine yielding (or not yielding) is not reliable.
> - There is no point to yielding in a loop; there are no spurious wakes,
> so once we yield, we will only be re-entered once the command is done.
> Replace `while` by `if`.
>
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
As suggested in an earlier patch, I prefer to keep setting -EINPROGRESS,
even if we don't check for it elsewhere, for better debugability.
Kevin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 10/16] qcow2: Fix cache_clean_timer
2025-10-28 16:33 ` [PATCH 10/16] qcow2: Fix cache_clean_timer Hanna Czenczek
@ 2025-10-29 20:23 ` Kevin Wolf
2025-10-31 9:29 ` Hanna Czenczek
0 siblings, 1 reply; 36+ messages in thread
From: Kevin Wolf @ 2025-10-29 20:23 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> The cache-cleaner runs as a timer CB in the BDS AioContext. With
> multiqueue, it can run concurrently to I/O requests, and because it does
> not take any lock, this can break concurrent cache accesses, corrupting
> the image. While the chances of this happening are low, it can be
> reproduced e.g. by modifying the code to schedule the timer CB every
> 5 ms (instead of at most once per second) and modifying the last (inner)
> while loop of qcow2_cache_clean_unused() like so:
>
> while (i < c->size && can_clean_entry(c, i)) {
> for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
> usleep(100);
> }
> c->entries[i].offset = 0;
> c->entries[i].lru_counter = 0;
> i++;
> to_clean++;
> }
>
> i.e. making it wait on purpose for the point in time where the cache is
> in use by something else.
>
> The solution chosen for this in this patch is not the best solution, I
> hope, but I admittedly can’t come up with anything strictly better.
>
> We can protect from concurrent cache accesses either by taking the
> existing s->lock, or we introduce a new (non-coroutine) mutex
> specifically for cache accesses. I would prefer to avoid the latter so
> as not to introduce additional (very slight) overhead.
In theory, the old plan was that eventually qcow2 would use fine grained
locks instead of the single s->lock, and having a separate cache lock
would be a step towards it. But if we never actually make use of it to
hold s->lock for a shorter time, that's not really a good argument. I'm
not sure if that's ever going to happen unless for a rewrite in Rust or
something.
I never tried to measure specifically if lock contention is a problem
with high queue depth and random I/O on a huge disk. Intuitively,
holding s->lock while doing I/O for loading entries into the cache can't
be really good.
Anyway, I went a bit on a tangent there...
> Using s->lock, which is a coroutine mutex, however means that we need to
> take it in a coroutine, so the timer CB must enter such a coroutine. As
> a result, descheduling the timer is no longer a guarantee that the
> cache-cleaner will not run, because it may now be yielding in
> qemu_co_mutex_lock().
I think creating a coroutine in cache_clean_timer_cb() is the wrong
approach. Instead, cache_clean_timer_init() could create a coroutine
and its implementation could be something like this:
while (!s->cache_clean_timer_stopping) {
qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
QEMU_CLOCK_VIRTUAL,
s->cache_clean_interval * NANOSECONDS_PER_SECOND);
qemu_co_mutex_lock(&s->lock);
qcow2_cache_clean_unused(s->l2_table_cache);
qcow2_cache_clean_unused(s->refcount_block_cache);
qemu_co_mutex_unlock(&s->lock);
}
s->cache_clean_timer_stopping = false;
> (Note even now this was only guaranteed for cache_clean_timer_del()
> callers that run in the BDS (the timer’s) AioContext. For callers
> running in the main context, the problem may have already existed,
> though maybe the BQL prevents timers from running in other contexts, I’m
> not sure.)
>
> Polling to await the timer to actually settle seems very complicated for
> something that’s rather a minor problem, but I can’t come up with any
> better solution that doesn’t again just overlook potential problems.
>
> (One cleaner idea may be to have a generic way to have timers run
> coroutines, and to await those when descheduling the timer. But while
> cleaner, it would also be more complicated, and I don’t think worth it
> at this point.)
>
> (Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
> I’m not too fond of this solution.)
>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
> block/qcow2.h | 1 +
> block/qcow2.c | 90 ++++++++++++++++++++++++++++++++++++++++++++-------
> 2 files changed, 79 insertions(+), 12 deletions(-)
> @@ -867,6 +893,39 @@ static void cache_clean_timer_del(BlockDriverState *bs)
> }
> }
>
> +/*
> + * Delete the cache clean timer and await any yet running instance.
> + * Must be called from the main or BDS AioContext, holding s->lock.
> + */
> +static void coroutine_fn
> +cache_clean_timer_locked_co_del_and_wait(BlockDriverState *bs)
> +{
> + BDRVQcow2State *s = bs->opaque;
> + IO_OR_GS_CODE();
> + cache_clean_timer_del(bs);
> + if (qatomic_read(&s->cache_clean_running)) {
> + qemu_co_mutex_unlock(&s->lock);
> + qatomic_set(&s->cache_clean_polling, true);
> + BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
Polling in a coroutine_fn is verboten.
If we do need this function, I think it would be a yield here and a wake
on the other side. I think we might be able to get around it if we move
the call from qcow2_do_open() into qcow2_open() (i.e. outside the
coroutine). A bit ugly, so your choice.
> + qemu_co_mutex_lock(&s->lock);
> + }
> +}
> +
> +/*
> + * Delete the cache clean timer and await any yet running instance.
> + * Must be called from the main or BDS AioContext without s->lock held.
> + */
> +static void cache_clean_timer_del_and_wait(BlockDriverState *bs)
> +{
> + BDRVQcow2State *s = bs->opaque;
> + IO_OR_GS_CODE();
> + cache_clean_timer_del(bs);
> + if (qatomic_read(&s->cache_clean_running)) {
> + qatomic_set(&s->cache_clean_polling, true);
> + BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
> + }
> +}
> +
> static void qcow2_detach_aio_context(BlockDriverState *bs)
> {
> cache_clean_timer_del(bs);
> @@ -1214,12 +1273,20 @@ fail:
> return ret;
> }
>
> +/* s_locked specifies whether s->lock is held or not */
> static void qcow2_update_options_commit(BlockDriverState *bs,
> - Qcow2ReopenState *r)
> + Qcow2ReopenState *r,
> + bool s_locked)
> {
> BDRVQcow2State *s = bs->opaque;
> int i;
>
> + if (s_locked) {
> + cache_clean_timer_locked_co_del_and_wait(bs);
> + } else {
> + cache_clean_timer_del_and_wait(bs);
> + }
> +
> if (s->l2_table_cache) {
> qcow2_cache_destroy(s->l2_table_cache);
> }
> @@ -1228,6 +1295,10 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
> }
> s->l2_table_cache = r->l2_table_cache;
> s->refcount_block_cache = r->refcount_block_cache;
> +
> + s->cache_clean_interval = r->cache_clean_interval;
> + cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> +
> s->l2_slice_size = r->l2_slice_size;
>
> s->overlap_check = r->overlap_check;
> @@ -1239,12 +1310,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>
> s->discard_no_unref = r->discard_no_unref;
>
> - if (s->cache_clean_interval != r->cache_clean_interval) {
> - cache_clean_timer_del(bs);
> - s->cache_clean_interval = r->cache_clean_interval;
> - cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> - }
> -
I think the del/init pair here won't be necessary any more after
switching to the background coroutine. It will just start using the new
value of s->cache_clean_interval the next time it sleeps.
Kevin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 16/16] win32-aio: Run CB in original context
2025-10-28 16:33 ` [PATCH 16/16] win32-aio: Run CB in original context Hanna Czenczek
@ 2025-10-30 14:12 ` Kevin Wolf
2025-10-31 9:31 ` Hanna Czenczek
0 siblings, 1 reply; 36+ messages in thread
From: Kevin Wolf @ 2025-10-30 14:12 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> AIO callbacks must be called in the originally calling AioContext,
> regardless of the BDS’s “main” AioContext.
>
> Note: I tried to test this (under wine), but failed. Whenever I tried
> to use multiqueue or even just an I/O thread for a virtio-blk (or
> virtio-scsi) device, I/O stalled, both with and without this patch.
>
> For what it’s worth, when not using an I/O thread, I/O continued to work
> with this patch.
>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Should we then do the opposite thing and just move every request into
the main thread and only move back before returning to the caller?
But I'm also not against making the theoretical fix so that maybe
someone can fix the other problem later. It just seems to be of somewhat
limited use on its own.
Kevin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 03/16] iscsi: Run co BH CB in the coroutine’s AioContext
2025-10-29 14:27 ` Kevin Wolf
@ 2025-10-31 9:07 ` Hanna Czenczek
2025-10-31 13:27 ` Kevin Wolf
0 siblings, 1 reply; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-31 9:07 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
On 29.10.25 15:27, Kevin Wolf wrote:
> Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
>> For rbd (and others), as described in “rbd: Run co BH CB in the
>> coroutine’s AioContext”, the pattern of setting a completion flag and
>> waking a coroutine that yields while the flag is not set can only work
>> when both run in the same thread.
>>
>> iscsi has the same pattern, but the details are a bit different:
>> iscsi_co_generic_cb() can (as far as I understand) only run through
>> iscsi_service(), not just from a random thread at a random time.
>> iscsi_service() in turn can only be run after iscsi_set_events() set up
>> an FD event handler, which is done in iscsi_co_wait_for_task().
>>
>> As a result, iscsi_co_wait_for_task() will always yield exactly once,
>> because iscsi_co_generic_cb() can only run after iscsi_set_events(),
>> after the completion flag has already been checked, and the yielding
>> coroutine will then be woken only once the completion flag was set to
>> true. So as far as I can tell, iscsi has no bug and already works fine.
>>
>> Still, we don’t need the completion flag because we know we have to
>> yield exactly once, so we can drop it. This simplifies the code and
>> makes it more obvious that the “rbd bug” isn’t present here.
>>
>> This makes iscsi_co_generic_bh_cb() and iscsi_retry_timer_expired() a
>> bit boring, and actually, for the former, we could drop it and run
>> aio_co_wake() directly from scsi_co_generic_cb() to the same effect; but
>> that would remove the replay_bh_schedule_oneshot_event(), and I assume
>> we shouldn’t do that. At least schedule both the BH and the timer in
>> the coroutine’s AioContext to make them simple wrappers around
>> qemu_coroutine_enter(), without a further BH indirection.
> I don't think we have to keep the BH. Is your concern about replay? I
> doubt that this works across different QEMU versions anyway, and if it
> does, it's pure luck.
It is solely about replay, yes. I assumed the
replay_bh_schedule_oneshot_event() would be a replay point, so removing
it would, well, remove a replay point. I suppose we’re going to have
one replay point per request anyway (when going through the blkreplay
driver), so maybe it doesn’t matter much?
Anyway, it seemed safer to keep it. But apart from replay, I don’t have
any concern about dropping the BH.
>> Finally, remove the iTask->co != NULL checks: This field is set by
>> iscsi_co_init_iscsitask(), which all users of IscsiTask run before even
>> setting up iscsi_co_generic_cb() as the callback, and it is never set or
>> cleared elsewhere, so it is impossible to not be set in
>> iscsi_co_generic_cb().
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> Kevin
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 05/16] curl: Fix coroutine waking
2025-10-29 16:57 ` Kevin Wolf
@ 2025-10-31 9:15 ` Hanna Czenczek
2025-10-31 13:17 ` Kevin Wolf
0 siblings, 1 reply; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-31 9:15 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
On 29.10.25 17:57, Kevin Wolf wrote:
> Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
>> If we wake a coroutine from a different context, we must ensure that it
>> will yield exactly once (now or later), awaiting that wake.
>>
>> curl’s current .ret == -EINPROGRESS loop may lead to the coroutine not
>> yielding if the request finishes before the loop gets run. To fix it,
>> drop the loop and just yield exactly once, unless the request is served
>> from the cache or failed before it was submitted – that last part makes
>> it a bit complicated, as the result of curl_find_buf() now needs to be a
>> tristate.
>>
>> (Can be reproduced with multiqueue by adding a usleep(100000) before the
>> `while (acb.ret == -EINPROGRESS)` loop.)
>>
>> Also, add a comment why aio_co_wake() is safe regardless of whether the
>> coroutine and curl_multi_check_completion() run in the same context.
>>
>> Cc: qemu-stable@nongnu.org
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>> block/curl.c | 55 +++++++++++++++++++++++++++++++++++-----------------
>> 1 file changed, 37 insertions(+), 18 deletions(-)
>>
>> diff --git a/block/curl.c b/block/curl.c
>> index 68cf83ce55..65996a8866 100644
>> --- a/block/curl.c
>> +++ b/block/curl.c
>> @@ -124,6 +124,16 @@ typedef struct BDRVCURLState {
>> char *proxypassword;
>> } BDRVCURLState;
>>
>> +/** Possible result states of curl_find_buf() */
>> +typedef enum {
>> + /* No buffer found, need to create new request */
>> + CURL_NO_BUF_FOUND,
>> + /* Buffer found, request filled and done */
>> + CURL_REQUEST_FILLED,
>> + /* Ongoing request found, need to yield */
>> + CURL_REQUEST_ONGOING,
>> +} CURLFindBufResult;
>> +
>> static void curl_clean_state(CURLState *s);
>> static void curl_multi_do(void *arg);
>>
>> @@ -258,8 +268,8 @@ read_end:
>> }
>>
>> /* Called with s->mutex held. */
>> -static bool curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
>> - CURLAIOCB *acb)
>> +static CURLFindBufResult curl_find_buf(BDRVCURLState *s, uint64_t start,
>> + uint64_t len, CURLAIOCB *acb)
>> {
>> int i;
>> uint64_t end = start + len;
>> @@ -289,7 +299,7 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
>> qemu_iovec_memset(acb->qiov, clamped_len, 0, len - clamped_len);
>> }
>> acb->ret = 0;
>> - return true;
>> + return CURL_REQUEST_FILLED;
>> }
>>
>> // Wait for unfinished chunks
>> @@ -307,13 +317,13 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
>> for (j=0; j<CURL_NUM_ACB; j++) {
>> if (!state->acb[j]) {
>> state->acb[j] = acb;
>> - return true;
>> + return CURL_REQUEST_ONGOING;
>> }
>> }
>> }
>> }
>>
>> - return false;
>> + return CURL_NO_BUF_FOUND;
>> }
>>
>> /* Called with s->mutex held. */
>> @@ -378,6 +388,16 @@ static void curl_multi_check_completion(BDRVCURLState *s)
>> acb->ret = error ? -EIO : 0;
>> state->acb[i] = NULL;
>> qemu_mutex_unlock(&s->mutex);
>> + /*
>> + * Current AioContext is the BDS context, which may or may not
>> + * be the request (coroutine) context.
>> + * - If it is, the coroutine must have yielded or the FD handler
>> + * (curl_multi_do()/curl_multi_timeout_do()) could not have
>> + * been called and we would not be here
>> + * - If it is not, it doesn't matter whether it has already
>> + * yielded or not; it will be scheduled once it does yield
>> + * So aio_co_wake() is safe to call.
>> + */
>> aio_co_wake(acb->co);
>> qemu_mutex_lock(&s->mutex);
>> }
>> @@ -868,7 +888,8 @@ out_noclean:
>> return -EINVAL;
>> }
>>
>> -static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
>> +/* Return whether a request was submitted that requires yielding */
>> +static bool coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
>> {
>> CURLState *state;
>> int running;
>> @@ -877,13 +898,15 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
>>
>> uint64_t start = acb->offset;
>> uint64_t end;
>> + CURLFindBufResult find_buf_res;
>>
>> - qemu_mutex_lock(&s->mutex);
>> + QEMU_LOCK_GUARD(&s->mutex);
>>
>> // In case we have the requested data already (e.g. read-ahead),
>> // we can just call the callback and be done.
>> - if (curl_find_buf(s, start, acb->bytes, acb)) {
>> - goto out;
>> + find_buf_res = curl_find_buf(s, start, acb->bytes, acb);
>> + if (find_buf_res != CURL_NO_BUF_FOUND) {
>> + return find_buf_res == CURL_REQUEST_ONGOING;
>> }
>>
>> // No cache found, so let's start a new request
>> @@ -898,7 +921,7 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
>> if (curl_init_state(s, state) < 0) {
>> curl_clean_state(state);
>> acb->ret = -EIO;
>> - goto out;
>> + return false;
>> }
>>
>> acb->start = 0;
>> @@ -913,7 +936,7 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
>> if (state->buf_len && state->orig_buf == NULL) {
>> curl_clean_state(state);
>> acb->ret = -ENOMEM;
>> - goto out;
>> + return false;
>> }
>> state->acb[0] = acb;
>>
>> @@ -925,14 +948,12 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
>> acb->ret = -EIO;
>>
>> curl_clean_state(state);
>> - goto out;
>> + return false;
>> }
>>
>> /* Tell curl it needs to kick things off */
>> curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
>> -
>> -out:
>> - qemu_mutex_unlock(&s->mutex);
>> + return true;
>> }
>>
>> static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
>> @@ -941,14 +962,12 @@ static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
>> {
>> CURLAIOCB acb = {
>> .co = qemu_coroutine_self(),
>> - .ret = -EINPROGRESS,
>> .qiov = qiov,
>> .offset = offset,
>> .bytes = bytes
>> };
> Let's leave -EINPROGRESS here even if no other code checks for this
> value any more. It can be helpful for debugging when you can distinguish
> "completed successfully" from "still running".
Does that mean you want me to keep the `complete` field on rbd and nfs, too?
>>
>> - curl_setup_preadv(bs, &acb);
>> - while (acb.ret == -EINPROGRESS) {
>> + if (curl_setup_preadv(bs, &acb)) {
>> qemu_coroutine_yield();
>> }
>> return acb.ret;
> That whole pattern of returning true and false or even a new enum
> everywhere to tell if we are waiting for something felt strange to me.
> Took me a while, but I think now I know what I expected instead: Why
> don't these places just yield immediately instead of requiring the outer
> layer to understand what happened in the functions it called?
I was considering the same. My result was, if they yielded immediately,
we might as well fully inline curl_setup_preadv() into this function. I
didn’t want to do that at the time, but if you prefer, no problem.
Hanna
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 06/16] gluster: Do not move coroutine into BDS context
2025-10-29 17:10 ` Kevin Wolf
@ 2025-10-31 9:16 ` Hanna Czenczek
0 siblings, 0 replies; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-31 9:16 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
On 29.10.25 18:10, Kevin Wolf wrote:
> Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
>> The request coroutine may not run in the BDS AioContext. We should wake
>> it in its own context, not move it.
>>
>> With that, we can remove GlusterAIOCB.aio_context.
>>
>> Also add a comment why aio_co_schedule() is safe to use in this way.
>>
>> **Note:** Due to a lack of a gluster set-up, I have not tested this
>> commit. It seemed safe enough to send anyway, just maybe not to
>> qemu-stable. To be clear, I don’t know of any user-visible bugs that
>> would arise from the state without this patch; the request coroutine is
>> moved into the main BDS AioContext, which feels wrong, but I’m not sure
>> whether it can actually produce hard bugs.
> Doesn't moving into a different AioContext mean that everything down to
> the AIO callback in the guest device runs in a different thread? That's
> definitely a big no-no for virtio-scsi, and I think also for virtio-blk.
It does, yes.
>> I’ll leave it to others’ opinions whether to keep or drop this patch
>> under these circumstances.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> It looks trivially correct (famous last words...) and the bugs not
> unlikely to be hit, so I'd say keep it.
>
> I have no idea if the gluster library is actually thread safe, but if it
> isn't, that breaks before gluster_finish_aiocb(). After reentering the
> coroutine, the library isn't called any more.
OK.
Hanna
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 07/16] nvme: Kick and check completions in BDS context
2025-10-29 17:39 ` Kevin Wolf
@ 2025-10-31 9:18 ` Hanna Czenczek
0 siblings, 0 replies; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-31 9:18 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
On 29.10.25 18:39, Kevin Wolf wrote:
> Am 29.10.2025 um 18:23 hat Kevin Wolf geschrieben:
>> Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
>>> nvme_process_completion() must run in the main BDS context, so schedule
>>> a BH for requests that aren’t there.
>>>
>>> The context in which we kick does not matter, but let’s just keep kick
>>> and process_completion together for simplicity’s sake.
>> Ok, fair, move the main BDS context for calling these functions. But
>> doesn't that mean that we need to move back to the request context for
>> calling the callback?
>>
>> In particular, I see this:
>>
>> static void nvme_rw_cb_bh(void *opaque)
>> {
>> NVMeCoData *data = opaque;
>> qemu_coroutine_enter(data->co);
>> }
>>
>> The next patch changes some things about coroutine wakeup, but it
>> doesn't touch this qemu_coroutine_enter(). So I think the coroutine is
>> now running in the wrong thread.
> It actually isn't because the patch changes in which AioContext the BH
> is called. Quite confusing with all the indirections. Let's get rid of
> the BH with qemu_coroutine_enter() and just call aio_co_wake() directly.
Sure!
Hanna
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 07/16] nvme: Kick and check completions in BDS context
2025-10-29 17:23 ` Kevin Wolf
2025-10-29 17:39 ` Kevin Wolf
@ 2025-10-31 9:19 ` Hanna Czenczek
1 sibling, 0 replies; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-31 9:19 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
On 29.10.25 18:23, Kevin Wolf wrote:
> Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
>> nvme_process_completion() must run in the main BDS context, so schedule
>> a BH for requests that aren’t there.
>>
>> The context in which we kick does not matter, but let’s just keep kick
>> and process_completion together for simplicity’s sake.
> Ok, fair, move the main BDS context for calling these functions. But
> doesn't that mean that we need to move back to the request context for
> calling the callback?
>
> In particular, I see this:
>
> static void nvme_rw_cb_bh(void *opaque)
> {
> NVMeCoData *data = opaque;
> qemu_coroutine_enter(data->co);
> }
>
> The next patch changes some things about coroutine wakeup, but it
> doesn't touch this qemu_coroutine_enter(). So I think the coroutine is
> now running in the wrong thread.
>
> I also feel that it gets a bit confusing what is running in which
> context, so maybe we can add comments to each of the callbacks telling
> that they are running in main BDS context or request coroutine context.
Makes sense, I’ll try my best.
Hanna
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 10/16] qcow2: Fix cache_clean_timer
2025-10-29 20:23 ` Kevin Wolf
@ 2025-10-31 9:29 ` Hanna Czenczek
2025-10-31 13:03 ` Kevin Wolf
0 siblings, 1 reply; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-31 9:29 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
On 29.10.25 21:23, Kevin Wolf wrote:
> Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
>> The cache-cleaner runs as a timer CB in the BDS AioContext. With
>> multiqueue, it can run concurrently to I/O requests, and because it does
>> not take any lock, this can break concurrent cache accesses, corrupting
>> the image. While the chances of this happening are low, it can be
>> reproduced e.g. by modifying the code to schedule the timer CB every
>> 5 ms (instead of at most once per second) and modifying the last (inner)
>> while loop of qcow2_cache_clean_unused() like so:
>>
>> while (i < c->size && can_clean_entry(c, i)) {
>> for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
>> usleep(100);
>> }
>> c->entries[i].offset = 0;
>> c->entries[i].lru_counter = 0;
>> i++;
>> to_clean++;
>> }
>>
>> i.e. making it wait on purpose for the point in time where the cache is
>> in use by something else.
>>
>> The solution chosen for this in this patch is not the best solution, I
>> hope, but I admittedly can’t come up with anything strictly better.
>>
>> We can protect from concurrent cache accesses either by taking the
>> existing s->lock, or we introduce a new (non-coroutine) mutex
>> specifically for cache accesses. I would prefer to avoid the latter so
>> as not to introduce additional (very slight) overhead.
> In theory, the old plan was that eventually qcow2 would use fine grained
> locks instead of the single s->lock, and having a separate cache lock
> would be a step towards it. But if we never actually make use of it to
> hold s->lock for a shorter time, that's not really a good argument. I'm
> not sure if that's ever going to happen unless for a rewrite in Rust or
> something.
>
> I never tried to measure specifically if lock contention is a problem
> with high queue depth and random I/O on a huge disk. Intuitively,
> holding s->lock while doing I/O for loading entries into the cache can't
> be really good.
>
> Anyway, I went a bit on a tangent there...
>
>> Using s->lock, which is a coroutine mutex, however means that we need to
>> take it in a coroutine, so the timer CB must enter such a coroutine. As
>> a result, descheduling the timer is no longer a guarantee that the
>> cache-cleaner will not run, because it may now be yielding in
>> qemu_co_mutex_lock().
> I think creating a coroutine in cache_clean_timer_cb() is the wrong
> approach. Instead, cache_clean_timer_init() could create a coroutine
> and its implementation could be something like this:
>
> while (!s->cache_clean_timer_stopping) {
> qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
> QEMU_CLOCK_VIRTUAL,
> s->cache_clean_interval * NANOSECONDS_PER_SECOND);
>
> qemu_co_mutex_lock(&s->lock);
> qcow2_cache_clean_unused(s->l2_table_cache);
> qcow2_cache_clean_unused(s->refcount_block_cache);
> qemu_co_mutex_unlock(&s->lock);
> }
> s->cache_clean_timer_stopping = false;
Ah, that’s nicer. I think we can replace the flag by checking
s->cache_clean_interval > 0 and adding a CoQueue to wake up waiting
coroutines.
>> (Note even now this was only guaranteed for cache_clean_timer_del()
>> callers that run in the BDS (the timer’s) AioContext. For callers
>> running in the main context, the problem may have already existed,
>> though maybe the BQL prevents timers from running in other contexts, I’m
>> not sure.)
>>
>> Polling to await the timer to actually settle seems very complicated for
>> something that’s rather a minor problem, but I can’t come up with any
>> better solution that doesn’t again just overlook potential problems.
>>
>> (One cleaner idea may be to have a generic way to have timers run
>> coroutines, and to await those when descheduling the timer. But while
>> cleaner, it would also be more complicated, and I don’t think worth it
>> at this point.)
>>
>> (Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
>> I’m not too fond of this solution.)
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>> block/qcow2.h | 1 +
>> block/qcow2.c | 90 ++++++++++++++++++++++++++++++++++++++++++++-------
>> 2 files changed, 79 insertions(+), 12 deletions(-)
>> @@ -867,6 +893,39 @@ static void cache_clean_timer_del(BlockDriverState *bs)
>> }
>> }
>>
>> +/*
>> + * Delete the cache clean timer and await any yet running instance.
>> + * Must be called from the main or BDS AioContext, holding s->lock.
>> + */
>> +static void coroutine_fn
>> +cache_clean_timer_locked_co_del_and_wait(BlockDriverState *bs)
>> +{
>> + BDRVQcow2State *s = bs->opaque;
>> + IO_OR_GS_CODE();
>> + cache_clean_timer_del(bs);
>> + if (qatomic_read(&s->cache_clean_running)) {
>> + qemu_co_mutex_unlock(&s->lock);
>> + qatomic_set(&s->cache_clean_polling, true);
>> + BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
> Polling in a coroutine_fn is verboten.
>
> If we do need this function, I think it would be a yield here and a wake
> on the other side. I think we might be able to get around it if we move
> the call from qcow2_do_open() into qcow2_open() (i.e. outside the
> coroutine). A bit ugly, so your choice.
We can let a CoQueue do the waking, no?
>> + qemu_co_mutex_lock(&s->lock);
>> + }
>> +}
>> +
>> +/*
>> + * Delete the cache clean timer and await any yet running instance.
>> + * Must be called from the main or BDS AioContext without s->lock held.
>> + */
>> +static void cache_clean_timer_del_and_wait(BlockDriverState *bs)
>> +{
>> + BDRVQcow2State *s = bs->opaque;
>> + IO_OR_GS_CODE();
>> + cache_clean_timer_del(bs);
>> + if (qatomic_read(&s->cache_clean_running)) {
>> + qatomic_set(&s->cache_clean_polling, true);
>> + BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
>> + }
>> +}
>> +
>> static void qcow2_detach_aio_context(BlockDriverState *bs)
>> {
>> cache_clean_timer_del(bs);
>> @@ -1214,12 +1273,20 @@ fail:
>> return ret;
>> }
>>
>> +/* s_locked specifies whether s->lock is held or not */
>> static void qcow2_update_options_commit(BlockDriverState *bs,
>> - Qcow2ReopenState *r)
>> + Qcow2ReopenState *r,
>> + bool s_locked)
>> {
>> BDRVQcow2State *s = bs->opaque;
>> int i;
>>
>> + if (s_locked) {
>> + cache_clean_timer_locked_co_del_and_wait(bs);
>> + } else {
>> + cache_clean_timer_del_and_wait(bs);
>> + }
>> +
>> if (s->l2_table_cache) {
>> qcow2_cache_destroy(s->l2_table_cache);
>> }
>> @@ -1228,6 +1295,10 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>> }
>> s->l2_table_cache = r->l2_table_cache;
>> s->refcount_block_cache = r->refcount_block_cache;
>> +
>> + s->cache_clean_interval = r->cache_clean_interval;
>> + cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
>> +
>> s->l2_slice_size = r->l2_slice_size;
>>
>> s->overlap_check = r->overlap_check;
>> @@ -1239,12 +1310,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>>
>> s->discard_no_unref = r->discard_no_unref;
>>
>> - if (s->cache_clean_interval != r->cache_clean_interval) {
>> - cache_clean_timer_del(bs);
>> - s->cache_clean_interval = r->cache_clean_interval;
>> - cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
>> - }
>> -
> I think the del/init pair here won't be necessary any more after
> switching to the background coroutine. It will just start using the new
> value of s->cache_clean_interval the next time it sleeps.
One problem is that if we don’t lock s->lock, the coroutine can read
s->l2_table_cache and s->refcount_block_cache while they’re invalid,
which is why I moved the deletion above. We also still need to delete
if the interval is set to 0 (or special-case that in the coroutine to
wait forever).
We could run all of this in a coroutine so we can lock s->lock, or we
have to force-stop the timer/coroutine at the start. Maybe running it
in a coroutine is better…
Hanna
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 16/16] win32-aio: Run CB in original context
2025-10-30 14:12 ` Kevin Wolf
@ 2025-10-31 9:31 ` Hanna Czenczek
0 siblings, 0 replies; 36+ messages in thread
From: Hanna Czenczek @ 2025-10-31 9:31 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
On 30.10.25 15:12, Kevin Wolf wrote:
> Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
>> AIO callbacks must be called in the originally calling AioContext,
>> regardless of the BDS’s “main” AioContext.
>>
>> Note: I tried to test this (under wine), but failed. Whenever I tried
>> to use multiqueue or even just an I/O thread for a virtio-blk (or
>> virtio-scsi) device, I/O stalled, both with and without this patch.
>>
>> For what it’s worth, when not using an I/O thread, I/O continued to work
>> with this patch.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> Should we then do the opposite thing and just move every request into
> the main thread and only move back before returning to the caller?
Maybe it would be fairer to fail requests when they’re not in BDS context?
> But I'm also not against making the theoretical fix so that maybe
> someone can fix the other problem later. It just seems to be of somewhat
> limited use on its own.
True. But as explained in patch 13, these last few patches are not
really that useful in practice anyway…
Hanna
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 10/16] qcow2: Fix cache_clean_timer
2025-10-31 9:29 ` Hanna Czenczek
@ 2025-10-31 13:03 ` Kevin Wolf
2025-11-06 16:08 ` Hanna Czenczek
0 siblings, 1 reply; 36+ messages in thread
From: Kevin Wolf @ 2025-10-31 13:03 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
Am 31.10.2025 um 10:29 hat Hanna Czenczek geschrieben:
> On 29.10.25 21:23, Kevin Wolf wrote:
> > Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> > > The cache-cleaner runs as a timer CB in the BDS AioContext. With
> > > multiqueue, it can run concurrently to I/O requests, and because it does
> > > not take any lock, this can break concurrent cache accesses, corrupting
> > > the image. While the chances of this happening are low, it can be
> > > reproduced e.g. by modifying the code to schedule the timer CB every
> > > 5 ms (instead of at most once per second) and modifying the last (inner)
> > > while loop of qcow2_cache_clean_unused() like so:
> > >
> > > while (i < c->size && can_clean_entry(c, i)) {
> > > for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
> > > usleep(100);
> > > }
> > > c->entries[i].offset = 0;
> > > c->entries[i].lru_counter = 0;
> > > i++;
> > > to_clean++;
> > > }
> > >
> > > i.e. making it wait on purpose for the point in time where the cache is
> > > in use by something else.
> > >
> > > The solution chosen for this in this patch is not the best solution, I
> > > hope, but I admittedly can’t come up with anything strictly better.
> > >
> > > We can protect from concurrent cache accesses either by taking the
> > > existing s->lock, or we introduce a new (non-coroutine) mutex
> > > specifically for cache accesses. I would prefer to avoid the latter so
> > > as not to introduce additional (very slight) overhead.
> > In theory, the old plan was that eventually qcow2 would use fine grained
> > locks instead of the single s->lock, and having a separate cache lock
> > would be a step towards it. But if we never actually make use of it to
> > hold s->lock for a shorter time, that's not really a good argument. I'm
> > not sure if that's ever going to happen unless for a rewrite in Rust or
> > something.
> >
> > I never tried to measure specifically if lock contention is a problem
> > with high queue depth and random I/O on a huge disk. Intuitively,
> > holding s->lock while doing I/O for loading entries into the cache can't
> > be really good.
> >
> > Anyway, I went a bit on a tangent there...
> >
> > > Using s->lock, which is a coroutine mutex, however means that we need to
> > > take it in a coroutine, so the timer CB must enter such a coroutine. As
> > > a result, descheduling the timer is no longer a guarantee that the
> > > cache-cleaner will not run, because it may now be yielding in
> > > qemu_co_mutex_lock().
> > I think creating a coroutine in cache_clean_timer_cb() is the wrong
> > approach. Instead, cache_clean_timer_init() could create a coroutine
> > and its implementation could be something like this:
> >
> > while (!s->cache_clean_timer_stopping) {
> > qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
> > QEMU_CLOCK_VIRTUAL,
Oh, I wanted to comment on this one, too, but apparently forgot...
I have absolutely no idea why we're using QEMU_CLOCK_VIRTUAL in the
existing code. I don't think the cache cleaning should stop just because
the guest is paused. You can still have block jobs and exports that work
on the image and fill the cache.
> > s->cache_clean_interval * NANOSECONDS_PER_SECOND);
> >
> > qemu_co_mutex_lock(&s->lock);
> > qcow2_cache_clean_unused(s->l2_table_cache);
> > qcow2_cache_clean_unused(s->refcount_block_cache);
> > qemu_co_mutex_unlock(&s->lock);
> > }
> > s->cache_clean_timer_stopping = false;
>
> Ah, that’s nicer. I think we can replace the flag by checking
> s->cache_clean_interval > 0 and adding a CoQueue to wake up waiting
> coroutines.
Ah, yes, stopping on s->cache_clean_interval == 0 is good.
> > > (Note even now this was only guaranteed for cache_clean_timer_del()
> > > callers that run in the BDS (the timer’s) AioContext. For callers
> > > running in the main context, the problem may have already existed,
> > > though maybe the BQL prevents timers from running in other contexts, I’m
> > > not sure.)
> > >
> > > Polling to await the timer to actually settle seems very complicated for
> > > something that’s rather a minor problem, but I can’t come up with any
> > > better solution that doesn’t again just overlook potential problems.
> > >
> > > (One cleaner idea may be to have a generic way to have timers run
> > > coroutines, and to await those when descheduling the timer. But while
> > > cleaner, it would also be more complicated, and I don’t think worth it
> > > at this point.)
> > >
> > > (Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
> > > I’m not too fond of this solution.)
> > >
> > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > ---
> > > block/qcow2.h | 1 +
> > > block/qcow2.c | 90 ++++++++++++++++++++++++++++++++++++++++++++-------
> > > 2 files changed, 79 insertions(+), 12 deletions(-)
> > > @@ -867,6 +893,39 @@ static void cache_clean_timer_del(BlockDriverState *bs)
> > > }
> > > }
> > > +/*
> > > + * Delete the cache clean timer and await any yet running instance.
> > > + * Must be called from the main or BDS AioContext, holding s->lock.
> > > + */
> > > +static void coroutine_fn
> > > +cache_clean_timer_locked_co_del_and_wait(BlockDriverState *bs)
> > > +{
> > > + BDRVQcow2State *s = bs->opaque;
> > > + IO_OR_GS_CODE();
> > > + cache_clean_timer_del(bs);
> > > + if (qatomic_read(&s->cache_clean_running)) {
> > > + qemu_co_mutex_unlock(&s->lock);
> > > + qatomic_set(&s->cache_clean_polling, true);
> > > + BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
> > Polling in a coroutine_fn is verboten.
> >
> > If we do need this function, I think it would be a yield here and a wake
> > on the other side. I think we might be able to get around it if we move
> > the call from qcow2_do_open() into qcow2_open() (i.e. outside the
> > coroutine). A bit ugly, so your choice.
>
> We can let a CoQueue do the waking, no?
Yes, that's probably nicer. I don't expect that you'll ever have more
than a single waiter, but a CoQueue would be safe in either case.
> > > + qemu_co_mutex_lock(&s->lock);
> > > + }
> > > +}
> > > +
> > > +/*
> > > + * Delete the cache clean timer and await any yet running instance.
> > > + * Must be called from the main or BDS AioContext without s->lock held.
> > > + */
> > > +static void cache_clean_timer_del_and_wait(BlockDriverState *bs)
> > > +{
> > > + BDRVQcow2State *s = bs->opaque;
> > > + IO_OR_GS_CODE();
> > > + cache_clean_timer_del(bs);
> > > + if (qatomic_read(&s->cache_clean_running)) {
> > > + qatomic_set(&s->cache_clean_polling, true);
> > > + BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
> > > + }
> > > +}
> > > +
> > > static void qcow2_detach_aio_context(BlockDriverState *bs)
> > > {
> > > cache_clean_timer_del(bs);
> > > @@ -1214,12 +1273,20 @@ fail:
> > > return ret;
> > > }
> > > +/* s_locked specifies whether s->lock is held or not */
> > > static void qcow2_update_options_commit(BlockDriverState *bs,
> > > - Qcow2ReopenState *r)
> > > + Qcow2ReopenState *r,
> > > + bool s_locked)
> > > {
> > > BDRVQcow2State *s = bs->opaque;
> > > int i;
> > > + if (s_locked) {
> > > + cache_clean_timer_locked_co_del_and_wait(bs);
> > > + } else {
> > > + cache_clean_timer_del_and_wait(bs);
> > > + }
> > > +
> > > if (s->l2_table_cache) {
> > > qcow2_cache_destroy(s->l2_table_cache);
> > > }
> > > @@ -1228,6 +1295,10 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
> > > }
> > > s->l2_table_cache = r->l2_table_cache;
> > > s->refcount_block_cache = r->refcount_block_cache;
> > > +
> > > + s->cache_clean_interval = r->cache_clean_interval;
> > > + cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> > > +
> > > s->l2_slice_size = r->l2_slice_size;
> > > s->overlap_check = r->overlap_check;
> > > @@ -1239,12 +1310,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
> > > s->discard_no_unref = r->discard_no_unref;
> > > - if (s->cache_clean_interval != r->cache_clean_interval) {
> > > - cache_clean_timer_del(bs);
> > > - s->cache_clean_interval = r->cache_clean_interval;
> > > - cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> > > - }
> > > -
> > I think the del/init pair here won't be necessary any more after
> > switching to the background coroutine. It will just start using the new
> > value of s->cache_clean_interval the next time it sleeps.
>
> One problem is that if we don’t lock s->lock, the coroutine can read
> s->l2_table_cache and s->refcount_block_cache while they’re invalid, which
> is why I moved the deletion above.
I see. This is a preexisting problem, right? The timer runs in the BDS
main context, while qcow2_update_options_commit() runs in the main loop
or... essentially anywhere else?
Should it be a separate patch then? A comment in the code wouldn't hurt
either.
> We also still need to delete if the interval is set to 0 (or
> special-case that in the coroutine to wait forever).
If the coroutine terminates on 0 as you suggested above, that would
automatically be solved.
> We could run all of this in a coroutine so we can lock s->lock, or we
> have to force-stop the timer/coroutine at the start. Maybe running it
> in a coroutine is better…
So qcow2_reopen_commit() would immediately enter a coroutine and
BDRV_POLL_WHILE() for its completion?
It's not exactly pretty (and I hope polling in reopen callbacks is even
allowed), but maybe more local ugliness than having additional state
(like s->cache_clean_running) to allow BDRV_POLL_WHILE() to wait for the
cache clean coroutine to stop.
Kevin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 05/16] curl: Fix coroutine waking
2025-10-31 9:15 ` Hanna Czenczek
@ 2025-10-31 13:17 ` Kevin Wolf
0 siblings, 0 replies; 36+ messages in thread
From: Kevin Wolf @ 2025-10-31 13:17 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
Am 31.10.2025 um 10:15 hat Hanna Czenczek geschrieben:
> On 29.10.25 17:57, Kevin Wolf wrote:
> > Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> > > If we wake a coroutine from a different context, we must ensure that it
> > > will yield exactly once (now or later), awaiting that wake.
> > >
> > > curl’s current .ret == -EINPROGRESS loop may lead to the coroutine not
> > > yielding if the request finishes before the loop gets run. To fix it,
> > > drop the loop and just yield exactly once, unless the request is served
> > > from the cache or failed before it was submitted – that last part makes
> > > it a bit complicated, as the result of curl_find_buf() now needs to be a
> > > tristate.
> > >
> > > (Can be reproduced with multiqueue by adding a usleep(100000) before the
> > > `while (acb.ret == -EINPROGRESS)` loop.)
> > >
> > > Also, add a comment why aio_co_wake() is safe regardless of whether the
> > > coroutine and curl_multi_check_completion() run in the same context.
> > >
> > > Cc: qemu-stable@nongnu.org
> > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > ---
> > > block/curl.c | 55 +++++++++++++++++++++++++++++++++++-----------------
> > > 1 file changed, 37 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/block/curl.c b/block/curl.c
> > > index 68cf83ce55..65996a8866 100644
> > > --- a/block/curl.c
> > > +++ b/block/curl.c
> > > @@ -124,6 +124,16 @@ typedef struct BDRVCURLState {
> > > char *proxypassword;
> > > } BDRVCURLState;
> > > +/** Possible result states of curl_find_buf() */
> > > +typedef enum {
> > > + /* No buffer found, need to create new request */
> > > + CURL_NO_BUF_FOUND,
> > > + /* Buffer found, request filled and done */
> > > + CURL_REQUEST_FILLED,
> > > + /* Ongoing request found, need to yield */
> > > + CURL_REQUEST_ONGOING,
> > > +} CURLFindBufResult;
> > > +
> > > static void curl_clean_state(CURLState *s);
> > > static void curl_multi_do(void *arg);
> > > @@ -258,8 +268,8 @@ read_end:
> > > }
> > > /* Called with s->mutex held. */
> > > -static bool curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
> > > - CURLAIOCB *acb)
> > > +static CURLFindBufResult curl_find_buf(BDRVCURLState *s, uint64_t start,
> > > + uint64_t len, CURLAIOCB *acb)
> > > {
> > > int i;
> > > uint64_t end = start + len;
> > > @@ -289,7 +299,7 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
> > > qemu_iovec_memset(acb->qiov, clamped_len, 0, len - clamped_len);
> > > }
> > > acb->ret = 0;
> > > - return true;
> > > + return CURL_REQUEST_FILLED;
> > > }
> > > // Wait for unfinished chunks
> > > @@ -307,13 +317,13 @@ static bool curl_find_buf(BDRVCURLState *s, uint64_t start, uint64_t len,
> > > for (j=0; j<CURL_NUM_ACB; j++) {
> > > if (!state->acb[j]) {
> > > state->acb[j] = acb;
> > > - return true;
> > > + return CURL_REQUEST_ONGOING;
> > > }
> > > }
> > > }
> > > }
> > > - return false;
> > > + return CURL_NO_BUF_FOUND;
> > > }
> > > /* Called with s->mutex held. */
> > > @@ -378,6 +388,16 @@ static void curl_multi_check_completion(BDRVCURLState *s)
> > > acb->ret = error ? -EIO : 0;
> > > state->acb[i] = NULL;
> > > qemu_mutex_unlock(&s->mutex);
> > > + /*
> > > + * Current AioContext is the BDS context, which may or may not
> > > + * be the request (coroutine) context.
> > > + * - If it is, the coroutine must have yielded or the FD handler
> > > + * (curl_multi_do()/curl_multi_timeout_do()) could not have
> > > + * been called and we would not be here
> > > + * - If it is not, it doesn't matter whether it has already
> > > + * yielded or not; it will be scheduled once it does yield
> > > + * So aio_co_wake() is safe to call.
> > > + */
> > > aio_co_wake(acb->co);
> > > qemu_mutex_lock(&s->mutex);
> > > }
> > > @@ -868,7 +888,8 @@ out_noclean:
> > > return -EINVAL;
> > > }
> > > -static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
> > > +/* Return whether a request was submitted that requires yielding */
> > > +static bool coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
> > > {
> > > CURLState *state;
> > > int running;
> > > @@ -877,13 +898,15 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
> > > uint64_t start = acb->offset;
> > > uint64_t end;
> > > + CURLFindBufResult find_buf_res;
> > > - qemu_mutex_lock(&s->mutex);
> > > + QEMU_LOCK_GUARD(&s->mutex);
> > > // In case we have the requested data already (e.g. read-ahead),
> > > // we can just call the callback and be done.
> > > - if (curl_find_buf(s, start, acb->bytes, acb)) {
> > > - goto out;
> > > + find_buf_res = curl_find_buf(s, start, acb->bytes, acb);
> > > + if (find_buf_res != CURL_NO_BUF_FOUND) {
> > > + return find_buf_res == CURL_REQUEST_ONGOING;
> > > }
> > > // No cache found, so let's start a new request
> > > @@ -898,7 +921,7 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
> > > if (curl_init_state(s, state) < 0) {
> > > curl_clean_state(state);
> > > acb->ret = -EIO;
> > > - goto out;
> > > + return false;
> > > }
> > > acb->start = 0;
> > > @@ -913,7 +936,7 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
> > > if (state->buf_len && state->orig_buf == NULL) {
> > > curl_clean_state(state);
> > > acb->ret = -ENOMEM;
> > > - goto out;
> > > + return false;
> > > }
> > > state->acb[0] = acb;
> > > @@ -925,14 +948,12 @@ static void coroutine_fn curl_setup_preadv(BlockDriverState *bs, CURLAIOCB *acb)
> > > acb->ret = -EIO;
> > > curl_clean_state(state);
> > > - goto out;
> > > + return false;
> > > }
> > > /* Tell curl it needs to kick things off */
> > > curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
> > > -
> > > -out:
> > > - qemu_mutex_unlock(&s->mutex);
> > > + return true;
> > > }
> > > static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
> > > @@ -941,14 +962,12 @@ static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
> > > {
> > > CURLAIOCB acb = {
> > > .co = qemu_coroutine_self(),
> > > - .ret = -EINPROGRESS,
> > > .qiov = qiov,
> > > .offset = offset,
> > > .bytes = bytes
> > > };
> > Let's leave -EINPROGRESS here even if no other code checks for this
> > value any more. It can be helpful for debugging when you can distinguish
> > "completed successfully" from "still running".
>
> Does that mean you want me to keep the `complete` field on rbd and nfs, too?
Hm, I think I wouldn't necessarily keep an additional field around in
the struct just for this. But you could really argue either way.
I just thought that if it exists anyway, like ret here, we can as well
initialise it with some identifiable value.
> > > - curl_setup_preadv(bs, &acb);
> > > - while (acb.ret == -EINPROGRESS) {
> > > + if (curl_setup_preadv(bs, &acb)) {
> > > qemu_coroutine_yield();
> > > }
> > > return acb.ret;
> > That whole pattern of returning true and false or even a new enum
> > everywhere to tell if we are waiting for something felt strange to me.
> > Took me a while, but I think now I know what I expected instead: Why
> > don't these places just yield immediately instead of requiring the outer
> > layer to understand what happened in the functions it called?
>
> I was considering the same. My result was, if they yielded immediately, we
> might as well fully inline curl_setup_preadv() into this function. I didn’t
> want to do that at the time, but if you prefer, no problem.
Not sure how both are related (the three additional lines of the
yielding loop certainly wouldn't stop us from inlining), but sure, that
could be done. curl_co_preadv() has never been more than a trivial
wrapper that resulted from a direct AIO to coroutine conversion.
Kevin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 03/16] iscsi: Run co BH CB in the coroutine’s AioContext
2025-10-31 9:07 ` Hanna Czenczek
@ 2025-10-31 13:27 ` Kevin Wolf
0 siblings, 0 replies; 36+ messages in thread
From: Kevin Wolf @ 2025-10-31 13:27 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
Am 31.10.2025 um 10:07 hat Hanna Czenczek geschrieben:
> On 29.10.25 15:27, Kevin Wolf wrote:
> > Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> > > For rbd (and others), as described in “rbd: Run co BH CB in the
> > > coroutine’s AioContext”, the pattern of setting a completion flag and
> > > waking a coroutine that yields while the flag is not set can only work
> > > when both run in the same thread.
> > >
> > > iscsi has the same pattern, but the details are a bit different:
> > > iscsi_co_generic_cb() can (as far as I understand) only run through
> > > iscsi_service(), not just from a random thread at a random time.
> > > iscsi_service() in turn can only be run after iscsi_set_events() set up
> > > an FD event handler, which is done in iscsi_co_wait_for_task().
> > >
> > > As a result, iscsi_co_wait_for_task() will always yield exactly once,
> > > because iscsi_co_generic_cb() can only run after iscsi_set_events(),
> > > after the completion flag has already been checked, and the yielding
> > > coroutine will then be woken only once the completion flag was set to
> > > true. So as far as I can tell, iscsi has no bug and already works fine.
> > >
> > > Still, we don’t need the completion flag because we know we have to
> > > yield exactly once, so we can drop it. This simplifies the code and
> > > makes it more obvious that the “rbd bug” isn’t present here.
> > >
> > > This makes iscsi_co_generic_bh_cb() and iscsi_retry_timer_expired() a
> > > bit boring, and actually, for the former, we could drop it and run
> > > aio_co_wake() directly from scsi_co_generic_cb() to the same effect; but
> > > that would remove the replay_bh_schedule_oneshot_event(), and I assume
> > > we shouldn’t do that. At least schedule both the BH and the timer in
> > > the coroutine’s AioContext to make them simple wrappers around
> > > qemu_coroutine_enter(), without a further BH indirection.
> > I don't think we have to keep the BH. Is your concern about replay? I
> > doubt that this works across different QEMU versions anyway, and if it
> > does, it's pure luck.
>
> It is solely about replay, yes. I assumed the
> replay_bh_schedule_oneshot_event() would be a replay point, so removing it
> would, well, remove a replay point. I suppose we’re going to have one
> replay point per request anyway (when going through the blkreplay driver),
> so maybe it doesn’t matter much?
Yes, I think it is a replay point. And I don't really know what replay
does when the log has an event that doesn't appear in the code (or the
other way around).
I just don't expect that compatibility of replay logs across QEMU
versions is important (and even if it were important, that it is
achieved, because we probably change the control flow leading to replay
points all the time without even noticing). As far as I understand the
idea is that you want to debug a guest, and during that single debug
session you record a guest and then replay it multiple times. I don't
think it's expected that you keep the replay logs for a long time and
across QEMU updates.
Kevin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 10/16] qcow2: Fix cache_clean_timer
2025-10-31 13:03 ` Kevin Wolf
@ 2025-11-06 16:08 ` Hanna Czenczek
0 siblings, 0 replies; 36+ messages in thread
From: Hanna Czenczek @ 2025-11-06 16:08 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini,
Richard W . M . Jones, Ilya Dryomov, Peter Lieven,
Philippe Mathieu-Daudé, Alex Bennée, Fam Zheng,
Ronnie Sahlberg
On 31.10.25 14:03, Kevin Wolf wrote:
> Am 31.10.2025 um 10:29 hat Hanna Czenczek geschrieben:
>> On 29.10.25 21:23, Kevin Wolf wrote:
>>> Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
[...]
>>>> @@ -1239,12 +1310,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>>>> s->discard_no_unref = r->discard_no_unref;
>>>> - if (s->cache_clean_interval != r->cache_clean_interval) {
>>>> - cache_clean_timer_del(bs);
>>>> - s->cache_clean_interval = r->cache_clean_interval;
>>>> - cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
>>>> - }
>>>> -
>>> I think the del/init pair here won't be necessary any more after
>>> switching to the background coroutine. It will just start using the new
>>> value of s->cache_clean_interval the next time it sleeps.
>> One problem is that if we don’t lock s->lock, the coroutine can read
>> s->l2_table_cache and s->refcount_block_cache while they’re invalid, which
>> is why I moved the deletion above.
> I see. This is a preexisting problem, right? The timer runs in the BDS
> main context, while qcow2_update_options_commit() runs in the main loop
> or... essentially anywhere else?
Sorry for the late reply, yes. There may be the additional problem
noted in the commit message, specifically that the timer may fire, run
the CB in the BDS’s context, and is concurrently deleted from the main
context. It will then still run, so just moving the delete up is not a
100 % solution I think. We also need to make sure the timer code really
isn’t running.
> Should it be a separate patch then? A comment in the code wouldn't hurt
> either.
I’ll add a comment.
>> We also still need to delete if the interval is set to 0 (or
>> special-case that in the coroutine to wait forever).
> If the coroutine terminates on 0 as you suggested above, that would
> automatically be solved.
I don’t think so, because we still need to be able to restart it (when
transitioning from a value of 0 to a different value).
>> We could run all of this in a coroutine so we can lock s->lock, or we
>> have to force-stop the timer/coroutine at the start. Maybe running it
>> in a coroutine is better…
> So qcow2_reopen_commit() would immediately enter a coroutine and
> BDRV_POLL_WHILE() for its completion?
>
> It's not exactly pretty (and I hope polling in reopen callbacks is even
> allowed), but maybe more local ugliness than having additional state
> (like s->cache_clean_running) to allow BDRV_POLL_WHILE() to wait for the
> cache clean coroutine to stop.
I think I’ll keep it as-is. We’ll probably actually want to wait in
detach_aio_context, too.
Hanna
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2025-11-06 16:10 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-28 16:33 [PATCH 00/16] block: Some multi-threading fixes Hanna Czenczek
2025-10-28 16:33 ` [PATCH 01/16] block: Note on aio_co_wake use if not yet yielding Hanna Czenczek
2025-10-28 16:33 ` [PATCH 02/16] rbd: Run co BH CB in the coroutine’s AioContext Hanna Czenczek
2025-10-28 16:33 ` [PATCH 03/16] iscsi: " Hanna Czenczek
2025-10-29 14:27 ` Kevin Wolf
2025-10-31 9:07 ` Hanna Czenczek
2025-10-31 13:27 ` Kevin Wolf
2025-10-28 16:33 ` [PATCH 04/16] nfs: " Hanna Czenczek
2025-10-28 16:33 ` [PATCH 05/16] curl: Fix coroutine waking Hanna Czenczek
2025-10-29 16:57 ` Kevin Wolf
2025-10-31 9:15 ` Hanna Czenczek
2025-10-31 13:17 ` Kevin Wolf
2025-10-28 16:33 ` [PATCH 06/16] gluster: Do not move coroutine into BDS context Hanna Czenczek
2025-10-29 17:10 ` Kevin Wolf
2025-10-31 9:16 ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 07/16] nvme: Kick and check completions in " Hanna Czenczek
2025-10-29 17:23 ` Kevin Wolf
2025-10-29 17:39 ` Kevin Wolf
2025-10-31 9:18 ` Hanna Czenczek
2025-10-31 9:19 ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 08/16] nvme: Fix coroutine waking Hanna Czenczek
2025-10-29 17:43 ` Kevin Wolf
2025-10-28 16:33 ` [PATCH 09/16] block/io: Take reqs_lock for tracked_requests Hanna Czenczek
2025-10-28 16:33 ` [PATCH 10/16] qcow2: Fix cache_clean_timer Hanna Czenczek
2025-10-29 20:23 ` Kevin Wolf
2025-10-31 9:29 ` Hanna Czenczek
2025-10-31 13:03 ` Kevin Wolf
2025-11-06 16:08 ` Hanna Czenczek
2025-10-28 16:33 ` [PATCH 11/16] ssh: Run restart_coroutine in current AioContext Hanna Czenczek
2025-10-28 16:33 ` [PATCH 12/16] blkreplay: Run BH in coroutine’s AioContext Hanna Czenczek
2025-10-28 16:33 ` [PATCH 13/16] block: Note in which AioContext AIO CBs are called Hanna Czenczek
2025-10-28 16:33 ` [PATCH 14/16] iscsi: Create AIO BH in original AioContext Hanna Czenczek
2025-10-28 16:33 ` [PATCH 15/16] null-aio: Run CB " Hanna Czenczek
2025-10-28 16:33 ` [PATCH 16/16] win32-aio: Run CB in original context Hanna Czenczek
2025-10-30 14:12 ` Kevin Wolf
2025-10-31 9:31 ` Hanna Czenczek
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).