[RFC 00/11] aio: add the aio_add_sqe() io

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC 00/11] aio: add the aio_add_sqe() io_uring API
@ 2025-05-28 19:09 Stefan Hajnoczi
  2025-05-28 19:09 ` [RFC 01/11] aio-posix: fix polling mode with fdmon-io_uring Stefan Hajnoczi
                   ` (10 more replies)
  0 siblings, 11 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-05-28 19:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, hibriansong, Kevin Wolf,
	Hanna Czenczek

This patch series contains io_uring improvements:

1. Support the glib event loop in fdmon-io_uring.
   - aio-posix: fix polling mode with fdmon-io_uring
   - aio-posix: keep polling enabled with fdmon-io_uring.c
   - tests/unit: skip test-nested-aio-poll with io_uring
   - aio-posix: integrate fdmon into glib event loop

2. Enable fdmon-io_uring on hosts where io_uring is available at runtime.
   Otherwise continue using ppoll(2) or epoll(7).
   - aio: remove aio_context_use_g_source()

3. Add the new aio_add_sqe() API for submitting io_uring requests in the QEMU
   event loop.
   - aio: free AioContext when aio_context_new() fails
   - aio: add errp argument to aio_context_setup()
   - aio-posix: gracefully handle io_uring_queue_init() failure
   - aio-posix: add aio_add_sqe() API for user-defined io_uring requests
   - aio-posix: avoid EventNotifier for cqe_handler_bh

4. Use aio_add_sqe() in block/io_uring.c instead of creating a dedicated
   io_uring context for --blockdev aio=io_uring. This simplifies the code,
   reduces the number of file descriptors, and demonstrates the aio_add_sqe()
   API.
   - block/io_uring: use aio_add_sqe()

The highlight is aio_add_sqe(), which is needed for the FUSE-over-io_uring
Google Summer of Code project and other future QEMU features that natively use
Linux io_uring functionality.

I'm not happy with performance yet. This is why I've marked the series as
Request For Comments:
rw        bs iodepth aio    iothread before after  diff
randread  4k       1 native        0  76281 79707  +4.5%
randread  4k      64 native        0 255078 247293 -3.1%
randwrite 4k       1 native        0 132706 123337 -7.1%
randwrite 4k      64 native        0 275589 245192 -11%
randread  4k       1 io_uring      0  75284 78023  +3.5%
randread  4k      64 io_uring      0 254637 248222 -2.5%
randwrite 4k       1 io_uring      0 126519 128641 +1.7%
randwrite 4k      64 io_uring      0 258967 249266 -3.7%
randread  4k       1 native        1  90557 88436  -2.3%
randread  4k      64 native        1 290673 280456 -3.5%
randwrite 4k       1 native        1 183015 169106 -7.6%
randwrite 4k      64 native        1 281316 280078 -0.4%
randread  4k       1 io_uring      1  92479 86983  -5.9%
randread  4k      64 io_uring      1 304229 257730 -15.3%
randwrite 4k       1 io_uring      1 183983 157425 -14.4%
randwrite 4k      64 io_uring      1 299979 264156 -11.9%

Overall the performance decreases, so I need to continue profiling the
iodepth=64 cases with aio=native and aio=io_uring.

This series replaces the following older series that were held off from merging
until the QEMU 10.1 development window opened and the performance results were
collected:
- "[PATCH 0/3] [RESEND] block: unify block and fdmon io_uring"
- "[PATCH 0/4] aio-posix: integrate fdmon into glib event loop"

Stefan Hajnoczi (11):
  aio-posix: fix polling mode with fdmon-io_uring
  aio-posix: keep polling enabled with fdmon-io_uring.c
  tests/unit: skip test-nested-aio-poll with io_uring
  aio-posix: integrate fdmon into glib event loop
  aio: remove aio_context_use_g_source()
  aio: free AioContext when aio_context_new() fails
  aio: add errp argument to aio_context_setup()
  aio-posix: gracefully handle io_uring_queue_init() failure
  aio-posix: add aio_add_sqe() API for user-defined io_uring requests
  aio-posix: avoid EventNotifier for cqe_handler_bh
  block/io_uring: use aio_add_sqe()

 meson.build                       |   2 +-
 include/block/aio.h               | 134 +++++++-
 include/block/raw-aio.h           |   5 -
 util/aio-posix.h                  |  18 +-
 block/file-posix.c                |  38 +--
 block/io_uring.c                  | 489 +++++++-----------------------
 stubs/io_uring.c                  |  32 --
 tests/unit/test-aio.c             |   7 +-
 tests/unit/test-nested-aio-poll.c |  13 +-
 util/aio-posix.c                  | 134 ++++----
 util/aio-win32.c                  |   6 +-
 util/async.c                      |  53 +---
 util/fdmon-epoll.c                |  52 +++-
 util/fdmon-io_uring.c             | 218 ++++++++++---
 util/fdmon-poll.c                 |  88 +++++-
 block/trace-events                |  12 +-
 stubs/meson.build                 |   3 -
 util/trace-events                 |   4 +
 18 files changed, 668 insertions(+), 640 deletions(-)
 delete mode 100644 stubs/io_uring.c

-- 
2.49.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC 01/11] aio-posix: fix polling mode with fdmon-io_uring
  2025-05-28 19:09 [RFC 00/11] aio: add the aio_add_sqe() io_uring API Stefan Hajnoczi
@ 2025-05-28 19:09 ` Stefan Hajnoczi
  2025-05-28 20:29   ` Eric Blake
  2025-05-28 19:09 ` [RFC 02/11] aio-posix: keep polling enabled with fdmon-io_uring.c Stefan Hajnoczi
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-05-28 19:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, hibriansong, Kevin Wolf,
	Hanna Czenczek

The io_uring(7) file descriptor monitor cannot enter polling mode
because it needs to submit a POLL_ADD SQE every time a file descriptor
becomes active. Submitting SQEs only happens in FDMonOps->wait() outside
of polling mode.

Fix this using the multi-shot mechanism introduced in Linux 5.13 and
liburing 2.1. Stable and enterprise Linux distros ship 5.14+ as of March
2025, so it is safe to require this. Note that fdmon-io_uring is
currently not enabled at runtime and is not essential, so QEMU can still
be built without it on older hosts.

In multi-shot mode, a POLL_ADD SQE remains active until canceled with
POLL_REMOVE. This avoids the need to submit a new SQE every time a file
descriptor becomes active.

When POLL_REMOVE is processed by the host kernel, the multi-shot
POLL_ADD operation completes with -ECANCELED. Adjust the code slightly
to take this into account.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 meson.build           |  2 +-
 util/fdmon-io_uring.c | 34 +++++++++++++++++++++-------------
 2 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/meson.build b/meson.build
index fdad3fb528..6a362b9209 100644
--- a/meson.build
+++ b/meson.build
@@ -1157,7 +1157,7 @@ linux_io_uring_test = '''
 
 linux_io_uring = not_found
 if not get_option('linux_io_uring').auto() or have_block
-  linux_io_uring = dependency('liburing', version: '>=0.3',
+  linux_io_uring = dependency('liburing', version: '>=2.1',
                               required: get_option('linux_io_uring'),
                               method: 'pkg-config')
   if not cc.links(linux_io_uring_test)
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index b0d68bdc44..6cd665e565 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -124,8 +124,7 @@ static AioHandler *dequeue(AioHandlerSList *head, unsigned *flags)
     /*
      * Don't clear FDMON_IO_URING_REMOVE.  It's sticky so it can serve two
      * purposes: telling fill_sq_ring() to submit IORING_OP_POLL_REMOVE and
-     * telling process_cqe() to delete the AioHandler when its
-     * IORING_OP_POLL_ADD completes.
+     * telling process_cqe() to ignore IORING_OP_POLL_ADD completions.
      */
     *flags = qatomic_fetch_and(&node->flags, ~(FDMON_IO_URING_PENDING |
                                               FDMON_IO_URING_ADD));
@@ -166,12 +165,12 @@ static void fdmon_io_uring_update(AioContext *ctx,
     }
 }
 
-static void add_poll_add_sqe(AioContext *ctx, AioHandler *node)
+static void add_poll_multishot_sqe(AioContext *ctx, AioHandler *node)
 {
     struct io_uring_sqe *sqe = get_sqe(ctx);
     int events = poll_events_from_pfd(node->pfd.events);
 
-    io_uring_prep_poll_add(sqe, node->pfd.fd, events);
+    io_uring_prep_poll_multishot(sqe, node->pfd.fd, events);
     io_uring_sqe_set_data(sqe, node);
 }
 
@@ -213,7 +212,7 @@ static void fill_sq_ring(AioContext *ctx)
     while ((node = dequeue(&submit_list, &flags))) {
         /* Order matters, just in case both flags were set */
         if (flags & FDMON_IO_URING_ADD) {
-            add_poll_add_sqe(ctx, node);
+            add_poll_multishot_sqe(ctx, node);
         }
         if (flags & FDMON_IO_URING_REMOVE) {
             add_poll_remove_sqe(ctx, node);
@@ -234,21 +233,30 @@ static bool process_cqe(AioContext *ctx,
         return false;
     }
 
+    flags = qatomic_read(&node->flags);
+
     /*
-     * Deletion can only happen when IORING_OP_POLL_ADD completes.  If we race
-     * with enqueue() here then we can safely clear the FDMON_IO_URING_REMOVE
-     * bit before IORING_OP_POLL_REMOVE is submitted.
+     * poll_multishot cancelled by poll_remove? Or completed early because fd
+     * was closed before poll_remove finished?
      */
-    flags = qatomic_fetch_and(&node->flags, ~FDMON_IO_URING_REMOVE);
-    if (flags & FDMON_IO_URING_REMOVE) {
+    if (cqe->res == -ECANCELED || cqe->res == -EBADF) {
+        assert(!(cqe->flags & IORING_CQE_F_MORE));
+        assert(flags & FDMON_IO_URING_REMOVE);
         QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers, node, node_deleted);
         return false;
     }
 
-    aio_add_ready_handler(ready_list, node, pfd_events_from_poll(cqe->res));
+    /* Ignore if it becomes ready during removal */
+    if (flags & FDMON_IO_URING_REMOVE) {
+        return false;
+    }
 
-    /* IORING_OP_POLL_ADD is one-shot so we must re-arm it */
-    add_poll_add_sqe(ctx, node);
+    /* Multi-shot can stop at any time, so re-arm if necessary */
+    if (!(cqe->flags & IORING_CQE_F_MORE)) {
+        add_poll_multishot_sqe(ctx, node);
+    }
+
+    aio_add_ready_handler(ready_list, node, pfd_events_from_poll(cqe->res));
     return true;
 }
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC 01/11] aio-posix: fix polling mode with fdmon-io_uring
  2025-05-28 19:09 ` [RFC 01/11] aio-posix: fix polling mode with fdmon-io_uring Stefan Hajnoczi
@ 2025-05-28 20:29   ` Eric Blake
  0 siblings, 0 replies; 33+ messages in thread
From: Eric Blake @ 2025-05-28 20:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek

On Wed, May 28, 2025 at 03:09:06PM -0400, Stefan Hajnoczi wrote:
> The io_uring(7) file descriptor monitor cannot enter polling mode
> because it needs to submit a POLL_ADD SQE every time a file descriptor
> becomes active. Submitting SQEs only happens in FDMonOps->wait() outside
> of polling mode.
> 
> Fix this using the multi-shot mechanism introduced in Linux 5.13 and
> liburing 2.1. Stable and enterprise Linux distros ship 5.14+ as of March
> 2025, so it is safe to require this. Note that fdmon-io_uring is
> currently not enabled at runtime and is not essential, so QEMU can still
> be built without it on older hosts.
> 
> In multi-shot mode, a POLL_ADD SQE remains active until canceled with
> POLL_REMOVE. This avoids the need to submit a new SQE every time a file
> descriptor becomes active.
> 
> When POLL_REMOVE is processed by the host kernel, the multi-shot
> POLL_ADD operation completes with -ECANCELED. Adjust the code slightly
> to take this into account.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC 02/11] aio-posix: keep polling enabled with fdmon-io_uring.c
  2025-05-28 19:09 [RFC 00/11] aio: add the aio_add_sqe() io_uring API Stefan Hajnoczi
  2025-05-28 19:09 ` [RFC 01/11] aio-posix: fix polling mode with fdmon-io_uring Stefan Hajnoczi
@ 2025-05-28 19:09 ` Stefan Hajnoczi
  2025-05-28 20:34   ` Eric Blake
  2025-05-28 19:09 ` [RFC 03/11] tests/unit: skip test-nested-aio-poll with io_uring Stefan Hajnoczi
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-05-28 19:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, hibriansong, Kevin Wolf,
	Hanna Czenczek, Chao Gao

Commit 816a430c517e ("util/aio: Defer disabling poll mode as long as
possible") kept polling enabled when the event loop timeout is 0. Since
there is no timeout the event loop will continue immediately and the
overhead of disabling and re-enabling polling can be avoided.

fdmon-io_uring.c is unable to take advantage of this optimization
because its ->need_wait() function returns true whenever there are new
io_uring SQEs to submit:

  if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Polling will be disabled even when timeout == 0.

Extend the optimization to handle the case when need_wait() returns true
and timeout == 0.

Cc: Chao Gao <chao.gao@intel.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/aio-posix.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util/aio-posix.c b/util/aio-posix.c
index 2e0a5dadc4..2439cf0feb 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -722,7 +722,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
          * up IO threads when some work becomes pending. It is essential to
          * avoid hangs or unnecessary latency.
          */
-        if (poll_set_started(ctx, &ready_list, false)) {
+        if (timeout && poll_set_started(ctx, &ready_list, false)) {
             timeout = 0;
             progress = true;
         }
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC 02/11] aio-posix: keep polling enabled with fdmon-io_uring.c
  2025-05-28 19:09 ` [RFC 02/11] aio-posix: keep polling enabled with fdmon-io_uring.c Stefan Hajnoczi
@ 2025-05-28 20:34   ` Eric Blake
  0 siblings, 0 replies; 33+ messages in thread
From: Eric Blake @ 2025-05-28 20:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek,
	Chao Gao

On Wed, May 28, 2025 at 03:09:07PM -0400, Stefan Hajnoczi wrote:
> Commit 816a430c517e ("util/aio: Defer disabling poll mode as long as
> possible") kept polling enabled when the event loop timeout is 0. Since
> there is no timeout the event loop will continue immediately and the
> overhead of disabling and re-enabling polling can be avoided.
> 
> fdmon-io_uring.c is unable to take advantage of this optimization
> because its ->need_wait() function returns true whenever there are new
> io_uring SQEs to submit:
> 
>   if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
>                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Polling will be disabled even when timeout == 0.
> 
> Extend the optimization to handle the case when need_wait() returns true
> and timeout == 0.
> 
> Cc: Chao Gao <chao.gao@intel.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC 03/11] tests/unit: skip test-nested-aio-poll with io_uring
  2025-05-28 19:09 [RFC 00/11] aio: add the aio_add_sqe() io_uring API Stefan Hajnoczi
  2025-05-28 19:09 ` [RFC 01/11] aio-posix: fix polling mode with fdmon-io_uring Stefan Hajnoczi
  2025-05-28 19:09 ` [RFC 02/11] aio-posix: keep polling enabled with fdmon-io_uring.c Stefan Hajnoczi
@ 2025-05-28 19:09 ` Stefan Hajnoczi
  2025-05-28 20:40   ` Eric Blake
  2025-05-28 19:09 ` [RFC 04/11] aio-posix: integrate fdmon into glib event loop Stefan Hajnoczi
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-05-28 19:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, hibriansong, Kevin Wolf,
	Hanna Czenczek

test-nested-aio-poll relies on internal details of how fdmon-poll.c
handles AioContext polling. Skip it when other fdmon implementations are
in use.

Note that this test is only built on POSIX systems so it is safe to
include "util/aio-posix.h".

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/unit/test-nested-aio-poll.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/tests/unit/test-nested-aio-poll.c b/tests/unit/test-nested-aio-poll.c
index d8fd92c43b..45484e745b 100644
--- a/tests/unit/test-nested-aio-poll.c
+++ b/tests/unit/test-nested-aio-poll.c
@@ -15,6 +15,7 @@
 #include "qemu/osdep.h"
 #include "block/aio.h"
 #include "qapi/error.h"
+#include "util/aio-posix.h"
 
 typedef struct {
     AioContext *ctx;
@@ -71,6 +72,12 @@ static void test(void)
         .ctx = aio_context_new(&error_abort),
     };
 
+    if (td.ctx->fdmon_ops != &fdmon_poll_ops) {
+        /* This test is tied to fdmon-poll.c */
+        g_test_skip("fdmon_poll_ops not in use");
+        return;
+    }
+
     qemu_set_current_aio_context(td.ctx);
 
     /* Enable polling */
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC 03/11] tests/unit: skip test-nested-aio-poll with io_uring
  2025-05-28 19:09 ` [RFC 03/11] tests/unit: skip test-nested-aio-poll with io_uring Stefan Hajnoczi
@ 2025-05-28 20:40   ` Eric Blake
  0 siblings, 0 replies; 33+ messages in thread
From: Eric Blake @ 2025-05-28 20:40 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek

On Wed, May 28, 2025 at 03:09:08PM -0400, Stefan Hajnoczi wrote:
> test-nested-aio-poll relies on internal details of how fdmon-poll.c
> handles AioContext polling. Skip it when other fdmon implementations are
> in use.
> 
> Note that this test is only built on POSIX systems so it is safe to
> include "util/aio-posix.h".
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  tests/unit/test-nested-aio-poll.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC 04/11] aio-posix: integrate fdmon into glib event loop
  2025-05-28 19:09 [RFC 00/11] aio: add the aio_add_sqe() io_uring API Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2025-05-28 19:09 ` [RFC 03/11] tests/unit: skip test-nested-aio-poll with io_uring Stefan Hajnoczi
@ 2025-05-28 19:09 ` Stefan Hajnoczi
  2025-05-28 21:01   ` Eric Blake
  2025-05-28 19:09 ` [RFC 05/11] aio: remove aio_context_use_g_source() Stefan Hajnoczi
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-05-28 19:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, hibriansong, Kevin Wolf,
	Hanna Czenczek

AioContext's glib integration only supports ppoll(2) file descriptor
monitoring. epoll(7) and io_uring(7) disable themselves and switch back
to ppoll(2) when the glib event loop is used. The main loop thread
cannot use epoll(7) or io_uring(7) because it always uses the glib event
loop.

Future QEMU features may require io_uring(7). One example is uring_cmd
support in FUSE exports. Each feature could create its own io_uring(7)
context and integrate it into the event loop, but this is inefficient
due to extra syscalls. It would be more efficient to reuse the
AioContext's existing fdmon-io_uring.c io_uring(7) context because
fdmon-io_uring.c will already be active on systems where Linux io_uring
is available.

In order to keep fdmon-io_uring.c's AioContext operational even when the
glib event loop is used, extend FDMonOps with an API similar to
GSourceFuncs so that file descriptor monitoring can integrate into the
glib event loop.

A quick summary of the GSourceFuncs API:
- prepare() is called each event loop iteration before waiting for file
  descriptors and timers.
- check() is called to determine whether events are ready to be
  dispatched after waiting.
- dispatch() is called to process events.

More details here: https://docs.gtk.org/glib/struct.SourceFuncs.html

Move the ppoll(2)-specific code from aio-posix.c into fdmon-poll.c and
also implement epoll(7)- and io_uring(7)-specific file descriptor
monitoring code for glib event loops.

Note that it's still faster to use aio_poll() rather than the glib event
loop since glib waits for file descriptor activity with ppoll(2) and
does not support adaptive polling. But at least epoll(7) and io_uring(7)
now work in glib event loops.

Splitting this into multiple commits without temporarily breaking
AioContext proved difficult so this commit makes all the changes. The
next commit will remove the aio_context_use_g_source() API because it is
no longer needed.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/aio.h   | 36 ++++++++++++++++++
 util/aio-posix.h      |  5 +++
 tests/unit/test-aio.c |  7 +++-
 util/aio-posix.c      | 69 ++++++++-------------------------
 util/fdmon-epoll.c    | 52 ++++++++++++++++++++++---
 util/fdmon-io_uring.c | 44 +++++++++++++++++++++-
 util/fdmon-poll.c     | 88 ++++++++++++++++++++++++++++++++++++++++++-
 7 files changed, 239 insertions(+), 62 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 99ff48420b..39ed86d14d 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -106,6 +106,38 @@ typedef struct {
      * Returns: true if ->wait() should be called, false otherwise.
      */
     bool (*need_wait)(AioContext *ctx);
+
+    /*
+     * gsource_prepare:
+     * @ctx: the AioContext
+     *
+     * Prepare for the glib event loop to wait for events instead of the usual
+     * ->wait() call. See glib's GSourceFuncs->prepare().
+     */
+    void (*gsource_prepare)(AioContext *ctx);
+
+    /*
+     * gsource_check:
+     * @ctx: the AioContext
+     *
+     * Called by the glib event loop from glib's GSourceFuncs->check() after
+     * waiting for events.
+     *
+     * Returns: true when ready to be dispatched.
+     */
+    bool (*gsource_check)(AioContext *ctx);
+
+    /*
+     * gsource_dispatch:
+     * @ctx: the AioContext
+     * @ready_list: list for handlers that become ready
+     *
+     * Place ready AioHandlers on ready_list. Called as part of the glib event
+     * loop from glib's GSourceFuncs->dispatch().
+     *
+     * Called with list_lock incremented.
+     */
+    void (*gsource_dispatch)(AioContext *ctx, AioHandlerList *ready_list);
 } FDMonOps;
 
 /*
@@ -222,6 +254,7 @@ struct AioContext {
     /* State for file descriptor monitoring using Linux io_uring */
     struct io_uring fdmon_io_uring;
     AioHandlerSList submit_list;
+    gpointer io_uring_fd_tag;
 #endif
 
     /* TimerLists for calling timers - one per clock type.  Has its own
@@ -254,6 +287,9 @@ struct AioContext {
     /* epoll(7) state used when built with CONFIG_EPOLL */
     int epollfd;
 
+    /* The GSource unix fd tag for epollfd */
+    gpointer epollfd_tag;
+
     const FDMonOps *fdmon_ops;
 };
 
diff --git a/util/aio-posix.h b/util/aio-posix.h
index 82a0201ea4..f9994ed79e 100644
--- a/util/aio-posix.h
+++ b/util/aio-posix.h
@@ -47,9 +47,14 @@ void aio_add_ready_handler(AioHandlerList *ready_list, AioHandler *node,
 
 extern const FDMonOps fdmon_poll_ops;
 
+/* Switch back to poll(2). list_lock must be held. */
+void fdmon_poll_downgrade(AioContext *ctx);
+
 #ifdef CONFIG_EPOLL_CREATE1
 bool fdmon_epoll_try_upgrade(AioContext *ctx, unsigned npfd);
 void fdmon_epoll_setup(AioContext *ctx);
+
+/* list_lock must be held */
 void fdmon_epoll_disable(AioContext *ctx);
 #else
 static inline bool fdmon_epoll_try_upgrade(AioContext *ctx, unsigned npfd)
diff --git a/tests/unit/test-aio.c b/tests/unit/test-aio.c
index e77d86be87..010d65b79a 100644
--- a/tests/unit/test-aio.c
+++ b/tests/unit/test-aio.c
@@ -527,7 +527,12 @@ static void test_source_bh_delete_from_cb(void)
     g_assert_cmpint(data1.n, ==, data1.max);
     g_assert(data1.bh == NULL);
 
-    assert(g_main_context_iteration(NULL, false));
+    /*
+     * There may be up to one more iteration due to the aio_notify
+     * EventNotifier.
+     */
+    g_main_context_iteration(NULL, false);
+
     assert(!g_main_context_iteration(NULL, false));
 }
 
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 2439cf0feb..dd135c7baa 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -70,15 +70,6 @@ static AioHandler *find_aio_handler(AioContext *ctx, int fd)
 
 static bool aio_remove_fd_handler(AioContext *ctx, AioHandler *node)
 {
-    /* If the GSource is in the process of being destroyed then
-     * g_source_remove_poll() causes an assertion failure.  Skip
-     * removal in that case, because glib cleans up its state during
-     * destruction anyway.
-     */
-    if (!g_source_is_destroyed(&ctx->source)) {
-        g_source_remove_poll(&ctx->source, &node->pfd);
-    }
-
     node->pfd.revents = 0;
     node->poll_ready = false;
 
@@ -153,7 +144,6 @@ void aio_set_fd_handler(AioContext *ctx,
         } else {
             new_node->pfd = node->pfd;
         }
-        g_source_add_poll(&ctx->source, &new_node->pfd);
 
         new_node->pfd.events = (io_read ? G_IO_IN | G_IO_HUP | G_IO_ERR : 0);
         new_node->pfd.events |= (io_write ? G_IO_OUT | G_IO_ERR : 0);
@@ -267,37 +257,13 @@ bool aio_prepare(AioContext *ctx)
     poll_set_started(ctx, &ready_list, false);
     /* TODO what to do with this list? */
 
+    ctx->fdmon_ops->gsource_prepare(ctx);
     return false;
 }
 
 bool aio_pending(AioContext *ctx)
 {
-    AioHandler *node;
-    bool result = false;
-
-    /*
-     * We have to walk very carefully in case aio_set_fd_handler is
-     * called while we're walking.
-     */
-    qemu_lockcnt_inc(&ctx->list_lock);
-
-    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
-        int revents;
-
-        /* TODO should this check poll ready? */
-        revents = node->pfd.revents & node->pfd.events;
-        if (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR) && node->io_read) {
-            result = true;
-            break;
-        }
-        if (revents & (G_IO_OUT | G_IO_ERR) && node->io_write) {
-            result = true;
-            break;
-        }
-    }
-    qemu_lockcnt_dec(&ctx->list_lock);
-
-    return result;
+    return ctx->fdmon_ops->gsource_check(ctx);
 }
 
 static void aio_free_deleted_handlers(AioContext *ctx)
@@ -390,10 +356,6 @@ static bool aio_dispatch_handler(AioContext *ctx, AioHandler *node)
     return progress;
 }
 
-/*
- * If we have a list of ready handlers then this is more efficient than
- * scanning all handlers with aio_dispatch_handlers().
- */
 static bool aio_dispatch_ready_handlers(AioContext *ctx,
                                         AioHandlerList *ready_list,
                                         int64_t block_ns)
@@ -417,24 +379,18 @@ static bool aio_dispatch_ready_handlers(AioContext *ctx,
     return progress;
 }
 
-/* Slower than aio_dispatch_ready_handlers() but only used via glib */
-static bool aio_dispatch_handlers(AioContext *ctx)
-{
-    AioHandler *node, *tmp;
-    bool progress = false;
-
-    QLIST_FOREACH_SAFE_RCU(node, &ctx->aio_handlers, node, tmp) {
-        progress = aio_dispatch_handler(ctx, node) || progress;
-    }
-
-    return progress;
-}
-
 void aio_dispatch(AioContext *ctx)
 {
+    AioHandlerList ready_list = QLIST_HEAD_INITIALIZER(ready_list);
+
     qemu_lockcnt_inc(&ctx->list_lock);
     aio_bh_poll(ctx);
-    aio_dispatch_handlers(ctx);
+
+    ctx->fdmon_ops->gsource_dispatch(ctx, &ready_list);
+
+    /* block_ns is 0 because polling is disabled in the glib event loop */
+    aio_dispatch_ready_handlers(ctx, &ready_list, 0);
+
     aio_free_deleted_handlers(ctx);
     qemu_lockcnt_dec(&ctx->list_lock);
 
@@ -759,6 +715,7 @@ void aio_context_setup(AioContext *ctx)
 {
     ctx->fdmon_ops = &fdmon_poll_ops;
     ctx->epollfd = -1;
+    ctx->epollfd_tag = NULL;
 
     /* Use the fastest fd monitoring implementation if available */
     if (fdmon_io_uring_setup(ctx)) {
@@ -771,7 +728,11 @@ void aio_context_setup(AioContext *ctx)
 void aio_context_destroy(AioContext *ctx)
 {
     fdmon_io_uring_destroy(ctx);
+
+    qemu_lockcnt_lock(&ctx->list_lock);
     fdmon_epoll_disable(ctx);
+    qemu_lockcnt_unlock(&ctx->list_lock);
+
     aio_free_deleted_handlers(ctx);
 }
 
diff --git a/util/fdmon-epoll.c b/util/fdmon-epoll.c
index 9fb8800dde..0e78baa634 100644
--- a/util/fdmon-epoll.c
+++ b/util/fdmon-epoll.c
@@ -19,8 +19,12 @@ void fdmon_epoll_disable(AioContext *ctx)
         ctx->epollfd = -1;
     }
 
-    /* Switch back */
-    ctx->fdmon_ops = &fdmon_poll_ops;
+    if (ctx->epollfd_tag) {
+        g_source_remove_unix_fd(&ctx->source, ctx->epollfd_tag);
+        ctx->epollfd_tag = NULL;
+    }
+
+    fdmon_poll_downgrade(ctx);
 }
 
 static inline int epoll_events_from_pfd(int pfd_events)
@@ -93,10 +97,47 @@ out:
     return ret;
 }
 
+static void fdmon_epoll_gsource_prepare(AioContext *ctx)
+{
+    /* Do nothing */
+}
+
+static bool fdmon_epoll_gsource_check(AioContext *ctx)
+{
+    return g_source_query_unix_fd(&ctx->source, ctx->epollfd_tag) & G_IO_IN;
+}
+
+static void fdmon_epoll_gsource_dispatch(AioContext *ctx,
+                                         AioHandlerList *ready_list)
+{
+    AioHandler *node;
+    int ret;
+    struct epoll_event events[128];
+
+    /* Collect events and process them */
+    ret = epoll_wait(ctx->epollfd, events, ARRAY_SIZE(events), 0);
+    if (ret <= 0) {
+        return;
+    }
+    for (int i = 0; i < ret; i++) {
+        int ev = events[i].events;
+        int revents = (ev & EPOLLIN ? G_IO_IN : 0) |
+                      (ev & EPOLLOUT ? G_IO_OUT : 0) |
+                      (ev & EPOLLHUP ? G_IO_HUP : 0) |
+                      (ev & EPOLLERR ? G_IO_ERR : 0);
+
+        node = events[i].data.ptr;
+        aio_add_ready_handler(ready_list, node, revents);
+    }
+}
+
 static const FDMonOps fdmon_epoll_ops = {
     .update = fdmon_epoll_update,
     .wait = fdmon_epoll_wait,
     .need_wait = aio_poll_disabled,
+    .gsource_prepare = fdmon_epoll_gsource_prepare,
+    .gsource_check = fdmon_epoll_gsource_check,
+    .gsource_dispatch = fdmon_epoll_gsource_dispatch,
 };
 
 static bool fdmon_epoll_try_enable(AioContext *ctx)
@@ -118,6 +159,8 @@ static bool fdmon_epoll_try_enable(AioContext *ctx)
     }
 
     ctx->fdmon_ops = &fdmon_epoll_ops;
+    ctx->epollfd_tag = g_source_add_unix_fd(&ctx->source, ctx->epollfd,
+                                            G_IO_IN);
     return true;
 }
 
@@ -139,12 +182,11 @@ bool fdmon_epoll_try_upgrade(AioContext *ctx, unsigned npfd)
     }
 
     ok = fdmon_epoll_try_enable(ctx);
-
-    qemu_lockcnt_inc_and_unlock(&ctx->list_lock);
-
     if (!ok) {
         fdmon_epoll_disable(ctx);
     }
+
+    qemu_lockcnt_inc_and_unlock(&ctx->list_lock);
     return ok;
 }
 
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index 6cd665e565..2092d08d24 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -268,6 +268,11 @@ static int process_cq_ring(AioContext *ctx, AioHandlerList *ready_list)
     unsigned num_ready = 0;
     unsigned head;
 
+    /* If the CQ overflowed then fetch CQEs with a syscall */
+    if (io_uring_cq_has_overflow(ring)) {
+        io_uring_get_events(ring);
+    }
+
     io_uring_for_each_cqe(ring, head, cqe) {
         if (process_cqe(ctx, ready_list, cqe)) {
             num_ready++;
@@ -280,6 +285,30 @@ static int process_cq_ring(AioContext *ctx, AioHandlerList *ready_list)
     return num_ready;
 }
 
+/* This is where SQEs are submitted in the glib event loop */
+static void fdmon_io_uring_gsource_prepare(AioContext *ctx)
+{
+    fill_sq_ring(ctx);
+    if (io_uring_sq_ready(&ctx->fdmon_io_uring)) {
+        while (io_uring_submit(&ctx->fdmon_io_uring) == -EINTR) {
+            /* Keep trying if syscall was interrupted */
+        }
+    }
+}
+
+static bool fdmon_io_uring_gsource_check(AioContext *ctx)
+{
+    gpointer tag = ctx->io_uring_fd_tag;
+    return g_source_query_unix_fd(&ctx->source, tag) & G_IO_IN;
+}
+
+/* This is where CQEs are processed in the glib event loop */
+static void fdmon_io_uring_gsource_dispatch(AioContext *ctx,
+                                            AioHandlerList *ready_list)
+{
+    process_cq_ring(ctx, ready_list);
+}
+
 static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
                                int64_t timeout)
 {
@@ -327,12 +356,17 @@ static const FDMonOps fdmon_io_uring_ops = {
     .update = fdmon_io_uring_update,
     .wait = fdmon_io_uring_wait,
     .need_wait = fdmon_io_uring_need_wait,
+    .gsource_prepare = fdmon_io_uring_gsource_prepare,
+    .gsource_check = fdmon_io_uring_gsource_check,
+    .gsource_dispatch = fdmon_io_uring_gsource_dispatch,
 };
 
 bool fdmon_io_uring_setup(AioContext *ctx)
 {
     int ret;
 
+    ctx->io_uring_fd_tag = NULL;
+
     ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
     if (ret != 0) {
         return false;
@@ -340,6 +374,9 @@ bool fdmon_io_uring_setup(AioContext *ctx)
 
     QSLIST_INIT(&ctx->submit_list);
     ctx->fdmon_ops = &fdmon_io_uring_ops;
+    ctx->io_uring_fd_tag = g_source_add_unix_fd(&ctx->source,
+            ctx->fdmon_io_uring.ring_fd, G_IO_IN);
+
     return true;
 }
 
@@ -364,6 +401,11 @@ void fdmon_io_uring_destroy(AioContext *ctx)
             QSLIST_REMOVE_HEAD_RCU(&ctx->submit_list, node_submitted);
         }
 
-        ctx->fdmon_ops = &fdmon_poll_ops;
+        g_source_remove_unix_fd(&ctx->source, ctx->io_uring_fd_tag);
+        ctx->io_uring_fd_tag = NULL;
+
+        qemu_lockcnt_lock(&ctx->list_lock);
+        fdmon_poll_downgrade(ctx);
+        qemu_lockcnt_unlock(&ctx->list_lock);
     }
 }
diff --git a/util/fdmon-poll.c b/util/fdmon-poll.c
index 17df917cf9..f91dc54944 100644
--- a/util/fdmon-poll.c
+++ b/util/fdmon-poll.c
@@ -72,6 +72,11 @@ static int fdmon_poll_wait(AioContext *ctx, AioHandlerList *ready_list,
 
     /* epoll(7) is faster above a certain number of fds */
     if (fdmon_epoll_try_upgrade(ctx, npfd)) {
+        QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
+            if (!QLIST_IS_INSERTED(node, node_deleted) && node->pfd.events) {
+                g_source_remove_poll(&ctx->source, &node->pfd);
+            }
+        }
         npfd = 0; /* we won't need pollfds[], reset npfd */
         return ctx->fdmon_ops->wait(ctx, ready_list, timeout);
     }
@@ -97,11 +102,92 @@ static void fdmon_poll_update(AioContext *ctx,
                               AioHandler *old_node,
                               AioHandler *new_node)
 {
-    /* Do nothing, AioHandler already contains the state we'll need */
+    if (old_node && !new_node) {
+        /*
+         * If the GSource is in the process of being destroyed then
+         * g_source_remove_poll() causes an assertion failure.  Skip removal in
+         * that case, because glib cleans up its state during destruction
+         * anyway.
+         */
+        if (!g_source_is_destroyed(&ctx->source)) {
+            g_source_remove_poll(&ctx->source, &old_node->pfd);
+        }
+    }
+
+    if (!old_node && new_node) {
+        g_source_add_poll(&ctx->source, &new_node->pfd);
+    }
+}
+
+static void fdmon_poll_gsource_prepare(AioContext *ctx)
+{
+    /* Do nothing */
+}
+
+static bool fdmon_poll_gsource_check(AioContext *ctx)
+{
+    AioHandler *node;
+    bool result = false;
+
+    /*
+     * We have to walk very carefully in case aio_set_fd_handler is
+     * called while we're walking.
+     */
+    qemu_lockcnt_inc(&ctx->list_lock);
+
+    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
+        int revents = node->pfd.revents & node->pfd.events;
+
+        if (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR) && node->io_read) {
+            result = true;
+            break;
+        }
+        if (revents & (G_IO_OUT | G_IO_ERR) && node->io_write) {
+            result = true;
+            break;
+        }
+    }
+
+    qemu_lockcnt_dec(&ctx->list_lock);
+
+    return result;
+}
+
+static void fdmon_poll_gsource_dispatch(AioContext *ctx,
+                                        AioHandlerList *ready_list)
+{
+    AioHandler *node;
+
+    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
+        int revents;
+
+        revents = node->pfd.revents & node->pfd.events;
+        if (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR) && node->io_read) {
+            aio_add_ready_handler(ready_list, node, revents);
+        } else if (revents & (G_IO_OUT | G_IO_ERR) && node->io_write) {
+            aio_add_ready_handler(ready_list, node, revents);
+        }
+    }
 }
 
 const FDMonOps fdmon_poll_ops = {
     .update = fdmon_poll_update,
     .wait = fdmon_poll_wait,
     .need_wait = aio_poll_disabled,
+    .gsource_prepare = fdmon_poll_gsource_prepare,
+    .gsource_check = fdmon_poll_gsource_check,
+    .gsource_dispatch = fdmon_poll_gsource_dispatch,
 };
+
+void fdmon_poll_downgrade(AioContext *ctx)
+{
+    AioHandler *node;
+
+    ctx->fdmon_ops = &fdmon_poll_ops;
+
+    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
+        if (!QLIST_IS_INSERTED(node, node_deleted) && node->pfd.events) {
+            g_source_add_poll(&ctx->source, &node->pfd);
+        }
+    }
+}
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC 04/11] aio-posix: integrate fdmon into glib event loop
  2025-05-28 19:09 ` [RFC 04/11] aio-posix: integrate fdmon into glib event loop Stefan Hajnoczi
@ 2025-05-28 21:01   ` Eric Blake
  0 siblings, 0 replies; 33+ messages in thread
From: Eric Blake @ 2025-05-28 21:01 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek

On Wed, May 28, 2025 at 03:09:09PM -0400, Stefan Hajnoczi wrote:
> AioContext's glib integration only supports ppoll(2) file descriptor
> monitoring. epoll(7) and io_uring(7) disable themselves and switch back
> to ppoll(2) when the glib event loop is used. The main loop thread
> cannot use epoll(7) or io_uring(7) because it always uses the glib event
> loop.
> 
> Future QEMU features may require io_uring(7). One example is uring_cmd
> support in FUSE exports. Each feature could create its own io_uring(7)
> context and integrate it into the event loop, but this is inefficient
> due to extra syscalls. It would be more efficient to reuse the
> AioContext's existing fdmon-io_uring.c io_uring(7) context because
> fdmon-io_uring.c will already be active on systems where Linux io_uring
> is available.
> 
> In order to keep fdmon-io_uring.c's AioContext operational even when the
> glib event loop is used, extend FDMonOps with an API similar to
> GSourceFuncs so that file descriptor monitoring can integrate into the
> glib event loop.
> 
> A quick summary of the GSourceFuncs API:
> - prepare() is called each event loop iteration before waiting for file
>   descriptors and timers.
> - check() is called to determine whether events are ready to be
>   dispatched after waiting.
> - dispatch() is called to process events.
> 
> More details here: https://docs.gtk.org/glib/struct.SourceFuncs.html
> 
> Move the ppoll(2)-specific code from aio-posix.c into fdmon-poll.c and
> also implement epoll(7)- and io_uring(7)-specific file descriptor
> monitoring code for glib event loops.
> 
> Note that it's still faster to use aio_poll() rather than the glib event
> loop since glib waits for file descriptor activity with ppoll(2) and
> does not support adaptive polling. But at least epoll(7) and io_uring(7)
> now work in glib event loops.

Nice.

> 
> Splitting this into multiple commits without temporarily breaking
> AioContext proved difficult so this commit makes all the changes. The
> next commit will remove the aio_context_use_g_source() API because it is
> no longer needed.

Resulting in a big patch, but I agree that there's no easier way to do
it more incrementally.

> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  include/block/aio.h   | 36 ++++++++++++++++++
>  util/aio-posix.h      |  5 +++
>  tests/unit/test-aio.c |  7 +++-
>  util/aio-posix.c      | 69 ++++++++-------------------------
>  util/fdmon-epoll.c    | 52 ++++++++++++++++++++++---
>  util/fdmon-io_uring.c | 44 +++++++++++++++++++++-
>  util/fdmon-poll.c     | 88 ++++++++++++++++++++++++++++++++++++++++++-
>  7 files changed, 239 insertions(+), 62 deletions(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC 05/11] aio: remove aio_context_use_g_source()
  2025-05-28 19:09 [RFC 00/11] aio: add the aio_add_sqe() io_uring API Stefan Hajnoczi
                   ` (3 preceding siblings ...)
  2025-05-28 19:09 ` [RFC 04/11] aio-posix: integrate fdmon into glib event loop Stefan Hajnoczi
@ 2025-05-28 19:09 ` Stefan Hajnoczi
  2025-05-28 21:02   ` Eric Blake
  2025-05-28 19:09 ` [RFC 06/11] aio: free AioContext when aio_context_new() fails Stefan Hajnoczi
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-05-28 19:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, hibriansong, Kevin Wolf,
	Hanna Czenczek

There is no need for aio_context_use_g_source() now that epoll(7) and
io_uring(7) file descriptor monitoring works with the glib event loop.
AioContext doesn't need to be notified that GSource is being used.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/aio.h               |  3 ---
 tests/unit/test-nested-aio-poll.c |  6 ------
 util/aio-posix.c                  | 12 ------------
 util/aio-win32.c                  |  4 ----
 util/async.c                      |  1 -
 5 files changed, 26 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 39ed86d14d..1657740a0e 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -728,9 +728,6 @@ void aio_context_setup(AioContext *ctx);
  */
 void aio_context_destroy(AioContext *ctx);
 
-/* Used internally, do not call outside AioContext code */
-void aio_context_use_g_source(AioContext *ctx);
-
 /**
  * aio_context_set_poll_params:
  * @ctx: the aio context
diff --git a/tests/unit/test-nested-aio-poll.c b/tests/unit/test-nested-aio-poll.c
index 45484e745b..d13ecccd8c 100644
--- a/tests/unit/test-nested-aio-poll.c
+++ b/tests/unit/test-nested-aio-poll.c
@@ -83,12 +83,6 @@ static void test(void)
     /* Enable polling */
     aio_context_set_poll_params(td.ctx, 1000000, 2, 2, &error_abort);
 
-    /*
-     * The GSource is unused but this has the side-effect of changing the fdmon
-     * that AioContext uses.
-     */
-    aio_get_g_source(td.ctx);
-
     /* Make the event notifier active (set) right away */
     event_notifier_init(&td.poll_notifier, 1);
     aio_set_event_notifier(td.ctx, &td.poll_notifier,
diff --git a/util/aio-posix.c b/util/aio-posix.c
index dd135c7baa..5634da7d27 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -736,18 +736,6 @@ void aio_context_destroy(AioContext *ctx)
     aio_free_deleted_handlers(ctx);
 }
 
-void aio_context_use_g_source(AioContext *ctx)
-{
-    /*
-     * Disable io_uring when the glib main loop is used because it doesn't
-     * support mixed glib/aio_poll() usage. It relies on aio_poll() being
-     * called regularly so that changes to the monitored file descriptors are
-     * submitted, otherwise a list of pending fd handlers builds up.
-     */
-    fdmon_io_uring_destroy(ctx);
-    aio_free_deleted_handlers(ctx);
-}
-
 void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
                                  int64_t grow, int64_t shrink, Error **errp)
 {
diff --git a/util/aio-win32.c b/util/aio-win32.c
index 6583d5c5f3..34c4074133 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -427,10 +427,6 @@ void aio_context_destroy(AioContext *ctx)
 {
 }
 
-void aio_context_use_g_source(AioContext *ctx)
-{
-}
-
 void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
                                  int64_t grow, int64_t shrink, Error **errp)
 {
diff --git a/util/async.c b/util/async.c
index 2719c629ae..a39410d675 100644
--- a/util/async.c
+++ b/util/async.c
@@ -430,7 +430,6 @@ static GSourceFuncs aio_source_funcs = {
 
 GSource *aio_get_g_source(AioContext *ctx)
 {
-    aio_context_use_g_source(ctx);
     g_source_ref(&ctx->source);
     return &ctx->source;
 }
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC 05/11] aio: remove aio_context_use_g_source()
  2025-05-28 19:09 ` [RFC 05/11] aio: remove aio_context_use_g_source() Stefan Hajnoczi
@ 2025-05-28 21:02   ` Eric Blake
  0 siblings, 0 replies; 33+ messages in thread
From: Eric Blake @ 2025-05-28 21:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek

On Wed, May 28, 2025 at 03:09:10PM -0400, Stefan Hajnoczi wrote:
> There is no need for aio_context_use_g_source() now that epoll(7) and
> io_uring(7) file descriptor monitoring works with the glib event loop.
> AioContext doesn't need to be notified that GSource is being used.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC 06/11] aio: free AioContext when aio_context_new() fails
  2025-05-28 19:09 [RFC 00/11] aio: add the aio_add_sqe() io_uring API Stefan Hajnoczi
                   ` (4 preceding siblings ...)
  2025-05-28 19:09 ` [RFC 05/11] aio: remove aio_context_use_g_source() Stefan Hajnoczi
@ 2025-05-28 19:09 ` Stefan Hajnoczi
  2025-05-28 21:06   ` Eric Blake
  2025-05-28 19:09 ` [RFC 07/11] aio: add errp argument to aio_context_setup() Stefan Hajnoczi
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-05-28 19:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, hibriansong, Kevin Wolf,
	Hanna Czenczek

g_source_destroy() only removes the GSource from the GMainContext it's
attached to, if any. It does not free it.

Use g_source_unref() instead so that the AioContext (which embeds a
GSource) is freed. There is no need to call g_source_destroy() in
aio_context_new() because the GSource isn't attached to a GMainContext
yet.

aio_ctx_finalize() expects everything to be set up already, so introduce
the new ctx->initialized boolean and do nothing when called with
!initialized. This also requires moving aio_context_setup() down after
event_notifier_init() since aio_ctx_finalize() won't release any
resources that aio_context_setup() acquired.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/aio.h |  3 +++
 util/async.c        | 12 ++++++++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 1657740a0e..2760f308f5 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -291,6 +291,9 @@ struct AioContext {
     gpointer epollfd_tag;
 
     const FDMonOps *fdmon_ops;
+
+    /* Was aio_context_new() successful? */
+    bool initialized;
 };
 
 /**
diff --git a/util/async.c b/util/async.c
index a39410d675..bc841eeb4f 100644
--- a/util/async.c
+++ b/util/async.c
@@ -369,6 +369,10 @@ aio_ctx_finalize(GSource     *source)
     QEMUBH *bh;
     unsigned flags;
 
+    if (!ctx->initialized) {
+        return;
+    }
+
     thread_pool_free_aio(ctx->thread_pool);
 
 #ifdef CONFIG_LINUX_AIO
@@ -579,13 +583,15 @@ AioContext *aio_context_new(Error **errp)
     ctx = (AioContext *) g_source_new(&aio_source_funcs, sizeof(AioContext));
     QSLIST_INIT(&ctx->bh_list);
     QSIMPLEQ_INIT(&ctx->bh_slice_list);
-    aio_context_setup(ctx);
 
     ret = event_notifier_init(&ctx->notifier, false);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "Failed to initialize event notifier");
         goto fail;
     }
+
+    aio_context_setup(ctx);
+
     g_source_set_can_recurse(&ctx->source, true);
     qemu_lockcnt_init(&ctx->list_lock);
 
@@ -619,9 +625,11 @@ AioContext *aio_context_new(Error **errp)
 
     register_aiocontext(ctx);
 
+    ctx->initialized = true;
+
     return ctx;
 fail:
-    g_source_destroy(&ctx->source);
+    g_source_unref(&ctx->source);
     return NULL;
 }
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC 06/11] aio: free AioContext when aio_context_new() fails
  2025-05-28 19:09 ` [RFC 06/11] aio: free AioContext when aio_context_new() fails Stefan Hajnoczi
@ 2025-05-28 21:06   ` Eric Blake
  2025-06-05 17:49     ` Stefan Hajnoczi
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Blake @ 2025-05-28 21:06 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek

On Wed, May 28, 2025 at 03:09:11PM -0400, Stefan Hajnoczi wrote:
> g_source_destroy() only removes the GSource from the GMainContext it's
> attached to, if any. It does not free it.
> 
> Use g_source_unref() instead so that the AioContext (which embeds a
> GSource) is freed. There is no need to call g_source_destroy() in
> aio_context_new() because the GSource isn't attached to a GMainContext
> yet.
> 
> aio_ctx_finalize() expects everything to be set up already, so introduce
> the new ctx->initialized boolean and do nothing when called with
> !initialized. This also requires moving aio_context_setup() down after
> event_notifier_init() since aio_ctx_finalize() won't release any
> resources that aio_context_setup() acquired.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  include/block/aio.h |  3 +++
>  util/async.c        | 12 ++++++++++--
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 

> +++ b/util/async.c
> @@ -369,6 +369,10 @@ aio_ctx_finalize(GSource     *source)

While you're here, is it worth cleaning up that odd spacing in the
function parameter list?

>      QEMUBH *bh;
>      unsigned flags;
>  
> +    if (!ctx->initialized) {
> +        return;
> +    }
> +
>      thread_pool_free_aio(ctx->thread_pool);

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 06/11] aio: free AioContext when aio_context_new() fails
  2025-05-28 21:06   ` Eric Blake
@ 2025-06-05 17:49     ` Stefan Hajnoczi
  0 siblings, 0 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-06-05 17:49 UTC (permalink / raw)
  To: Eric Blake
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek

[-- Attachment #1: Type: text/plain, Size: 1264 bytes --]

On Wed, May 28, 2025 at 04:06:13PM -0500, Eric Blake wrote:
> On Wed, May 28, 2025 at 03:09:11PM -0400, Stefan Hajnoczi wrote:
> > g_source_destroy() only removes the GSource from the GMainContext it's
> > attached to, if any. It does not free it.
> > 
> > Use g_source_unref() instead so that the AioContext (which embeds a
> > GSource) is freed. There is no need to call g_source_destroy() in
> > aio_context_new() because the GSource isn't attached to a GMainContext
> > yet.
> > 
> > aio_ctx_finalize() expects everything to be set up already, so introduce
> > the new ctx->initialized boolean and do nothing when called with
> > !initialized. This also requires moving aio_context_setup() down after
> > event_notifier_init() since aio_ctx_finalize() won't release any
> > resources that aio_context_setup() acquired.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  include/block/aio.h |  3 +++
> >  util/async.c        | 12 ++++++++++--
> >  2 files changed, 13 insertions(+), 2 deletions(-)
> > 
> 
> > +++ b/util/async.c
> > @@ -369,6 +369,10 @@ aio_ctx_finalize(GSource     *source)
> 
> While you're here, is it worth cleaning up that odd spacing in the
> function parameter list?

Yes.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC 07/11] aio: add errp argument to aio_context_setup()
  2025-05-28 19:09 [RFC 00/11] aio: add the aio_add_sqe() io_uring API Stefan Hajnoczi
                   ` (5 preceding siblings ...)
  2025-05-28 19:09 ` [RFC 06/11] aio: free AioContext when aio_context_new() fails Stefan Hajnoczi
@ 2025-05-28 19:09 ` Stefan Hajnoczi
  2025-05-28 21:07   ` Eric Blake
  2025-05-28 19:09 ` [RFC 08/11] aio-posix: gracefully handle io_uring_queue_init() failure Stefan Hajnoczi
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-05-28 19:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, hibriansong, Kevin Wolf,
	Hanna Czenczek

When aio_context_new() -> aio_context_setup() fails at startup it
doesn't really matter whether errors are returned to the caller or the
process terminates immediately.

However, it is not acceptable to terminate when hotplugging --object
iothread at runtime. Refactor aio_context_setup() so that errors can be
propagated. The next commit will set errp when fdmon_io_uring_setup()
fails.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/aio.h | 3 ++-
 util/aio-posix.c    | 2 +-
 util/aio-win32.c    | 2 +-
 util/async.c        | 7 ++++++-
 4 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 2760f308f5..d919d7c8f4 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -718,10 +718,11 @@ void qemu_set_current_aio_context(AioContext *ctx);
 /**
  * aio_context_setup:
  * @ctx: the aio context
+ * @errp: error pointer
  *
  * Initialize the aio context.
  */
-void aio_context_setup(AioContext *ctx);
+void aio_context_setup(AioContext *ctx, Error **errp);
 
 /**
  * aio_context_destroy:
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 5634da7d27..fa047fc7ad 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -711,7 +711,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
     return progress;
 }
 
-void aio_context_setup(AioContext *ctx)
+void aio_context_setup(AioContext *ctx, Error **errp)
 {
     ctx->fdmon_ops = &fdmon_poll_ops;
     ctx->epollfd = -1;
diff --git a/util/aio-win32.c b/util/aio-win32.c
index 34c4074133..4ba401ff92 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -419,7 +419,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
     return progress;
 }
 
-void aio_context_setup(AioContext *ctx)
+void aio_context_setup(AioContext *ctx, Error **errp)
 {
 }
 
diff --git a/util/async.c b/util/async.c
index bc841eeb4f..bba9622e97 100644
--- a/util/async.c
+++ b/util/async.c
@@ -577,6 +577,7 @@ static void co_schedule_bh_cb(void *opaque)
 
 AioContext *aio_context_new(Error **errp)
 {
+    ERRP_GUARD();
     int ret;
     AioContext *ctx;
 
@@ -590,7 +591,11 @@ AioContext *aio_context_new(Error **errp)
         goto fail;
     }
 
-    aio_context_setup(ctx);
+    aio_context_setup(ctx, errp);
+    if (*errp) {
+        event_notifier_cleanup(&ctx->notifier);
+        goto fail;
+    }
 
     g_source_set_can_recurse(&ctx->source, true);
     qemu_lockcnt_init(&ctx->list_lock);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC 07/11] aio: add errp argument to aio_context_setup()
  2025-05-28 19:09 ` [RFC 07/11] aio: add errp argument to aio_context_setup() Stefan Hajnoczi
@ 2025-05-28 21:07   ` Eric Blake
  0 siblings, 0 replies; 33+ messages in thread
From: Eric Blake @ 2025-05-28 21:07 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek

On Wed, May 28, 2025 at 03:09:12PM -0400, Stefan Hajnoczi wrote:
> When aio_context_new() -> aio_context_setup() fails at startup it
> doesn't really matter whether errors are returned to the caller or the
> process terminates immediately.
> 
> However, it is not acceptable to terminate when hotplugging --object
> iothread at runtime. Refactor aio_context_setup() so that errors can be
> propagated. The next commit will set errp when fdmon_io_uring_setup()
> fails.
> 
> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  include/block/aio.h | 3 ++-
>  util/aio-posix.c    | 2 +-
>  util/aio-win32.c    | 2 +-
>  util/async.c        | 7 ++++++-
>  4 files changed, 10 insertions(+), 4 deletions(-)
>

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC 08/11] aio-posix: gracefully handle io_uring_queue_init() failure
  2025-05-28 19:09 [RFC 00/11] aio: add the aio_add_sqe() io_uring API Stefan Hajnoczi
                   ` (6 preceding siblings ...)
  2025-05-28 19:09 ` [RFC 07/11] aio: add errp argument to aio_context_setup() Stefan Hajnoczi
@ 2025-05-28 19:09 ` Stefan Hajnoczi
  2025-05-28 22:12   ` Eric Blake
  2025-06-02 12:26   ` Brian
  2025-05-28 19:09 ` [RFC 09/11] aio-posix: add aio_add_sqe() API for user-defined io_uring requests Stefan Hajnoczi
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-05-28 19:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, hibriansong, Kevin Wolf,
	Hanna Czenczek

io_uring may not be available at runtime due to system policies (e.g.
the io_uring_disabled sysctl) or creation could fail due to file
descriptor resource limits.

Handle failure scenarios as follows:

If another AioContext already has io_uring, then fail AioContext
creation so that the aio_add_sqe() API is available uniformly from all
QEMU threads. Otherwise fall back to epoll(7) if io_uring is
unavailable.

Notes:
- Update the comment about selecting the fastest fdmon implementation.
  At this point it's not about speed anymore, it's about aio_add_sqe()
  API availability.
- Uppercase the error message when converting from error_report() to
  error_setg_errno() for consistency (but there are instances of
  lowercase in the codebase).
- It's easier to move the #ifdefs from aio-posix.h to aio-posix.c.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/aio-posix.h      | 12 ++----------
 util/aio-posix.c      | 29 ++++++++++++++++++++++++++---
 util/fdmon-io_uring.c |  8 ++++----
 3 files changed, 32 insertions(+), 17 deletions(-)

diff --git a/util/aio-posix.h b/util/aio-posix.h
index f9994ed79e..6f9d97d866 100644
--- a/util/aio-posix.h
+++ b/util/aio-posix.h
@@ -18,6 +18,7 @@
 #define AIO_POSIX_H
 
 #include "block/aio.h"
+#include "qapi/error.h"
 
 struct AioHandler {
     GPollFD pfd;
@@ -72,17 +73,8 @@ static inline void fdmon_epoll_disable(AioContext *ctx)
 #endif /* !CONFIG_EPOLL_CREATE1 */
 
 #ifdef CONFIG_LINUX_IO_URING
-bool fdmon_io_uring_setup(AioContext *ctx);
+void fdmon_io_uring_setup(AioContext *ctx, Error **errp);
 void fdmon_io_uring_destroy(AioContext *ctx);
-#else
-static inline bool fdmon_io_uring_setup(AioContext *ctx)
-{
-    return false;
-}
-
-static inline void fdmon_io_uring_destroy(AioContext *ctx)
-{
-}
 #endif /* !CONFIG_LINUX_IO_URING */
 
 #endif /* AIO_POSIX_H */
diff --git a/util/aio-posix.c b/util/aio-posix.c
index fa047fc7ad..44b3df61f9 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -16,6 +16,7 @@
 #include "qemu/osdep.h"
 #include "block/block.h"
 #include "block/thread-pool.h"
+#include "qapi/error.h"
 #include "qemu/main-loop.h"
 #include "qemu/lockcnt.h"
 #include "qemu/rcu.h"
@@ -717,17 +718,39 @@ void aio_context_setup(AioContext *ctx, Error **errp)
     ctx->epollfd = -1;
     ctx->epollfd_tag = NULL;
 
-    /* Use the fastest fd monitoring implementation if available */
-    if (fdmon_io_uring_setup(ctx)) {
-        return;
+#ifdef CONFIG_LINUX_IO_URING
+    {
+        static bool need_io_uring;
+        Error *local_err = NULL; /* ERRP_GUARD() doesn't handle error_abort */
+
+        /* io_uring takes precedence because it provides aio_add_sqe() support */
+        fdmon_io_uring_setup(ctx, &local_err);
+        if (!local_err) {
+            /*
+             * If one AioContext gets io_uring, then all AioContexts need io_uring
+             * so that aio_add_sqe() support is available across all threads.
+             */
+            need_io_uring = true;
+            return;
+        }
+        if (need_io_uring) {
+            error_propagate(errp, local_err);
+            return;
+        }
+
+        warn_report_err_once(local_err); /* frees local_err */
+        local_err = NULL;
     }
+#endif /* CONFIG_LINUX_IO_URING */
 
     fdmon_epoll_setup(ctx);
 }
 
 void aio_context_destroy(AioContext *ctx)
 {
+#ifdef CONFIG_LINUX_IO_URING
     fdmon_io_uring_destroy(ctx);
+#endif
 
     qemu_lockcnt_lock(&ctx->list_lock);
     fdmon_epoll_disable(ctx);
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index 2092d08d24..ef1a866a03 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -45,6 +45,7 @@
 
 #include "qemu/osdep.h"
 #include <poll.h>
+#include "qapi/error.h"
 #include "qemu/rcu_queue.h"
 #include "aio-posix.h"
 
@@ -361,7 +362,7 @@ static const FDMonOps fdmon_io_uring_ops = {
     .gsource_dispatch = fdmon_io_uring_gsource_dispatch,
 };
 
-bool fdmon_io_uring_setup(AioContext *ctx)
+void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
 {
     int ret;
 
@@ -369,15 +370,14 @@ bool fdmon_io_uring_setup(AioContext *ctx)
 
     ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
     if (ret != 0) {
-        return false;
+        error_setg_errno(errp, -ret, "Failed to initialize io_uring");
+        return;
     }
 
     QSLIST_INIT(&ctx->submit_list);
     ctx->fdmon_ops = &fdmon_io_uring_ops;
     ctx->io_uring_fd_tag = g_source_add_unix_fd(&ctx->source,
             ctx->fdmon_io_uring.ring_fd, G_IO_IN);
-
-    return true;
 }
 
 void fdmon_io_uring_destroy(AioContext *ctx)
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC 08/11] aio-posix: gracefully handle io_uring_queue_init() failure
  2025-05-28 19:09 ` [RFC 08/11] aio-posix: gracefully handle io_uring_queue_init() failure Stefan Hajnoczi
@ 2025-05-28 22:12   ` Eric Blake
  2025-05-29 15:38     ` Stefan Hajnoczi
  2025-06-02 12:26   ` Brian
  1 sibling, 1 reply; 33+ messages in thread
From: Eric Blake @ 2025-05-28 22:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek

On Wed, May 28, 2025 at 03:09:13PM -0400, Stefan Hajnoczi wrote:
> io_uring may not be available at runtime due to system policies (e.g.
> the io_uring_disabled sysctl) or creation could fail due to file
> descriptor resource limits.
> 
> Handle failure scenarios as follows:
> 
> If another AioContext already has io_uring, then fail AioContext
> creation so that the aio_add_sqe() API is available uniformly from all
> QEMU threads. Otherwise fall back to epoll(7) if io_uring is
> unavailable.
> 
> Notes:
> - Update the comment about selecting the fastest fdmon implementation.
>   At this point it's not about speed anymore, it's about aio_add_sqe()
>   API availability.
> - Uppercase the error message when converting from error_report() to
>   error_setg_errno() for consistency (but there are instances of
>   lowercase in the codebase).
> - It's easier to move the #ifdefs from aio-posix.h to aio-posix.c.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  util/aio-posix.h      | 12 ++----------
>  util/aio-posix.c      | 29 ++++++++++++++++++++++++++---
>  util/fdmon-io_uring.c |  8 ++++----
>  3 files changed, 32 insertions(+), 17 deletions(-)
> 
> diff --git a/util/aio-posix.h b/util/aio-posix.h
> index f9994ed79e..6f9d97d866 100644
> --- a/util/aio-posix.h
> +++ b/util/aio-posix.h
> @@ -18,6 +18,7 @@
>  #define AIO_POSIX_H
>  
>  #include "block/aio.h"
> +#include "qapi/error.h"
>  
>  struct AioHandler {
>      GPollFD pfd;
> @@ -72,17 +73,8 @@ static inline void fdmon_epoll_disable(AioContext *ctx)
>  #endif /* !CONFIG_EPOLL_CREATE1 */
>  
>  #ifdef CONFIG_LINUX_IO_URING
> -bool fdmon_io_uring_setup(AioContext *ctx);
> +void fdmon_io_uring_setup(AioContext *ctx, Error **errp);

Why change the return type to void?  That forces you to have to check
errp.  If you still return bool (true for errp unchanged, false when
errp set), callers might have a simpler interface.

>  void fdmon_io_uring_destroy(AioContext *ctx);
> -#else
> -static inline bool fdmon_io_uring_setup(AioContext *ctx)
> -{
> -    return false;
> -}
> -
> -static inline void fdmon_io_uring_destroy(AioContext *ctx)
> -{
> -}
>  #endif /* !CONFIG_LINUX_IO_URING */
>  
>  #endif /* AIO_POSIX_H */
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index fa047fc7ad..44b3df61f9 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -16,6 +16,7 @@
>  #include "qemu/osdep.h"
>  #include "block/block.h"
>  #include "block/thread-pool.h"
> +#include "qapi/error.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/lockcnt.h"
>  #include "qemu/rcu.h"
> @@ -717,17 +718,39 @@ void aio_context_setup(AioContext *ctx, Error **errp)
>      ctx->epollfd = -1;
>      ctx->epollfd_tag = NULL;
>  
> -    /* Use the fastest fd monitoring implementation if available */
> -    if (fdmon_io_uring_setup(ctx)) {
> -        return;
> +#ifdef CONFIG_LINUX_IO_URING
> +    {
> +        static bool need_io_uring;
> +        Error *local_err = NULL; /* ERRP_GUARD() doesn't handle error_abort */

Really? I thought that was part of why we added ERRP_GUARD, so that
error_abort is pinned closer to the spot where the error is triggered
rather than where it was chained.  But your use of errp is a bit
different than usual here, so you may be correct that it doesn't
handle error_abort in the way that you want (allowing a graceful
downgrade to epoll if it is the first failure, but aborting if it is
the second AioContext that fails).

> +
> +        /* io_uring takes precedence because it provides aio_add_sqe() support */
> +        fdmon_io_uring_setup(ctx, &local_err);
> +        if (!local_err) {
> +            /*
> +             * If one AioContext gets io_uring, then all AioContexts need io_uring
> +             * so that aio_add_sqe() support is available across all threads.
> +             */
> +            need_io_uring = true;
> +            return;
> +        }
> +        if (need_io_uring) {
> +            error_propagate(errp, local_err);
> +            return;
> +        }
> +
> +        warn_report_err_once(local_err); /* frees local_err */
> +        local_err = NULL;
>      }
> +#endif /* CONFIG_LINUX_IO_URING */
>  
>      fdmon_epoll_setup(ctx);
>  }
>  
>  void aio_context_destroy(AioContext *ctx)
>  {
> +#ifdef CONFIG_LINUX_IO_URING
>      fdmon_io_uring_destroy(ctx);
> +#endif
>  
>      qemu_lockcnt_lock(&ctx->list_lock);
>      fdmon_epoll_disable(ctx);
> diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> index 2092d08d24..ef1a866a03 100644
> --- a/util/fdmon-io_uring.c
> +++ b/util/fdmon-io_uring.c
> @@ -45,6 +45,7 @@
>  
>  #include "qemu/osdep.h"
>  #include <poll.h>
> +#include "qapi/error.h"
>  #include "qemu/rcu_queue.h"
>  #include "aio-posix.h"
>  
> @@ -361,7 +362,7 @@ static const FDMonOps fdmon_io_uring_ops = {
>      .gsource_dispatch = fdmon_io_uring_gsource_dispatch,
>  };
>  
> -bool fdmon_io_uring_setup(AioContext *ctx)
> +void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
>  {
>      int ret;
>  
> @@ -369,15 +370,14 @@ bool fdmon_io_uring_setup(AioContext *ctx)
>  
>      ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
>      if (ret != 0) {
> -        return false;
> +        error_setg_errno(errp, -ret, "Failed to initialize io_uring");
> +        return;

This is where I question whether you should still return bool.

>      }
>  
>      QSLIST_INIT(&ctx->submit_list);
>      ctx->fdmon_ops = &fdmon_io_uring_ops;
>      ctx->io_uring_fd_tag = g_source_add_unix_fd(&ctx->source,
>              ctx->fdmon_io_uring.ring_fd, G_IO_IN);
> -
> -    return true;
>  }
>  
>  void fdmon_io_uring_destroy(AioContext *ctx)
> -- 
> 2.49.0
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 08/11] aio-posix: gracefully handle io_uring_queue_init() failure
  2025-05-28 22:12   ` Eric Blake
@ 2025-05-29 15:38     ` Stefan Hajnoczi
  2025-06-03  6:05       ` Markus Armbruster
  0 siblings, 1 reply; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-05-29 15:38 UTC (permalink / raw)
  To: Eric Blake
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek,
	armbru

[-- Attachment #1: Type: text/plain, Size: 6993 bytes --]

On Wed, May 28, 2025 at 05:12:14PM -0500, Eric Blake wrote:
> On Wed, May 28, 2025 at 03:09:13PM -0400, Stefan Hajnoczi wrote:
> > io_uring may not be available at runtime due to system policies (e.g.
> > the io_uring_disabled sysctl) or creation could fail due to file
> > descriptor resource limits.
> > 
> > Handle failure scenarios as follows:
> > 
> > If another AioContext already has io_uring, then fail AioContext
> > creation so that the aio_add_sqe() API is available uniformly from all
> > QEMU threads. Otherwise fall back to epoll(7) if io_uring is
> > unavailable.
> > 
> > Notes:
> > - Update the comment about selecting the fastest fdmon implementation.
> >   At this point it's not about speed anymore, it's about aio_add_sqe()
> >   API availability.
> > - Uppercase the error message when converting from error_report() to
> >   error_setg_errno() for consistency (but there are instances of
> >   lowercase in the codebase).
> > - It's easier to move the #ifdefs from aio-posix.h to aio-posix.c.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  util/aio-posix.h      | 12 ++----------
> >  util/aio-posix.c      | 29 ++++++++++++++++++++++++++---
> >  util/fdmon-io_uring.c |  8 ++++----
> >  3 files changed, 32 insertions(+), 17 deletions(-)
> > 
> > diff --git a/util/aio-posix.h b/util/aio-posix.h
> > index f9994ed79e..6f9d97d866 100644
> > --- a/util/aio-posix.h
> > +++ b/util/aio-posix.h
> > @@ -18,6 +18,7 @@
> >  #define AIO_POSIX_H
> >  
> >  #include "block/aio.h"
> > +#include "qapi/error.h"
> >  
> >  struct AioHandler {
> >      GPollFD pfd;
> > @@ -72,17 +73,8 @@ static inline void fdmon_epoll_disable(AioContext *ctx)
> >  #endif /* !CONFIG_EPOLL_CREATE1 */
> >  
> >  #ifdef CONFIG_LINUX_IO_URING
> > -bool fdmon_io_uring_setup(AioContext *ctx);
> > +void fdmon_io_uring_setup(AioContext *ctx, Error **errp);
> 
> Why change the return type to void?  That forces you to have to check
> errp.  If you still return bool (true for errp unchanged, false when
> errp set), callers might have a simpler interface.

QEMU has both forms. I prefer void foo(Error **errp) because it
eliminates these awkward states:
1. Return true with errp set. There is a risk of leaking errp here.
2. Return false with errp NULL. This results in no error message.

Sometimes it is handy to have a return value but I think that void is a
good default return type.

I have CCed Markus in case he has suggestions.

> 
> >  void fdmon_io_uring_destroy(AioContext *ctx);
> > -#else
> > -static inline bool fdmon_io_uring_setup(AioContext *ctx)
> > -{
> > -    return false;
> > -}
> > -
> > -static inline void fdmon_io_uring_destroy(AioContext *ctx)
> > -{
> > -}
> >  #endif /* !CONFIG_LINUX_IO_URING */
> >  
> >  #endif /* AIO_POSIX_H */
> > diff --git a/util/aio-posix.c b/util/aio-posix.c
> > index fa047fc7ad..44b3df61f9 100644
> > --- a/util/aio-posix.c
> > +++ b/util/aio-posix.c
> > @@ -16,6 +16,7 @@
> >  #include "qemu/osdep.h"
> >  #include "block/block.h"
> >  #include "block/thread-pool.h"
> > +#include "qapi/error.h"
> >  #include "qemu/main-loop.h"
> >  #include "qemu/lockcnt.h"
> >  #include "qemu/rcu.h"
> > @@ -717,17 +718,39 @@ void aio_context_setup(AioContext *ctx, Error **errp)
> >      ctx->epollfd = -1;
> >      ctx->epollfd_tag = NULL;
> >  
> > -    /* Use the fastest fd monitoring implementation if available */
> > -    if (fdmon_io_uring_setup(ctx)) {
> > -        return;
> > +#ifdef CONFIG_LINUX_IO_URING
> > +    {
> > +        static bool need_io_uring;
> > +        Error *local_err = NULL; /* ERRP_GUARD() doesn't handle error_abort */
> 
> Really? I thought that was part of why we added ERRP_GUARD, so that
> error_abort is pinned closer to the spot where the error is triggered
> rather than where it was chained.  But your use of errp is a bit
> different than usual here, so you may be correct that it doesn't
> handle error_abort in the way that you want (allowing a graceful
> downgrade to epoll if it is the first failure, but aborting if it is
> the second AioContext that fails).

The ERRP_GUARD() doc comment explains why it behaves this way:

 * Note: &error_abort is not rewritten, because that would move the
 * abort from the place where the error is created to the place where
 * it's propagated.

> 
> > +
> > +        /* io_uring takes precedence because it provides aio_add_sqe() support */
> > +        fdmon_io_uring_setup(ctx, &local_err);
> > +        if (!local_err) {
> > +            /*
> > +             * If one AioContext gets io_uring, then all AioContexts need io_uring
> > +             * so that aio_add_sqe() support is available across all threads.
> > +             */
> > +            need_io_uring = true;
> > +            return;
> > +        }
> > +        if (need_io_uring) {
> > +            error_propagate(errp, local_err);
> > +            return;
> > +        }
> > +
> > +        warn_report_err_once(local_err); /* frees local_err */
> > +        local_err = NULL;
> >      }
> > +#endif /* CONFIG_LINUX_IO_URING */
> >  
> >      fdmon_epoll_setup(ctx);
> >  }
> >  
> >  void aio_context_destroy(AioContext *ctx)
> >  {
> > +#ifdef CONFIG_LINUX_IO_URING
> >      fdmon_io_uring_destroy(ctx);
> > +#endif
> >  
> >      qemu_lockcnt_lock(&ctx->list_lock);
> >      fdmon_epoll_disable(ctx);
> > diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> > index 2092d08d24..ef1a866a03 100644
> > --- a/util/fdmon-io_uring.c
> > +++ b/util/fdmon-io_uring.c
> > @@ -45,6 +45,7 @@
> >  
> >  #include "qemu/osdep.h"
> >  #include <poll.h>
> > +#include "qapi/error.h"
> >  #include "qemu/rcu_queue.h"
> >  #include "aio-posix.h"
> >  
> > @@ -361,7 +362,7 @@ static const FDMonOps fdmon_io_uring_ops = {
> >      .gsource_dispatch = fdmon_io_uring_gsource_dispatch,
> >  };
> >  
> > -bool fdmon_io_uring_setup(AioContext *ctx)
> > +void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
> >  {
> >      int ret;
> >  
> > @@ -369,15 +370,14 @@ bool fdmon_io_uring_setup(AioContext *ctx)
> >  
> >      ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
> >      if (ret != 0) {
> > -        return false;
> > +        error_setg_errno(errp, -ret, "Failed to initialize io_uring");
> > +        return;
> 
> This is where I question whether you should still return bool.
> 
> >      }
> >  
> >      QSLIST_INIT(&ctx->submit_list);
> >      ctx->fdmon_ops = &fdmon_io_uring_ops;
> >      ctx->io_uring_fd_tag = g_source_add_unix_fd(&ctx->source,
> >              ctx->fdmon_io_uring.ring_fd, G_IO_IN);
> > -
> > -    return true;
> >  }
> >  
> >  void fdmon_io_uring_destroy(AioContext *ctx)
> > -- 
> > 2.49.0
> > 
> > 
> 
> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.
> Virtualization:  qemu.org | libguestfs.org
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 08/11] aio-posix: gracefully handle io_uring_queue_init() failure
  2025-05-29 15:38     ` Stefan Hajnoczi
@ 2025-06-03  6:05       ` Markus Armbruster
  2025-06-03 18:48         ` Stefan Hajnoczi
  0 siblings, 1 reply; 33+ messages in thread
From: Markus Armbruster @ 2025-06-03  6:05 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Eric Blake, qemu-devel, qemu-block, hibriansong, Kevin Wolf,
	Hanna Czenczek

Stefan Hajnoczi <stefanha@redhat.com> writes:

> On Wed, May 28, 2025 at 05:12:14PM -0500, Eric Blake wrote:
>> On Wed, May 28, 2025 at 03:09:13PM -0400, Stefan Hajnoczi wrote:
>> > io_uring may not be available at runtime due to system policies (e.g.
>> > the io_uring_disabled sysctl) or creation could fail due to file
>> > descriptor resource limits.
>> > 
>> > Handle failure scenarios as follows:
>> > 
>> > If another AioContext already has io_uring, then fail AioContext
>> > creation so that the aio_add_sqe() API is available uniformly from all
>> > QEMU threads. Otherwise fall back to epoll(7) if io_uring is
>> > unavailable.
>> > 
>> > Notes:
>> > - Update the comment about selecting the fastest fdmon implementation.
>> >   At this point it's not about speed anymore, it's about aio_add_sqe()
>> >   API availability.
>> > - Uppercase the error message when converting from error_report() to
>> >   error_setg_errno() for consistency (but there are instances of
>> >   lowercase in the codebase).
>> > - It's easier to move the #ifdefs from aio-posix.h to aio-posix.c.
>> > 
>> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>> > ---
>> >  util/aio-posix.h      | 12 ++----------
>> >  util/aio-posix.c      | 29 ++++++++++++++++++++++++++---
>> >  util/fdmon-io_uring.c |  8 ++++----
>> >  3 files changed, 32 insertions(+), 17 deletions(-)
>> > 
>> > diff --git a/util/aio-posix.h b/util/aio-posix.h
>> > index f9994ed79e..6f9d97d866 100644
>> > --- a/util/aio-posix.h
>> > +++ b/util/aio-posix.h
>> > @@ -18,6 +18,7 @@
>> >  #define AIO_POSIX_H
>> >  
>> >  #include "block/aio.h"
>> > +#include "qapi/error.h"
>> >  
>> >  struct AioHandler {
>> >      GPollFD pfd;
>> > @@ -72,17 +73,8 @@ static inline void fdmon_epoll_disable(AioContext *ctx)
>> >  #endif /* !CONFIG_EPOLL_CREATE1 */
>> >  
>> >  #ifdef CONFIG_LINUX_IO_URING
>> > -bool fdmon_io_uring_setup(AioContext *ctx);
>> > +void fdmon_io_uring_setup(AioContext *ctx, Error **errp);
>> 
>> Why change the return type to void?  That forces you to have to check
>> errp.  If you still return bool (true for errp unchanged, false when
>> errp set), callers might have a simpler interface.
>
> QEMU has both forms. I prefer void foo(Error **errp) because it
> eliminates these awkward states:
> 1. Return true with errp set. There is a risk of leaking errp here.
> 2. Return false with errp NULL. This results in no error message.
>
> Sometimes it is handy to have a return value but I think that void is a
> good default return type.

I used to think this way, too.  Experience changed my mind.

The "awkward states" are bugs.

The price to avoid them is more verbose error handling.  Because such
bugs have been rare in practice, the vebosity has turned out to be too
much pain for too little gain.  qapi/error.h's big comment:

 * = Rules =
 [...]
 * - Whenever practical, also return a value that indicates success /
 *   failure.  This can make the error checking more concise, and can
 *   avoid useless error object creation and destruction.  Note that
 *   we still have many functions returning void.  We recommend
 *   • bool-valued functions return true on success / false on failure,
 *   • pointer-valued functions return non-null / null pointer, and
 *   • integer-valued functions return non-negative / negative.

For what it's worth, this is exactly how GError wants to be used.  We
deviated from it because we thought we were smarter, and came to regret
it.

Further down, the big comment shows example code:

 * Call a function, receive an error from it, and pass it to the caller
 * - when the function returns a value that indicates failure, say
 *   false:
 *     if (!foo(arg, errp)) {
 *         handle the error...
 *     }
 * - when it does not, say because it is a void function:
 *     ERRP_GUARD();
 *     foo(arg, errp);
 *     if (*errp) {
 *         handle the error...
 *     }
 * More on ERRP_GUARD() below.
 *
 * Code predating ERRP_GUARD() still exists, and looks like this:
 *     Error *err = NULL;
 *     foo(arg, &err);
 *     if (err) {
 *         handle the error...
 *         error_propagate(errp, err); // deprecated
 *     }
 * Avoid in new code.  Do *not* "optimize" it to
 *     foo(arg, errp);
 *     if (*errp) { // WRONG!
 *         handle the error...
 *     }
 * because errp may be NULL without the ERRP_GUARD() guard.

In case you think the difference in readability isn't all that big:
error handling is *everywhere*, and any clutter adds up quickly,
obscuring the logic.  Every line counts.

> I have CCed Markus in case he has suggestions.

Thanks for that!

>> >  void fdmon_io_uring_destroy(AioContext *ctx);
>> > -#else
>> > -static inline bool fdmon_io_uring_setup(AioContext *ctx)
>> > -{
>> > -    return false;
>> > -}
>> > -
>> > -static inline void fdmon_io_uring_destroy(AioContext *ctx)
>> > -{
>> > -}
>> >  #endif /* !CONFIG_LINUX_IO_URING */
>> >  
>> >  #endif /* AIO_POSIX_H */
>> > diff --git a/util/aio-posix.c b/util/aio-posix.c
>> > index fa047fc7ad..44b3df61f9 100644
>> > --- a/util/aio-posix.c
>> > +++ b/util/aio-posix.c
>> > @@ -16,6 +16,7 @@
>> >  #include "qemu/osdep.h"
>> >  #include "block/block.h"
>> >  #include "block/thread-pool.h"
>> > +#include "qapi/error.h"
>> >  #include "qemu/main-loop.h"
>> >  #include "qemu/lockcnt.h"
>> >  #include "qemu/rcu.h"
>> > @@ -717,17 +718,39 @@ void aio_context_setup(AioContext *ctx, Error **errp)
>> >      ctx->epollfd = -1;
>> >      ctx->epollfd_tag = NULL;
>> >  
>> > -    /* Use the fastest fd monitoring implementation if available */
>> > -    if (fdmon_io_uring_setup(ctx)) {
>> > -        return;
>> > +#ifdef CONFIG_LINUX_IO_URING
>> > +    {
>> > +        static bool need_io_uring;
>> > +        Error *local_err = NULL; /* ERRP_GUARD() doesn't handle error_abort */
>> 
>> Really? I thought that was part of why we added ERRP_GUARD, so that
>> error_abort is pinned closer to the spot where the error is triggered
>> rather than where it was chained.  But your use of errp is a bit
>> different than usual here, so you may be correct that it doesn't
>> handle error_abort in the way that you want (allowing a graceful
>> downgrade to epoll if it is the first failure, but aborting if it is
>> the second AioContext that fails).
>
> The ERRP_GUARD() doc comment explains why it behaves this way:
>
>  * Note: &error_abort is not rewritten, because that would move the
>  * abort from the place where the error is created to the place where
>  * it's propagated.

Error propagation should be avoided when possible.  It's not possible
here; more on that below.

Why avoid?  Two reasons.  One, it degrades stack backtraces, as Eric
pointed out.  Two, passing @errp directly is more obvious and less
verbose, as we've seen above.

>> > +
>> > +        /* io_uring takes precedence because it provides aio_add_sqe() support */
>> > +        fdmon_io_uring_setup(ctx, &local_err);
>> > +        if (!local_err) {
>> > +            /*
>> > +             * If one AioContext gets io_uring, then all AioContexts need io_uring
>> > +             * so that aio_add_sqe() support is available across all threads.
>> > +             */
>> > +            need_io_uring = true;
>> > +            return;
>> > +        }
>> > +        if (need_io_uring) {
>> > +            error_propagate(errp, local_err);
>> > +            return;
>> > +        }
>> > +
>> > +        warn_report_err_once(local_err); /* frees local_err */
>> > +        local_err = NULL;

This is why we can't avoid error_propagate() here: when
fdmon_io_uring_setup() fails, we either consume the error and succeed,
or pass it to the caller and fail.

Because of the former, passing @errp to fdmon_io_uring_setup() would be
wrong; we need to pass a &local_err.  Which we then need to propagate to
do the latter.

Similar code exists elsewhere.  It's fairly rare, though.

ERRP_GUARD() is not designed for this pattern.  We have to propragate
manually.

I'd drop the /* ERRP_GUARD() doesn't handle error_abort */ comment.  To
make sense of it, I believe you need to understand a lot more.  And if
you do, you don't really need the comment.

>> >      }
>> > +#endif /* CONFIG_LINUX_IO_URING */
>> >  
>> >      fdmon_epoll_setup(ctx);
>> >  }
>> >  
>> >  void aio_context_destroy(AioContext *ctx)
>> >  {
>> > +#ifdef CONFIG_LINUX_IO_URING
>> >      fdmon_io_uring_destroy(ctx);
>> > +#endif
>> >  
>> >      qemu_lockcnt_lock(&ctx->list_lock);
>> >      fdmon_epoll_disable(ctx);

[...]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 08/11] aio-posix: gracefully handle io_uring_queue_init() failure
  2025-06-03  6:05       ` Markus Armbruster
@ 2025-06-03 18:48         ` Stefan Hajnoczi
  0 siblings, 0 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-06-03 18:48 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Eric Blake, qemu-devel, qemu-block, hibriansong, Kevin Wolf,
	Hanna Czenczek

[-- Attachment #1: Type: text/plain, Size: 9375 bytes --]

On Tue, Jun 03, 2025 at 08:05:58AM +0200, Markus Armbruster wrote:
> Stefan Hajnoczi <stefanha@redhat.com> writes:
> 
> > On Wed, May 28, 2025 at 05:12:14PM -0500, Eric Blake wrote:
> >> On Wed, May 28, 2025 at 03:09:13PM -0400, Stefan Hajnoczi wrote:
> >> > io_uring may not be available at runtime due to system policies (e.g.
> >> > the io_uring_disabled sysctl) or creation could fail due to file
> >> > descriptor resource limits.
> >> > 
> >> > Handle failure scenarios as follows:
> >> > 
> >> > If another AioContext already has io_uring, then fail AioContext
> >> > creation so that the aio_add_sqe() API is available uniformly from all
> >> > QEMU threads. Otherwise fall back to epoll(7) if io_uring is
> >> > unavailable.
> >> > 
> >> > Notes:
> >> > - Update the comment about selecting the fastest fdmon implementation.
> >> >   At this point it's not about speed anymore, it's about aio_add_sqe()
> >> >   API availability.
> >> > - Uppercase the error message when converting from error_report() to
> >> >   error_setg_errno() for consistency (but there are instances of
> >> >   lowercase in the codebase).
> >> > - It's easier to move the #ifdefs from aio-posix.h to aio-posix.c.
> >> > 
> >> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> >> > ---
> >> >  util/aio-posix.h      | 12 ++----------
> >> >  util/aio-posix.c      | 29 ++++++++++++++++++++++++++---
> >> >  util/fdmon-io_uring.c |  8 ++++----
> >> >  3 files changed, 32 insertions(+), 17 deletions(-)
> >> > 
> >> > diff --git a/util/aio-posix.h b/util/aio-posix.h
> >> > index f9994ed79e..6f9d97d866 100644
> >> > --- a/util/aio-posix.h
> >> > +++ b/util/aio-posix.h
> >> > @@ -18,6 +18,7 @@
> >> >  #define AIO_POSIX_H
> >> >  
> >> >  #include "block/aio.h"
> >> > +#include "qapi/error.h"
> >> >  
> >> >  struct AioHandler {
> >> >      GPollFD pfd;
> >> > @@ -72,17 +73,8 @@ static inline void fdmon_epoll_disable(AioContext *ctx)
> >> >  #endif /* !CONFIG_EPOLL_CREATE1 */
> >> >  
> >> >  #ifdef CONFIG_LINUX_IO_URING
> >> > -bool fdmon_io_uring_setup(AioContext *ctx);
> >> > +void fdmon_io_uring_setup(AioContext *ctx, Error **errp);
> >> 
> >> Why change the return type to void?  That forces you to have to check
> >> errp.  If you still return bool (true for errp unchanged, false when
> >> errp set), callers might have a simpler interface.
> >
> > QEMU has both forms. I prefer void foo(Error **errp) because it
> > eliminates these awkward states:
> > 1. Return true with errp set. There is a risk of leaking errp here.
> > 2. Return false with errp NULL. This results in no error message.
> >
> > Sometimes it is handy to have a return value but I think that void is a
> > good default return type.
> 
> I used to think this way, too.  Experience changed my mind.
> 
> The "awkward states" are bugs.
> 
> The price to avoid them is more verbose error handling.  Because such
> bugs have been rare in practice, the vebosity has turned out to be too
> much pain for too little gain.  qapi/error.h's big comment:
> 
>  * = Rules =
>  [...]
>  * - Whenever practical, also return a value that indicates success /
>  *   failure.  This can make the error checking more concise, and can
>  *   avoid useless error object creation and destruction.  Note that
>  *   we still have many functions returning void.  We recommend
>  *   • bool-valued functions return true on success / false on failure,
>  *   • pointer-valued functions return non-null / null pointer, and
>  *   • integer-valued functions return non-negative / negative.
> 
> For what it's worth, this is exactly how GError wants to be used.  We
> deviated from it because we thought we were smarter, and came to regret
> it.
> 
> Further down, the big comment shows example code:
> 
>  * Call a function, receive an error from it, and pass it to the caller
>  * - when the function returns a value that indicates failure, say
>  *   false:
>  *     if (!foo(arg, errp)) {
>  *         handle the error...
>  *     }
>  * - when it does not, say because it is a void function:
>  *     ERRP_GUARD();
>  *     foo(arg, errp);
>  *     if (*errp) {
>  *         handle the error...
>  *     }
>  * More on ERRP_GUARD() below.
>  *
>  * Code predating ERRP_GUARD() still exists, and looks like this:
>  *     Error *err = NULL;
>  *     foo(arg, &err);
>  *     if (err) {
>  *         handle the error...
>  *         error_propagate(errp, err); // deprecated
>  *     }
>  * Avoid in new code.  Do *not* "optimize" it to
>  *     foo(arg, errp);
>  *     if (*errp) { // WRONG!
>  *         handle the error...
>  *     }
>  * because errp may be NULL without the ERRP_GUARD() guard.
> 
> In case you think the difference in readability isn't all that big:
> error handling is *everywhere*, and any clutter adds up quickly,
> obscuring the logic.  Every line counts.

Thank you for the pointers! I missed that this was already documented.

I will switch to a bool return type.

> 
> > I have CCed Markus in case he has suggestions.
> 
> Thanks for that!
> 
> >> >  void fdmon_io_uring_destroy(AioContext *ctx);
> >> > -#else
> >> > -static inline bool fdmon_io_uring_setup(AioContext *ctx)
> >> > -{
> >> > -    return false;
> >> > -}
> >> > -
> >> > -static inline void fdmon_io_uring_destroy(AioContext *ctx)
> >> > -{
> >> > -}
> >> >  #endif /* !CONFIG_LINUX_IO_URING */
> >> >  
> >> >  #endif /* AIO_POSIX_H */
> >> > diff --git a/util/aio-posix.c b/util/aio-posix.c
> >> > index fa047fc7ad..44b3df61f9 100644
> >> > --- a/util/aio-posix.c
> >> > +++ b/util/aio-posix.c
> >> > @@ -16,6 +16,7 @@
> >> >  #include "qemu/osdep.h"
> >> >  #include "block/block.h"
> >> >  #include "block/thread-pool.h"
> >> > +#include "qapi/error.h"
> >> >  #include "qemu/main-loop.h"
> >> >  #include "qemu/lockcnt.h"
> >> >  #include "qemu/rcu.h"
> >> > @@ -717,17 +718,39 @@ void aio_context_setup(AioContext *ctx, Error **errp)
> >> >      ctx->epollfd = -1;
> >> >      ctx->epollfd_tag = NULL;
> >> >  
> >> > -    /* Use the fastest fd monitoring implementation if available */
> >> > -    if (fdmon_io_uring_setup(ctx)) {
> >> > -        return;
> >> > +#ifdef CONFIG_LINUX_IO_URING
> >> > +    {
> >> > +        static bool need_io_uring;
> >> > +        Error *local_err = NULL; /* ERRP_GUARD() doesn't handle error_abort */
> >> 
> >> Really? I thought that was part of why we added ERRP_GUARD, so that
> >> error_abort is pinned closer to the spot where the error is triggered
> >> rather than where it was chained.  But your use of errp is a bit
> >> different than usual here, so you may be correct that it doesn't
> >> handle error_abort in the way that you want (allowing a graceful
> >> downgrade to epoll if it is the first failure, but aborting if it is
> >> the second AioContext that fails).
> >
> > The ERRP_GUARD() doc comment explains why it behaves this way:
> >
> >  * Note: &error_abort is not rewritten, because that would move the
> >  * abort from the place where the error is created to the place where
> >  * it's propagated.
> 
> Error propagation should be avoided when possible.  It's not possible
> here; more on that below.
> 
> Why avoid?  Two reasons.  One, it degrades stack backtraces, as Eric
> pointed out.  Two, passing @errp directly is more obvious and less
> verbose, as we've seen above.
> 
> >> > +
> >> > +        /* io_uring takes precedence because it provides aio_add_sqe() support */
> >> > +        fdmon_io_uring_setup(ctx, &local_err);
> >> > +        if (!local_err) {
> >> > +            /*
> >> > +             * If one AioContext gets io_uring, then all AioContexts need io_uring
> >> > +             * so that aio_add_sqe() support is available across all threads.
> >> > +             */
> >> > +            need_io_uring = true;
> >> > +            return;
> >> > +        }
> >> > +        if (need_io_uring) {
> >> > +            error_propagate(errp, local_err);
> >> > +            return;
> >> > +        }
> >> > +
> >> > +        warn_report_err_once(local_err); /* frees local_err */
> >> > +        local_err = NULL;
> 
> This is why we can't avoid error_propagate() here: when
> fdmon_io_uring_setup() fails, we either consume the error and succeed,
> or pass it to the caller and fail.
> 
> Because of the former, passing @errp to fdmon_io_uring_setup() would be
> wrong; we need to pass a &local_err.  Which we then need to propagate to
> do the latter.
> 
> Similar code exists elsewhere.  It's fairly rare, though.
> 
> ERRP_GUARD() is not designed for this pattern.  We have to propragate
> manually.
> 
> I'd drop the /* ERRP_GUARD() doesn't handle error_abort */ comment.  To
> make sense of it, I believe you need to understand a lot more.  And if
> you do, you don't really need the comment.

Will fix.

> 
> >> >      }
> >> > +#endif /* CONFIG_LINUX_IO_URING */
> >> >  
> >> >      fdmon_epoll_setup(ctx);
> >> >  }
> >> >  
> >> >  void aio_context_destroy(AioContext *ctx)
> >> >  {
> >> > +#ifdef CONFIG_LINUX_IO_URING
> >> >      fdmon_io_uring_destroy(ctx);
> >> > +#endif
> >> >  
> >> >      qemu_lockcnt_lock(&ctx->list_lock);
> >> >      fdmon_epoll_disable(ctx);
> 
> [...]
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 08/11] aio-posix: gracefully handle io_uring_queue_init() failure
  2025-05-28 19:09 ` [RFC 08/11] aio-posix: gracefully handle io_uring_queue_init() failure Stefan Hajnoczi
  2025-05-28 22:12   ` Eric Blake
@ 2025-06-02 12:26   ` Brian
  2025-06-02 20:20     ` Stefan Hajnoczi
  1 sibling, 1 reply; 33+ messages in thread
From: Brian @ 2025-06-02 12:26 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel; +Cc: qemu-block, Kevin Wolf, Hanna Czenczek

[-- Attachment #1: Type: text/plain, Size: 5284 bytes --]

On 5/28/25 3:09 PM, Stefan Hajnoczi wrote:
> io_uring may not be available at runtime due to system policies (e.g.
> the io_uring_disabled sysctl) or creation could fail due to file
> descriptor resource limits.
>
> Handle failure scenarios as follows:
>
> If another AioContext already has io_uring, then fail AioContext
> creation so that the aio_add_sqe() API is available uniformly from all
> QEMU threads. Otherwise fall back to epoll(7) if io_uring is
> unavailable.
>
> Notes:
> - Update the comment about selecting the fastest fdmon implementation.
>    At this point it's not about speed anymore, it's about aio_add_sqe()
>    API availability.
> - Uppercase the error message when converting from error_report() to
>    error_setg_errno() for consistency (but there are instances of
>    lowercase in the codebase).
> - It's easier to move the #ifdefs from aio-posix.h to aio-posix.c.
>
> Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
> ---
>   util/aio-posix.h      | 12 ++----------
>   util/aio-posix.c      | 29 ++++++++++++++++++++++++++---
>   util/fdmon-io_uring.c |  8 ++++----
>   3 files changed, 32 insertions(+), 17 deletions(-)
>
> diff --git a/util/aio-posix.h b/util/aio-posix.h
> index f9994ed79e..6f9d97d866 100644
> --- a/util/aio-posix.h
> +++ b/util/aio-posix.h
> @@ -18,6 +18,7 @@
>   #define AIO_POSIX_H
>   
>   #include "block/aio.h"
> +#include "qapi/error.h" struct AioHandler { GPollFD pfd; @@ -72,17 +73,8 @@ static inline 
> void fdmon_epoll_disable(AioContext *ctx) #endif /* 
> !CONFIG_EPOLL_CREATE1 */ #ifdef CONFIG_LINUX_IO_URING -bool 
> fdmon_io_uring_setup(AioContext *ctx); +void 
> fdmon_io_uring_setup(AioContext *ctx, Error **errp); void 
> fdmon_io_uring_destroy(AioContext *ctx); -#else -static inline bool 
> fdmon_io_uring_setup(AioContext *ctx) -{ - return false; -} - -static 
> inline void fdmon_io_uring_destroy(AioContext *ctx) -{ -} #endif /* 
> !CONFIG_LINUX_IO_URING */ #endif /* AIO_POSIX_H */ diff --git 
> a/util/aio-posix.c b/util/aio-posix.c index fa047fc7ad..44b3df61f9 
> 100644 --- a/util/aio-posix.c +++ b/util/aio-posix.c @@ -16,6 +16,7 @@ 
> #include "qemu/osdep.h"
>   #include "block/block.h"
>   #include "block/thread-pool.h"
> +#include "qapi/error.h"
>   #include "qemu/main-loop.h"
>   #include "qemu/lockcnt.h"
>   #include "qemu/rcu.h"
> @@ -717,17 +718,39 @@ void aio_context_setup(AioContext *ctx, Error **errp)
>       ctx->epollfd = -1;
>       ctx->epollfd_tag = NULL;
>   
> -    /* Use the fastest fd monitoring implementation if available */
> -    if (fdmon_io_uring_setup(ctx)) {
> -        return;
> +#ifdef CONFIG_LINUX_IO_URING
> +    {
> +        static bool need_io_uring;
> +        Error *local_err = NULL; /* ERRP_GUARD() doesn't handle error_abort */
> +
> +        /* io_uring takes precedence because it provides aio_add_sqe() support */
> +        fdmon_io_uring_setup(ctx, &local_err);
> +        if (!local_err) {
> +            /*
> +             * If one AioContext gets io_uring, then all AioContexts need io_uring
> +             * so that aio_add_sqe() support is available across all threads.
> +             */
> +            need_io_uring = true;
> +            return;
> +        }
> +        if (need_io_uring) {
> +            error_propagate(errp, local_err);
> +            return;
> +        }
> +
> +        warn_report_err_once(local_err); /* frees local_err */
> +        local_err = NULL;
>       }
> +#endif /* CONFIG_LINUX_IO_URING */
>   
Is there a problem with the logic of this code snippet?

If we fail at fdmon_io_uring_setup, specifically at io_uring_queue_init,
local_err (or errp) will be set to a non-NULL error value. In that case,
need_io_uring will be set to true, but the function will return 
immediately.
As a result, the later if (need_io_uring) block will never be executed

>       fdmon_epoll_setup(ctx);
>   }
>   
>   void aio_context_destroy(AioContext *ctx)
>   {
> +#ifdef CONFIG_LINUX_IO_URING
>       fdmon_io_uring_destroy(ctx);
> +#endif
>   
>       qemu_lockcnt_lock(&ctx->list_lock);
>       fdmon_epoll_disable(ctx);
> diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> index 2092d08d24..ef1a866a03 100644
> --- a/util/fdmon-io_uring.c
> +++ b/util/fdmon-io_uring.c
> @@ -45,6 +45,7 @@
>   
>   #include "qemu/osdep.h"
>   #include <poll.h>
> +#include "qapi/error.h"
>   #include "qemu/rcu_queue.h"
>   #include "aio-posix.h"
>   
> @@ -361,7 +362,7 @@ static const FDMonOps fdmon_io_uring_ops = {
>       .gsource_dispatch = fdmon_io_uring_gsource_dispatch,
>   };
>   
> -bool fdmon_io_uring_setup(AioContext *ctx)
> +void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
>   {
>       int ret;
>   
> @@ -369,15 +370,14 @@ bool fdmon_io_uring_setup(AioContext *ctx)
>   
>       ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
>       if (ret != 0) {
> -        return false;
> +        error_setg_errno(errp, -ret, "Failed to initialize io_uring");
> +        return;
>       }
>   
>       QSLIST_INIT(&ctx->submit_list);
>       ctx->fdmon_ops = &fdmon_io_uring_ops;
>       ctx->io_uring_fd_tag = g_source_add_unix_fd(&ctx->source,
>               ctx->fdmon_io_uring.ring_fd, G_IO_IN);
> -
> -    return true;
>   }
>   
>   void fdmon_io_uring_destroy(AioContext *ctx)

[-- Attachment #2: Type: text/html, Size: 6457 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 08/11] aio-posix: gracefully handle io_uring_queue_init() failure
  2025-06-02 12:26   ` Brian
@ 2025-06-02 20:20     ` Stefan Hajnoczi
  2025-06-02 22:37       ` Brian
  0 siblings, 1 reply; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-06-02 20:20 UTC (permalink / raw)
  To: Brian; +Cc: qemu-devel, qemu-block, Kevin Wolf, Hanna Czenczek

[-- Attachment #1: Type: text/plain, Size: 6088 bytes --]

On Mon, Jun 02, 2025 at 08:26:39AM -0400, Brian wrote:
> On 5/28/25 3:09 PM, Stefan Hajnoczi wrote:
> > io_uring may not be available at runtime due to system policies (e.g.
> > the io_uring_disabled sysctl) or creation could fail due to file
> > descriptor resource limits.
> > 
> > Handle failure scenarios as follows:
> > 
> > If another AioContext already has io_uring, then fail AioContext
> > creation so that the aio_add_sqe() API is available uniformly from all
> > QEMU threads. Otherwise fall back to epoll(7) if io_uring is
> > unavailable.
> > 
> > Notes:
> > - Update the comment about selecting the fastest fdmon implementation.
> >    At this point it's not about speed anymore, it's about aio_add_sqe()
> >    API availability.
> > - Uppercase the error message when converting from error_report() to
> >    error_setg_errno() for consistency (but there are instances of
> >    lowercase in the codebase).
> > - It's easier to move the #ifdefs from aio-posix.h to aio-posix.c.
> > 
> > Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
> > ---
> >   util/aio-posix.h      | 12 ++----------
> >   util/aio-posix.c      | 29 ++++++++++++++++++++++++++---
> >   util/fdmon-io_uring.c |  8 ++++----
> >   3 files changed, 32 insertions(+), 17 deletions(-)
> > 
> > diff --git a/util/aio-posix.h b/util/aio-posix.h
> > index f9994ed79e..6f9d97d866 100644
> > --- a/util/aio-posix.h
> > +++ b/util/aio-posix.h
> > @@ -18,6 +18,7 @@
> >   #define AIO_POSIX_H
> >   #include "block/aio.h"
> > +#include "qapi/error.h" struct AioHandler { GPollFD pfd; @@ -72,17
> > +73,8 @@ static inline void fdmon_epoll_disable(AioContext *ctx) #endif
> > /* !CONFIG_EPOLL_CREATE1 */ #ifdef CONFIG_LINUX_IO_URING -bool
> > fdmon_io_uring_setup(AioContext *ctx); +void
> > fdmon_io_uring_setup(AioContext *ctx, Error **errp); void
> > fdmon_io_uring_destroy(AioContext *ctx); -#else -static inline bool
> > fdmon_io_uring_setup(AioContext *ctx) -{ - return false; -} - -static
> > inline void fdmon_io_uring_destroy(AioContext *ctx) -{ -} #endif /*
> > !CONFIG_LINUX_IO_URING */ #endif /* AIO_POSIX_H */ diff --git
> > a/util/aio-posix.c b/util/aio-posix.c index fa047fc7ad..44b3df61f9
> > 100644 --- a/util/aio-posix.c +++ b/util/aio-posix.c @@ -16,6 +16,7 @@
> > #include "qemu/osdep.h"
> >   #include "block/block.h"
> >   #include "block/thread-pool.h"
> > +#include "qapi/error.h"
> >   #include "qemu/main-loop.h"
> >   #include "qemu/lockcnt.h"
> >   #include "qemu/rcu.h"
> > @@ -717,17 +718,39 @@ void aio_context_setup(AioContext *ctx, Error **errp)
> >       ctx->epollfd = -1;
> >       ctx->epollfd_tag = NULL;
> > -    /* Use the fastest fd monitoring implementation if available */
> > -    if (fdmon_io_uring_setup(ctx)) {
> > -        return;
> > +#ifdef CONFIG_LINUX_IO_URING
> > +    {
> > +        static bool need_io_uring;
> > +        Error *local_err = NULL; /* ERRP_GUARD() doesn't handle error_abort */
> > +
> > +        /* io_uring takes precedence because it provides aio_add_sqe() support */
> > +        fdmon_io_uring_setup(ctx, &local_err);
> > +        if (!local_err) {
> > +            /*
> > +             * If one AioContext gets io_uring, then all AioContexts need io_uring
> > +             * so that aio_add_sqe() support is available across all threads.
> > +             */
> > +            need_io_uring = true;
> > +            return;
> > +        }
> > +        if (need_io_uring) {
> > +            error_propagate(errp, local_err);
> > +            return;
> > +        }
> > +
> > +        warn_report_err_once(local_err); /* frees local_err */
> > +        local_err = NULL;
> >       }
> > +#endif /* CONFIG_LINUX_IO_URING */
> Is there a problem with the logic of this code snippet?
> 
> If we fail at fdmon_io_uring_setup, specifically at io_uring_queue_init,
> local_err (or errp) will be set to a non-NULL error value. In that case,
> need_io_uring will be set to true, but the function will return immediately.

If local_err is non-NULL then this conditional is not taken:

  if (!local_err) {
      /*
       * If one AioContext gets io_uring, then all AioContexts need io_uring
       * so that aio_add_sqe() support is available across all threads.
       */
      need_io_uring = true;
      return;
  }

If the logic you described is correct, please rephrase it. I don't see
how what you've written can happen.

> As a result, the later if (need_io_uring) block will never be executed
> 
> >       fdmon_epoll_setup(ctx);
> >   }
> >   void aio_context_destroy(AioContext *ctx)
> >   {
> > +#ifdef CONFIG_LINUX_IO_URING
> >       fdmon_io_uring_destroy(ctx);
> > +#endif
> >       qemu_lockcnt_lock(&ctx->list_lock);
> >       fdmon_epoll_disable(ctx);
> > diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> > index 2092d08d24..ef1a866a03 100644
> > --- a/util/fdmon-io_uring.c
> > +++ b/util/fdmon-io_uring.c
> > @@ -45,6 +45,7 @@
> >   #include "qemu/osdep.h"
> >   #include <poll.h>
> > +#include "qapi/error.h"
> >   #include "qemu/rcu_queue.h"
> >   #include "aio-posix.h"
> > @@ -361,7 +362,7 @@ static const FDMonOps fdmon_io_uring_ops = {
> >       .gsource_dispatch = fdmon_io_uring_gsource_dispatch,
> >   };
> > -bool fdmon_io_uring_setup(AioContext *ctx)
> > +void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
> >   {
> >       int ret;
> > @@ -369,15 +370,14 @@ bool fdmon_io_uring_setup(AioContext *ctx)
> >       ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
> >       if (ret != 0) {
> > -        return false;
> > +        error_setg_errno(errp, -ret, "Failed to initialize io_uring");
> > +        return;
> >       }
> >       QSLIST_INIT(&ctx->submit_list);
> >       ctx->fdmon_ops = &fdmon_io_uring_ops;
> >       ctx->io_uring_fd_tag = g_source_add_unix_fd(&ctx->source,
> >               ctx->fdmon_io_uring.ring_fd, G_IO_IN);
> > -
> > -    return true;
> >   }
> >   void fdmon_io_uring_destroy(AioContext *ctx)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 08/11] aio-posix: gracefully handle io_uring_queue_init() failure
  2025-06-02 20:20     ` Stefan Hajnoczi
@ 2025-06-02 22:37       ` Brian
  0 siblings, 0 replies; 33+ messages in thread
From: Brian @ 2025-06-02 22:37 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, qemu-block, Kevin Wolf, Hanna Czenczek

[-- Attachment #1: Type: text/plain, Size: 6179 bytes --]


On 6/2/25 4:20 PM, Stefan Hajnoczi wrote:
> On Mon, Jun 02, 2025 at 08:26:39AM -0400, Brian wrote:
>> On 5/28/25 3:09 PM, Stefan Hajnoczi wrote:
>>> io_uring may not be available at runtime due to system policies (e.g.
>>> the io_uring_disabled sysctl) or creation could fail due to file
>>> descriptor resource limits.
>>>
>>> Handle failure scenarios as follows:
>>>
>>> If another AioContext already has io_uring, then fail AioContext
>>> creation so that the aio_add_sqe() API is available uniformly from all
>>> QEMU threads. Otherwise fall back to epoll(7) if io_uring is
>>> unavailable.
>>>
>>> Notes:
>>> - Update the comment about selecting the fastest fdmon implementation.
>>>     At this point it's not about speed anymore, it's about aio_add_sqe()
>>>     API availability.
>>> - Uppercase the error message when converting from error_report() to
>>>     error_setg_errno() for consistency (but there are instances of
>>>     lowercase in the codebase).
>>> - It's easier to move the #ifdefs from aio-posix.h to aio-posix.c.
>>>
>>> Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
>>> ---
>>>    util/aio-posix.h      | 12 ++----------
>>>    util/aio-posix.c      | 29 ++++++++++++++++++++++++++---
>>>    util/fdmon-io_uring.c |  8 ++++----
>>>    3 files changed, 32 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/util/aio-posix.h b/util/aio-posix.h
>>> index f9994ed79e..6f9d97d866 100644
>>> --- a/util/aio-posix.h
>>> +++ b/util/aio-posix.h
>>> @@ -18,6 +18,7 @@
>>>    #define AIO_POSIX_H
>>>    #include "block/aio.h"
>>> +#include "qapi/error.h" struct AioHandler { GPollFD pfd; @@ -72,17
>>> +73,8 @@ static inline void fdmon_epoll_disable(AioContext *ctx) #endif
>>> /* !CONFIG_EPOLL_CREATE1 */ #ifdef CONFIG_LINUX_IO_URING -bool
>>> fdmon_io_uring_setup(AioContext *ctx); +void
>>> fdmon_io_uring_setup(AioContext *ctx, Error **errp); void
>>> fdmon_io_uring_destroy(AioContext *ctx); -#else -static inline bool
>>> fdmon_io_uring_setup(AioContext *ctx) -{ - return false; -} - -static
>>> inline void fdmon_io_uring_destroy(AioContext *ctx) -{ -} #endif /*
>>> !CONFIG_LINUX_IO_URING */ #endif /* AIO_POSIX_H */ diff --git
>>> a/util/aio-posix.c b/util/aio-posix.c index fa047fc7ad..44b3df61f9
>>> 100644 --- a/util/aio-posix.c +++ b/util/aio-posix.c @@ -16,6 +16,7 @@
>>> #include "qemu/osdep.h"
>>>    #include "block/block.h"
>>>    #include "block/thread-pool.h"
>>> +#include "qapi/error.h"
>>>    #include "qemu/main-loop.h"
>>>    #include "qemu/lockcnt.h"
>>>    #include "qemu/rcu.h"
>>> @@ -717,17 +718,39 @@ void aio_context_setup(AioContext *ctx, Error **errp)
>>>        ctx->epollfd = -1;
>>>        ctx->epollfd_tag = NULL;
>>> -    /* Use the fastest fd monitoring implementation if available */
>>> -    if (fdmon_io_uring_setup(ctx)) {
>>> -        return;
>>> +#ifdef CONFIG_LINUX_IO_URING
>>> +    {
>>> +        static bool need_io_uring;
>>> +        Error *local_err = NULL; /* ERRP_GUARD() doesn't handle error_abort */
>>> +
>>> +        /* io_uring takes precedence because it provides aio_add_sqe() support */
>>> +        fdmon_io_uring_setup(ctx, &local_err);
>>> +        if (!local_err) {
>>> +            /*
>>> +             * If one AioContext gets io_uring, then all AioContexts need io_uring
>>> +             * so that aio_add_sqe() support is available across all threads.
>>> +             */
>>> +            need_io_uring = true;
>>> +            return;
>>> +        }
>>> +        if (need_io_uring) {
>>> +            error_propagate(errp, local_err);
>>> +            return;
>>> +        }
>>> +
>>> +        warn_report_err_once(local_err); /* frees local_err */
>>> +        local_err = NULL;
>>>        }
>>> +#endif /* CONFIG_LINUX_IO_URING */
>> Is there a problem with the logic of this code snippet?
>>
>> If we fail at fdmon_io_uring_setup, specifically at io_uring_queue_init,
>> local_err (or errp) will be set to a non-NULL error value. In that case,
>> need_io_uring will be set to true, but the function will return immediately.
> If local_err is non-NULL then this conditional is not taken:
>
>    if (!local_err) {
>        /*
>         * If one AioContext gets io_uring, then all AioContexts need io_uring
>         * so that aio_add_sqe() support is available across all threads.
>         */
>        need_io_uring = true;
>        return;
>    }
>
> If the logic you described is correct, please rephrase it. I don't see
> how what you've written can happen.

Sorry, I didn’t notice that need_io_uring is a static var. Was confused 
about when if (need_io_block) gets executed

>> As a result, the later if (need_io_uring) block will never be executed
>>
>>>        fdmon_epoll_setup(ctx);
>>>    }
>>>    void aio_context_destroy(AioContext *ctx)
>>>    {
>>> +#ifdef CONFIG_LINUX_IO_URING
>>>        fdmon_io_uring_destroy(ctx);
>>> +#endif
>>>        qemu_lockcnt_lock(&ctx->list_lock);
>>>        fdmon_epoll_disable(ctx);
>>> diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
>>> index 2092d08d24..ef1a866a03 100644
>>> --- a/util/fdmon-io_uring.c
>>> +++ b/util/fdmon-io_uring.c
>>> @@ -45,6 +45,7 @@
>>>    #include "qemu/osdep.h"
>>>    #include <poll.h>
>>> +#include "qapi/error.h"
>>>    #include "qemu/rcu_queue.h"
>>>    #include "aio-posix.h"
>>> @@ -361,7 +362,7 @@ static const FDMonOps fdmon_io_uring_ops = {
>>>        .gsource_dispatch = fdmon_io_uring_gsource_dispatch,
>>>    };
>>> -bool fdmon_io_uring_setup(AioContext *ctx)
>>> +void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
>>>    {
>>>        int ret;
>>> @@ -369,15 +370,14 @@ bool fdmon_io_uring_setup(AioContext *ctx)
>>>        ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
>>>        if (ret != 0) {
>>> -        return false;
>>> +        error_setg_errno(errp, -ret, "Failed to initialize io_uring");
>>> +        return;
>>>        }
>>>        QSLIST_INIT(&ctx->submit_list);
>>>        ctx->fdmon_ops = &fdmon_io_uring_ops;
>>>        ctx->io_uring_fd_tag = g_source_add_unix_fd(&ctx->source,
>>>                ctx->fdmon_io_uring.ring_fd, G_IO_IN);
>>> -
>>> -    return true;
>>>    }
>>>    void fdmon_io_uring_destroy(AioContext *ctx)

[-- Attachment #2: Type: text/html, Size: 6862 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC 09/11] aio-posix: add aio_add_sqe() API for user-defined io_uring requests
  2025-05-28 19:09 [RFC 00/11] aio: add the aio_add_sqe() io_uring API Stefan Hajnoczi
                   ` (7 preceding siblings ...)
  2025-05-28 19:09 ` [RFC 08/11] aio-posix: gracefully handle io_uring_queue_init() failure Stefan Hajnoczi
@ 2025-05-28 19:09 ` Stefan Hajnoczi
  2025-05-28 22:15   ` Eric Blake
  2025-05-29 20:02   ` Eric Blake
  2025-05-28 19:09 ` [RFC 10/11] aio-posix: avoid EventNotifier for cqe_handler_bh Stefan Hajnoczi
  2025-05-28 19:09 ` [RFC 11/11] block/io_uring: use aio_add_sqe() Stefan Hajnoczi
  10 siblings, 2 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-05-28 19:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, hibriansong, Kevin Wolf,
	Hanna Czenczek

Introduce the aio_add_sqe() API for submitting io_uring requests in the
current AioContext. This allows other components in QEMU, like the block
layer, to take advantage of io_uring features without creating their own
io_uring context.

This API supports nested event loops just like file descriptor
monitoring and BHs do. This comes at a complexity cost: a BH is required
to dispatch CQE callbacks and they are placed on a list so that a nested
event loop can invoke its parent's pending CQE callbacks. If you're
wondering why CqeHandler exists instead of just a callback function
pointer, this is why.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/aio.h   |  82 ++++++++++++++++++++++++
 util/aio-posix.h      |   1 +
 util/aio-posix.c      |   9 +++
 util/fdmon-io_uring.c | 145 +++++++++++++++++++++++++++++++-----------
 4 files changed, 200 insertions(+), 37 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index d919d7c8f4..95beef28c3 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -61,6 +61,27 @@ typedef struct LuringState LuringState;
 /* Is polling disabled? */
 bool aio_poll_disabled(AioContext *ctx);
 
+#ifdef CONFIG_LINUX_IO_URING
+/*
+ * Each io_uring request must have a unique CqeHandler that processes the cqe.
+ * The lifetime of a CqeHandler must be at least from aio_add_sqe() until
+ * ->cb() invocation.
+ */
+typedef struct CqeHandler CqeHandler;
+struct CqeHandler {
+    /* Called by the AioContext when the request has completed */
+    void (*cb)(CqeHandler *handler);
+
+    /* Used internally, do not access this */
+    QSIMPLEQ_ENTRY(CqeHandler) next;
+
+    /* This field is filled in before ->cb() is called */
+    struct io_uring_cqe cqe;
+};
+
+typedef QSIMPLEQ_HEAD(, CqeHandler) CqeHandlerSimpleQ;
+#endif /* CONFIG_LINUX_IO_URING */
+
 /* Callbacks for file descriptor monitoring implementations */
 typedef struct {
     /*
@@ -138,6 +159,27 @@ typedef struct {
      * Called with list_lock incremented.
      */
     void (*gsource_dispatch)(AioContext *ctx, AioHandlerList *ready_list);
+
+#ifdef CONFIG_LINUX_IO_URING
+    /**
+     * aio_add_sqe: Add an io_uring sqe for submission.
+     * @prep_sqe: invoked with an sqe that should be prepared for submission
+     * @opaque: user-defined argument to @prep_sqe()
+     * @cqe_handler: the unique cqe handler associated with this request
+     *
+     * The caller's @prep_sqe() function is invoked to fill in the details of
+     * the sqe. Do not call io_uring_sqe_set_data() on this sqe.
+     *
+     * The kernel may see the sqe as soon as @pre_sqe() returns or it may take
+     * until the next event loop iteration.
+     *
+     * This function is called from the current AioContext and is not
+     * thread-safe.
+     */
+    void (*add_sqe)(AioContext *ctx,
+                    void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
+                    void *opaque, CqeHandler *cqe_handler);
+#endif /* CONFIG_LINUX_IO_URING */
 } FDMonOps;
 
 /*
@@ -255,6 +297,10 @@ struct AioContext {
     struct io_uring fdmon_io_uring;
     AioHandlerSList submit_list;
     gpointer io_uring_fd_tag;
+
+    /* Pending callback state for cqe handlers */
+    CqeHandlerSimpleQ cqe_handler_ready_list;
+    QEMUBH *cqe_handler_bh;
 #endif
 
     /* TimerLists for calling timers - one per clock type.  Has its own
@@ -761,4 +807,40 @@ void aio_context_set_aio_params(AioContext *ctx, int64_t max_batch);
  */
 void aio_context_set_thread_pool_params(AioContext *ctx, int64_t min,
                                         int64_t max, Error **errp);
+
+#ifdef CONFIG_LINUX_IO_URING
+/**
+ * aio_has_io_uring: Return whether io_uring is available.
+ *
+ * io_uring is either available in all AioContexts or in none, so this only
+ * needs to be called once from within any thread's AioContext.
+ */
+static inline bool aio_has_io_uring(void)
+{
+    AioContext *ctx = qemu_get_current_aio_context();
+    return ctx->fdmon_ops->add_sqe;
+}
+
+/**
+ * aio_add_sqe: Add an io_uring sqe for submission.
+ * @prep_sqe: invoked with an sqe that should be prepared for submission
+ * @opaque: user-defined argument to @prep_sqe()
+ * @cqe_handler: the unique cqe handler associated with this request
+ *
+ * The caller's @prep_sqe() function is invoked to fill in the details of the
+ * sqe. Do not call io_uring_sqe_set_data() on this sqe.
+ *
+ * The sqe is submitted by the current AioContext. The kernel may see the sqe
+ * as soon as @pre_sqe() returns or it may take until the next event loop
+ * iteration.
+ *
+ * When the AioContext is destroyed, pending sqes are ignored and their
+ * CqeHandlers are not invoked.
+ *
+ * This function must be called only when aio_has_io_uring() returns true.
+ */
+void aio_add_sqe(void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
+                 void *opaque, CqeHandler *cqe_handler);
+#endif /* CONFIG_LINUX_IO_URING */
+
 #endif
diff --git a/util/aio-posix.h b/util/aio-posix.h
index 6f9d97d866..57ef801a5f 100644
--- a/util/aio-posix.h
+++ b/util/aio-posix.h
@@ -36,6 +36,7 @@ struct AioHandler {
 #ifdef CONFIG_LINUX_IO_URING
     QSLIST_ENTRY(AioHandler) node_submitted;
     unsigned flags; /* see fdmon-io_uring.c */
+    CqeHandler cqe_handler;
 #endif
     int64_t poll_idle_timeout; /* when to stop userspace polling */
     bool poll_ready; /* has polling detected an event? */
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 44b3df61f9..89bb215a2f 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -790,3 +790,12 @@ void aio_context_set_aio_params(AioContext *ctx, int64_t max_batch)
 
     aio_notify(ctx);
 }
+
+#ifdef CONFIG_LINUX_IO_URING
+void aio_add_sqe(void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
+                 void *opaque, CqeHandler *cqe_handler)
+{
+    AioContext *ctx = qemu_get_current_aio_context();
+    ctx->fdmon_ops->add_sqe(ctx, prep_sqe, opaque, cqe_handler);
+}
+#endif /* CONFIG_LINUX_IO_URING */
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index ef1a866a03..3a49d6a20a 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -75,8 +75,8 @@ static inline int pfd_events_from_poll(int poll_events)
 }
 
 /*
- * Returns an sqe for submitting a request.  Only be called within
- * fdmon_io_uring_wait().
+ * Returns an sqe for submitting a request. Only called from the AioContext
+ * thread.
  */
 static struct io_uring_sqe *get_sqe(AioContext *ctx)
 {
@@ -166,23 +166,43 @@ static void fdmon_io_uring_update(AioContext *ctx,
     }
 }
 
+static void fdmon_io_uring_add_sqe(AioContext *ctx,
+        void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
+        void *opaque, CqeHandler *cqe_handler)
+{
+    struct io_uring_sqe *sqe = get_sqe(ctx);
+
+    prep_sqe(sqe, opaque);
+    io_uring_sqe_set_data(sqe, cqe_handler);
+}
+
+static void fdmon_special_cqe_handler(CqeHandler *cqe_handler)
+{
+    /*
+     * This is an empty function that is never called. It is used as a function
+     * pointer to distinguish it from ordinary cqe handlers.
+     */
+}
+
 static void add_poll_multishot_sqe(AioContext *ctx, AioHandler *node)
 {
     struct io_uring_sqe *sqe = get_sqe(ctx);
     int events = poll_events_from_pfd(node->pfd.events);
 
     io_uring_prep_poll_multishot(sqe, node->pfd.fd, events);
-    io_uring_sqe_set_data(sqe, node);
+    node->cqe_handler.cb = fdmon_special_cqe_handler;
+    io_uring_sqe_set_data(sqe, &node->cqe_handler);
 }
 
 static void add_poll_remove_sqe(AioContext *ctx, AioHandler *node)
 {
     struct io_uring_sqe *sqe = get_sqe(ctx);
+    CqeHandler *cqe_handler = &node->cqe_handler;
 
 #ifdef LIBURING_HAVE_DATA64
-    io_uring_prep_poll_remove(sqe, (uintptr_t)node);
+    io_uring_prep_poll_remove(sqe, (uintptr_t)cqe_handler);
 #else
-    io_uring_prep_poll_remove(sqe, node);
+    io_uring_prep_poll_remove(sqe, cqe_handler);
 #endif
     io_uring_sqe_set_data(sqe, NULL);
 }
@@ -221,20 +241,12 @@ static void fill_sq_ring(AioContext *ctx)
     }
 }
 
-/* Returns true if a handler became ready */
-static bool process_cqe(AioContext *ctx,
-                        AioHandlerList *ready_list,
-                        struct io_uring_cqe *cqe)
+static bool process_cqe_aio_handler(AioContext *ctx,
+                                    AioHandlerList *ready_list,
+                                    AioHandler *node,
+                                    struct io_uring_cqe *cqe)
 {
-    AioHandler *node = io_uring_cqe_get_data(cqe);
-    unsigned flags;
-
-    /* poll_timeout and poll_remove have a zero user_data field */
-    if (!node) {
-        return false;
-    }
-
-    flags = qatomic_read(&node->flags);
+    unsigned flags = qatomic_read(&node->flags);
 
     /*
      * poll_multishot cancelled by poll_remove? Or completed early because fd
@@ -261,6 +273,56 @@ static bool process_cqe(AioContext *ctx,
     return true;
 }
 
+/* Process CqeHandlers from the ready list */
+static void cqe_handler_bh(void *opaque)
+{
+    AioContext *ctx = opaque;
+    CqeHandlerSimpleQ *ready_list = &ctx->cqe_handler_ready_list;
+
+    /*
+     * If cqe_handler->cb() calls aio_poll() it must continue processing
+     * ready_list. Schedule a BH so the inner event loop calls us again.
+     */
+    qemu_bh_schedule(ctx->cqe_handler_bh);
+
+    while (!QSIMPLEQ_EMPTY(ready_list)) {
+        CqeHandler *cqe_handler = QSIMPLEQ_FIRST(ready_list);
+
+        QSIMPLEQ_REMOVE_HEAD(ready_list, next);
+
+        cqe_handler->cb(cqe_handler);
+    }
+
+    qemu_bh_cancel(ctx->cqe_handler_bh);
+}
+
+/* Returns true if a handler became ready */
+static bool process_cqe(AioContext *ctx,
+                        AioHandlerList *ready_list,
+                        struct io_uring_cqe *cqe)
+{
+    CqeHandler *cqe_handler = io_uring_cqe_get_data(cqe);
+
+    /* poll_timeout and poll_remove have a zero user_data field */
+    if (!cqe_handler) {
+        return false;
+    }
+
+    /*
+     * Special handling for AioHandler cqes. They need ready_list and have a
+     * return value.
+     */
+    if (cqe_handler->cb == fdmon_special_cqe_handler) {
+        AioHandler *node = container_of(cqe_handler, AioHandler, cqe_handler);
+        return process_cqe_aio_handler(ctx, ready_list, node, cqe);
+    }
+
+    cqe_handler->cqe = *cqe;
+    QSIMPLEQ_INSERT_TAIL(&ctx->cqe_handler_ready_list, cqe_handler, next);
+    qemu_bh_schedule(ctx->cqe_handler_bh);
+    return false;
+}
+
 static int process_cq_ring(AioContext *ctx, AioHandlerList *ready_list)
 {
     struct io_uring *ring = &ctx->fdmon_io_uring;
@@ -360,6 +422,7 @@ static const FDMonOps fdmon_io_uring_ops = {
     .gsource_prepare = fdmon_io_uring_gsource_prepare,
     .gsource_check = fdmon_io_uring_gsource_check,
     .gsource_dispatch = fdmon_io_uring_gsource_dispatch,
+    .add_sqe = fdmon_io_uring_add_sqe,
 };
 
 void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
@@ -375,6 +438,8 @@ void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
     }
 
     QSLIST_INIT(&ctx->submit_list);
+    QSIMPLEQ_INIT(&ctx->cqe_handler_ready_list);
+    ctx->cqe_handler_bh = aio_bh_new(ctx, cqe_handler_bh, ctx);
     ctx->fdmon_ops = &fdmon_io_uring_ops;
     ctx->io_uring_fd_tag = g_source_add_unix_fd(&ctx->source,
             ctx->fdmon_io_uring.ring_fd, G_IO_IN);
@@ -382,30 +447,36 @@ void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
 
 void fdmon_io_uring_destroy(AioContext *ctx)
 {
-    if (ctx->fdmon_ops == &fdmon_io_uring_ops) {
-        AioHandler *node;
+    AioHandler *node;
 
-        io_uring_queue_exit(&ctx->fdmon_io_uring);
+    if (ctx->fdmon_ops != &fdmon_io_uring_ops) {
+        return;
+    }
 
-        /* Move handlers due to be removed onto the deleted list */
-        while ((node = QSLIST_FIRST_RCU(&ctx->submit_list))) {
-            unsigned flags = qatomic_fetch_and(&node->flags,
-                    ~(FDMON_IO_URING_PENDING |
-                      FDMON_IO_URING_ADD |
-                      FDMON_IO_URING_REMOVE));
+    io_uring_queue_exit(&ctx->fdmon_io_uring);
 
-            if (flags & FDMON_IO_URING_REMOVE) {
-                QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers, node, node_deleted);
-            }
+    /* Move handlers due to be removed onto the deleted list */
+    while ((node = QSLIST_FIRST_RCU(&ctx->submit_list))) {
+        unsigned flags = qatomic_fetch_and(&node->flags,
+                ~(FDMON_IO_URING_PENDING |
+                  FDMON_IO_URING_ADD |
+                  FDMON_IO_URING_REMOVE));
 
-            QSLIST_REMOVE_HEAD_RCU(&ctx->submit_list, node_submitted);
+        if (flags & FDMON_IO_URING_REMOVE) {
+            QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers,
+                                  node, node_deleted);
         }
 
-        g_source_remove_unix_fd(&ctx->source, ctx->io_uring_fd_tag);
-        ctx->io_uring_fd_tag = NULL;
-
-        qemu_lockcnt_lock(&ctx->list_lock);
-        fdmon_poll_downgrade(ctx);
-        qemu_lockcnt_unlock(&ctx->list_lock);
+        QSLIST_REMOVE_HEAD_RCU(&ctx->submit_list, node_submitted);
     }
+
+    g_source_remove_unix_fd(&ctx->source, ctx->io_uring_fd_tag);
+    ctx->io_uring_fd_tag = NULL;
+
+    assert(QSIMPLEQ_EMPTY(&ctx->cqe_handler_ready_list));
+    qemu_bh_delete(ctx->cqe_handler_bh);
+
+    qemu_lockcnt_lock(&ctx->list_lock);
+    fdmon_poll_downgrade(ctx);
+    qemu_lockcnt_unlock(&ctx->list_lock);
 }
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC 09/11] aio-posix: add aio_add_sqe() API for user-defined io_uring requests
  2025-05-28 19:09 ` [RFC 09/11] aio-posix: add aio_add_sqe() API for user-defined io_uring requests Stefan Hajnoczi
@ 2025-05-28 22:15   ` Eric Blake
  2025-05-29 20:02   ` Eric Blake
  1 sibling, 0 replies; 33+ messages in thread
From: Eric Blake @ 2025-05-28 22:15 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek

On Wed, May 28, 2025 at 03:09:14PM -0400, Stefan Hajnoczi wrote:
> Introduce the aio_add_sqe() API for submitting io_uring requests in the
> current AioContext. This allows other components in QEMU, like the block
> layer, to take advantage of io_uring features without creating their own
> io_uring context.

Interesting that 8/11 refers to the function that we are just adding
now; but since the mention is only in the comments, I'm okay with
introducing the comment early and avoiding churn.

> 
> This API supports nested event loops just like file descriptor
> monitoring and BHs do. This comes at a complexity cost: a BH is required
> to dispatch CQE callbacks and they are placed on a list so that a nested
> event loop can invoke its parent's pending CQE callbacks. If you're
> wondering why CqeHandler exists instead of just a callback function
> pointer, this is why.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  include/block/aio.h   |  82 ++++++++++++++++++++++++
>  util/aio-posix.h      |   1 +
>  util/aio-posix.c      |   9 +++
>  util/fdmon-io_uring.c | 145 +++++++++++++++++++++++++++++++-----------
>  4 files changed, 200 insertions(+), 37 deletions(-)

I'll have to resume my review tomorrow.  (I'm having fun learning
about io_uring while reviewing this)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 09/11] aio-posix: add aio_add_sqe() API for user-defined io_uring requests
  2025-05-28 19:09 ` [RFC 09/11] aio-posix: add aio_add_sqe() API for user-defined io_uring requests Stefan Hajnoczi
  2025-05-28 22:15   ` Eric Blake
@ 2025-05-29 20:02   ` Eric Blake
  2025-06-05 17:52     ` Stefan Hajnoczi
  1 sibling, 1 reply; 33+ messages in thread
From: Eric Blake @ 2025-05-29 20:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek

On Wed, May 28, 2025 at 03:09:14PM -0400, Stefan Hajnoczi wrote:
> Introduce the aio_add_sqe() API for submitting io_uring requests in the
> current AioContext. This allows other components in QEMU, like the block
> layer, to take advantage of io_uring features without creating their own
> io_uring context.
> 
> This API supports nested event loops just like file descriptor
> monitoring and BHs do. This comes at a complexity cost: a BH is required
> to dispatch CQE callbacks and they are placed on a list so that a nested
> event loop can invoke its parent's pending CQE callbacks. If you're
> wondering why CqeHandler exists instead of just a callback function
> pointer, this is why.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---

Large patch.  I found a couple of nits, but the overall design looks
sound.

Reviewed-by: Eric Blake <eblake@redhat.com>

>  include/block/aio.h   |  82 ++++++++++++++++++++++++
>  util/aio-posix.h      |   1 +
>  util/aio-posix.c      |   9 +++
>  util/fdmon-io_uring.c | 145 +++++++++++++++++++++++++++++++-----------
>  4 files changed, 200 insertions(+), 37 deletions(-)
> 
> diff --git a/include/block/aio.h b/include/block/aio.h
> index d919d7c8f4..95beef28c3 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -61,6 +61,27 @@ typedef struct LuringState LuringState;
>  /* Is polling disabled? */
>  bool aio_poll_disabled(AioContext *ctx);
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +/*
> + * Each io_uring request must have a unique CqeHandler that processes the cqe.
> + * The lifetime of a CqeHandler must be at least from aio_add_sqe() until
> + * ->cb() invocation.
> + */
> +typedef struct CqeHandler CqeHandler;
> +struct CqeHandler {
> +    /* Called by the AioContext when the request has completed */
> +    void (*cb)(CqeHandler *handler);

I see an opaque callback pointer in prep_cqe below, but not one here.
Is that because callers can write their own struct that includes a
CqeHandler as its first member, if more state is needed?

> +
> +    /* Used internally, do not access this */
> +    QSIMPLEQ_ENTRY(CqeHandler) next;
> +
> +    /* This field is filled in before ->cb() is called */
> +    struct io_uring_cqe cqe;
> +};
> +
> +typedef QSIMPLEQ_HEAD(, CqeHandler) CqeHandlerSimpleQ;
> +#endif /* CONFIG_LINUX_IO_URING */
> +
>  /* Callbacks for file descriptor monitoring implementations */
>  typedef struct {
>      /*
> @@ -138,6 +159,27 @@ typedef struct {
>       * Called with list_lock incremented.
>       */
>      void (*gsource_dispatch)(AioContext *ctx, AioHandlerList *ready_list);
> +
> +#ifdef CONFIG_LINUX_IO_URING
> +    /**
> +     * aio_add_sqe: Add an io_uring sqe for submission.
> +     * @prep_sqe: invoked with an sqe that should be prepared for submission
> +     * @opaque: user-defined argument to @prep_sqe()
> +     * @cqe_handler: the unique cqe handler associated with this request
> +     *
> +     * The caller's @prep_sqe() function is invoked to fill in the details of
> +     * the sqe. Do not call io_uring_sqe_set_data() on this sqe.
> +     *
> +     * The kernel may see the sqe as soon as @pre_sqe() returns or it may take

prep_sqe

> +     * until the next event loop iteration.
> +     *
> +     * This function is called from the current AioContext and is not
> +     * thread-safe.
> +     */
> +    void (*add_sqe)(AioContext *ctx,
> +                    void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
> +                    void *opaque, CqeHandler *cqe_handler);
> +#endif /* CONFIG_LINUX_IO_URING */
>  } FDMonOps;
>  
>  /*
> @@ -255,6 +297,10 @@ struct AioContext {
>      struct io_uring fdmon_io_uring;
>      AioHandlerSList submit_list;
>      gpointer io_uring_fd_tag;
> +
> +    /* Pending callback state for cqe handlers */
> +    CqeHandlerSimpleQ cqe_handler_ready_list;
> +    QEMUBH *cqe_handler_bh;
>  #endif

While here, is it worth adding a comment to state which matching #if
it ends (similar to what you did above in FDMonOps add_sqe)?

>  
>      /* TimerLists for calling timers - one per clock type.  Has its own
> @@ -761,4 +807,40 @@ void aio_context_set_aio_params(AioContext *ctx, int64_t max_batch);
>   */
>  void aio_context_set_thread_pool_params(AioContext *ctx, int64_t min,
>                                          int64_t max, Error **errp);
> +
> +#ifdef CONFIG_LINUX_IO_URING
> +/**
> + * aio_has_io_uring: Return whether io_uring is available.
> + *
> + * io_uring is either available in all AioContexts or in none, so this only
> + * needs to be called once from within any thread's AioContext.
> + */
> +static inline bool aio_has_io_uring(void)
> +{
> +    AioContext *ctx = qemu_get_current_aio_context();
> +    return ctx->fdmon_ops->add_sqe;
> +}
> +
> +/**
> + * aio_add_sqe: Add an io_uring sqe for submission.
> + * @prep_sqe: invoked with an sqe that should be prepared for submission
> + * @opaque: user-defined argument to @prep_sqe()
> + * @cqe_handler: the unique cqe handler associated with this request
> + *
> + * The caller's @prep_sqe() function is invoked to fill in the details of the
> + * sqe. Do not call io_uring_sqe_set_data() on this sqe.
> + *
> + * The sqe is submitted by the current AioContext. The kernel may see the sqe
> + * as soon as @pre_sqe() returns or it may take until the next event loop

prep_sqe

> + * iteration.
> + *
> + * When the AioContext is destroyed, pending sqes are ignored and their
> + * CqeHandlers are not invoked.
> + *
> + * This function must be called only when aio_has_io_uring() returns true.
> + */
> +void aio_add_sqe(void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
> +                 void *opaque, CqeHandler *cqe_handler);
> +#endif /* CONFIG_LINUX_IO_URING */
> +
>  #endif
> diff --git a/util/aio-posix.h b/util/aio-posix.h
> index 6f9d97d866..57ef801a5f 100644
> --- a/util/aio-posix.h
> +++ b/util/aio-posix.h
> @@ -36,6 +36,7 @@ struct AioHandler {
>  #ifdef CONFIG_LINUX_IO_URING
>      QSLIST_ENTRY(AioHandler) node_submitted;
>      unsigned flags; /* see fdmon-io_uring.c */
> +    CqeHandler cqe_handler;
>  #endif
>      int64_t poll_idle_timeout; /* when to stop userspace polling */
>      bool poll_ready; /* has polling detected an event? */
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index 44b3df61f9..89bb215a2f 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -790,3 +790,12 @@ void aio_context_set_aio_params(AioContext *ctx, int64_t max_batch)
>  
>      aio_notify(ctx);
>  }
> +
> +#ifdef CONFIG_LINUX_IO_URING
> +void aio_add_sqe(void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
> +                 void *opaque, CqeHandler *cqe_handler)
> +{
> +    AioContext *ctx = qemu_get_current_aio_context();
> +    ctx->fdmon_ops->add_sqe(ctx, prep_sqe, opaque, cqe_handler);
> +}
> +#endif /* CONFIG_LINUX_IO_URING */
> diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> index ef1a866a03..3a49d6a20a 100644
> --- a/util/fdmon-io_uring.c
> +++ b/util/fdmon-io_uring.c
> @@ -75,8 +75,8 @@ static inline int pfd_events_from_poll(int poll_events)
>  }
>  
>  /*
> - * Returns an sqe for submitting a request.  Only be called within
> - * fdmon_io_uring_wait().
> + * Returns an sqe for submitting a request. Only called from the AioContext
> + * thread.
>   */
>  static struct io_uring_sqe *get_sqe(AioContext *ctx)
>  {
> @@ -166,23 +166,43 @@ static void fdmon_io_uring_update(AioContext *ctx,
>      }
>  }
>  
> +static void fdmon_io_uring_add_sqe(AioContext *ctx,
> +        void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
> +        void *opaque, CqeHandler *cqe_handler)
> +{
> +    struct io_uring_sqe *sqe = get_sqe(ctx);
> +
> +    prep_sqe(sqe, opaque);
> +    io_uring_sqe_set_data(sqe, cqe_handler);
> +}
> +
> +static void fdmon_special_cqe_handler(CqeHandler *cqe_handler)
> +{
> +    /*
> +     * This is an empty function that is never called. It is used as a function
> +     * pointer to distinguish it from ordinary cqe handlers.
> +     */
> +}
> +
>  static void add_poll_multishot_sqe(AioContext *ctx, AioHandler *node)
>  {
>      struct io_uring_sqe *sqe = get_sqe(ctx);
>      int events = poll_events_from_pfd(node->pfd.events);
>  
>      io_uring_prep_poll_multishot(sqe, node->pfd.fd, events);
> -    io_uring_sqe_set_data(sqe, node);
> +    node->cqe_handler.cb = fdmon_special_cqe_handler;
> +    io_uring_sqe_set_data(sqe, &node->cqe_handler);
>  }
>  
>  static void add_poll_remove_sqe(AioContext *ctx, AioHandler *node)
>  {
>      struct io_uring_sqe *sqe = get_sqe(ctx);
> +    CqeHandler *cqe_handler = &node->cqe_handler;
>  
>  #ifdef LIBURING_HAVE_DATA64
> -    io_uring_prep_poll_remove(sqe, (uintptr_t)node);
> +    io_uring_prep_poll_remove(sqe, (uintptr_t)cqe_handler);
>  #else
> -    io_uring_prep_poll_remove(sqe, node);
> +    io_uring_prep_poll_remove(sqe, cqe_handler);
>  #endif
>      io_uring_sqe_set_data(sqe, NULL);
>  }
> @@ -221,20 +241,12 @@ static void fill_sq_ring(AioContext *ctx)
>      }
>  }
>  
> -/* Returns true if a handler became ready */
> -static bool process_cqe(AioContext *ctx,
> -                        AioHandlerList *ready_list,
> -                        struct io_uring_cqe *cqe)
> +static bool process_cqe_aio_handler(AioContext *ctx,
> +                                    AioHandlerList *ready_list,
> +                                    AioHandler *node,
> +                                    struct io_uring_cqe *cqe)
>  {
> -    AioHandler *node = io_uring_cqe_get_data(cqe);
> -    unsigned flags;
> -
> -    /* poll_timeout and poll_remove have a zero user_data field */
> -    if (!node) {
> -        return false;
> -    }
> -
> -    flags = qatomic_read(&node->flags);
> +    unsigned flags = qatomic_read(&node->flags);
>  
>      /*
>       * poll_multishot cancelled by poll_remove? Or completed early because fd
> @@ -261,6 +273,56 @@ static bool process_cqe(AioContext *ctx,
>      return true;
>  }
>  
> +/* Process CqeHandlers from the ready list */
> +static void cqe_handler_bh(void *opaque)
> +{
> +    AioContext *ctx = opaque;
> +    CqeHandlerSimpleQ *ready_list = &ctx->cqe_handler_ready_list;
> +
> +    /*
> +     * If cqe_handler->cb() calls aio_poll() it must continue processing
> +     * ready_list. Schedule a BH so the inner event loop calls us again.
> +     */
> +    qemu_bh_schedule(ctx->cqe_handler_bh);
> +
> +    while (!QSIMPLEQ_EMPTY(ready_list)) {
> +        CqeHandler *cqe_handler = QSIMPLEQ_FIRST(ready_list);
> +
> +        QSIMPLEQ_REMOVE_HEAD(ready_list, next);
> +
> +        cqe_handler->cb(cqe_handler);
> +    }
> +
> +    qemu_bh_cancel(ctx->cqe_handler_bh);
> +}
> +
> +/* Returns true if a handler became ready */
> +static bool process_cqe(AioContext *ctx,
> +                        AioHandlerList *ready_list,
> +                        struct io_uring_cqe *cqe)
> +{
> +    CqeHandler *cqe_handler = io_uring_cqe_get_data(cqe);
> +
> +    /* poll_timeout and poll_remove have a zero user_data field */
> +    if (!cqe_handler) {
> +        return false;
> +    }
> +
> +    /*
> +     * Special handling for AioHandler cqes. They need ready_list and have a
> +     * return value.
> +     */
> +    if (cqe_handler->cb == fdmon_special_cqe_handler) {
> +        AioHandler *node = container_of(cqe_handler, AioHandler, cqe_handler);
> +        return process_cqe_aio_handler(ctx, ready_list, node, cqe);
> +    }
> +
> +    cqe_handler->cqe = *cqe;
> +    QSIMPLEQ_INSERT_TAIL(&ctx->cqe_handler_ready_list, cqe_handler, next);
> +    qemu_bh_schedule(ctx->cqe_handler_bh);
> +    return false;
> +}
> +
>  static int process_cq_ring(AioContext *ctx, AioHandlerList *ready_list)
>  {
>      struct io_uring *ring = &ctx->fdmon_io_uring;
> @@ -360,6 +422,7 @@ static const FDMonOps fdmon_io_uring_ops = {
>      .gsource_prepare = fdmon_io_uring_gsource_prepare,
>      .gsource_check = fdmon_io_uring_gsource_check,
>      .gsource_dispatch = fdmon_io_uring_gsource_dispatch,
> +    .add_sqe = fdmon_io_uring_add_sqe,
>  };
>  
>  void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
> @@ -375,6 +438,8 @@ void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
>      }
>  
>      QSLIST_INIT(&ctx->submit_list);
> +    QSIMPLEQ_INIT(&ctx->cqe_handler_ready_list);
> +    ctx->cqe_handler_bh = aio_bh_new(ctx, cqe_handler_bh, ctx);
>      ctx->fdmon_ops = &fdmon_io_uring_ops;
>      ctx->io_uring_fd_tag = g_source_add_unix_fd(&ctx->source,
>              ctx->fdmon_io_uring.ring_fd, G_IO_IN);
> @@ -382,30 +447,36 @@ void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
>  
>  void fdmon_io_uring_destroy(AioContext *ctx)
>  {
> -    if (ctx->fdmon_ops == &fdmon_io_uring_ops) {
> -        AioHandler *node;
> +    AioHandler *node;
>  
> -        io_uring_queue_exit(&ctx->fdmon_io_uring);
> +    if (ctx->fdmon_ops != &fdmon_io_uring_ops) {
> +        return;
> +    }
>  
> -        /* Move handlers due to be removed onto the deleted list */
> -        while ((node = QSLIST_FIRST_RCU(&ctx->submit_list))) {
> -            unsigned flags = qatomic_fetch_and(&node->flags,
> -                    ~(FDMON_IO_URING_PENDING |
> -                      FDMON_IO_URING_ADD |
> -                      FDMON_IO_URING_REMOVE));
> +    io_uring_queue_exit(&ctx->fdmon_io_uring);
>  
> -            if (flags & FDMON_IO_URING_REMOVE) {
> -                QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers, node, node_deleted);
> -            }
> +    /* Move handlers due to be removed onto the deleted list */
> +    while ((node = QSLIST_FIRST_RCU(&ctx->submit_list))) {
> +        unsigned flags = qatomic_fetch_and(&node->flags,
> +                ~(FDMON_IO_URING_PENDING |
> +                  FDMON_IO_URING_ADD |
> +                  FDMON_IO_URING_REMOVE));
>  
> -            QSLIST_REMOVE_HEAD_RCU(&ctx->submit_list, node_submitted);
> +        if (flags & FDMON_IO_URING_REMOVE) {
> +            QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers,
> +                                  node, node_deleted);
>          }
>  
> -        g_source_remove_unix_fd(&ctx->source, ctx->io_uring_fd_tag);
> -        ctx->io_uring_fd_tag = NULL;
> -
> -        qemu_lockcnt_lock(&ctx->list_lock);
> -        fdmon_poll_downgrade(ctx);
> -        qemu_lockcnt_unlock(&ctx->list_lock);
> +        QSLIST_REMOVE_HEAD_RCU(&ctx->submit_list, node_submitted);
>      }
> +
> +    g_source_remove_unix_fd(&ctx->source, ctx->io_uring_fd_tag);
> +    ctx->io_uring_fd_tag = NULL;
> +
> +    assert(QSIMPLEQ_EMPTY(&ctx->cqe_handler_ready_list));
> +    qemu_bh_delete(ctx->cqe_handler_bh);
> +
> +    qemu_lockcnt_lock(&ctx->list_lock);
> +    fdmon_poll_downgrade(ctx);
> +    qemu_lockcnt_unlock(&ctx->list_lock);
>  }
> -- 
> 2.49.0
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 09/11] aio-posix: add aio_add_sqe() API for user-defined io_uring requests
  2025-05-29 20:02   ` Eric Blake
@ 2025-06-05 17:52     ` Stefan Hajnoczi
  0 siblings, 0 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-06-05 17:52 UTC (permalink / raw)
  To: Eric Blake
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek

[-- Attachment #1: Type: text/plain, Size: 5883 bytes --]

On Thu, May 29, 2025 at 03:02:16PM -0500, Eric Blake wrote:
> On Wed, May 28, 2025 at 03:09:14PM -0400, Stefan Hajnoczi wrote:
> > Introduce the aio_add_sqe() API for submitting io_uring requests in the
> > current AioContext. This allows other components in QEMU, like the block
> > layer, to take advantage of io_uring features without creating their own
> > io_uring context.
> > 
> > This API supports nested event loops just like file descriptor
> > monitoring and BHs do. This comes at a complexity cost: a BH is required
> > to dispatch CQE callbacks and they are placed on a list so that a nested
> > event loop can invoke its parent's pending CQE callbacks. If you're
> > wondering why CqeHandler exists instead of just a callback function
> > pointer, this is why.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> 
> Large patch.  I found a couple of nits, but the overall design looks
> sound.
> 
> Reviewed-by: Eric Blake <eblake@redhat.com>
> 
> >  include/block/aio.h   |  82 ++++++++++++++++++++++++
> >  util/aio-posix.h      |   1 +
> >  util/aio-posix.c      |   9 +++
> >  util/fdmon-io_uring.c | 145 +++++++++++++++++++++++++++++++-----------
> >  4 files changed, 200 insertions(+), 37 deletions(-)
> > 
> > diff --git a/include/block/aio.h b/include/block/aio.h
> > index d919d7c8f4..95beef28c3 100644
> > --- a/include/block/aio.h
> > +++ b/include/block/aio.h
> > @@ -61,6 +61,27 @@ typedef struct LuringState LuringState;
> >  /* Is polling disabled? */
> >  bool aio_poll_disabled(AioContext *ctx);
> >  
> > +#ifdef CONFIG_LINUX_IO_URING
> > +/*
> > + * Each io_uring request must have a unique CqeHandler that processes the cqe.
> > + * The lifetime of a CqeHandler must be at least from aio_add_sqe() until
> > + * ->cb() invocation.
> > + */
> > +typedef struct CqeHandler CqeHandler;
> > +struct CqeHandler {
> > +    /* Called by the AioContext when the request has completed */
> > +    void (*cb)(CqeHandler *handler);
> 
> I see an opaque callback pointer in prep_cqe below, but not one here.
> Is that because callers can write their own struct that includes a
> CqeHandler as its first member, if more state is needed?

Yes.

> 
> > +
> > +    /* Used internally, do not access this */
> > +    QSIMPLEQ_ENTRY(CqeHandler) next;
> > +
> > +    /* This field is filled in before ->cb() is called */
> > +    struct io_uring_cqe cqe;
> > +};
> > +
> > +typedef QSIMPLEQ_HEAD(, CqeHandler) CqeHandlerSimpleQ;
> > +#endif /* CONFIG_LINUX_IO_URING */
> > +
> >  /* Callbacks for file descriptor monitoring implementations */
> >  typedef struct {
> >      /*
> > @@ -138,6 +159,27 @@ typedef struct {
> >       * Called with list_lock incremented.
> >       */
> >      void (*gsource_dispatch)(AioContext *ctx, AioHandlerList *ready_list);
> > +
> > +#ifdef CONFIG_LINUX_IO_URING
> > +    /**
> > +     * aio_add_sqe: Add an io_uring sqe for submission.
> > +     * @prep_sqe: invoked with an sqe that should be prepared for submission
> > +     * @opaque: user-defined argument to @prep_sqe()
> > +     * @cqe_handler: the unique cqe handler associated with this request
> > +     *
> > +     * The caller's @prep_sqe() function is invoked to fill in the details of
> > +     * the sqe. Do not call io_uring_sqe_set_data() on this sqe.
> > +     *
> > +     * The kernel may see the sqe as soon as @pre_sqe() returns or it may take
> 
> prep_sqe

Oops, will fix.

> 
> > +     * until the next event loop iteration.
> > +     *
> > +     * This function is called from the current AioContext and is not
> > +     * thread-safe.
> > +     */
> > +    void (*add_sqe)(AioContext *ctx,
> > +                    void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
> > +                    void *opaque, CqeHandler *cqe_handler);
> > +#endif /* CONFIG_LINUX_IO_URING */
> >  } FDMonOps;
> >  
> >  /*
> > @@ -255,6 +297,10 @@ struct AioContext {
> >      struct io_uring fdmon_io_uring;
> >      AioHandlerSList submit_list;
> >      gpointer io_uring_fd_tag;
> > +
> > +    /* Pending callback state for cqe handlers */
> > +    CqeHandlerSimpleQ cqe_handler_ready_list;
> > +    QEMUBH *cqe_handler_bh;
> >  #endif
> 
> While here, is it worth adding a comment to state which matching #if
> it ends (similar to what you did above in FDMonOps add_sqe)?

Sounds good.

> 
> >  
> >      /* TimerLists for calling timers - one per clock type.  Has its own
> > @@ -761,4 +807,40 @@ void aio_context_set_aio_params(AioContext *ctx, int64_t max_batch);
> >   */
> >  void aio_context_set_thread_pool_params(AioContext *ctx, int64_t min,
> >                                          int64_t max, Error **errp);
> > +
> > +#ifdef CONFIG_LINUX_IO_URING
> > +/**
> > + * aio_has_io_uring: Return whether io_uring is available.
> > + *
> > + * io_uring is either available in all AioContexts or in none, so this only
> > + * needs to be called once from within any thread's AioContext.
> > + */
> > +static inline bool aio_has_io_uring(void)
> > +{
> > +    AioContext *ctx = qemu_get_current_aio_context();
> > +    return ctx->fdmon_ops->add_sqe;
> > +}
> > +
> > +/**
> > + * aio_add_sqe: Add an io_uring sqe for submission.
> > + * @prep_sqe: invoked with an sqe that should be prepared for submission
> > + * @opaque: user-defined argument to @prep_sqe()
> > + * @cqe_handler: the unique cqe handler associated with this request
> > + *
> > + * The caller's @prep_sqe() function is invoked to fill in the details of the
> > + * sqe. Do not call io_uring_sqe_set_data() on this sqe.
> > + *
> > + * The sqe is submitted by the current AioContext. The kernel may see the sqe
> > + * as soon as @pre_sqe() returns or it may take until the next event loop
> 
> prep_sqe

Will fix.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC 10/11] aio-posix: avoid EventNotifier for cqe_handler_bh
  2025-05-28 19:09 [RFC 00/11] aio: add the aio_add_sqe() io_uring API Stefan Hajnoczi
                   ` (8 preceding siblings ...)
  2025-05-28 19:09 ` [RFC 09/11] aio-posix: add aio_add_sqe() API for user-defined io_uring requests Stefan Hajnoczi
@ 2025-05-28 19:09 ` Stefan Hajnoczi
  2025-05-29 20:09   ` Eric Blake
  2025-05-28 19:09 ` [RFC 11/11] block/io_uring: use aio_add_sqe() Stefan Hajnoczi
  10 siblings, 1 reply; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-05-28 19:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, hibriansong, Kevin Wolf,
	Hanna Czenczek

fdmon_ops->wait() is called with notify_me enabled. This makes it an
expensive place to call qemu_bh_schedule() because aio_notify() invokes
write(2) on the EventNotifier.

Moving qemu_bh_schedule() after notify_me is reset improves IOPS from
270k to 300k IOPS with --blockdev file,aio=io_uring.

I considered alternatives:
1. Introducing a variant of qemu_bh_schedule() that skips aio_notify().
   This only makes sense within the AioContext and fdmon implementation
   itself and is therefore a specialized internal API. I don't like
   that.
2. Changing fdmon_ops->wait() so implementors can reset notify_me
   themselves. This makes things complex and the other fdmon
   implementations don't need it, so it doesn't seem like a good
   solution.

So in the end I moved the qemu_bh_schedule() call from fdmon-io_uring.c
to aio-posix.c. It's ugly but straightforward.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/aio-posix.c      | 11 +++++++++++
 util/fdmon-io_uring.c | 11 ++++++++++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/util/aio-posix.c b/util/aio-posix.c
index 89bb215a2f..01428b141c 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -693,6 +693,17 @@ bool aio_poll(AioContext *ctx, bool blocking)
                              qatomic_read(&ctx->notify_me) - 2);
     }
 
+#ifdef CONFIG_LINUX_IO_URING
+    /*
+     * This is part of fdmon-io_uring.c but it's more efficient to do it here
+     * after notify_me has been reset. That way qemu_bh_schedule() ->
+     * aio_notify() does not write the EventNotifier.
+     */
+    if (!QSIMPLEQ_EMPTY(&ctx->cqe_handler_ready_list)) {
+        qemu_bh_schedule(ctx->cqe_handler_bh);
+    }
+#endif
+
     aio_notify_accept(ctx);
 
     /* Calculate blocked time for adaptive polling */
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index 3a49d6a20a..03a07a4caf 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -318,8 +318,12 @@ static bool process_cqe(AioContext *ctx,
     }
 
     cqe_handler->cqe = *cqe;
+
+    /*
+     * aio_poll() and fdmon_io_uring_gsource_dispatch() schedule cqe_handler_bh
+     * when the list is non-empty.
+     */
     QSIMPLEQ_INSERT_TAIL(&ctx->cqe_handler_ready_list, cqe_handler, next);
-    qemu_bh_schedule(ctx->cqe_handler_bh);
     return false;
 }
 
@@ -370,6 +374,11 @@ static void fdmon_io_uring_gsource_dispatch(AioContext *ctx,
                                             AioHandlerList *ready_list)
 {
     process_cq_ring(ctx, ready_list);
+
+    /* Ensure CqeHandlers enqueued by process_cq_ring() will run */
+    if (!QSIMPLEQ_EMPTY(&ctx->cqe_handler_ready_list)) {
+        qemu_bh_schedule(ctx->cqe_handler_bh);
+    }
 }
 
 static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC 10/11] aio-posix: avoid EventNotifier for cqe_handler_bh
  2025-05-28 19:09 ` [RFC 10/11] aio-posix: avoid EventNotifier for cqe_handler_bh Stefan Hajnoczi
@ 2025-05-29 20:09   ` Eric Blake
  0 siblings, 0 replies; 33+ messages in thread
From: Eric Blake @ 2025-05-29 20:09 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek

On Wed, May 28, 2025 at 03:09:15PM -0400, Stefan Hajnoczi wrote:
> fdmon_ops->wait() is called with notify_me enabled. This makes it an
> expensive place to call qemu_bh_schedule() because aio_notify() invokes
> write(2) on the EventNotifier.
> 
> Moving qemu_bh_schedule() after notify_me is reset improves IOPS from
> 270k to 300k IOPS with --blockdev file,aio=io_uring.
> 
> I considered alternatives:
> 1. Introducing a variant of qemu_bh_schedule() that skips aio_notify().
>    This only makes sense within the AioContext and fdmon implementation
>    itself and is therefore a specialized internal API. I don't like
>    that.
> 2. Changing fdmon_ops->wait() so implementors can reset notify_me
>    themselves. This makes things complex and the other fdmon
>    implementations don't need it, so it doesn't seem like a good
>    solution.
> 
> So in the end I moved the qemu_bh_schedule() call from fdmon-io_uring.c
> to aio-posix.c. It's ugly but straightforward.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  util/aio-posix.c      | 11 +++++++++++
>  util/fdmon-io_uring.c | 11 ++++++++++-
>  2 files changed, 21 insertions(+), 1 deletion(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

> 
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index 89bb215a2f..01428b141c 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -693,6 +693,17 @@ bool aio_poll(AioContext *ctx, bool blocking)
>                               qatomic_read(&ctx->notify_me) - 2);
>      }
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +    /*
> +     * This is part of fdmon-io_uring.c but it's more efficient to do it here
> +     * after notify_me has been reset. That way qemu_bh_schedule() ->
> +     * aio_notify() does not write the EventNotifier.
> +     */
> +    if (!QSIMPLEQ_EMPTY(&ctx->cqe_handler_ready_list)) {
> +        qemu_bh_schedule(ctx->cqe_handler_bh);
> +    }
> +#endif
> +
>      aio_notify_accept(ctx);
>  
>      /* Calculate blocked time for adaptive polling */
> diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> index 3a49d6a20a..03a07a4caf 100644
> --- a/util/fdmon-io_uring.c
> +++ b/util/fdmon-io_uring.c
> @@ -318,8 +318,12 @@ static bool process_cqe(AioContext *ctx,
>      }
>  
>      cqe_handler->cqe = *cqe;
> +
> +    /*
> +     * aio_poll() and fdmon_io_uring_gsource_dispatch() schedule cqe_handler_bh
> +     * when the list is non-empty.
> +     */
>      QSIMPLEQ_INSERT_TAIL(&ctx->cqe_handler_ready_list, cqe_handler, next);
> -    qemu_bh_schedule(ctx->cqe_handler_bh);
>      return false;
>  }
>  
> @@ -370,6 +374,11 @@ static void fdmon_io_uring_gsource_dispatch(AioContext *ctx,
>                                              AioHandlerList *ready_list)
>  {
>      process_cq_ring(ctx, ready_list);
> +
> +    /* Ensure CqeHandlers enqueued by process_cq_ring() will run */
> +    if (!QSIMPLEQ_EMPTY(&ctx->cqe_handler_ready_list)) {
> +        qemu_bh_schedule(ctx->cqe_handler_bh);
> +    }
>  }
>  
>  static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
> -- 
> 2.49.0
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC 11/11] block/io_uring: use aio_add_sqe()
  2025-05-28 19:09 [RFC 00/11] aio: add the aio_add_sqe() io_uring API Stefan Hajnoczi
                   ` (9 preceding siblings ...)
  2025-05-28 19:09 ` [RFC 10/11] aio-posix: avoid EventNotifier for cqe_handler_bh Stefan Hajnoczi
@ 2025-05-28 19:09 ` Stefan Hajnoczi
  2025-05-29 21:11   ` Eric Blake
  10 siblings, 1 reply; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-05-28 19:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, hibriansong, Kevin Wolf,
	Hanna Czenczek

AioContext has its own io_uring instance for file descriptor monitoring.
The disk I/O io_uring code was developed separately. Originally I
thought the characteristics of file descriptor monitoring and disk I/O
were too different, requiring separate io_uring instances.

Now it has become clear to me that it's feasible to share a single
io_uring instance for file descriptor monitoring and disk I/O. We're not
using io_uring's IOPOLL feature or anything else that would require a
separate instance.

Unify block/io_uring.c and util/fdmon-io_uring.c using the new
aio_add_sqe() API that allows user-defined io_uring sqe submission. Now
block/io_uring.c just needs to submit readv/writev/fsync and most of the
io_uring-specific logic is handled by fdmon-io_uring.c.

There are two immediate advantages:
1. Fewer system calls. There is no need to monitor the disk I/O io_uring
   ring fd from the file descriptor monitoring io_uring instance. Disk
   I/O completions are now picked up directly. Also, sqes are
   accumulated in the sq ring until the end of the event loop iteration
   and there are fewer io_uring_enter(2) syscalls.
2. Less code duplication.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/aio.h     |   7 -
 include/block/raw-aio.h |   5 -
 block/file-posix.c      |  38 ++--
 block/io_uring.c        | 489 ++++++++++------------------------------
 stubs/io_uring.c        |  32 ---
 util/async.c            |  35 ---
 util/fdmon-io_uring.c   |   6 +
 block/trace-events      |  12 +-
 stubs/meson.build       |   3 -
 util/trace-events       |   4 +
 10 files changed, 139 insertions(+), 492 deletions(-)
 delete mode 100644 stubs/io_uring.c

diff --git a/include/block/aio.h b/include/block/aio.h
index 95beef28c3..fbb45cca74 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -291,8 +291,6 @@ struct AioContext {
     struct LinuxAioState *linux_aio;
 #endif
 #ifdef CONFIG_LINUX_IO_URING
-    LuringState *linux_io_uring;
-
     /* State for file descriptor monitoring using Linux io_uring */
     struct io_uring fdmon_io_uring;
     AioHandlerSList submit_list;
@@ -597,11 +595,6 @@ struct LinuxAioState *aio_setup_linux_aio(AioContext *ctx, Error **errp);
 /* Return the LinuxAioState bound to this AioContext */
 struct LinuxAioState *aio_get_linux_aio(AioContext *ctx);
 
-/* Setup the LuringState bound to this AioContext */
-LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp);
-
-/* Return the LuringState bound to this AioContext */
-LuringState *aio_get_linux_io_uring(AioContext *ctx);
 /**
  * aio_timer_new_with_attrs:
  * @ctx: the aio context
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 6570244496..30e5fc9a9f 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -74,15 +74,10 @@ static inline bool laio_has_fua(void)
 #endif
 /* io_uring.c - Linux io_uring implementation */
 #ifdef CONFIG_LINUX_IO_URING
-LuringState *luring_init(Error **errp);
-void luring_cleanup(LuringState *s);
-
 /* luring_co_submit: submit I/O requests in the thread's current AioContext. */
 int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
                                   QEMUIOVector *qiov, int type,
                                   BdrvRequestFlags flags);
-void luring_detach_aio_context(LuringState *s, AioContext *old_context);
-void luring_attach_aio_context(LuringState *s, AioContext *new_context);
 bool luring_has_fua(void);
 #else
 static inline bool luring_has_fua(void)
diff --git a/block/file-posix.c b/block/file-posix.c
index 9b5f08ccb2..d1f1fc3a77 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -755,14 +755,23 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     }
 #endif /* !defined(CONFIG_LINUX_AIO) */
 
-#ifndef CONFIG_LINUX_IO_URING
     if (s->use_linux_io_uring) {
+#ifdef CONFIG_LINUX_IO_URING
+        if (!aio_has_io_uring()) {
+            error_setg(errp, "aio=io_uring was specified, but is not "
+                             "available (disabled via io_uring_disabled "
+                             "sysctl or blocked by container runtime "
+                             "seccomp policy?)");
+            ret = -EINVAL;
+            goto fail;
+        }
+#else
         error_setg(errp, "aio=io_uring was specified, but is not supported "
                          "in this build.");
         ret = -EINVAL;
         goto fail;
-    }
 #endif /* !defined(CONFIG_LINUX_IO_URING) */
+    }
 
     s->has_discard = true;
     s->has_write_zeroes = true;
@@ -2522,27 +2531,6 @@ static bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
     return true;
 }
 
-#ifdef CONFIG_LINUX_IO_URING
-static inline bool raw_check_linux_io_uring(BDRVRawState *s)
-{
-    Error *local_err = NULL;
-    AioContext *ctx;
-
-    if (!s->use_linux_io_uring) {
-        return false;
-    }
-
-    ctx = qemu_get_current_aio_context();
-    if (unlikely(!aio_setup_linux_io_uring(ctx, &local_err))) {
-        error_reportf_err(local_err, "Unable to use linux io_uring, "
-                                     "falling back to thread pool: ");
-        s->use_linux_io_uring = false;
-        return false;
-    }
-    return true;
-}
-#endif
-
 #ifdef CONFIG_LINUX_AIO
 static inline bool raw_check_linux_aio(BDRVRawState *s)
 {
@@ -2595,7 +2583,7 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr,
     if (s->needs_alignment && !bdrv_qiov_is_aligned(bs, qiov)) {
         type |= QEMU_AIO_MISALIGNED;
 #ifdef CONFIG_LINUX_IO_URING
-    } else if (raw_check_linux_io_uring(s)) {
+    } else if (s->use_linux_io_uring) {
         assert(qiov->size == bytes);
         ret = luring_co_submit(bs, s->fd, offset, qiov, type, flags);
         goto out;
@@ -2692,7 +2680,7 @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
     };
 
 #ifdef CONFIG_LINUX_IO_URING
-    if (raw_check_linux_io_uring(s)) {
+    if (s->use_linux_io_uring) {
         return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
     }
 #endif
diff --git a/block/io_uring.c b/block/io_uring.c
index dd4f304910..dd930ee57e 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -11,28 +11,20 @@
 #include "qemu/osdep.h"
 #include <liburing.h>
 #include "block/aio.h"
-#include "qemu/queue.h"
 #include "block/block.h"
 #include "block/raw-aio.h"
 #include "qemu/coroutine.h"
-#include "qemu/defer-call.h"
-#include "qapi/error.h"
 #include "system/block-backend.h"
 #include "trace.h"
 
-/* Only used for assertions.  */
-#include "qemu/coroutine_int.h"
-
-/* io_uring ring size */
-#define MAX_ENTRIES 128
-
-typedef struct LuringAIOCB {
+typedef struct {
     Coroutine *co;
-    struct io_uring_sqe sqeq;
-    ssize_t ret;
     QEMUIOVector *qiov;
-    bool is_read;
-    QSIMPLEQ_ENTRY(LuringAIOCB) next;
+    uint64_t offset;
+    ssize_t ret;
+    int type;
+    int fd;
+    BdrvRequestFlags flags;
 
     /*
      * Buffered reads may require resubmission, see
@@ -40,36 +32,51 @@ typedef struct LuringAIOCB {
      */
     int total_read;
     QEMUIOVector resubmit_qiov;
-} LuringAIOCB;
 
-typedef struct LuringQueue {
-    unsigned int in_queue;
-    unsigned int in_flight;
-    bool blocked;
-    QSIMPLEQ_HEAD(, LuringAIOCB) submit_queue;
-} LuringQueue;
+    CqeHandler cqe_handler;
+} LuringRequest;
 
-struct LuringState {
-    AioContext *aio_context;
-
-    struct io_uring ring;
-
-    /* No locking required, only accessed from AioContext home thread */
-    LuringQueue io_q;
-
-    QEMUBH *completion_bh;
-};
-
-/**
- * luring_resubmit:
- *
- * Resubmit a request by appending it to submit_queue.  The caller must ensure
- * that ioq_submit() is called later so that submit_queue requests are started.
- */
-static void luring_resubmit(LuringState *s, LuringAIOCB *luringcb)
+static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
 {
-    QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
-    s->io_q.in_queue++;
+    LuringRequest *req = opaque;
+    QEMUIOVector *qiov = req->qiov;
+    uint64_t offset = req->offset;
+    int fd = req->fd;
+    BdrvRequestFlags flags = req->flags;
+
+    switch (req->type) {
+    case QEMU_AIO_WRITE:
+#ifdef HAVE_IO_URING_PREP_WRITEV2
+    {
+        int luring_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
+        io_uring_prep_writev2(sqe, fd, qiov->iov,
+                              qiov->niov, offset, luring_flags);
+    }
+#else
+        assert(flags == 0);
+        io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset);
+#endif
+        break;
+    case QEMU_AIO_ZONE_APPEND:
+        io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset);
+        break;
+    case QEMU_AIO_READ:
+    {
+        if (req->resubmit_qiov.iov != NULL) {
+            qiov = &req->resubmit_qiov;
+        }
+        io_uring_prep_readv(sqe, fd, qiov->iov, qiov->niov,
+                            offset + req->total_read);
+        break;
+    }
+    case QEMU_AIO_FLUSH:
+        io_uring_prep_fsync(sqe, fd, IORING_FSYNC_DATASYNC);
+        break;
+    default:
+        fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
+                        __func__, req->type);
+        abort();
+    }
 }
 
 /**
@@ -78,385 +85,115 @@ static void luring_resubmit(LuringState *s, LuringAIOCB *luringcb)
  * Short reads are rare but may occur. The remaining read request needs to be
  * resubmitted.
  */
-static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
-                                       int nread)
+static void luring_resubmit_short_read(LuringRequest *req, int nread)
 {
     QEMUIOVector *resubmit_qiov;
     size_t remaining;
 
-    trace_luring_resubmit_short_read(s, luringcb, nread);
+    trace_luring_resubmit_short_read(req, nread);
 
     /* Update read position */
-    luringcb->total_read += nread;
-    remaining = luringcb->qiov->size - luringcb->total_read;
+    req->total_read += nread;
+    remaining = req->qiov->size - req->total_read;
 
     /* Shorten qiov */
-    resubmit_qiov = &luringcb->resubmit_qiov;
+    resubmit_qiov = &req->resubmit_qiov;
     if (resubmit_qiov->iov == NULL) {
-        qemu_iovec_init(resubmit_qiov, luringcb->qiov->niov);
+        qemu_iovec_init(resubmit_qiov, req->qiov->niov);
     } else {
         qemu_iovec_reset(resubmit_qiov);
     }
-    qemu_iovec_concat(resubmit_qiov, luringcb->qiov, luringcb->total_read,
-                      remaining);
+    qemu_iovec_concat(resubmit_qiov, req->qiov, req->total_read, remaining);
 
-    /* Update sqe */
-    luringcb->sqeq.off += nread;
-    luringcb->sqeq.addr = (uintptr_t)luringcb->resubmit_qiov.iov;
-    luringcb->sqeq.len = luringcb->resubmit_qiov.niov;
-
-    luring_resubmit(s, luringcb);
+    aio_add_sqe(luring_prep_sqe, req, &req->cqe_handler);
 }
 
-/**
- * luring_process_completions:
- * @s: AIO state
- *
- * Fetches completed I/O requests, consumes cqes and invokes their callbacks
- * The function is somewhat tricky because it supports nested event loops, for
- * example when a request callback invokes aio_poll().
- *
- * Function schedules BH completion so it  can be called again in a nested
- * event loop.  When there are no events left  to complete the BH is being
- * canceled.
- *
- */
-static void luring_process_completions(LuringState *s)
+static void luring_cqe_handler(CqeHandler *cqe_handler)
 {
-    struct io_uring_cqe *cqes;
-    int total_bytes;
+    LuringRequest *req = container_of(cqe_handler, LuringRequest, cqe_handler);
+    int ret = cqe_handler->cqe.res;
 
-    defer_call_begin();
+    trace_luring_cqe_handler(req, ret);
 
-    /*
-     * Request completion callbacks can run the nested event loop.
-     * Schedule ourselves so the nested event loop will "see" remaining
-     * completed requests and process them.  Without this, completion
-     * callbacks that wait for other requests using a nested event loop
-     * would hang forever.
-     *
-     * This workaround is needed because io_uring uses poll_wait, which
-     * is woken up when new events are added to the uring, thus polling on
-     * the same uring fd will block unless more events are received.
-     *
-     * Other leaf block drivers (drivers that access the data themselves)
-     * are networking based, so they poll sockets for data and run the
-     * correct coroutine.
-     */
-    qemu_bh_schedule(s->completion_bh);
-
-    while (io_uring_peek_cqe(&s->ring, &cqes) == 0) {
-        LuringAIOCB *luringcb;
-        int ret;
-
-        if (!cqes) {
-            break;
+    if (ret < 0) {
+        /*
+         * Only writev/readv/fsync requests on regular files or host block
+         * devices are submitted. Therefore -EAGAIN is not expected but it's
+         * known to happen sometimes with Linux SCSI. Submit again and hope
+         * the request completes successfully.
+         *
+         * For more information, see:
+         * https://lore.kernel.org/io-uring/20210727165811.284510-3-axboe@kernel.dk/T/#u
+         *
+         * If the code is changed to submit other types of requests in the
+         * future, then this workaround may need to be extended to deal with
+         * genuine -EAGAIN results that should not be resubmitted
+         * immediately.
+         */
+        if (ret == -EINTR || ret == -EAGAIN) {
+            aio_add_sqe(luring_prep_sqe, req, &req->cqe_handler);
+            return;
         }
-
-        luringcb = io_uring_cqe_get_data(cqes);
-        ret = cqes->res;
-        io_uring_cqe_seen(&s->ring, cqes);
-        cqes = NULL;
-
-        /* Change counters one-by-one because we can be nested. */
-        s->io_q.in_flight--;
-        trace_luring_process_completion(s, luringcb, ret);
-
+    } else if (req->qiov) {
         /* total_read is non-zero only for resubmitted read requests */
-        total_bytes = ret + luringcb->total_read;
+        int total_bytes = ret + req->total_read;
 
-        if (ret < 0) {
-            /*
-             * Only writev/readv/fsync requests on regular files or host block
-             * devices are submitted. Therefore -EAGAIN is not expected but it's
-             * known to happen sometimes with Linux SCSI. Submit again and hope
-             * the request completes successfully.
-             *
-             * For more information, see:
-             * https://lore.kernel.org/io-uring/20210727165811.284510-3-axboe@kernel.dk/T/#u
-             *
-             * If the code is changed to submit other types of requests in the
-             * future, then this workaround may need to be extended to deal with
-             * genuine -EAGAIN results that should not be resubmitted
-             * immediately.
-             */
-            if (ret == -EINTR || ret == -EAGAIN) {
-                luring_resubmit(s, luringcb);
-                continue;
-            }
-        } else if (!luringcb->qiov) {
-            goto end;
-        } else if (total_bytes == luringcb->qiov->size) {
+        if (total_bytes == req->qiov->size) {
             ret = 0;
-        /* Only read/write */
         } else {
             /* Short Read/Write */
-            if (luringcb->is_read) {
+            if (req->type == QEMU_AIO_READ) {
                 if (ret > 0) {
-                    luring_resubmit_short_read(s, luringcb, ret);
-                    continue;
-                } else {
-                    /* Pad with zeroes */
-                    qemu_iovec_memset(luringcb->qiov, total_bytes, 0,
-                                      luringcb->qiov->size - total_bytes);
-                    ret = 0;
+                    luring_resubmit_short_read(req, ret);
+                    return;
                 }
+
+                /* Pad with zeroes */
+                qemu_iovec_memset(req->qiov, total_bytes, 0,
+                                  req->qiov->size - total_bytes);
+                ret = 0;
             } else {
                 ret = -ENOSPC;
             }
         }
-end:
-        luringcb->ret = ret;
-        qemu_iovec_destroy(&luringcb->resubmit_qiov);
-
-        /*
-         * If the coroutine is already entered it must be in ioq_submit()
-         * and will notice luringcb->ret has been filled in when it
-         * eventually runs later. Coroutines cannot be entered recursively
-         * so avoid doing that!
-         */
-        assert(luringcb->co->ctx == s->aio_context);
-        if (!qemu_coroutine_entered(luringcb->co)) {
-            aio_co_wake(luringcb->co);
-        }
     }
 
-    qemu_bh_cancel(s->completion_bh);
+    req->ret = ret;
+    qemu_iovec_destroy(&req->resubmit_qiov);
 
-    defer_call_end();
-}
-
-static int ioq_submit(LuringState *s)
-{
-    int ret = 0;
-    LuringAIOCB *luringcb, *luringcb_next;
-
-    while (s->io_q.in_queue > 0) {
-        /*
-         * Try to fetch sqes from the ring for requests waiting in
-         * the overflow queue
-         */
-        QSIMPLEQ_FOREACH_SAFE(luringcb, &s->io_q.submit_queue, next,
-                              luringcb_next) {
-            struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
-            if (!sqes) {
-                break;
-            }
-            /* Prep sqe for submission */
-            *sqes = luringcb->sqeq;
-            QSIMPLEQ_REMOVE_HEAD(&s->io_q.submit_queue, next);
-        }
-        ret = io_uring_submit(&s->ring);
-        trace_luring_io_uring_submit(s, ret);
-        /* Prevent infinite loop if submission is refused */
-        if (ret <= 0) {
-            if (ret == -EAGAIN || ret == -EINTR) {
-                continue;
-            }
-            break;
-        }
-        s->io_q.in_flight += ret;
-        s->io_q.in_queue  -= ret;
-    }
-    s->io_q.blocked = (s->io_q.in_queue > 0);
-
-    if (s->io_q.in_flight) {
-        /*
-         * We can try to complete something just right away if there are
-         * still requests in-flight.
-         */
-        luring_process_completions(s);
-    }
-    return ret;
-}
-
-static void luring_process_completions_and_submit(LuringState *s)
-{
-    luring_process_completions(s);
-
-    if (s->io_q.in_queue > 0) {
-        ioq_submit(s);
+    /*
+     * If the coroutine is already entered it must be in luring_co_submit() and
+     * will notice req->ret has been filled in when it eventually runs later.
+     * Coroutines cannot be entered recursively so avoid doing that!
+     */
+    if (!qemu_coroutine_entered(req->co)) {
+        aio_co_wake(req->co);
     }
 }
 
-static void qemu_luring_completion_bh(void *opaque)
+int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd,
+                                  uint64_t offset, QEMUIOVector *qiov,
+                                  int type, BdrvRequestFlags flags)
 {
-    LuringState *s = opaque;
-    luring_process_completions_and_submit(s);
-}
-
-static void qemu_luring_completion_cb(void *opaque)
-{
-    LuringState *s = opaque;
-    luring_process_completions_and_submit(s);
-}
-
-static bool qemu_luring_poll_cb(void *opaque)
-{
-    LuringState *s = opaque;
-
-    return io_uring_cq_ready(&s->ring);
-}
-
-static void qemu_luring_poll_ready(void *opaque)
-{
-    LuringState *s = opaque;
-
-    luring_process_completions_and_submit(s);
-}
-
-static void ioq_init(LuringQueue *io_q)
-{
-    QSIMPLEQ_INIT(&io_q->submit_queue);
-    io_q->in_queue = 0;
-    io_q->in_flight = 0;
-    io_q->blocked = false;
-}
-
-static void luring_deferred_fn(void *opaque)
-{
-    LuringState *s = opaque;
-    trace_luring_unplug_fn(s, s->io_q.blocked, s->io_q.in_queue,
-                           s->io_q.in_flight);
-    if (!s->io_q.blocked && s->io_q.in_queue > 0) {
-        ioq_submit(s);
-    }
-}
-
-/**
- * luring_do_submit:
- * @fd: file descriptor for I/O
- * @luringcb: AIO control block
- * @s: AIO state
- * @offset: offset for request
- * @type: type of request
- *
- * Fetches sqes from ring, adds to pending queue and preps them
- *
- */
-static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
-                            uint64_t offset, int type, BdrvRequestFlags flags)
-{
-    int ret;
-    struct io_uring_sqe *sqes = &luringcb->sqeq;
-
-    switch (type) {
-    case QEMU_AIO_WRITE:
-#ifdef HAVE_IO_URING_PREP_WRITEV2
-    {
-        int luring_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
-        io_uring_prep_writev2(sqes, fd, luringcb->qiov->iov,
-                              luringcb->qiov->niov, offset, luring_flags);
-    }
-#else
-        assert(flags == 0);
-        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
-                             luringcb->qiov->niov, offset);
-#endif
-        break;
-    case QEMU_AIO_ZONE_APPEND:
-        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
-                             luringcb->qiov->niov, offset);
-        break;
-    case QEMU_AIO_READ:
-        io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
-                            luringcb->qiov->niov, offset);
-        break;
-    case QEMU_AIO_FLUSH:
-        io_uring_prep_fsync(sqes, fd, IORING_FSYNC_DATASYNC);
-        break;
-    default:
-        fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
-                        __func__, type);
-        abort();
-    }
-    io_uring_sqe_set_data(sqes, luringcb);
-
-    QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
-    s->io_q.in_queue++;
-    trace_luring_do_submit(s, s->io_q.blocked, s->io_q.in_queue,
-                           s->io_q.in_flight);
-    if (!s->io_q.blocked) {
-        if (s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES) {
-            ret = ioq_submit(s);
-            trace_luring_do_submit_done(s, ret);
-            return ret;
-        }
-
-        defer_call(luring_deferred_fn, s);
-    }
-    return 0;
-}
-
-int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
-                                  QEMUIOVector *qiov, int type,
-                                  BdrvRequestFlags flags)
-{
-    int ret;
-    AioContext *ctx = qemu_get_current_aio_context();
-    LuringState *s = aio_get_linux_io_uring(ctx);
-    LuringAIOCB luringcb = {
+    LuringRequest req = {
         .co         = qemu_coroutine_self(),
-        .ret        = -EINPROGRESS,
         .qiov       = qiov,
-        .is_read    = (type == QEMU_AIO_READ),
+        .ret        = -EINPROGRESS,
+        .type       = type,
+        .fd         = fd,
+        .offset     = offset,
+        .flags      = flags,
     };
-    trace_luring_co_submit(bs, s, &luringcb, fd, offset, qiov ? qiov->size : 0,
-                           type);
-    ret = luring_do_submit(fd, &luringcb, s, offset, type, flags);
 
-    if (ret < 0) {
-        return ret;
-    }
+    req.cqe_handler.cb = luring_cqe_handler;
 
-    if (luringcb.ret == -EINPROGRESS) {
+    trace_luring_co_submit(bs, &req, fd, offset, qiov ? qiov->size : 0, type);
+    aio_add_sqe(luring_prep_sqe, &req, &req.cqe_handler);
+
+    if (req.ret == -EINPROGRESS) {
         qemu_coroutine_yield();
     }
-    return luringcb.ret;
-}
-
-void luring_detach_aio_context(LuringState *s, AioContext *old_context)
-{
-    aio_set_fd_handler(old_context, s->ring.ring_fd,
-                       NULL, NULL, NULL, NULL, s);
-    qemu_bh_delete(s->completion_bh);
-    s->aio_context = NULL;
-}
-
-void luring_attach_aio_context(LuringState *s, AioContext *new_context)
-{
-    s->aio_context = new_context;
-    s->completion_bh = aio_bh_new(new_context, qemu_luring_completion_bh, s);
-    aio_set_fd_handler(s->aio_context, s->ring.ring_fd,
-                       qemu_luring_completion_cb, NULL,
-                       qemu_luring_poll_cb, qemu_luring_poll_ready, s);
-}
-
-LuringState *luring_init(Error **errp)
-{
-    int rc;
-    LuringState *s = g_new0(LuringState, 1);
-    struct io_uring *ring = &s->ring;
-
-    trace_luring_init_state(s, sizeof(*s));
-
-    rc = io_uring_queue_init(MAX_ENTRIES, ring, 0);
-    if (rc < 0) {
-        error_setg_errno(errp, -rc, "failed to init linux io_uring ring");
-        g_free(s);
-        return NULL;
-    }
-
-    ioq_init(&s->io_q);
-    return s;
-
-}
-
-void luring_cleanup(LuringState *s)
-{
-    io_uring_queue_exit(&s->ring);
-    trace_luring_cleanup_state(s);
-    g_free(s);
+    return req.ret;
 }
 
 bool luring_has_fua(void)
diff --git a/stubs/io_uring.c b/stubs/io_uring.c
deleted file mode 100644
index 622d1e4648..0000000000
--- a/stubs/io_uring.c
+++ /dev/null
@@ -1,32 +0,0 @@
-/*
- * Linux io_uring support.
- *
- * Copyright (C) 2009 IBM, Corp.
- * Copyright (C) 2009 Red Hat, Inc.
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- */
-#include "qemu/osdep.h"
-#include "block/aio.h"
-#include "block/raw-aio.h"
-
-void luring_detach_aio_context(LuringState *s, AioContext *old_context)
-{
-    abort();
-}
-
-void luring_attach_aio_context(LuringState *s, AioContext *new_context)
-{
-    abort();
-}
-
-LuringState *luring_init(Error **errp)
-{
-    abort();
-}
-
-void luring_cleanup(LuringState *s)
-{
-    abort();
-}
diff --git a/util/async.c b/util/async.c
index bba9622e97..d66575acd2 100644
--- a/util/async.c
+++ b/util/async.c
@@ -383,14 +383,6 @@ aio_ctx_finalize(GSource     *source)
     }
 #endif
 
-#ifdef CONFIG_LINUX_IO_URING
-    if (ctx->linux_io_uring) {
-        luring_detach_aio_context(ctx->linux_io_uring, ctx);
-        luring_cleanup(ctx->linux_io_uring);
-        ctx->linux_io_uring = NULL;
-    }
-#endif
-
     assert(QSLIST_EMPTY(&ctx->scheduled_coroutines));
     qemu_bh_delete(ctx->co_schedule_bh);
 
@@ -465,29 +457,6 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
 }
 #endif
 
-#ifdef CONFIG_LINUX_IO_URING
-LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp)
-{
-    if (ctx->linux_io_uring) {
-        return ctx->linux_io_uring;
-    }
-
-    ctx->linux_io_uring = luring_init(errp);
-    if (!ctx->linux_io_uring) {
-        return NULL;
-    }
-
-    luring_attach_aio_context(ctx->linux_io_uring, ctx);
-    return ctx->linux_io_uring;
-}
-
-LuringState *aio_get_linux_io_uring(AioContext *ctx)
-{
-    assert(ctx->linux_io_uring);
-    return ctx->linux_io_uring;
-}
-#endif
-
 void aio_notify(AioContext *ctx)
 {
     /*
@@ -611,10 +580,6 @@ AioContext *aio_context_new(Error **errp)
     ctx->linux_aio = NULL;
 #endif
 
-#ifdef CONFIG_LINUX_IO_URING
-    ctx->linux_io_uring = NULL;
-#endif
-
     ctx->thread_pool = NULL;
     qemu_rec_mutex_init(&ctx->lock);
     timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index 03a07a4caf..2c64f80e5f 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -48,6 +48,7 @@
 #include "qapi/error.h"
 #include "qemu/rcu_queue.h"
 #include "aio-posix.h"
+#include "trace.h"
 
 enum {
     FDMON_IO_URING_ENTRIES  = 128, /* sq/cq ring size */
@@ -174,6 +175,9 @@ static void fdmon_io_uring_add_sqe(AioContext *ctx,
 
     prep_sqe(sqe, opaque);
     io_uring_sqe_set_data(sqe, cqe_handler);
+
+    trace_fdmon_io_uring_add_sqe(ctx, opaque, sqe->opcode, sqe->fd, sqe->off,
+                                 cqe_handler);
 }
 
 static void fdmon_special_cqe_handler(CqeHandler *cqe_handler)
@@ -290,6 +294,8 @@ static void cqe_handler_bh(void *opaque)
 
         QSIMPLEQ_REMOVE_HEAD(ready_list, next);
 
+        trace_fdmon_io_uring_cqe_handler(ctx, cqe_handler,
+                                         cqe_handler->cqe.res);
         cqe_handler->cb(cqe_handler);
     }
 
diff --git a/block/trace-events b/block/trace-events
index 8e789e1f12..c9b4736ff8 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -62,15 +62,9 @@ qmp_block_stream(void *bs) "bs %p"
 file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) "acb %p opaque %p offset %"PRId64" count %d type %d"
 
 # io_uring.c
-luring_init_state(void *s, size_t size) "s %p size %zu"
-luring_cleanup_state(void *s) "%p freed"
-luring_unplug_fn(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
-luring_do_submit(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
-luring_do_submit_done(void *s, int ret) "LuringState %p submitted to kernel %d"
-luring_co_submit(void *bs, void *s, void *luringcb, int fd, uint64_t offset, size_t nbytes, int type) "bs %p s %p luringcb %p fd %d offset %" PRId64 " nbytes %zd type %d"
-luring_process_completion(void *s, void *aiocb, int ret) "LuringState %p luringcb %p ret %d"
-luring_io_uring_submit(void *s, int ret) "LuringState %p ret %d"
-luring_resubmit_short_read(void *s, void *luringcb, int nread) "LuringState %p luringcb %p nread %d"
+luring_cqe_handler(void *req, int ret) "req %p ret %d"
+luring_co_submit(void *bs, void *req, int fd, uint64_t offset, size_t nbytes, int type) "bs %p req %p fd %d offset %" PRId64 " nbytes %zd type %d"
+luring_resubmit_short_read(void *req, int nread) "req %p nread %d"
 
 # qcow2.c
 qcow2_add_task(void *co, void *bs, void *pool, const char *action, int cluster_type, uint64_t host_offset, uint64_t offset, uint64_t bytes, void *qiov, size_t qiov_offset) "co %p bs %p pool %p: %s: cluster_type %d file_cluster_offset %" PRIu64 " offset %" PRIu64 " bytes %" PRIu64 " qiov %p qiov_offset %zu"
diff --git a/stubs/meson.build b/stubs/meson.build
index 63392f5e78..d157b06273 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -32,9 +32,6 @@ if have_block or have_ga
   stub_ss.add(files('cpus-virtual-clock.c'))
   stub_ss.add(files('icount.c'))
   stub_ss.add(files('graph-lock.c'))
-  if linux_io_uring.found()
-    stub_ss.add(files('io_uring.c'))
-  endif
   if libaio.found()
     stub_ss.add(files('linux-aio.c'))
   endif
diff --git a/util/trace-events b/util/trace-events
index bd8f25fb59..540d662507 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -24,6 +24,10 @@ buffer_move_empty(const char *buf, size_t len, const char *from) "%s: %zd bytes
 buffer_move(const char *buf, size_t len, const char *from) "%s: %zd bytes from %s"
 buffer_free(const char *buf, size_t len) "%s: capacity %zd"
 
+# fdmon-io_uring.c
+fdmon_io_uring_add_sqe(void *ctx, void *opaque, int opcode, int fd, uint64_t off, void *cqe_handler) "ctx %p opaque %p opcode %d fd %d off %"PRId64" cqe_handler %p"
+fdmon_io_uring_cqe_handler(void *ctx, void *cqe_handler, int cqe_res) "ctx %p cqe_handler %p cqe_res %d"
+
 # filemonitor-inotify.c
 qemu_file_monitor_add_watch(void *mon, const char *dirpath, const char *filename, void *cb, void *opaque, int64_t id) "File monitor %p add watch dir='%s' file='%s' cb=%p opaque=%p id=%" PRId64
 qemu_file_monitor_remove_watch(void *mon, const char *dirpath, int64_t id) "File monitor %p remove watch dir='%s' id=%" PRId64
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC 11/11] block/io_uring: use aio_add_sqe()
  2025-05-28 19:09 ` [RFC 11/11] block/io_uring: use aio_add_sqe() Stefan Hajnoczi
@ 2025-05-29 21:11   ` Eric Blake
  2025-06-05 18:40     ` Stefan Hajnoczi
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Blake @ 2025-05-29 21:11 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek

On Wed, May 28, 2025 at 03:09:16PM -0400, Stefan Hajnoczi wrote:
> AioContext has its own io_uring instance for file descriptor monitoring.
> The disk I/O io_uring code was developed separately. Originally I
> thought the characteristics of file descriptor monitoring and disk I/O
> were too different, requiring separate io_uring instances.
> 
> Now it has become clear to me that it's feasible to share a single
> io_uring instance for file descriptor monitoring and disk I/O. We're not
> using io_uring's IOPOLL feature or anything else that would require a
> separate instance.
> 
> Unify block/io_uring.c and util/fdmon-io_uring.c using the new
> aio_add_sqe() API that allows user-defined io_uring sqe submission. Now
> block/io_uring.c just needs to submit readv/writev/fsync and most of the
> io_uring-specific logic is handled by fdmon-io_uring.c.
> 
> There are two immediate advantages:
> 1. Fewer system calls. There is no need to monitor the disk I/O io_uring
>    ring fd from the file descriptor monitoring io_uring instance. Disk
>    I/O completions are now picked up directly. Also, sqes are
>    accumulated in the sq ring until the end of the event loop iteration
>    and there are fewer io_uring_enter(2) syscalls.
> 2. Less code duplication.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---

Comments below, but looks sane to me.

Reviewed-by: Eric Blake <eblake@redhat.com>

>  include/block/aio.h     |   7 -
>  include/block/raw-aio.h |   5 -
>  block/file-posix.c      |  38 ++--
>  block/io_uring.c        | 489 ++++++++++------------------------------
>  stubs/io_uring.c        |  32 ---
>  util/async.c            |  35 ---
>  util/fdmon-io_uring.c   |   6 +
>  block/trace-events      |  12 +-
>  stubs/meson.build       |   3 -
>  util/trace-events       |   4 +
>  10 files changed, 139 insertions(+), 492 deletions(-)
>  delete mode 100644 stubs/io_uring.c
> 
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 95beef28c3..fbb45cca74 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -291,8 +291,6 @@ struct AioContext {
>      struct LinuxAioState *linux_aio;
>  #endif
>  #ifdef CONFIG_LINUX_IO_URING
> -    LuringState *linux_io_uring;
> -
>      /* State for file descriptor monitoring using Linux io_uring */
>      struct io_uring fdmon_io_uring;
>      AioHandlerSList submit_list;
> @@ -597,11 +595,6 @@ struct LinuxAioState *aio_setup_linux_aio(AioContext *ctx, Error **errp);
>  /* Return the LinuxAioState bound to this AioContext */
>  struct LinuxAioState *aio_get_linux_aio(AioContext *ctx);
>  
> -/* Setup the LuringState bound to this AioContext */
> -LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp);
> -
> -/* Return the LuringState bound to this AioContext */
> -LuringState *aio_get_linux_io_uring(AioContext *ctx);
>  /**
>   * aio_timer_new_with_attrs:
>   * @ctx: the aio context
> diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
> index 6570244496..30e5fc9a9f 100644
> --- a/include/block/raw-aio.h
> +++ b/include/block/raw-aio.h
> @@ -74,15 +74,10 @@ static inline bool laio_has_fua(void)
>  #endif
>  /* io_uring.c - Linux io_uring implementation */
>  #ifdef CONFIG_LINUX_IO_URING
> -LuringState *luring_init(Error **errp);
> -void luring_cleanup(LuringState *s);
> -
>  /* luring_co_submit: submit I/O requests in the thread's current AioContext. */
>  int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
>                                    QEMUIOVector *qiov, int type,
>                                    BdrvRequestFlags flags);
> -void luring_detach_aio_context(LuringState *s, AioContext *old_context);
> -void luring_attach_aio_context(LuringState *s, AioContext *new_context);
>  bool luring_has_fua(void);
>  #else
>  static inline bool luring_has_fua(void)
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 9b5f08ccb2..d1f1fc3a77 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -755,14 +755,23 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>      }
>  #endif /* !defined(CONFIG_LINUX_AIO) */
>  
> -#ifndef CONFIG_LINUX_IO_URING
>      if (s->use_linux_io_uring) {
> +#ifdef CONFIG_LINUX_IO_URING
> +        if (!aio_has_io_uring()) {

Compared to the old code... [1]

> +            error_setg(errp, "aio=io_uring was specified, but is not "
> +                             "available (disabled via io_uring_disabled "
> +                             "sysctl or blocked by container runtime "
> +                             "seccomp policy?)");
> +            ret = -EINVAL;
> +            goto fail;
> +        }
> +#else
>          error_setg(errp, "aio=io_uring was specified, but is not supported "
>                           "in this build.");

While here, let's get rid of the trailing '.' in the error_setg call.


>          ret = -EINVAL;
>          goto fail;
> -    }
>  #endif /* !defined(CONFIG_LINUX_IO_URING) */
> +    }
>  
>      s->has_discard = true;
>      s->has_write_zeroes = true;
> @@ -2522,27 +2531,6 @@ static bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
>      return true;
>  }
>  
> -#ifdef CONFIG_LINUX_IO_URING
> -static inline bool raw_check_linux_io_uring(BDRVRawState *s)
> -{
> -    Error *local_err = NULL;
> -    AioContext *ctx;
> -
> -    if (!s->use_linux_io_uring) {
> -        return false;
> -    }
> -
> -    ctx = qemu_get_current_aio_context();
> -    if (unlikely(!aio_setup_linux_io_uring(ctx, &local_err))) {

[1]... is there a reason you dropped the unlikely() wrapper?

> -        error_reportf_err(local_err, "Unable to use linux io_uring, "
> -                                     "falling back to thread pool: ");
> -        s->use_linux_io_uring = false;
> -        return false;
> -    }
> -    return true;
> -}
> -#endif
> -
>  #ifdef CONFIG_LINUX_AIO
>  static inline bool raw_check_linux_aio(BDRVRawState *s)
>  {
> @@ -2595,7 +2583,7 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr,
>      if (s->needs_alignment && !bdrv_qiov_is_aligned(bs, qiov)) {
>          type |= QEMU_AIO_MISALIGNED;
>  #ifdef CONFIG_LINUX_IO_URING
> -    } else if (raw_check_linux_io_uring(s)) {
> +    } else if (s->use_linux_io_uring) {
>          assert(qiov->size == bytes);
>          ret = luring_co_submit(bs, s->fd, offset, qiov, type, flags);
>          goto out;
> @@ -2692,7 +2680,7 @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
>      };
>  
>  #ifdef CONFIG_LINUX_IO_URING
> -    if (raw_check_linux_io_uring(s)) {
> +    if (s->use_linux_io_uring) {
>          return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
>      }
>  #endif
> diff --git a/block/io_uring.c b/block/io_uring.c
> index dd4f304910..dd930ee57e 100644
> --- a/block/io_uring.c
> +++ b/block/io_uring.c
> @@ -11,28 +11,20 @@
>  #include "qemu/osdep.h"
>  #include <liburing.h>
>  #include "block/aio.h"
> -#include "qemu/queue.h"
>  #include "block/block.h"
>  #include "block/raw-aio.h"
>  #include "qemu/coroutine.h"
> -#include "qemu/defer-call.h"
> -#include "qapi/error.h"
>  #include "system/block-backend.h"
>  #include "trace.h"
>  
> -/* Only used for assertions.  */
> -#include "qemu/coroutine_int.h"
> -
> -/* io_uring ring size */
> -#define MAX_ENTRIES 128
> -
> -typedef struct LuringAIOCB {
> +typedef struct {
>      Coroutine *co;
> -    struct io_uring_sqe sqeq;
> -    ssize_t ret;
>      QEMUIOVector *qiov;
> -    bool is_read;
> -    QSIMPLEQ_ENTRY(LuringAIOCB) next;
> +    uint64_t offset;
> +    ssize_t ret;
> +    int type;
> +    int fd;
> +    BdrvRequestFlags flags;
>  
>      /*
>       * Buffered reads may require resubmission, see
> @@ -40,36 +32,51 @@ typedef struct LuringAIOCB {
>       */
>      int total_read;
>      QEMUIOVector resubmit_qiov;
> -} LuringAIOCB;
>  
> -typedef struct LuringQueue {
> -    unsigned int in_queue;
> -    unsigned int in_flight;
> -    bool blocked;
> -    QSIMPLEQ_HEAD(, LuringAIOCB) submit_queue;
> -} LuringQueue;
> +    CqeHandler cqe_handler;
> +} LuringRequest;
>  
> -struct LuringState {
> -    AioContext *aio_context;
> -
> -    struct io_uring ring;
> -
> -    /* No locking required, only accessed from AioContext home thread */
> -    LuringQueue io_q;
> -
> -    QEMUBH *completion_bh;
> -};
> -
> -/**
> - * luring_resubmit:
> - *
> - * Resubmit a request by appending it to submit_queue.  The caller must ensure
> - * that ioq_submit() is called later so that submit_queue requests are started.
> - */
> -static void luring_resubmit(LuringState *s, LuringAIOCB *luringcb)
> +static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
>  {
> -    QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
> -    s->io_q.in_queue++;
> +    LuringRequest *req = opaque;
> +    QEMUIOVector *qiov = req->qiov;
> +    uint64_t offset = req->offset;
> +    int fd = req->fd;
> +    BdrvRequestFlags flags = req->flags;
> +
> +    switch (req->type) {
> +    case QEMU_AIO_WRITE:
> +#ifdef HAVE_IO_URING_PREP_WRITEV2
> +    {
> +        int luring_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
> +        io_uring_prep_writev2(sqe, fd, qiov->iov,
> +                              qiov->niov, offset, luring_flags);
> +    }
> +#else
> +        assert(flags == 0);
> +        io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset);

Hmm.  'man io_uring_prep_writev2' states:

       Unless an application explicitly needs to pass in more than  one  iovec,
       it  is  more  efficient  to  use io_uring_prep_write(3) rather than this
       function, as no state has to be maintained for  a  non-vectored  IO  re‐
       quest.

Obviously, if we want luring_flags of RWF_DSYNC to be set, we have to
use the newer interface; but if that flag is not present, should we be
conditionally falling back to the simpler interface when qiov->niov ==
1?

In fact, even when qiov->niov > 1, can we unvector it ourselves into
multiple io_uring_prep_write() calls, since the whole point of the
uring is that we aren't making syscalls, so more ops on the uring
should still be cheaper?  But that's a question for a followup patch
(still, given that your series is RFC for performance reasons, it may
be worth investigating).

[side note - I originally read the interface as "write version-2" and
not "write-vectored 2" - but the man page quickly got me on track on
how to parse the 'v'.  Quite a few Linux interfaces where a flags
argument has been added after the fact...]

> +#endif
> +        break;
> +    case QEMU_AIO_ZONE_APPEND:
> +        io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset);
> +        break;
> +    case QEMU_AIO_READ:
> +    {
> +        if (req->resubmit_qiov.iov != NULL) {
> +            qiov = &req->resubmit_qiov;
> +        }
> +        io_uring_prep_readv(sqe, fd, qiov->iov, qiov->niov,
> +                            offset + req->total_read);

Another case where the man page suggests io_uring_prep_read() is
faster when qiov->niov == 1, or where we may explore with unvectoring.

> +        break;
> +    }
> +    case QEMU_AIO_FLUSH:
> +        io_uring_prep_fsync(sqe, fd, IORING_FSYNC_DATASYNC);
> +        break;
> +    default:
> +        fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
> +                        __func__, req->type);
> +        abort();
> +    }
>  }
>  
>  /**
> @@ -78,385 +85,115 @@ static void luring_resubmit(LuringState *s, LuringAIOCB *luringcb)
>   * Short reads are rare but may occur. The remaining read request needs to be
>   * resubmitted.
>   */
> -static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
> -                                       int nread)
> +static void luring_resubmit_short_read(LuringRequest *req, int nread)
>  {
>      QEMUIOVector *resubmit_qiov;
>      size_t remaining;
>  
> -    trace_luring_resubmit_short_read(s, luringcb, nread);
> +    trace_luring_resubmit_short_read(req, nread);
>  
>      /* Update read position */
> -    luringcb->total_read += nread;
> -    remaining = luringcb->qiov->size - luringcb->total_read;
> +    req->total_read += nread;
> +    remaining = req->qiov->size - req->total_read;
>  
>      /* Shorten qiov */
> -    resubmit_qiov = &luringcb->resubmit_qiov;
> +    resubmit_qiov = &req->resubmit_qiov;
>      if (resubmit_qiov->iov == NULL) {
> -        qemu_iovec_init(resubmit_qiov, luringcb->qiov->niov);
> +        qemu_iovec_init(resubmit_qiov, req->qiov->niov);
>      } else {
>          qemu_iovec_reset(resubmit_qiov);
>      }
> -    qemu_iovec_concat(resubmit_qiov, luringcb->qiov, luringcb->total_read,
> -                      remaining);
> +    qemu_iovec_concat(resubmit_qiov, req->qiov, req->total_read, remaining);
>  
> -    /* Update sqe */
> -    luringcb->sqeq.off += nread;
> -    luringcb->sqeq.addr = (uintptr_t)luringcb->resubmit_qiov.iov;
> -    luringcb->sqeq.len = luringcb->resubmit_qiov.niov;
> -
> -    luring_resubmit(s, luringcb);
> +    aio_add_sqe(luring_prep_sqe, req, &req->cqe_handler);
>  }
>  
> -/**
> - * luring_process_completions:
> - * @s: AIO state
> - *
> - * Fetches completed I/O requests, consumes cqes and invokes their callbacks
> - * The function is somewhat tricky because it supports nested event loops, for
> - * example when a request callback invokes aio_poll().
> - *
> - * Function schedules BH completion so it  can be called again in a nested
> - * event loop.  When there are no events left  to complete the BH is being
> - * canceled.
> - *
> - */
> -static void luring_process_completions(LuringState *s)
> +static void luring_cqe_handler(CqeHandler *cqe_handler)
>  {
> -    struct io_uring_cqe *cqes;
> -    int total_bytes;
> +    LuringRequest *req = container_of(cqe_handler, LuringRequest, cqe_handler);
> +    int ret = cqe_handler->cqe.res;
>  
> -    defer_call_begin();
> +    trace_luring_cqe_handler(req, ret);
>  
> -    /*
> -     * Request completion callbacks can run the nested event loop.
> -     * Schedule ourselves so the nested event loop will "see" remaining
> -     * completed requests and process them.  Without this, completion
> -     * callbacks that wait for other requests using a nested event loop
> -     * would hang forever.
> -     *
> -     * This workaround is needed because io_uring uses poll_wait, which
> -     * is woken up when new events are added to the uring, thus polling on
> -     * the same uring fd will block unless more events are received.
> -     *
> -     * Other leaf block drivers (drivers that access the data themselves)
> -     * are networking based, so they poll sockets for data and run the
> -     * correct coroutine.
> -     */
> -    qemu_bh_schedule(s->completion_bh);
> -
> -    while (io_uring_peek_cqe(&s->ring, &cqes) == 0) {
> -        LuringAIOCB *luringcb;
> -        int ret;
> -
> -        if (!cqes) {
> -            break;
> +    if (ret < 0) {
> +        /*
> +         * Only writev/readv/fsync requests on regular files or host block
> +         * devices are submitted. Therefore -EAGAIN is not expected but it's
> +         * known to happen sometimes with Linux SCSI. Submit again and hope
> +         * the request completes successfully.
> +         *
> +         * For more information, see:
> +         * https://lore.kernel.org/io-uring/20210727165811.284510-3-axboe@kernel.dk/T/#u
> +         *
> +         * If the code is changed to submit other types of requests in the
> +         * future, then this workaround may need to be extended to deal with
> +         * genuine -EAGAIN results that should not be resubmitted
> +         * immediately.
> +         */
> +        if (ret == -EINTR || ret == -EAGAIN) {
> +            aio_add_sqe(luring_prep_sqe, req, &req->cqe_handler);
> +            return;
>          }
> -
> -        luringcb = io_uring_cqe_get_data(cqes);
> -        ret = cqes->res;
> -        io_uring_cqe_seen(&s->ring, cqes);
> -        cqes = NULL;
> -
> -        /* Change counters one-by-one because we can be nested. */
> -        s->io_q.in_flight--;
> -        trace_luring_process_completion(s, luringcb, ret);
> -
> +    } else if (req->qiov) {
>          /* total_read is non-zero only for resubmitted read requests */
> -        total_bytes = ret + luringcb->total_read;
> +        int total_bytes = ret + req->total_read;
>  
> -        if (ret < 0) {
> -            /*
> -             * Only writev/readv/fsync requests on regular files or host block
> -             * devices are submitted. Therefore -EAGAIN is not expected but it's
> -             * known to happen sometimes with Linux SCSI. Submit again and hope
> -             * the request completes successfully.
> -             *
> -             * For more information, see:
> -             * https://lore.kernel.org/io-uring/20210727165811.284510-3-axboe@kernel.dk/T/#u
> -             *
> -             * If the code is changed to submit other types of requests in the
> -             * future, then this workaround may need to be extended to deal with
> -             * genuine -EAGAIN results that should not be resubmitted
> -             * immediately.
> -             */
> -            if (ret == -EINTR || ret == -EAGAIN) {
> -                luring_resubmit(s, luringcb);
> -                continue;
> -            }
> -        } else if (!luringcb->qiov) {
> -            goto end;
> -        } else if (total_bytes == luringcb->qiov->size) {
> +        if (total_bytes == req->qiov->size) {
>              ret = 0;
> -        /* Only read/write */
>          } else {
>              /* Short Read/Write */
> -            if (luringcb->is_read) {
> +            if (req->type == QEMU_AIO_READ) {
>                  if (ret > 0) {
> -                    luring_resubmit_short_read(s, luringcb, ret);
> -                    continue;
> -                } else {
> -                    /* Pad with zeroes */
> -                    qemu_iovec_memset(luringcb->qiov, total_bytes, 0,
> -                                      luringcb->qiov->size - total_bytes);
> -                    ret = 0;
> +                    luring_resubmit_short_read(req, ret);
> +                    return;
>                  }
> +
> +                /* Pad with zeroes */
> +                qemu_iovec_memset(req->qiov, total_bytes, 0,
> +                                  req->qiov->size - total_bytes);
> +                ret = 0;
>              } else {
>                  ret = -ENOSPC;
>              }
>          }
> -end:
> -        luringcb->ret = ret;
> -        qemu_iovec_destroy(&luringcb->resubmit_qiov);
> -
> -        /*
> -         * If the coroutine is already entered it must be in ioq_submit()
> -         * and will notice luringcb->ret has been filled in when it
> -         * eventually runs later. Coroutines cannot be entered recursively
> -         * so avoid doing that!
> -         */
> -        assert(luringcb->co->ctx == s->aio_context);
> -        if (!qemu_coroutine_entered(luringcb->co)) {
> -            aio_co_wake(luringcb->co);
> -        }
>      }
>  
> -    qemu_bh_cancel(s->completion_bh);
> +    req->ret = ret;
> +    qemu_iovec_destroy(&req->resubmit_qiov);
>  
> -    defer_call_end();
> -}
> -
> -static int ioq_submit(LuringState *s)
> -{
> -    int ret = 0;
> -    LuringAIOCB *luringcb, *luringcb_next;
> -
> -    while (s->io_q.in_queue > 0) {
> -        /*
> -         * Try to fetch sqes from the ring for requests waiting in
> -         * the overflow queue
> -         */
> -        QSIMPLEQ_FOREACH_SAFE(luringcb, &s->io_q.submit_queue, next,
> -                              luringcb_next) {
> -            struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
> -            if (!sqes) {
> -                break;
> -            }
> -            /* Prep sqe for submission */
> -            *sqes = luringcb->sqeq;
> -            QSIMPLEQ_REMOVE_HEAD(&s->io_q.submit_queue, next);
> -        }
> -        ret = io_uring_submit(&s->ring);
> -        trace_luring_io_uring_submit(s, ret);
> -        /* Prevent infinite loop if submission is refused */
> -        if (ret <= 0) {
> -            if (ret == -EAGAIN || ret == -EINTR) {
> -                continue;
> -            }
> -            break;
> -        }
> -        s->io_q.in_flight += ret;
> -        s->io_q.in_queue  -= ret;
> -    }
> -    s->io_q.blocked = (s->io_q.in_queue > 0);
> -
> -    if (s->io_q.in_flight) {
> -        /*
> -         * We can try to complete something just right away if there are
> -         * still requests in-flight.
> -         */
> -        luring_process_completions(s);
> -    }
> -    return ret;
> -}
> -
> -static void luring_process_completions_and_submit(LuringState *s)
> -{
> -    luring_process_completions(s);
> -
> -    if (s->io_q.in_queue > 0) {
> -        ioq_submit(s);
> +    /*
> +     * If the coroutine is already entered it must be in luring_co_submit() and
> +     * will notice req->ret has been filled in when it eventually runs later.
> +     * Coroutines cannot be entered recursively so avoid doing that!
> +     */
> +    if (!qemu_coroutine_entered(req->co)) {
> +        aio_co_wake(req->co);
>      }
>  }
>  
> -static void qemu_luring_completion_bh(void *opaque)
> +int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd,
> +                                  uint64_t offset, QEMUIOVector *qiov,
> +                                  int type, BdrvRequestFlags flags)
>  {
> -    LuringState *s = opaque;
> -    luring_process_completions_and_submit(s);
> -}
> -
> -static void qemu_luring_completion_cb(void *opaque)
> -{
> -    LuringState *s = opaque;
> -    luring_process_completions_and_submit(s);
> -}
> -
> -static bool qemu_luring_poll_cb(void *opaque)
> -{
> -    LuringState *s = opaque;
> -
> -    return io_uring_cq_ready(&s->ring);
> -}
> -
> -static void qemu_luring_poll_ready(void *opaque)
> -{
> -    LuringState *s = opaque;
> -
> -    luring_process_completions_and_submit(s);
> -}
> -
> -static void ioq_init(LuringQueue *io_q)
> -{
> -    QSIMPLEQ_INIT(&io_q->submit_queue);
> -    io_q->in_queue = 0;
> -    io_q->in_flight = 0;
> -    io_q->blocked = false;
> -}
> -
> -static void luring_deferred_fn(void *opaque)
> -{
> -    LuringState *s = opaque;
> -    trace_luring_unplug_fn(s, s->io_q.blocked, s->io_q.in_queue,
> -                           s->io_q.in_flight);
> -    if (!s->io_q.blocked && s->io_q.in_queue > 0) {
> -        ioq_submit(s);
> -    }
> -}
> -
> -/**
> - * luring_do_submit:
> - * @fd: file descriptor for I/O
> - * @luringcb: AIO control block
> - * @s: AIO state
> - * @offset: offset for request
> - * @type: type of request
> - *
> - * Fetches sqes from ring, adds to pending queue and preps them
> - *
> - */
> -static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
> -                            uint64_t offset, int type, BdrvRequestFlags flags)
> -{
> -    int ret;
> -    struct io_uring_sqe *sqes = &luringcb->sqeq;
> -
> -    switch (type) {
> -    case QEMU_AIO_WRITE:
> -#ifdef HAVE_IO_URING_PREP_WRITEV2
> -    {
> -        int luring_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
> -        io_uring_prep_writev2(sqes, fd, luringcb->qiov->iov,
> -                              luringcb->qiov->niov, offset, luring_flags);
> -    }
> -#else
> -        assert(flags == 0);
> -        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
> -                             luringcb->qiov->niov, offset);
> -#endif
> -        break;
> -    case QEMU_AIO_ZONE_APPEND:
> -        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
> -                             luringcb->qiov->niov, offset);
> -        break;
> -    case QEMU_AIO_READ:
> -        io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
> -                            luringcb->qiov->niov, offset);
> -        break;
> -    case QEMU_AIO_FLUSH:
> -        io_uring_prep_fsync(sqes, fd, IORING_FSYNC_DATASYNC);
> -        break;
> -    default:
> -        fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
> -                        __func__, type);
> -        abort();
> -    }
> -    io_uring_sqe_set_data(sqes, luringcb);
> -
> -    QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
> -    s->io_q.in_queue++;
> -    trace_luring_do_submit(s, s->io_q.blocked, s->io_q.in_queue,
> -                           s->io_q.in_flight);
> -    if (!s->io_q.blocked) {
> -        if (s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES) {
> -            ret = ioq_submit(s);
> -            trace_luring_do_submit_done(s, ret);
> -            return ret;
> -        }
> -
> -        defer_call(luring_deferred_fn, s);
> -    }
> -    return 0;
> -}
> -
> -int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
> -                                  QEMUIOVector *qiov, int type,
> -                                  BdrvRequestFlags flags)
> -{
> -    int ret;
> -    AioContext *ctx = qemu_get_current_aio_context();
> -    LuringState *s = aio_get_linux_io_uring(ctx);
> -    LuringAIOCB luringcb = {
> +    LuringRequest req = {
>          .co         = qemu_coroutine_self(),
> -        .ret        = -EINPROGRESS,
>          .qiov       = qiov,
> -        .is_read    = (type == QEMU_AIO_READ),
> +        .ret        = -EINPROGRESS,
> +        .type       = type,
> +        .fd         = fd,
> +        .offset     = offset,
> +        .flags      = flags,
>      };
> -    trace_luring_co_submit(bs, s, &luringcb, fd, offset, qiov ? qiov->size : 0,
> -                           type);
> -    ret = luring_do_submit(fd, &luringcb, s, offset, type, flags);
>  
> -    if (ret < 0) {
> -        return ret;
> -    }
> +    req.cqe_handler.cb = luring_cqe_handler;
>  
> -    if (luringcb.ret == -EINPROGRESS) {
> +    trace_luring_co_submit(bs, &req, fd, offset, qiov ? qiov->size : 0, type);
> +    aio_add_sqe(luring_prep_sqe, &req, &req.cqe_handler);
> +
> +    if (req.ret == -EINPROGRESS) {
>          qemu_coroutine_yield();
>      }
> -    return luringcb.ret;
> -}
> -
> -void luring_detach_aio_context(LuringState *s, AioContext *old_context)
> -{
> -    aio_set_fd_handler(old_context, s->ring.ring_fd,
> -                       NULL, NULL, NULL, NULL, s);
> -    qemu_bh_delete(s->completion_bh);
> -    s->aio_context = NULL;
> -}
> -
> -void luring_attach_aio_context(LuringState *s, AioContext *new_context)
> -{
> -    s->aio_context = new_context;
> -    s->completion_bh = aio_bh_new(new_context, qemu_luring_completion_bh, s);
> -    aio_set_fd_handler(s->aio_context, s->ring.ring_fd,
> -                       qemu_luring_completion_cb, NULL,
> -                       qemu_luring_poll_cb, qemu_luring_poll_ready, s);
> -}
> -
> -LuringState *luring_init(Error **errp)
> -{
> -    int rc;
> -    LuringState *s = g_new0(LuringState, 1);
> -    struct io_uring *ring = &s->ring;
> -
> -    trace_luring_init_state(s, sizeof(*s));
> -
> -    rc = io_uring_queue_init(MAX_ENTRIES, ring, 0);
> -    if (rc < 0) {
> -        error_setg_errno(errp, -rc, "failed to init linux io_uring ring");
> -        g_free(s);
> -        return NULL;
> -    }
> -
> -    ioq_init(&s->io_q);
> -    return s;
> -
> -}
> -
> -void luring_cleanup(LuringState *s)
> -{
> -    io_uring_queue_exit(&s->ring);
> -    trace_luring_cleanup_state(s);
> -    g_free(s);
> +    return req.ret;
>  }
>  
>  bool luring_has_fua(void)
> diff --git a/stubs/io_uring.c b/stubs/io_uring.c
> deleted file mode 100644
> index 622d1e4648..0000000000
> --- a/stubs/io_uring.c
> +++ /dev/null
> @@ -1,32 +0,0 @@
> -/*
> - * Linux io_uring support.
> - *
> - * Copyright (C) 2009 IBM, Corp.
> - * Copyright (C) 2009 Red Hat, Inc.
> - *
> - * This work is licensed under the terms of the GNU GPL, version 2 or later.
> - * See the COPYING file in the top-level directory.
> - */
> -#include "qemu/osdep.h"
> -#include "block/aio.h"
> -#include "block/raw-aio.h"
> -
> -void luring_detach_aio_context(LuringState *s, AioContext *old_context)
> -{
> -    abort();
> -}
> -
> -void luring_attach_aio_context(LuringState *s, AioContext *new_context)
> -{
> -    abort();
> -}
> -
> -LuringState *luring_init(Error **errp)
> -{
> -    abort();
> -}
> -
> -void luring_cleanup(LuringState *s)
> -{
> -    abort();
> -}
> diff --git a/util/async.c b/util/async.c
> index bba9622e97..d66575acd2 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -383,14 +383,6 @@ aio_ctx_finalize(GSource     *source)
>      }
>  #endif
>  
> -#ifdef CONFIG_LINUX_IO_URING
> -    if (ctx->linux_io_uring) {
> -        luring_detach_aio_context(ctx->linux_io_uring, ctx);
> -        luring_cleanup(ctx->linux_io_uring);
> -        ctx->linux_io_uring = NULL;
> -    }
> -#endif
> -
>      assert(QSLIST_EMPTY(&ctx->scheduled_coroutines));
>      qemu_bh_delete(ctx->co_schedule_bh);
>  
> @@ -465,29 +457,6 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
>  }
>  #endif
>  
> -#ifdef CONFIG_LINUX_IO_URING
> -LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp)
> -{
> -    if (ctx->linux_io_uring) {
> -        return ctx->linux_io_uring;
> -    }
> -
> -    ctx->linux_io_uring = luring_init(errp);
> -    if (!ctx->linux_io_uring) {
> -        return NULL;
> -    }
> -
> -    luring_attach_aio_context(ctx->linux_io_uring, ctx);
> -    return ctx->linux_io_uring;
> -}
> -
> -LuringState *aio_get_linux_io_uring(AioContext *ctx)
> -{
> -    assert(ctx->linux_io_uring);
> -    return ctx->linux_io_uring;
> -}
> -#endif
> -
>  void aio_notify(AioContext *ctx)
>  {
>      /*
> @@ -611,10 +580,6 @@ AioContext *aio_context_new(Error **errp)
>      ctx->linux_aio = NULL;
>  #endif
>  
> -#ifdef CONFIG_LINUX_IO_URING
> -    ctx->linux_io_uring = NULL;
> -#endif
> -
>      ctx->thread_pool = NULL;
>      qemu_rec_mutex_init(&ctx->lock);
>      timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
> diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> index 03a07a4caf..2c64f80e5f 100644
> --- a/util/fdmon-io_uring.c
> +++ b/util/fdmon-io_uring.c
> @@ -48,6 +48,7 @@
>  #include "qapi/error.h"
>  #include "qemu/rcu_queue.h"
>  #include "aio-posix.h"
> +#include "trace.h"
>  
>  enum {
>      FDMON_IO_URING_ENTRIES  = 128, /* sq/cq ring size */
> @@ -174,6 +175,9 @@ static void fdmon_io_uring_add_sqe(AioContext *ctx,
>  
>      prep_sqe(sqe, opaque);
>      io_uring_sqe_set_data(sqe, cqe_handler);
> +
> +    trace_fdmon_io_uring_add_sqe(ctx, opaque, sqe->opcode, sqe->fd, sqe->off,
> +                                 cqe_handler);
>  }
>  
>  static void fdmon_special_cqe_handler(CqeHandler *cqe_handler)
> @@ -290,6 +294,8 @@ static void cqe_handler_bh(void *opaque)
>  
>          QSIMPLEQ_REMOVE_HEAD(ready_list, next);
>  
> +        trace_fdmon_io_uring_cqe_handler(ctx, cqe_handler,
> +                                         cqe_handler->cqe.res);
>          cqe_handler->cb(cqe_handler);
>      }
>  
> diff --git a/block/trace-events b/block/trace-events
> index 8e789e1f12..c9b4736ff8 100644
> --- a/block/trace-events
> +++ b/block/trace-events
> @@ -62,15 +62,9 @@ qmp_block_stream(void *bs) "bs %p"
>  file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) "acb %p opaque %p offset %"PRId64" count %d type %d"
>  
>  # io_uring.c
> -luring_init_state(void *s, size_t size) "s %p size %zu"
> -luring_cleanup_state(void *s) "%p freed"
> -luring_unplug_fn(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
> -luring_do_submit(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
> -luring_do_submit_done(void *s, int ret) "LuringState %p submitted to kernel %d"
> -luring_co_submit(void *bs, void *s, void *luringcb, int fd, uint64_t offset, size_t nbytes, int type) "bs %p s %p luringcb %p fd %d offset %" PRId64 " nbytes %zd type %d"
> -luring_process_completion(void *s, void *aiocb, int ret) "LuringState %p luringcb %p ret %d"
> -luring_io_uring_submit(void *s, int ret) "LuringState %p ret %d"
> -luring_resubmit_short_read(void *s, void *luringcb, int nread) "LuringState %p luringcb %p nread %d"
> +luring_cqe_handler(void *req, int ret) "req %p ret %d"
> +luring_co_submit(void *bs, void *req, int fd, uint64_t offset, size_t nbytes, int type) "bs %p req %p fd %d offset %" PRId64 " nbytes %zd type %d"
> +luring_resubmit_short_read(void *req, int nread) "req %p nread %d"
>  
>  # qcow2.c
>  qcow2_add_task(void *co, void *bs, void *pool, const char *action, int cluster_type, uint64_t host_offset, uint64_t offset, uint64_t bytes, void *qiov, size_t qiov_offset) "co %p bs %p pool %p: %s: cluster_type %d file_cluster_offset %" PRIu64 " offset %" PRIu64 " bytes %" PRIu64 " qiov %p qiov_offset %zu"
> diff --git a/stubs/meson.build b/stubs/meson.build
> index 63392f5e78..d157b06273 100644
> --- a/stubs/meson.build
> +++ b/stubs/meson.build
> @@ -32,9 +32,6 @@ if have_block or have_ga
>    stub_ss.add(files('cpus-virtual-clock.c'))
>    stub_ss.add(files('icount.c'))
>    stub_ss.add(files('graph-lock.c'))
> -  if linux_io_uring.found()
> -    stub_ss.add(files('io_uring.c'))
> -  endif
>    if libaio.found()
>      stub_ss.add(files('linux-aio.c'))
>    endif
> diff --git a/util/trace-events b/util/trace-events
> index bd8f25fb59..540d662507 100644
> --- a/util/trace-events
> +++ b/util/trace-events
> @@ -24,6 +24,10 @@ buffer_move_empty(const char *buf, size_t len, const char *from) "%s: %zd bytes
>  buffer_move(const char *buf, size_t len, const char *from) "%s: %zd bytes from %s"
>  buffer_free(const char *buf, size_t len) "%s: capacity %zd"
>  
> +# fdmon-io_uring.c
> +fdmon_io_uring_add_sqe(void *ctx, void *opaque, int opcode, int fd, uint64_t off, void *cqe_handler) "ctx %p opaque %p opcode %d fd %d off %"PRId64" cqe_handler %p"
> +fdmon_io_uring_cqe_handler(void *ctx, void *cqe_handler, int cqe_res) "ctx %p cqe_handler %p cqe_res %d"
> +
>  # filemonitor-inotify.c
>  qemu_file_monitor_add_watch(void *mon, const char *dirpath, const char *filename, void *cb, void *opaque, int64_t id) "File monitor %p add watch dir='%s' file='%s' cb=%p opaque=%p id=%" PRId64
>  qemu_file_monitor_remove_watch(void *mon, const char *dirpath, int64_t id) "File monitor %p remove watch dir='%s' id=%" PRId64
> -- 
> 2.49.0
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 11/11] block/io_uring: use aio_add_sqe()
  2025-05-29 21:11   ` Eric Blake
@ 2025-06-05 18:40     ` Stefan Hajnoczi
  0 siblings, 0 replies; 33+ messages in thread
From: Stefan Hajnoczi @ 2025-06-05 18:40 UTC (permalink / raw)
  To: Eric Blake
  Cc: qemu-devel, qemu-block, hibriansong, Kevin Wolf, Hanna Czenczek

[-- Attachment #1: Type: text/plain, Size: 11791 bytes --]

On Thu, May 29, 2025 at 04:11:21PM -0500, Eric Blake wrote:
> On Wed, May 28, 2025 at 03:09:16PM -0400, Stefan Hajnoczi wrote:
> > AioContext has its own io_uring instance for file descriptor monitoring.
> > The disk I/O io_uring code was developed separately. Originally I
> > thought the characteristics of file descriptor monitoring and disk I/O
> > were too different, requiring separate io_uring instances.
> > 
> > Now it has become clear to me that it's feasible to share a single
> > io_uring instance for file descriptor monitoring and disk I/O. We're not
> > using io_uring's IOPOLL feature or anything else that would require a
> > separate instance.
> > 
> > Unify block/io_uring.c and util/fdmon-io_uring.c using the new
> > aio_add_sqe() API that allows user-defined io_uring sqe submission. Now
> > block/io_uring.c just needs to submit readv/writev/fsync and most of the
> > io_uring-specific logic is handled by fdmon-io_uring.c.
> > 
> > There are two immediate advantages:
> > 1. Fewer system calls. There is no need to monitor the disk I/O io_uring
> >    ring fd from the file descriptor monitoring io_uring instance. Disk
> >    I/O completions are now picked up directly. Also, sqes are
> >    accumulated in the sq ring until the end of the event loop iteration
> >    and there are fewer io_uring_enter(2) syscalls.
> > 2. Less code duplication.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> 
> Comments below, but looks sane to me.
> 
> Reviewed-by: Eric Blake <eblake@redhat.com>
> 
> >  include/block/aio.h     |   7 -
> >  include/block/raw-aio.h |   5 -
> >  block/file-posix.c      |  38 ++--
> >  block/io_uring.c        | 489 ++++++++++------------------------------
> >  stubs/io_uring.c        |  32 ---
> >  util/async.c            |  35 ---
> >  util/fdmon-io_uring.c   |   6 +
> >  block/trace-events      |  12 +-
> >  stubs/meson.build       |   3 -
> >  util/trace-events       |   4 +
> >  10 files changed, 139 insertions(+), 492 deletions(-)
> >  delete mode 100644 stubs/io_uring.c
> > 
> > diff --git a/include/block/aio.h b/include/block/aio.h
> > index 95beef28c3..fbb45cca74 100644
> > --- a/include/block/aio.h
> > +++ b/include/block/aio.h
> > @@ -291,8 +291,6 @@ struct AioContext {
> >      struct LinuxAioState *linux_aio;
> >  #endif
> >  #ifdef CONFIG_LINUX_IO_URING
> > -    LuringState *linux_io_uring;
> > -
> >      /* State for file descriptor monitoring using Linux io_uring */
> >      struct io_uring fdmon_io_uring;
> >      AioHandlerSList submit_list;
> > @@ -597,11 +595,6 @@ struct LinuxAioState *aio_setup_linux_aio(AioContext *ctx, Error **errp);
> >  /* Return the LinuxAioState bound to this AioContext */
> >  struct LinuxAioState *aio_get_linux_aio(AioContext *ctx);
> >  
> > -/* Setup the LuringState bound to this AioContext */
> > -LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp);
> > -
> > -/* Return the LuringState bound to this AioContext */
> > -LuringState *aio_get_linux_io_uring(AioContext *ctx);
> >  /**
> >   * aio_timer_new_with_attrs:
> >   * @ctx: the aio context
> > diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
> > index 6570244496..30e5fc9a9f 100644
> > --- a/include/block/raw-aio.h
> > +++ b/include/block/raw-aio.h
> > @@ -74,15 +74,10 @@ static inline bool laio_has_fua(void)
> >  #endif
> >  /* io_uring.c - Linux io_uring implementation */
> >  #ifdef CONFIG_LINUX_IO_URING
> > -LuringState *luring_init(Error **errp);
> > -void luring_cleanup(LuringState *s);
> > -
> >  /* luring_co_submit: submit I/O requests in the thread's current AioContext. */
> >  int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
> >                                    QEMUIOVector *qiov, int type,
> >                                    BdrvRequestFlags flags);
> > -void luring_detach_aio_context(LuringState *s, AioContext *old_context);
> > -void luring_attach_aio_context(LuringState *s, AioContext *new_context);
> >  bool luring_has_fua(void);
> >  #else
> >  static inline bool luring_has_fua(void)
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 9b5f08ccb2..d1f1fc3a77 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -755,14 +755,23 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
> >      }
> >  #endif /* !defined(CONFIG_LINUX_AIO) */
> >  
> > -#ifndef CONFIG_LINUX_IO_URING
> >      if (s->use_linux_io_uring) {
> > +#ifdef CONFIG_LINUX_IO_URING
> > +        if (!aio_has_io_uring()) {
> 
> Compared to the old code... [1]
> 
> > +            error_setg(errp, "aio=io_uring was specified, but is not "
> > +                             "available (disabled via io_uring_disabled "
> > +                             "sysctl or blocked by container runtime "
> > +                             "seccomp policy?)");
> > +            ret = -EINVAL;
> > +            goto fail;
> > +        }
> > +#else
> >          error_setg(errp, "aio=io_uring was specified, but is not supported "
> >                           "in this build.");
> 
> While here, let's get rid of the trailing '.' in the error_setg call.

Sounds good.

> 
> 
> >          ret = -EINVAL;
> >          goto fail;
> > -    }
> >  #endif /* !defined(CONFIG_LINUX_IO_URING) */
> > +    }
> >  
> >      s->has_discard = true;
> >      s->has_write_zeroes = true;
> > @@ -2522,27 +2531,6 @@ static bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
> >      return true;
> >  }
> >  
> > -#ifdef CONFIG_LINUX_IO_URING
> > -static inline bool raw_check_linux_io_uring(BDRVRawState *s)
> > -{
> > -    Error *local_err = NULL;
> > -    AioContext *ctx;
> > -
> > -    if (!s->use_linux_io_uring) {
> > -        return false;
> > -    }
> > -
> > -    ctx = qemu_get_current_aio_context();
> > -    if (unlikely(!aio_setup_linux_io_uring(ctx, &local_err))) {
> 
> [1]... is there a reason you dropped the unlikely() wrapper?

raw_check_linux_io_uring() was called from the I/O code path where a
branch prediction hint might make sense for performance reasons. The
raw_open_common() you mentioned above is not in the I/O code path.

Also, the new code in raw_open_common() is not the same as
raw_check_linux_io_uring(). aio_setup_linux_io_uring() no longer exists
in the new code. The patch removes this code entirely, it doesn't just
drop unlikely().

> 
> > -        error_reportf_err(local_err, "Unable to use linux io_uring, "
> > -                                     "falling back to thread pool: ");
> > -        s->use_linux_io_uring = false;
> > -        return false;
> > -    }
> > -    return true;
> > -}
> > -#endif
> > -
> >  #ifdef CONFIG_LINUX_AIO
> >  static inline bool raw_check_linux_aio(BDRVRawState *s)
> >  {
> > @@ -2595,7 +2583,7 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr,
> >      if (s->needs_alignment && !bdrv_qiov_is_aligned(bs, qiov)) {
> >          type |= QEMU_AIO_MISALIGNED;
> >  #ifdef CONFIG_LINUX_IO_URING
> > -    } else if (raw_check_linux_io_uring(s)) {
> > +    } else if (s->use_linux_io_uring) {
> >          assert(qiov->size == bytes);
> >          ret = luring_co_submit(bs, s->fd, offset, qiov, type, flags);
> >          goto out;
> > @@ -2692,7 +2680,7 @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
> >      };
> >  
> >  #ifdef CONFIG_LINUX_IO_URING
> > -    if (raw_check_linux_io_uring(s)) {
> > +    if (s->use_linux_io_uring) {
> >          return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
> >      }
> >  #endif
> > diff --git a/block/io_uring.c b/block/io_uring.c
> > index dd4f304910..dd930ee57e 100644
> > --- a/block/io_uring.c
> > +++ b/block/io_uring.c
> > @@ -11,28 +11,20 @@
> >  #include "qemu/osdep.h"
> >  #include <liburing.h>
> >  #include "block/aio.h"
> > -#include "qemu/queue.h"
> >  #include "block/block.h"
> >  #include "block/raw-aio.h"
> >  #include "qemu/coroutine.h"
> > -#include "qemu/defer-call.h"
> > -#include "qapi/error.h"
> >  #include "system/block-backend.h"
> >  #include "trace.h"
> >  
> > -/* Only used for assertions.  */
> > -#include "qemu/coroutine_int.h"
> > -
> > -/* io_uring ring size */
> > -#define MAX_ENTRIES 128
> > -
> > -typedef struct LuringAIOCB {
> > +typedef struct {
> >      Coroutine *co;
> > -    struct io_uring_sqe sqeq;
> > -    ssize_t ret;
> >      QEMUIOVector *qiov;
> > -    bool is_read;
> > -    QSIMPLEQ_ENTRY(LuringAIOCB) next;
> > +    uint64_t offset;
> > +    ssize_t ret;
> > +    int type;
> > +    int fd;
> > +    BdrvRequestFlags flags;
> >  
> >      /*
> >       * Buffered reads may require resubmission, see
> > @@ -40,36 +32,51 @@ typedef struct LuringAIOCB {
> >       */
> >      int total_read;
> >      QEMUIOVector resubmit_qiov;
> > -} LuringAIOCB;
> >  
> > -typedef struct LuringQueue {
> > -    unsigned int in_queue;
> > -    unsigned int in_flight;
> > -    bool blocked;
> > -    QSIMPLEQ_HEAD(, LuringAIOCB) submit_queue;
> > -} LuringQueue;
> > +    CqeHandler cqe_handler;
> > +} LuringRequest;
> >  
> > -struct LuringState {
> > -    AioContext *aio_context;
> > -
> > -    struct io_uring ring;
> > -
> > -    /* No locking required, only accessed from AioContext home thread */
> > -    LuringQueue io_q;
> > -
> > -    QEMUBH *completion_bh;
> > -};
> > -
> > -/**
> > - * luring_resubmit:
> > - *
> > - * Resubmit a request by appending it to submit_queue.  The caller must ensure
> > - * that ioq_submit() is called later so that submit_queue requests are started.
> > - */
> > -static void luring_resubmit(LuringState *s, LuringAIOCB *luringcb)
> > +static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
> >  {
> > -    QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
> > -    s->io_q.in_queue++;
> > +    LuringRequest *req = opaque;
> > +    QEMUIOVector *qiov = req->qiov;
> > +    uint64_t offset = req->offset;
> > +    int fd = req->fd;
> > +    BdrvRequestFlags flags = req->flags;
> > +
> > +    switch (req->type) {
> > +    case QEMU_AIO_WRITE:
> > +#ifdef HAVE_IO_URING_PREP_WRITEV2
> > +    {
> > +        int luring_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
> > +        io_uring_prep_writev2(sqe, fd, qiov->iov,
> > +                              qiov->niov, offset, luring_flags);
> > +    }
> > +#else
> > +        assert(flags == 0);
> > +        io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset);
> 
> Hmm.  'man io_uring_prep_writev2' states:
> 
>        Unless an application explicitly needs to pass in more than  one  iovec,
>        it  is  more  efficient  to  use io_uring_prep_write(3) rather than this
>        function, as no state has to be maintained for  a  non-vectored  IO  re‐
>        quest.
> 
> Obviously, if we want luring_flags of RWF_DSYNC to be set, we have to
> use the newer interface; but if that flag is not present, should we be
> conditionally falling back to the simpler interface when qiov->niov ==
> 1?

Thanks for the suggestion. I will try it out and benchmark it for v2.

> 
> In fact, even when qiov->niov > 1, can we unvector it ourselves into
> multiple io_uring_prep_write() calls, since the whole point of the
> uring is that we aren't making syscalls, so more ops on the uring
> should still be cheaper?  But that's a question for a followup patch
> (still, given that your series is RFC for performance reasons, it may
> be worth investigating).

That's more complicated. I won't pursue that for the time being.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-06-05 18:40 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-28 19:09 [RFC 00/11] aio: add the aio_add_sqe() io_uring API Stefan Hajnoczi
2025-05-28 19:09 ` [RFC 01/11] aio-posix: fix polling mode with fdmon-io_uring Stefan Hajnoczi
2025-05-28 20:29   ` Eric Blake
2025-05-28 19:09 ` [RFC 02/11] aio-posix: keep polling enabled with fdmon-io_uring.c Stefan Hajnoczi
2025-05-28 20:34   ` Eric Blake
2025-05-28 19:09 ` [RFC 03/11] tests/unit: skip test-nested-aio-poll with io_uring Stefan Hajnoczi
2025-05-28 20:40   ` Eric Blake
2025-05-28 19:09 ` [RFC 04/11] aio-posix: integrate fdmon into glib event loop Stefan Hajnoczi
2025-05-28 21:01   ` Eric Blake
2025-05-28 19:09 ` [RFC 05/11] aio: remove aio_context_use_g_source() Stefan Hajnoczi
2025-05-28 21:02   ` Eric Blake
2025-05-28 19:09 ` [RFC 06/11] aio: free AioContext when aio_context_new() fails Stefan Hajnoczi
2025-05-28 21:06   ` Eric Blake
2025-06-05 17:49     ` Stefan Hajnoczi
2025-05-28 19:09 ` [RFC 07/11] aio: add errp argument to aio_context_setup() Stefan Hajnoczi
2025-05-28 21:07   ` Eric Blake
2025-05-28 19:09 ` [RFC 08/11] aio-posix: gracefully handle io_uring_queue_init() failure Stefan Hajnoczi
2025-05-28 22:12   ` Eric Blake
2025-05-29 15:38     ` Stefan Hajnoczi
2025-06-03  6:05       ` Markus Armbruster
2025-06-03 18:48         ` Stefan Hajnoczi
2025-06-02 12:26   ` Brian
2025-06-02 20:20     ` Stefan Hajnoczi
2025-06-02 22:37       ` Brian
2025-05-28 19:09 ` [RFC 09/11] aio-posix: add aio_add_sqe() API for user-defined io_uring requests Stefan Hajnoczi
2025-05-28 22:15   ` Eric Blake
2025-05-29 20:02   ` Eric Blake
2025-06-05 17:52     ` Stefan Hajnoczi
2025-05-28 19:09 ` [RFC 10/11] aio-posix: avoid EventNotifier for cqe_handler_bh Stefan Hajnoczi
2025-05-29 20:09   ` Eric Blake
2025-05-28 19:09 ` [RFC 11/11] block/io_uring: use aio_add_sqe() Stefan Hajnoczi
2025-05-29 21:11   ` Eric Blake
2025-06-05 18:40     ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).