[PULL 00/27] Block layer patches

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PULL 00/27] Block layer patches
@ 2023-10-31 18:58 Kevin Wolf
  2023-10-31 23:31 ` Stefan Hajnoczi
  0 siblings, 1 reply; 31+ messages in thread
From: Kevin Wolf @ 2023-10-31 18:58 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

The following changes since commit 516fffc9933cb21fad41ca8f7bf465d238d4d375:

  Merge tag 'pull-lu-20231030' of https://gitlab.com/rth7680/qemu into staging (2023-10-31 07:12:40 +0900)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 900e7d413d630ebd3f5d64bae0e6249621ec0c7f:

  iotests: add test for changing mirror's copy_mode (2023-10-31 19:46:51 +0100)

----------------------------------------------------------------
Block layer patches

- virtio-blk: use blk_io_plug_call() instead of notification BH
- mirror: allow switching from background to active mode
- qemu-img rebase: add compression support
- Fix locking in media change monitor commands
- Fix a few blockjob-related deadlocks when using iothread

----------------------------------------------------------------
Andrey Drobyshev (8):
      qemu-img: rebase: stop when reaching EOF of old backing file
      qemu-iotests: 024: add rebasing test case for overlay_size > backing_size
      qemu-img: rebase: use backing files' BlockBackend for buffer alignment
      qemu-img: add chunk size parameter to compare_buffers()
      qemu-img: rebase: avoid unnecessary COW operations
      iotests/{024, 271}: add testcases for qemu-img rebase
      qemu-img: add compression option to rebase subcommand
      iotests: add tests for "qemu-img rebase" with compression

Fiona Ebner (13):
      blockjob: drop AioContext lock before calling bdrv_graph_wrlock()
      block: avoid potential deadlock during bdrv_graph_wrlock() in bdrv_close()
      blockdev: mirror: avoid potential deadlock when using iothread
      blockjob: introduce block-job-change QMP command
      block/mirror: set actively_synced even after the job is ready
      block/mirror: move dirty bitmap to filter
      block/mirror: determine copy_to_target only once
      mirror: implement mirror_change method
      qapi/block-core: use JobType for BlockJobInfo's type
      qapi/block-core: turn BlockJobInfo into a union
      blockjob: query driver-specific info via a new 'query' driver method
      mirror: return mirror-specific information upon query
      iotests: add test for changing mirror's copy_mode

Kevin Wolf (2):
      block: Fix locking in media change monitor commands
      iotests: Test media change with iothreads

Stefan Hajnoczi (4):
      block: rename blk_io_plug_call() API to defer_call()
      util/defer-call: move defer_call() to util/
      virtio: use defer_call() in virtio_irqfd_notify()
      virtio-blk: remove batch notification BH

 qapi/block-core.json                               |  59 ++++++-
 qapi/job.json                                      |   4 +-
 docs/tools/qemu-img.rst                            |   6 +-
 include/block/blockjob.h                           |  11 ++
 include/block/blockjob_int.h                       |  12 ++
 include/qemu/defer-call.h                          |  16 ++
 include/sysemu/block-backend-io.h                  |   4 -
 block.c                                            |   2 +-
 block/blkio.c                                      |   9 +-
 block/io_uring.c                                   |  11 +-
 block/linux-aio.c                                  |   9 +-
 block/mirror.c                                     | 131 ++++++++++----
 block/monitor/block-hmp-cmds.c                     |   4 +-
 block/nvme.c                                       |   5 +-
 block/plug.c                                       | 159 -----------------
 block/qapi-sysemu.c                                |   5 +
 blockdev.c                                         |  28 ++-
 blockjob.c                                         |  30 +++-
 hw/block/dataplane/virtio-blk.c                    |  48 +----
 hw/block/dataplane/xen-block.c                     |  11 +-
 hw/block/virtio-blk.c                              |   5 +-
 hw/scsi/virtio-scsi.c                              |   7 +-
 hw/virtio/virtio.c                                 |  13 +-
 job.c                                              |   1 +
 qemu-img.c                                         | 136 +++++++++++----
 util/defer-call.c                                  | 156 +++++++++++++++++
 util/thread-pool.c                                 |   5 +
 MAINTAINERS                                        |   3 +-
 block/meson.build                                  |   1 -
 hw/virtio/trace-events                             |   1 +
 qemu-img-cmds.hx                                   |   4 +-
 tests/qemu-iotests/024                             | 117 +++++++++++++
 tests/qemu-iotests/024.out                         |  73 ++++++++
 tests/qemu-iotests/109.out                         |  24 +--
 tests/qemu-iotests/118                             |   6 +-
 tests/qemu-iotests/271                             | 131 ++++++++++++++
 tests/qemu-iotests/271.out                         |  82 +++++++++
 tests/qemu-iotests/314                             | 165 ++++++++++++++++++
 tests/qemu-iotests/314.out                         |  75 ++++++++
 tests/qemu-iotests/tests/mirror-change-copy-mode   | 193 +++++++++++++++++++++
 .../qemu-iotests/tests/mirror-change-copy-mode.out |   5 +
 util/meson.build                                   |   1 +
 42 files changed, 1437 insertions(+), 331 deletions(-)
 create mode 100644 include/qemu/defer-call.h
 delete mode 100644 block/plug.c
 create mode 100644 util/defer-call.c
 create mode 100755 tests/qemu-iotests/314
 create mode 100644 tests/qemu-iotests/314.out
 create mode 100755 tests/qemu-iotests/tests/mirror-change-copy-mode
 create mode 100644 tests/qemu-iotests/tests/mirror-change-copy-mode.out



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PULL 00/27] Block layer patches
  2023-10-31 18:58 [PULL 00/27] Block layer patches Kevin Wolf
@ 2023-10-31 23:31 ` Stefan Hajnoczi
  0 siblings, 0 replies; 31+ messages in thread
From: Stefan Hajnoczi @ 2023-10-31 23:31 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, kwolf, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 115 bytes --]

Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/8.2 for any user-visible changes.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PULL 00/27] Block layer patches
@ 2025-11-04 17:53 Kevin Wolf
  2025-11-04 17:53 ` [PULL 01/27] aio-posix: fix race between io_uring CQE and AioHandler deletion Kevin Wolf
                   ` (26 more replies)
  0 siblings, 27 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:53 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

The following changes since commit a8e63c013016f9ff981689189c5b063551d04559:

  Merge tag 'igvm-20251103--pull-request' of https://gitlab.com/kraxel/qemu into staging (2025-11-03 10:21:01 +0100)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 4d0de416dd06c405906735a61c2521912aa3d72c:

  qcow2, vmdk: Restrict creation with secondary file using protocol (2025-11-04 18:25:47 +0100)

----------------------------------------------------------------
Block layer patches

- stream: Fix potential crash during job completion
- aio: add the aio_add_sqe() io_uring API
- qcow2: put discards in discard queue when discard-no-unref is enabled
- qcow2, vmdk: Restrict creation with secondary file using protocol
- iotests: Run iotests with sanitizers
- iotests: Add more image formats to the thorough testing
- iotests: Improve the dry run list to speed up thorough testing
- Code cleanup

----------------------------------------------------------------
Akihiko Odaki (2):
      qemu-img: Fix amend option parse error handling
      iotests: Run iotests with sanitizers

Eric Blake (2):
      block: Allow drivers to control protocol prefix at creation
      qcow2, vmdk: Restrict creation with secondary file using protocol

Jean-Louis Dupond (2):
      qcow2: rename update_refcount_discard to queue_discard
      qcow2: put discards in discard queue when discard-no-unref is enabled

Kevin Wolf (1):
      iotests: Test resizing file node under raw with size/offset

Stefan Hajnoczi (15):
      aio-posix: fix race between io_uring CQE and AioHandler deletion
      aio-posix: fix fdmon-io_uring.c timeout stack variable lifetime
      aio-posix: fix spurious return from ->wait() due to signals
      aio-posix: keep polling enabled with fdmon-io_uring.c
      tests/unit: skip test-nested-aio-poll with io_uring
      aio-posix: integrate fdmon into glib event loop
      aio: remove aio_context_use_g_source()
      aio: free AioContext when aio_context_new() fails
      aio: add errp argument to aio_context_setup()
      aio-posix: gracefully handle io_uring_queue_init() failure
      aio-posix: unindent fdmon_io_uring_destroy()
      aio-posix: add fdmon_ops->dispatch()
      aio-posix: add aio_add_sqe() API for user-defined io_uring requests
      block/io_uring: use aio_add_sqe()
      block/io_uring: use non-vectored read/write when possible

Thomas Huth (3):
      tests/qemu-iotests/184: Fix skip message for qemu-img without throttle
      tests/qemu-iotests: Improve the dry run list to speed up thorough testing
      tests/qemu-iotest: Add more image formats to the thorough testing

Wesley Hershberger (1):
      block: Drop detach_subchain for bdrv_replace_node

Yeqi Fu (1):
      block: replace TABs with space

 block/qcow2.h                                 |   4 +
 include/block/aio.h                           | 156 +++++++-
 include/block/block-global-state.h            |   3 +-
 include/block/nbd.h                           |   2 +-
 include/block/raw-aio.h                       |   5 -
 util/aio-posix.h                              |  18 +-
 block.c                                       |  42 +--
 block/bochs.c                                 |  14 +-
 block/crypto.c                                |   2 +-
 block/file-posix.c                            |  98 +++--
 block/file-win32.c                            |  38 +-
 block/io_uring.c                              | 505 +++++++-------------------
 block/parallels.c                             |   2 +-
 block/qcow.c                                  |  12 +-
 block/qcow2-cluster.c                         |  16 +-
 block/qcow2-refcount.c                        |  25 +-
 block/qcow2.c                                 |   4 +-
 block/qed.c                                   |   2 +-
 block/raw-format.c                            |   2 +-
 block/vdi.c                                   |   2 +-
 block/vhdx.c                                  |   2 +-
 block/vmdk.c                                  |   2 +-
 block/vpc.c                                   |   2 +-
 qemu-img.c                                    |   2 +-
 stubs/io_uring.c                              |  32 --
 tests/unit/test-aio.c                         |   7 +-
 tests/unit/test-nested-aio-poll.c             |  13 +-
 util/aio-posix.c                              | 137 ++++---
 util/aio-win32.c                              |   7 +-
 util/async.c                                  |  71 ++--
 util/fdmon-epoll.c                            |  34 +-
 util/fdmon-io_uring.c                         | 247 ++++++++++---
 util/fdmon-poll.c                             |  85 ++++-
 tests/qemu-iotests/testrunner.py              |  12 +
 block/trace-events                            |  12 +-
 stubs/meson.build                             |   3 -
 tests/qemu-iotests/184                        |   2 +-
 tests/qemu-iotests/257                        |   8 +-
 tests/qemu-iotests/257.out                    |  14 +-
 tests/qemu-iotests/check                      |  42 ++-
 tests/qemu-iotests/meson.build                |  11 +-
 tests/qemu-iotests/tests/resize-below-raw     |  51 ++-
 tests/qemu-iotests/tests/resize-below-raw.out |   4 +-
 util/trace-events                             |   4 +
 44 files changed, 956 insertions(+), 800 deletions(-)
 delete mode 100644 stubs/io_uring.c



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PULL 01/27] aio-posix: fix race between io_uring CQE and AioHandler deletion
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
@ 2025-11-04 17:53 ` Kevin Wolf
  2025-11-04 17:53 ` [PULL 02/27] aio-posix: fix fdmon-io_uring.c timeout stack variable lifetime Kevin Wolf
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:53 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Stefan Hajnoczi <stefanha@redhat.com>

When an AioHandler is enqueued on ctx->submit_list for removal, the
fill_sq_ring() function will submit an io_uring POLL_REMOVE operation to
cancel the in-flight POLL_ADD operation.

There is a race when another thread enqueues an AioHandler for deletion
on ctx->submit_list when the POLL_ADD CQE has already appeared. In that
case POLL_REMOVE is unnecessary. The code already handled this, but
forgot that the AioHandler itself is still on ctx->submit_list when the
POLL_ADD CQE is being processed. It's unsafe to delete the AioHandler at
that point in time (use-after-free).

Solve this problem by keeping the AioHandler alive but setting a flag so
that it will be deleted by fill_sq_ring() when it runs.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-2-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 util/fdmon-io_uring.c | 33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index b0d68bdc44..ad89160f31 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -52,9 +52,10 @@ enum {
     FDMON_IO_URING_ENTRIES  = 128, /* sq/cq ring size */
 
     /* AioHandler::flags */
-    FDMON_IO_URING_PENDING  = (1 << 0),
-    FDMON_IO_URING_ADD      = (1 << 1),
-    FDMON_IO_URING_REMOVE   = (1 << 2),
+    FDMON_IO_URING_PENDING            = (1 << 0),
+    FDMON_IO_URING_ADD                = (1 << 1),
+    FDMON_IO_URING_REMOVE             = (1 << 2),
+    FDMON_IO_URING_DELETE_AIO_HANDLER = (1 << 3),
 };
 
 static inline int poll_events_from_pfd(int pfd_events)
@@ -218,6 +219,16 @@ static void fill_sq_ring(AioContext *ctx)
         if (flags & FDMON_IO_URING_REMOVE) {
             add_poll_remove_sqe(ctx, node);
         }
+        if (flags & FDMON_IO_URING_DELETE_AIO_HANDLER) {
+            /*
+             * process_cqe() sets this flag after ADD and REMOVE have been
+             * cleared. They cannot be set again, so they must be clear.
+             */
+            assert(!(flags & FDMON_IO_URING_ADD));
+            assert(!(flags & FDMON_IO_URING_REMOVE));
+
+            QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers, node, node_deleted);
+        }
     }
 }
 
@@ -241,7 +252,12 @@ static bool process_cqe(AioContext *ctx,
      */
     flags = qatomic_fetch_and(&node->flags, ~FDMON_IO_URING_REMOVE);
     if (flags & FDMON_IO_URING_REMOVE) {
-        QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers, node, node_deleted);
+        if (flags & FDMON_IO_URING_PENDING) {
+            /* Still on ctx->submit_list, defer deletion until fill_sq_ring() */
+            qatomic_or(&node->flags, FDMON_IO_URING_DELETE_AIO_HANDLER);
+        } else {
+            QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers, node, node_deleted);
+        }
         return false;
     }
 
@@ -347,10 +363,13 @@ void fdmon_io_uring_destroy(AioContext *ctx)
             unsigned flags = qatomic_fetch_and(&node->flags,
                     ~(FDMON_IO_URING_PENDING |
                       FDMON_IO_URING_ADD |
-                      FDMON_IO_URING_REMOVE));
+                      FDMON_IO_URING_REMOVE |
+                      FDMON_IO_URING_DELETE_AIO_HANDLER));
 
-            if (flags & FDMON_IO_URING_REMOVE) {
-                QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers, node, node_deleted);
+            if ((flags & FDMON_IO_URING_REMOVE) ||
+                (flags & FDMON_IO_URING_DELETE_AIO_HANDLER)) {
+                QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers,
+                                      node, node_deleted);
             }
 
             QSLIST_REMOVE_HEAD_RCU(&ctx->submit_list, node_submitted);
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 02/27] aio-posix: fix fdmon-io_uring.c timeout stack variable lifetime
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
  2025-11-04 17:53 ` [PULL 01/27] aio-posix: fix race between io_uring CQE and AioHandler deletion Kevin Wolf
@ 2025-11-04 17:53 ` Kevin Wolf
  2025-11-04 17:53 ` [PULL 03/27] aio-posix: fix spurious return from ->wait() due to signals Kevin Wolf
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:53 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Stefan Hajnoczi <stefanha@redhat.com>

io_uring_prep_timeout() stashes a pointer to the timespec struct rather
than copying its fields. That means the struct must live until after the
SQE has been submitted by io_uring_enter(2). add_timeout_sqe() violates
this constraint because the SQE is not submitted within the function.

Inline add_timeout_sqe() into fdmon_io_uring_wait() so that the struct
lives at least as long as io_uring_enter(2).

This fixes random hangs (bogus timeout values) when the kernel loads
undefined timespec struct values from userspace after the original
struct on the stack has been destroyed.

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20251104022933.618123-3-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 util/fdmon-io_uring.c | 27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index ad89160f31..b64ce42513 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -188,20 +188,6 @@ static void add_poll_remove_sqe(AioContext *ctx, AioHandler *node)
     io_uring_sqe_set_data(sqe, NULL);
 }
 
-/* Add a timeout that self-cancels when another cqe becomes ready */
-static void add_timeout_sqe(AioContext *ctx, int64_t ns)
-{
-    struct io_uring_sqe *sqe;
-    struct __kernel_timespec ts = {
-        .tv_sec = ns / NANOSECONDS_PER_SECOND,
-        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
-    };
-
-    sqe = get_sqe(ctx);
-    io_uring_prep_timeout(sqe, &ts, 1, 0);
-    io_uring_sqe_set_data(sqe, NULL);
-}
-
 /* Add sqes from ctx->submit_list for submission */
 static void fill_sq_ring(AioContext *ctx)
 {
@@ -291,13 +277,24 @@ static int process_cq_ring(AioContext *ctx, AioHandlerList *ready_list)
 static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
                                int64_t timeout)
 {
+    struct __kernel_timespec ts;
     unsigned wait_nr = 1; /* block until at least one cqe is ready */
     int ret;
 
     if (timeout == 0) {
         wait_nr = 0; /* non-blocking */
     } else if (timeout > 0) {
-        add_timeout_sqe(ctx, timeout);
+        /* Add a timeout that self-cancels when another cqe becomes ready */
+        struct io_uring_sqe *sqe;
+
+        ts = (struct __kernel_timespec){
+            .tv_sec = timeout / NANOSECONDS_PER_SECOND,
+            .tv_nsec = timeout % NANOSECONDS_PER_SECOND,
+        };
+
+        sqe = get_sqe(ctx);
+        io_uring_prep_timeout(sqe, &ts, 1, 0);
+        io_uring_sqe_set_data(sqe, NULL);
     }
 
     fill_sq_ring(ctx);
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 03/27] aio-posix: fix spurious return from ->wait() due to signals
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
  2025-11-04 17:53 ` [PULL 01/27] aio-posix: fix race between io_uring CQE and AioHandler deletion Kevin Wolf
  2025-11-04 17:53 ` [PULL 02/27] aio-posix: fix fdmon-io_uring.c timeout stack variable lifetime Kevin Wolf
@ 2025-11-04 17:53 ` Kevin Wolf
  2025-11-04 17:53 ` [PULL 04/27] aio-posix: keep polling enabled with fdmon-io_uring.c Kevin Wolf
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:53 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Stefan Hajnoczi <stefanha@redhat.com>

io_uring_enter(2) only returns -EINTR in some cases when interrupted by
a signal. Therefore the while loop in fdmon_io_uring_wait() is
incomplete and can lead to a spurious early return.

Handle the case when a signal interrupts io_uring_enter(2) but the
syscall returns the number of SQEs submitted (that takes priority over
-EINTR).

This patch probably makes little difference for QEMU, but the test suite
relies on the exact pattern of aio_poll() return values, so it's best to
hide this io_uring syscall interface quirk.

Here is the strace of test-aio receiving 3 SIGCONT signals after this
fix has been applied. Notice how the io_uring_enter(2) return value is 1
the first time because an SQE was submitted, but -EINTR the other times:

  eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK) = 9
  io_uring_enter(7, 1, 0, 0, NULL, 8) = 1
  clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe38a46240) = 0
  io_uring_enter(7, 1, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 1
  --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=596096, si_uid=1000} ---
  io_uring_enter(7, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = -1 EINTR (Interrupted system call)
  --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=596096, si_uid=1000} ---
  io_uring_enter(7, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8 <unfinished ...>
  <... io_uring_enter resumed>) = -1 EINTR (Interrupted system call)
  --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=596096, si_uid=1000} ---
  io_uring_enter(7, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8 <unfinished ...>
  <... io_uring_enter resumed>) = 0

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20251104022933.618123-4-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 util/fdmon-io_uring.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index b64ce42513..3d8638b0e5 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -299,9 +299,16 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
 
     fill_sq_ring(ctx);
 
+    /*
+     * Loop to handle signals in both cases:
+     * 1. If no SQEs were submitted, then -EINTR is returned.
+     * 2. If SQEs were submitted then the number of SQEs submitted is returned
+     *    rather than -EINTR.
+     */
     do {
         ret = io_uring_submit_and_wait(&ctx->fdmon_io_uring, wait_nr);
-    } while (ret == -EINTR);
+    } while (ret == -EINTR ||
+             (ret >= 0 && wait_nr > io_uring_cq_ready(&ctx->fdmon_io_uring)));
 
     assert(ret >= 0);
 
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 04/27] aio-posix: keep polling enabled with fdmon-io_uring.c
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (2 preceding siblings ...)
  2025-11-04 17:53 ` [PULL 03/27] aio-posix: fix spurious return from ->wait() due to signals Kevin Wolf
@ 2025-11-04 17:53 ` Kevin Wolf
  2025-11-04 17:53 ` [PULL 05/27] tests/unit: skip test-nested-aio-poll with io_uring Kevin Wolf
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:53 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Stefan Hajnoczi <stefanha@redhat.com>

Commit 816a430c517e ("util/aio: Defer disabling poll mode as long as
possible") kept polling enabled when the event loop timeout is 0. Since
there is no timeout the event loop will continue immediately and the
overhead of disabling and re-enabling polling can be avoided.

fdmon-io_uring.c is unable to take advantage of this optimization
because its ->need_wait() function returns true whenever there are new
io_uring SQEs to submit:

  if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Polling will be disabled even when timeout == 0.

Extend the optimization to handle the case when need_wait() returns true
and timeout == 0.

Cc: Chao Gao <chao.gao@intel.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-5-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 util/aio-posix.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/util/aio-posix.c b/util/aio-posix.c
index 2e0a5dadc4..824fdc34cc 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -559,7 +559,14 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
         elapsed_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time;
         max_ns = qemu_soonest_timeout(*timeout, max_ns);
         assert(!(max_ns && progress));
-    } while (elapsed_time < max_ns && !ctx->fdmon_ops->need_wait(ctx));
+
+        if (ctx->fdmon_ops->need_wait(ctx)) {
+            if (fdmon_supports_polling(ctx)) {
+                *timeout = 0; /* stay in polling mode */
+            }
+            break;
+        }
+    } while (elapsed_time < max_ns);
 
     if (remove_idle_poll_handlers(ctx, ready_list,
                                   start_time + elapsed_time)) {
@@ -722,7 +729,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
          * up IO threads when some work becomes pending. It is essential to
          * avoid hangs or unnecessary latency.
          */
-        if (poll_set_started(ctx, &ready_list, false)) {
+        if (timeout && poll_set_started(ctx, &ready_list, false)) {
             timeout = 0;
             progress = true;
         }
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 05/27] tests/unit: skip test-nested-aio-poll with io_uring
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (3 preceding siblings ...)
  2025-11-04 17:53 ` [PULL 04/27] aio-posix: keep polling enabled with fdmon-io_uring.c Kevin Wolf
@ 2025-11-04 17:53 ` Kevin Wolf
  2025-11-04 17:53 ` [PULL 06/27] aio-posix: integrate fdmon into glib event loop Kevin Wolf
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:53 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Stefan Hajnoczi <stefanha@redhat.com>

test-nested-aio-poll relies on internal details of how fdmon-poll.c
handles AioContext polling. Skip it when other fdmon implementations are
in use.

The reason why fdmon-io_uring.c behaves differently from fdmon-poll.c is
that its fdmon_ops->need_wait() function returns true when
io_uring_enter(2) must be called (e.g. to submit pending SQEs).
AioContext polling is skipped when ->need_wait() returns true, so the
test case will never enter AioContext polling mode with
fdmon-io_uring.c.

Restrict this test to fdmon-poll.c and drop the
aio_context_use_g_source() call since it's no longer necessary.

Note that this test is only built on POSIX systems so it is safe to
include "util/aio-posix.h".

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-6-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/unit/test-nested-aio-poll.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/tests/unit/test-nested-aio-poll.c b/tests/unit/test-nested-aio-poll.c
index d8fd92c43b..d13ecccd8c 100644
--- a/tests/unit/test-nested-aio-poll.c
+++ b/tests/unit/test-nested-aio-poll.c
@@ -15,6 +15,7 @@
 #include "qemu/osdep.h"
 #include "block/aio.h"
 #include "qapi/error.h"
+#include "util/aio-posix.h"
 
 typedef struct {
     AioContext *ctx;
@@ -71,17 +72,17 @@ static void test(void)
         .ctx = aio_context_new(&error_abort),
     };
 
+    if (td.ctx->fdmon_ops != &fdmon_poll_ops) {
+        /* This test is tied to fdmon-poll.c */
+        g_test_skip("fdmon_poll_ops not in use");
+        return;
+    }
+
     qemu_set_current_aio_context(td.ctx);
 
     /* Enable polling */
     aio_context_set_poll_params(td.ctx, 1000000, 2, 2, &error_abort);
 
-    /*
-     * The GSource is unused but this has the side-effect of changing the fdmon
-     * that AioContext uses.
-     */
-    aio_get_g_source(td.ctx);
-
     /* Make the event notifier active (set) right away */
     event_notifier_init(&td.poll_notifier, 1);
     aio_set_event_notifier(td.ctx, &td.poll_notifier,
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 06/27] aio-posix: integrate fdmon into glib event loop
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (4 preceding siblings ...)
  2025-11-04 17:53 ` [PULL 05/27] tests/unit: skip test-nested-aio-poll with io_uring Kevin Wolf
@ 2025-11-04 17:53 ` Kevin Wolf
  2025-11-05 15:06   ` Richard Henderson
  2025-11-04 17:53 ` [PULL 07/27] aio: remove aio_context_use_g_source() Kevin Wolf
                   ` (20 subsequent siblings)
  26 siblings, 1 reply; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:53 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Stefan Hajnoczi <stefanha@redhat.com>

AioContext's glib integration only supports ppoll(2) file descriptor
monitoring. epoll(7) and io_uring(7) disable themselves and switch back
to ppoll(2) when the glib event loop is used. The main loop thread
cannot use epoll(7) or io_uring(7) because it always uses the glib event
loop.

Future QEMU features may require io_uring(7). One example is uring_cmd
support in FUSE exports. Each feature could create its own io_uring(7)
context and integrate it into the event loop, but this is inefficient
due to extra syscalls. It would be more efficient to reuse the
AioContext's existing fdmon-io_uring.c io_uring(7) context because
fdmon-io_uring.c will already be active on systems where Linux io_uring
is available.

In order to keep fdmon-io_uring.c's AioContext operational even when the
glib event loop is used, extend FDMonOps with an API similar to
GSourceFuncs so that file descriptor monitoring can integrate into the
glib event loop.

A quick summary of the GSourceFuncs API:
- prepare() is called each event loop iteration before waiting for file
  descriptors and timers.
- check() is called to determine whether events are ready to be
  dispatched after waiting.
- dispatch() is called to process events.

More details here: https://docs.gtk.org/glib/struct.SourceFuncs.html

Move the ppoll(2)-specific code from aio-posix.c into fdmon-poll.c and
also implement epoll(7)- and io_uring(7)-specific file descriptor
monitoring code for glib event loops.

Note that it's still faster to use aio_poll() rather than the glib event
loop since glib waits for file descriptor activity with ppoll(2) and
does not support adaptive polling. But at least epoll(7) and io_uring(7)
now work in glib event loops.

Splitting this into multiple commits without temporarily breaking
AioContext proved difficult so this commit makes all the changes. The
next commit will remove the aio_context_use_g_source() API because it is
no longer needed.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-7-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/aio.h   | 36 ++++++++++++++++++
 util/aio-posix.h      |  5 +++
 tests/unit/test-aio.c |  7 +++-
 util/aio-posix.c      | 69 ++++++++---------------------------
 util/fdmon-epoll.c    | 34 ++++++++++++++---
 util/fdmon-io_uring.c | 44 +++++++++++++++++++++-
 util/fdmon-poll.c     | 85 ++++++++++++++++++++++++++++++++++++++++++-
 7 files changed, 218 insertions(+), 62 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 99ff48420b..39ed86d14d 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -106,6 +106,38 @@ typedef struct {
      * Returns: true if ->wait() should be called, false otherwise.
      */
     bool (*need_wait)(AioContext *ctx);
+
+    /*
+     * gsource_prepare:
+     * @ctx: the AioContext
+     *
+     * Prepare for the glib event loop to wait for events instead of the usual
+     * ->wait() call. See glib's GSourceFuncs->prepare().
+     */
+    void (*gsource_prepare)(AioContext *ctx);
+
+    /*
+     * gsource_check:
+     * @ctx: the AioContext
+     *
+     * Called by the glib event loop from glib's GSourceFuncs->check() after
+     * waiting for events.
+     *
+     * Returns: true when ready to be dispatched.
+     */
+    bool (*gsource_check)(AioContext *ctx);
+
+    /*
+     * gsource_dispatch:
+     * @ctx: the AioContext
+     * @ready_list: list for handlers that become ready
+     *
+     * Place ready AioHandlers on ready_list. Called as part of the glib event
+     * loop from glib's GSourceFuncs->dispatch().
+     *
+     * Called with list_lock incremented.
+     */
+    void (*gsource_dispatch)(AioContext *ctx, AioHandlerList *ready_list);
 } FDMonOps;
 
 /*
@@ -222,6 +254,7 @@ struct AioContext {
     /* State for file descriptor monitoring using Linux io_uring */
     struct io_uring fdmon_io_uring;
     AioHandlerSList submit_list;
+    gpointer io_uring_fd_tag;
 #endif
 
     /* TimerLists for calling timers - one per clock type.  Has its own
@@ -254,6 +287,9 @@ struct AioContext {
     /* epoll(7) state used when built with CONFIG_EPOLL */
     int epollfd;
 
+    /* The GSource unix fd tag for epollfd */
+    gpointer epollfd_tag;
+
     const FDMonOps *fdmon_ops;
 };
 
diff --git a/util/aio-posix.h b/util/aio-posix.h
index 82a0201ea4..f9994ed79e 100644
--- a/util/aio-posix.h
+++ b/util/aio-posix.h
@@ -47,9 +47,14 @@ void aio_add_ready_handler(AioHandlerList *ready_list, AioHandler *node,
 
 extern const FDMonOps fdmon_poll_ops;
 
+/* Switch back to poll(2). list_lock must be held. */
+void fdmon_poll_downgrade(AioContext *ctx);
+
 #ifdef CONFIG_EPOLL_CREATE1
 bool fdmon_epoll_try_upgrade(AioContext *ctx, unsigned npfd);
 void fdmon_epoll_setup(AioContext *ctx);
+
+/* list_lock must be held */
 void fdmon_epoll_disable(AioContext *ctx);
 #else
 static inline bool fdmon_epoll_try_upgrade(AioContext *ctx, unsigned npfd)
diff --git a/tests/unit/test-aio.c b/tests/unit/test-aio.c
index e77d86be87..010d65b79a 100644
--- a/tests/unit/test-aio.c
+++ b/tests/unit/test-aio.c
@@ -527,7 +527,12 @@ static void test_source_bh_delete_from_cb(void)
     g_assert_cmpint(data1.n, ==, data1.max);
     g_assert(data1.bh == NULL);
 
-    assert(g_main_context_iteration(NULL, false));
+    /*
+     * There may be up to one more iteration due to the aio_notify
+     * EventNotifier.
+     */
+    g_main_context_iteration(NULL, false);
+
     assert(!g_main_context_iteration(NULL, false));
 }
 
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 824fdc34cc..9de05ee7e8 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -70,15 +70,6 @@ static AioHandler *find_aio_handler(AioContext *ctx, int fd)
 
 static bool aio_remove_fd_handler(AioContext *ctx, AioHandler *node)
 {
-    /* If the GSource is in the process of being destroyed then
-     * g_source_remove_poll() causes an assertion failure.  Skip
-     * removal in that case, because glib cleans up its state during
-     * destruction anyway.
-     */
-    if (!g_source_is_destroyed(&ctx->source)) {
-        g_source_remove_poll(&ctx->source, &node->pfd);
-    }
-
     node->pfd.revents = 0;
     node->poll_ready = false;
 
@@ -153,7 +144,6 @@ void aio_set_fd_handler(AioContext *ctx,
         } else {
             new_node->pfd = node->pfd;
         }
-        g_source_add_poll(&ctx->source, &new_node->pfd);
 
         new_node->pfd.events = (io_read ? G_IO_IN | G_IO_HUP | G_IO_ERR : 0);
         new_node->pfd.events |= (io_write ? G_IO_OUT | G_IO_ERR : 0);
@@ -267,37 +257,13 @@ bool aio_prepare(AioContext *ctx)
     poll_set_started(ctx, &ready_list, false);
     /* TODO what to do with this list? */
 
+    ctx->fdmon_ops->gsource_prepare(ctx);
     return false;
 }
 
 bool aio_pending(AioContext *ctx)
 {
-    AioHandler *node;
-    bool result = false;
-
-    /*
-     * We have to walk very carefully in case aio_set_fd_handler is
-     * called while we're walking.
-     */
-    qemu_lockcnt_inc(&ctx->list_lock);
-
-    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
-        int revents;
-
-        /* TODO should this check poll ready? */
-        revents = node->pfd.revents & node->pfd.events;
-        if (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR) && node->io_read) {
-            result = true;
-            break;
-        }
-        if (revents & (G_IO_OUT | G_IO_ERR) && node->io_write) {
-            result = true;
-            break;
-        }
-    }
-    qemu_lockcnt_dec(&ctx->list_lock);
-
-    return result;
+    return ctx->fdmon_ops->gsource_check(ctx);
 }
 
 static void aio_free_deleted_handlers(AioContext *ctx)
@@ -390,10 +356,6 @@ static bool aio_dispatch_handler(AioContext *ctx, AioHandler *node)
     return progress;
 }
 
-/*
- * If we have a list of ready handlers then this is more efficient than
- * scanning all handlers with aio_dispatch_handlers().
- */
 static bool aio_dispatch_ready_handlers(AioContext *ctx,
                                         AioHandlerList *ready_list,
                                         int64_t block_ns)
@@ -417,24 +379,18 @@ static bool aio_dispatch_ready_handlers(AioContext *ctx,
     return progress;
 }
 
-/* Slower than aio_dispatch_ready_handlers() but only used via glib */
-static bool aio_dispatch_handlers(AioContext *ctx)
-{
-    AioHandler *node, *tmp;
-    bool progress = false;
-
-    QLIST_FOREACH_SAFE_RCU(node, &ctx->aio_handlers, node, tmp) {
-        progress = aio_dispatch_handler(ctx, node) || progress;
-    }
-
-    return progress;
-}
-
 void aio_dispatch(AioContext *ctx)
 {
+    AioHandlerList ready_list = QLIST_HEAD_INITIALIZER(ready_list);
+
     qemu_lockcnt_inc(&ctx->list_lock);
     aio_bh_poll(ctx);
-    aio_dispatch_handlers(ctx);
+
+    ctx->fdmon_ops->gsource_dispatch(ctx, &ready_list);
+
+    /* block_ns is 0 because polling is disabled in the glib event loop */
+    aio_dispatch_ready_handlers(ctx, &ready_list, 0);
+
     aio_free_deleted_handlers(ctx);
     qemu_lockcnt_dec(&ctx->list_lock);
 
@@ -766,6 +722,7 @@ void aio_context_setup(AioContext *ctx)
 {
     ctx->fdmon_ops = &fdmon_poll_ops;
     ctx->epollfd = -1;
+    ctx->epollfd_tag = NULL;
 
     /* Use the fastest fd monitoring implementation if available */
     if (fdmon_io_uring_setup(ctx)) {
@@ -778,7 +735,11 @@ void aio_context_setup(AioContext *ctx)
 void aio_context_destroy(AioContext *ctx)
 {
     fdmon_io_uring_destroy(ctx);
+
+    qemu_lockcnt_lock(&ctx->list_lock);
     fdmon_epoll_disable(ctx);
+    qemu_lockcnt_unlock(&ctx->list_lock);
+
     aio_free_deleted_handlers(ctx);
 }
 
diff --git a/util/fdmon-epoll.c b/util/fdmon-epoll.c
index 9fb8800dde..61118e1ee6 100644
--- a/util/fdmon-epoll.c
+++ b/util/fdmon-epoll.c
@@ -19,8 +19,12 @@ void fdmon_epoll_disable(AioContext *ctx)
         ctx->epollfd = -1;
     }
 
-    /* Switch back */
-    ctx->fdmon_ops = &fdmon_poll_ops;
+    if (ctx->epollfd_tag) {
+        g_source_remove_unix_fd(&ctx->source, ctx->epollfd_tag);
+        ctx->epollfd_tag = NULL;
+    }
+
+    fdmon_poll_downgrade(ctx);
 }
 
 static inline int epoll_events_from_pfd(int pfd_events)
@@ -93,10 +97,29 @@ out:
     return ret;
 }
 
+static void fdmon_epoll_gsource_prepare(AioContext *ctx)
+{
+    /* Do nothing */
+}
+
+static bool fdmon_epoll_gsource_check(AioContext *ctx)
+{
+    return g_source_query_unix_fd(&ctx->source, ctx->epollfd_tag) & G_IO_IN;
+}
+
+static void fdmon_epoll_gsource_dispatch(AioContext *ctx,
+                                         AioHandlerList *ready_list)
+{
+    fdmon_epoll_wait(ctx, ready_list, 0);
+}
+
 static const FDMonOps fdmon_epoll_ops = {
     .update = fdmon_epoll_update,
     .wait = fdmon_epoll_wait,
     .need_wait = aio_poll_disabled,
+    .gsource_prepare = fdmon_epoll_gsource_prepare,
+    .gsource_check = fdmon_epoll_gsource_check,
+    .gsource_dispatch = fdmon_epoll_gsource_dispatch,
 };
 
 static bool fdmon_epoll_try_enable(AioContext *ctx)
@@ -118,6 +141,8 @@ static bool fdmon_epoll_try_enable(AioContext *ctx)
     }
 
     ctx->fdmon_ops = &fdmon_epoll_ops;
+    ctx->epollfd_tag = g_source_add_unix_fd(&ctx->source, ctx->epollfd,
+                                            G_IO_IN);
     return true;
 }
 
@@ -139,12 +164,11 @@ bool fdmon_epoll_try_upgrade(AioContext *ctx, unsigned npfd)
     }
 
     ok = fdmon_epoll_try_enable(ctx);
-
-    qemu_lockcnt_inc_and_unlock(&ctx->list_lock);
-
     if (!ok) {
         fdmon_epoll_disable(ctx);
     }
+
+    qemu_lockcnt_inc_and_unlock(&ctx->list_lock);
     return ok;
 }
 
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index 3d8638b0e5..0a5ec5ead6 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -262,6 +262,11 @@ static int process_cq_ring(AioContext *ctx, AioHandlerList *ready_list)
     unsigned num_ready = 0;
     unsigned head;
 
+    /* If the CQ overflowed then fetch CQEs with a syscall */
+    if (io_uring_cq_has_overflow(ring)) {
+        io_uring_get_events(ring);
+    }
+
     io_uring_for_each_cqe(ring, head, cqe) {
         if (process_cqe(ctx, ready_list, cqe)) {
             num_ready++;
@@ -274,6 +279,30 @@ static int process_cq_ring(AioContext *ctx, AioHandlerList *ready_list)
     return num_ready;
 }
 
+/* This is where SQEs are submitted in the glib event loop */
+static void fdmon_io_uring_gsource_prepare(AioContext *ctx)
+{
+    fill_sq_ring(ctx);
+    if (io_uring_sq_ready(&ctx->fdmon_io_uring)) {
+        while (io_uring_submit(&ctx->fdmon_io_uring) == -EINTR) {
+            /* Keep trying if syscall was interrupted */
+        }
+    }
+}
+
+static bool fdmon_io_uring_gsource_check(AioContext *ctx)
+{
+    gpointer tag = ctx->io_uring_fd_tag;
+    return g_source_query_unix_fd(&ctx->source, tag) & G_IO_IN;
+}
+
+/* This is where CQEs are processed in the glib event loop */
+static void fdmon_io_uring_gsource_dispatch(AioContext *ctx,
+                                            AioHandlerList *ready_list)
+{
+    process_cq_ring(ctx, ready_list);
+}
+
 static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
                                int64_t timeout)
 {
@@ -339,12 +368,17 @@ static const FDMonOps fdmon_io_uring_ops = {
     .update = fdmon_io_uring_update,
     .wait = fdmon_io_uring_wait,
     .need_wait = fdmon_io_uring_need_wait,
+    .gsource_prepare = fdmon_io_uring_gsource_prepare,
+    .gsource_check = fdmon_io_uring_gsource_check,
+    .gsource_dispatch = fdmon_io_uring_gsource_dispatch,
 };
 
 bool fdmon_io_uring_setup(AioContext *ctx)
 {
     int ret;
 
+    ctx->io_uring_fd_tag = NULL;
+
     ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
     if (ret != 0) {
         return false;
@@ -352,6 +386,9 @@ bool fdmon_io_uring_setup(AioContext *ctx)
 
     QSLIST_INIT(&ctx->submit_list);
     ctx->fdmon_ops = &fdmon_io_uring_ops;
+    ctx->io_uring_fd_tag = g_source_add_unix_fd(&ctx->source,
+            ctx->fdmon_io_uring.ring_fd, G_IO_IN);
+
     return true;
 }
 
@@ -379,6 +416,11 @@ void fdmon_io_uring_destroy(AioContext *ctx)
             QSLIST_REMOVE_HEAD_RCU(&ctx->submit_list, node_submitted);
         }
 
-        ctx->fdmon_ops = &fdmon_poll_ops;
+        g_source_remove_unix_fd(&ctx->source, ctx->io_uring_fd_tag);
+        ctx->io_uring_fd_tag = NULL;
+
+        qemu_lockcnt_lock(&ctx->list_lock);
+        fdmon_poll_downgrade(ctx);
+        qemu_lockcnt_unlock(&ctx->list_lock);
     }
 }
diff --git a/util/fdmon-poll.c b/util/fdmon-poll.c
index 17df917cf9..0ae755cc13 100644
--- a/util/fdmon-poll.c
+++ b/util/fdmon-poll.c
@@ -72,6 +72,11 @@ static int fdmon_poll_wait(AioContext *ctx, AioHandlerList *ready_list,
 
     /* epoll(7) is faster above a certain number of fds */
     if (fdmon_epoll_try_upgrade(ctx, npfd)) {
+        QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
+            if (!QLIST_IS_INSERTED(node, node_deleted) && node->pfd.events) {
+                g_source_remove_poll(&ctx->source, &node->pfd);
+            }
+        }
         npfd = 0; /* we won't need pollfds[], reset npfd */
         return ctx->fdmon_ops->wait(ctx, ready_list, timeout);
     }
@@ -97,11 +102,89 @@ static void fdmon_poll_update(AioContext *ctx,
                               AioHandler *old_node,
                               AioHandler *new_node)
 {
-    /* Do nothing, AioHandler already contains the state we'll need */
+    if (old_node) {
+        /*
+         * If the GSource is in the process of being destroyed then
+         * g_source_remove_poll() causes an assertion failure.  Skip removal in
+         * that case, because glib cleans up its state during destruction
+         * anyway.
+         */
+        if (!g_source_is_destroyed(&ctx->source)) {
+            g_source_remove_poll(&ctx->source, &old_node->pfd);
+        }
+    }
+
+    if (new_node) {
+        g_source_add_poll(&ctx->source, &new_node->pfd);
+    }
+}
+
+static void fdmon_poll_gsource_prepare(AioContext *ctx)
+{
+    /* Do nothing */
+}
+
+static bool fdmon_poll_gsource_check(AioContext *ctx)
+{
+    AioHandler *node;
+    bool result = false;
+
+    /*
+     * We have to walk very carefully in case aio_set_fd_handler is
+     * called while we're walking.
+     */
+    qemu_lockcnt_inc(&ctx->list_lock);
+
+    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
+        int revents = node->pfd.revents & node->pfd.events;
+
+        if (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR) && node->io_read) {
+            result = true;
+            break;
+        }
+        if (revents & (G_IO_OUT | G_IO_ERR) && node->io_write) {
+            result = true;
+            break;
+        }
+    }
+
+    qemu_lockcnt_dec(&ctx->list_lock);
+
+    return result;
+}
+
+static void fdmon_poll_gsource_dispatch(AioContext *ctx,
+                                        AioHandlerList *ready_list)
+{
+    AioHandler *node;
+
+    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
+        int revents = node->pfd.revents;
+
+        if (revents) {
+            aio_add_ready_handler(ready_list, node, revents);
+        }
+    }
 }
 
 const FDMonOps fdmon_poll_ops = {
     .update = fdmon_poll_update,
     .wait = fdmon_poll_wait,
     .need_wait = aio_poll_disabled,
+    .gsource_prepare = fdmon_poll_gsource_prepare,
+    .gsource_check = fdmon_poll_gsource_check,
+    .gsource_dispatch = fdmon_poll_gsource_dispatch,
 };
+
+void fdmon_poll_downgrade(AioContext *ctx)
+{
+    AioHandler *node;
+
+    ctx->fdmon_ops = &fdmon_poll_ops;
+
+    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
+        if (!QLIST_IS_INSERTED(node, node_deleted) && node->pfd.events) {
+            g_source_add_poll(&ctx->source, &node->pfd);
+        }
+    }
+}
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 07/27] aio: remove aio_context_use_g_source()
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (5 preceding siblings ...)
  2025-11-04 17:53 ` [PULL 06/27] aio-posix: integrate fdmon into glib event loop Kevin Wolf
@ 2025-11-04 17:53 ` Kevin Wolf
  2025-11-04 17:53 ` [PULL 08/27] aio: free AioContext when aio_context_new() fails Kevin Wolf
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:53 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Stefan Hajnoczi <stefanha@redhat.com>

There is no need for aio_context_use_g_source() now that epoll(7) and
io_uring(7) file descriptor monitoring works with the glib event loop.
AioContext doesn't need to be notified that GSource is being used.

On hosts with io_uring support this now enables fdmon-io_uring.c by
default, replacing fdmon-poll.c and fdmon-epoll.c. In other words, the
event loop will use io_uring!

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-8-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/aio.h |  3 ---
 util/aio-posix.c    | 12 ------------
 util/aio-win32.c    |  4 ----
 util/async.c        |  1 -
 4 files changed, 20 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 39ed86d14d..1657740a0e 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -728,9 +728,6 @@ void aio_context_setup(AioContext *ctx);
  */
 void aio_context_destroy(AioContext *ctx);
 
-/* Used internally, do not call outside AioContext code */
-void aio_context_use_g_source(AioContext *ctx);
-
 /**
  * aio_context_set_poll_params:
  * @ctx: the aio context
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 9de05ee7e8..bebd9ce3a2 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -743,18 +743,6 @@ void aio_context_destroy(AioContext *ctx)
     aio_free_deleted_handlers(ctx);
 }
 
-void aio_context_use_g_source(AioContext *ctx)
-{
-    /*
-     * Disable io_uring when the glib main loop is used because it doesn't
-     * support mixed glib/aio_poll() usage. It relies on aio_poll() being
-     * called regularly so that changes to the monitored file descriptors are
-     * submitted, otherwise a list of pending fd handlers builds up.
-     */
-    fdmon_io_uring_destroy(ctx);
-    aio_free_deleted_handlers(ctx);
-}
-
 void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
                                  int64_t grow, int64_t shrink, Error **errp)
 {
diff --git a/util/aio-win32.c b/util/aio-win32.c
index c6fbce64c2..18cc9fb7a9 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -427,10 +427,6 @@ void aio_context_destroy(AioContext *ctx)
 {
 }
 
-void aio_context_use_g_source(AioContext *ctx)
-{
-}
-
 void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
                                  int64_t grow, int64_t shrink, Error **errp)
 {
diff --git a/util/async.c b/util/async.c
index a736d2cd0d..cb72ad3777 100644
--- a/util/async.c
+++ b/util/async.c
@@ -433,7 +433,6 @@ static GSourceFuncs aio_source_funcs = {
 
 GSource *aio_get_g_source(AioContext *ctx)
 {
-    aio_context_use_g_source(ctx);
     g_source_ref(&ctx->source);
     return &ctx->source;
 }
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 08/27] aio: free AioContext when aio_context_new() fails
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (6 preceding siblings ...)
  2025-11-04 17:53 ` [PULL 07/27] aio: remove aio_context_use_g_source() Kevin Wolf
@ 2025-11-04 17:53 ` Kevin Wolf
  2025-11-04 17:53 ` [PULL 09/27] aio: add errp argument to aio_context_setup() Kevin Wolf
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:53 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Stefan Hajnoczi <stefanha@redhat.com>

g_source_destroy() only removes the GSource from the GMainContext it's
attached to, if any. It does not free it.

Use g_source_unref() instead so that the AioContext (which embeds a
GSource) is freed. There is no need to call g_source_destroy() in
aio_context_new() because the GSource isn't attached to a GMainContext
yet.

aio_ctx_finalize() expects everything to be set up already, so introduce
the new ctx->initialized boolean and do nothing when called with
!initialized. This also requires moving aio_context_setup() down after
event_notifier_init() since aio_ctx_finalize() won't release any
resources that aio_context_setup() acquired.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-9-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/aio.h |  3 +++
 util/async.c        | 31 ++++++++++++++++++++++++++++---
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 1657740a0e..2760f308f5 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -291,6 +291,9 @@ struct AioContext {
     gpointer epollfd_tag;
 
     const FDMonOps *fdmon_ops;
+
+    /* Was aio_context_new() successful? */
+    bool initialized;
 };
 
 /**
diff --git a/util/async.c b/util/async.c
index cb72ad3777..7d06ff98f3 100644
--- a/util/async.c
+++ b/util/async.c
@@ -366,12 +366,16 @@ aio_ctx_dispatch(GSource     *source,
 }
 
 static void
-aio_ctx_finalize(GSource     *source)
+aio_ctx_finalize(GSource *source)
 {
     AioContext *ctx = (AioContext *) source;
     QEMUBH *bh;
     unsigned flags;
 
+    if (!ctx->initialized) {
+        return;
+    }
+
     thread_pool_free_aio(ctx->thread_pool);
 
 #ifdef CONFIG_LINUX_AIO
@@ -579,16 +583,35 @@ AioContext *aio_context_new(Error **errp)
     int ret;
     AioContext *ctx;
 
+    /*
+     * ctx is freed by g_source_unref() (e.g. aio_context_unref()). ctx's
+     * resources are freed as follows:
+     *
+     * 1. By aio_ctx_finalize() after aio_context_new() has returned and set
+     *    ->initialized = true.
+     *
+     * 2. By manual cleanup code in this function's error paths before goto
+     *    fail.
+     *
+     * Be careful to free resources in both cases!
+     */
     ctx = (AioContext *) g_source_new(&aio_source_funcs, sizeof(AioContext));
     QSLIST_INIT(&ctx->bh_list);
     QSIMPLEQ_INIT(&ctx->bh_slice_list);
-    aio_context_setup(ctx);
 
     ret = event_notifier_init(&ctx->notifier, false);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "Failed to initialize event notifier");
         goto fail;
     }
+
+    /*
+     * Resources cannot easily be freed manually after aio_context_setup(). If
+     * you add any new resources to AioContext, it's probably best to acquire
+     * them before aio_context_setup().
+     */
+    aio_context_setup(ctx);
+
     g_source_set_can_recurse(&ctx->source, true);
     qemu_lockcnt_init(&ctx->list_lock);
 
@@ -622,9 +645,11 @@ AioContext *aio_context_new(Error **errp)
 
     register_aiocontext(ctx);
 
+    ctx->initialized = true;
+
     return ctx;
 fail:
-    g_source_destroy(&ctx->source);
+    g_source_unref(&ctx->source);
     return NULL;
 }
 
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 09/27] aio: add errp argument to aio_context_setup()
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (7 preceding siblings ...)
  2025-11-04 17:53 ` [PULL 08/27] aio: free AioContext when aio_context_new() fails Kevin Wolf
@ 2025-11-04 17:53 ` Kevin Wolf
  2025-11-04 17:53 ` [PULL 10/27] aio-posix: gracefully handle io_uring_queue_init() failure Kevin Wolf
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:53 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Stefan Hajnoczi <stefanha@redhat.com>

When aio_context_new() -> aio_context_setup() fails at startup it
doesn't really matter whether errors are returned to the caller or the
process terminates immediately.

However, it is not acceptable to terminate when hotplugging --object
iothread at runtime. Refactor aio_context_setup() so that errors can be
propagated. The next commit will set errp when fdmon_io_uring_setup()
fails.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-10-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/aio.h | 5 ++++-
 util/aio-posix.c    | 5 +++--
 util/aio-win32.c    | 3 ++-
 util/async.c        | 6 +++++-
 4 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 2760f308f5..9562733fa7 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -718,10 +718,13 @@ void qemu_set_current_aio_context(AioContext *ctx);
 /**
  * aio_context_setup:
  * @ctx: the aio context
+ * @errp: error pointer
  *
  * Initialize the aio context.
+ *
+ * Returns: true on success, false otherwise
  */
-void aio_context_setup(AioContext *ctx);
+bool aio_context_setup(AioContext *ctx, Error **errp);
 
 /**
  * aio_context_destroy:
diff --git a/util/aio-posix.c b/util/aio-posix.c
index bebd9ce3a2..9806a75c12 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -718,7 +718,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
     return progress;
 }
 
-void aio_context_setup(AioContext *ctx)
+bool aio_context_setup(AioContext *ctx, Error **errp)
 {
     ctx->fdmon_ops = &fdmon_poll_ops;
     ctx->epollfd = -1;
@@ -726,10 +726,11 @@ void aio_context_setup(AioContext *ctx)
 
     /* Use the fastest fd monitoring implementation if available */
     if (fdmon_io_uring_setup(ctx)) {
-        return;
+        return true;
     }
 
     fdmon_epoll_setup(ctx);
+    return true;
 }
 
 void aio_context_destroy(AioContext *ctx)
diff --git a/util/aio-win32.c b/util/aio-win32.c
index 18cc9fb7a9..6e6f699e4b 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -419,8 +419,9 @@ bool aio_poll(AioContext *ctx, bool blocking)
     return progress;
 }
 
-void aio_context_setup(AioContext *ctx)
+bool aio_context_setup(AioContext *ctx, Error **errp)
 {
+    return true;
 }
 
 void aio_context_destroy(AioContext *ctx)
diff --git a/util/async.c b/util/async.c
index 7d06ff98f3..00e46b99f9 100644
--- a/util/async.c
+++ b/util/async.c
@@ -580,6 +580,7 @@ static void co_schedule_bh_cb(void *opaque)
 
 AioContext *aio_context_new(Error **errp)
 {
+    ERRP_GUARD();
     int ret;
     AioContext *ctx;
 
@@ -610,7 +611,10 @@ AioContext *aio_context_new(Error **errp)
      * you add any new resources to AioContext, it's probably best to acquire
      * them before aio_context_setup().
      */
-    aio_context_setup(ctx);
+    if (!aio_context_setup(ctx, errp)) {
+        event_notifier_cleanup(&ctx->notifier);
+        goto fail;
+    }
 
     g_source_set_can_recurse(&ctx->source, true);
     qemu_lockcnt_init(&ctx->list_lock);
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 10/27] aio-posix: gracefully handle io_uring_queue_init() failure
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (8 preceding siblings ...)
  2025-11-04 17:53 ` [PULL 09/27] aio: add errp argument to aio_context_setup() Kevin Wolf
@ 2025-11-04 17:53 ` Kevin Wolf
  2025-11-04 17:53 ` [PULL 11/27] aio-posix: unindent fdmon_io_uring_destroy() Kevin Wolf
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:53 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Stefan Hajnoczi <stefanha@redhat.com>

io_uring may not be available at runtime due to system policies (e.g.
the io_uring_disabled sysctl) or creation could fail due to file
descriptor resource limits.

Handle failure scenarios as follows:

If another AioContext already has io_uring, then fail AioContext
creation so that the aio_add_sqe() API is available uniformly from all
QEMU threads. Otherwise fall back to epoll(7) if io_uring is
unavailable.

Notes:
- Update the comment about selecting the fastest fdmon implementation.
  At this point it's not about speed anymore, it's about aio_add_sqe()
  API availability.
- Uppercase the error message when converting from error_report() to
  error_setg_errno() for consistency (but there are instances of
  lowercase in the codebase).
- It's easier to move the #ifdefs from aio-posix.h to aio-posix.c.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-11-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 util/aio-posix.h      | 12 ++----------
 util/aio-posix.c      | 28 +++++++++++++++++++++++++---
 util/fdmon-io_uring.c |  5 +++--
 3 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/util/aio-posix.h b/util/aio-posix.h
index f9994ed79e..dfa1a51c0b 100644
--- a/util/aio-posix.h
+++ b/util/aio-posix.h
@@ -18,6 +18,7 @@
 #define AIO_POSIX_H
 
 #include "block/aio.h"
+#include "qapi/error.h"
 
 struct AioHandler {
     GPollFD pfd;
@@ -72,17 +73,8 @@ static inline void fdmon_epoll_disable(AioContext *ctx)
 #endif /* !CONFIG_EPOLL_CREATE1 */
 
 #ifdef CONFIG_LINUX_IO_URING
-bool fdmon_io_uring_setup(AioContext *ctx);
+bool fdmon_io_uring_setup(AioContext *ctx, Error **errp);
 void fdmon_io_uring_destroy(AioContext *ctx);
-#else
-static inline bool fdmon_io_uring_setup(AioContext *ctx)
-{
-    return false;
-}
-
-static inline void fdmon_io_uring_destroy(AioContext *ctx)
-{
-}
 #endif /* !CONFIG_LINUX_IO_URING */
 
 #endif /* AIO_POSIX_H */
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 9806a75c12..c0285a26a3 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -16,6 +16,7 @@
 #include "qemu/osdep.h"
 #include "block/block.h"
 #include "block/thread-pool.h"
+#include "qapi/error.h"
 #include "qemu/main-loop.h"
 #include "qemu/lockcnt.h"
 #include "qemu/rcu.h"
@@ -724,10 +725,29 @@ bool aio_context_setup(AioContext *ctx, Error **errp)
     ctx->epollfd = -1;
     ctx->epollfd_tag = NULL;
 
-    /* Use the fastest fd monitoring implementation if available */
-    if (fdmon_io_uring_setup(ctx)) {
-        return true;
+#ifdef CONFIG_LINUX_IO_URING
+    {
+        static bool need_io_uring;
+        Error *local_err = NULL; /* ERRP_GUARD() doesn't handle error_abort */
+
+        /* io_uring takes precedence because it provides aio_add_sqe() support */
+        if (fdmon_io_uring_setup(ctx, &local_err)) {
+            /*
+             * If one AioContext gets io_uring, then all AioContexts need io_uring
+             * so that aio_add_sqe() support is available across all threads.
+             */
+            need_io_uring = true;
+            return true;
+        }
+        if (need_io_uring) {
+            error_propagate(errp, local_err);
+            return false;
+        }
+
+        /* Silently fall back on systems where io_uring is unavailable */
+        error_free(local_err);
     }
+#endif /* CONFIG_LINUX_IO_URING */
 
     fdmon_epoll_setup(ctx);
     return true;
@@ -735,7 +755,9 @@ bool aio_context_setup(AioContext *ctx, Error **errp)
 
 void aio_context_destroy(AioContext *ctx)
 {
+#ifdef CONFIG_LINUX_IO_URING
     fdmon_io_uring_destroy(ctx);
+#endif
 
     qemu_lockcnt_lock(&ctx->list_lock);
     fdmon_epoll_disable(ctx);
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index 0a5ec5ead6..9f25d6d6db 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -45,6 +45,7 @@
 
 #include "qemu/osdep.h"
 #include <poll.h>
+#include "qapi/error.h"
 #include "qemu/rcu_queue.h"
 #include "aio-posix.h"
 
@@ -373,7 +374,7 @@ static const FDMonOps fdmon_io_uring_ops = {
     .gsource_dispatch = fdmon_io_uring_gsource_dispatch,
 };
 
-bool fdmon_io_uring_setup(AioContext *ctx)
+bool fdmon_io_uring_setup(AioContext *ctx, Error **errp)
 {
     int ret;
 
@@ -381,6 +382,7 @@ bool fdmon_io_uring_setup(AioContext *ctx)
 
     ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
     if (ret != 0) {
+        error_setg_errno(errp, -ret, "Failed to initialize io_uring");
         return false;
     }
 
@@ -388,7 +390,6 @@ bool fdmon_io_uring_setup(AioContext *ctx)
     ctx->fdmon_ops = &fdmon_io_uring_ops;
     ctx->io_uring_fd_tag = g_source_add_unix_fd(&ctx->source,
             ctx->fdmon_io_uring.ring_fd, G_IO_IN);
-
     return true;
 }
 
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 11/27] aio-posix: unindent fdmon_io_uring_destroy()
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (9 preceding siblings ...)
  2025-11-04 17:53 ` [PULL 10/27] aio-posix: gracefully handle io_uring_queue_init() failure Kevin Wolf
@ 2025-11-04 17:53 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 12/27] aio-posix: add fdmon_ops->dispatch() Kevin Wolf
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:53 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Stefan Hajnoczi <stefanha@redhat.com>

Reduce the level of indentation to make further code changes easier to
read.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20251104022933.618123-12-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 util/fdmon-io_uring.c | 54 ++++++++++++++++++++++---------------------
 1 file changed, 28 insertions(+), 26 deletions(-)

diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index 9f25d6d6db..a06bbe2715 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -395,33 +395,35 @@ bool fdmon_io_uring_setup(AioContext *ctx, Error **errp)
 
 void fdmon_io_uring_destroy(AioContext *ctx)
 {
-    if (ctx->fdmon_ops == &fdmon_io_uring_ops) {
-        AioHandler *node;
-
-        io_uring_queue_exit(&ctx->fdmon_io_uring);
-
-        /* Move handlers due to be removed onto the deleted list */
-        while ((node = QSLIST_FIRST_RCU(&ctx->submit_list))) {
-            unsigned flags = qatomic_fetch_and(&node->flags,
-                    ~(FDMON_IO_URING_PENDING |
-                      FDMON_IO_URING_ADD |
-                      FDMON_IO_URING_REMOVE |
-                      FDMON_IO_URING_DELETE_AIO_HANDLER));
-
-            if ((flags & FDMON_IO_URING_REMOVE) ||
-                (flags & FDMON_IO_URING_DELETE_AIO_HANDLER)) {
-                QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers,
-                                      node, node_deleted);
-            }
-
-            QSLIST_REMOVE_HEAD_RCU(&ctx->submit_list, node_submitted);
-        }
+    AioHandler *node;
+
+    if (ctx->fdmon_ops != &fdmon_io_uring_ops) {
+        return;
+    }
+
+    io_uring_queue_exit(&ctx->fdmon_io_uring);
 
-        g_source_remove_unix_fd(&ctx->source, ctx->io_uring_fd_tag);
-        ctx->io_uring_fd_tag = NULL;
+    /* Move handlers due to be removed onto the deleted list */
+    while ((node = QSLIST_FIRST_RCU(&ctx->submit_list))) {
+        unsigned flags = qatomic_fetch_and(&node->flags,
+                ~(FDMON_IO_URING_PENDING |
+                  FDMON_IO_URING_ADD |
+                  FDMON_IO_URING_REMOVE |
+                  FDMON_IO_URING_DELETE_AIO_HANDLER));
 
-        qemu_lockcnt_lock(&ctx->list_lock);
-        fdmon_poll_downgrade(ctx);
-        qemu_lockcnt_unlock(&ctx->list_lock);
+        if ((flags & FDMON_IO_URING_REMOVE) ||
+            (flags & FDMON_IO_URING_DELETE_AIO_HANDLER)) {
+            QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers,
+                                  node, node_deleted);
+        }
+
+        QSLIST_REMOVE_HEAD_RCU(&ctx->submit_list, node_submitted);
     }
+
+    g_source_remove_unix_fd(&ctx->source, ctx->io_uring_fd_tag);
+    ctx->io_uring_fd_tag = NULL;
+
+    qemu_lockcnt_lock(&ctx->list_lock);
+    fdmon_poll_downgrade(ctx);
+    qemu_lockcnt_unlock(&ctx->list_lock);
 }
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 12/27] aio-posix: add fdmon_ops->dispatch()
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (10 preceding siblings ...)
  2025-11-04 17:53 ` [PULL 11/27] aio-posix: unindent fdmon_io_uring_destroy() Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 13/27] aio-posix: add aio_add_sqe() API for user-defined io_uring requests Kevin Wolf
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Stefan Hajnoczi <stefanha@redhat.com>

The ppoll and epoll file descriptor monitoring implementations rely on
the event loop's generic file descriptor, timer, and BH dispatch code to
invoke user callbacks.

The io_uring file descriptor monitoring implementation will need
io_uring-specific dispatch logic for CQE handlers for custom SQEs.

Introduce a new FDMonOps ->dispatch() callback that allows file
descriptor monitoring implementations to invoke user callbacks. The next
patch will use this new callback.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20251104022933.618123-13-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/aio.h | 19 +++++++++++++++++++
 util/aio-posix.c    |  9 +++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/block/aio.h b/include/block/aio.h
index 9562733fa7..b266daa58f 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -107,6 +107,25 @@ typedef struct {
      */
     bool (*need_wait)(AioContext *ctx);
 
+    /*
+     * dispatch:
+     * @ctx: the AioContext
+     *
+     * Dispatch any work that is specific to this file descriptor monitoring
+     * implementation. Usually the event loop's generic file descriptor
+     * monitoring, BH, and timer dispatching code is sufficient, but file
+     * descriptor monitoring implementations offering additional functionality
+     * may need to implement this function for custom behavior. Called at a
+     * point in the event loop when it is safe to invoke user-defined
+     * callbacks.
+     *
+     * This function is optional and may be NULL.
+     *
+     * Returns: true if progress was made (see aio_poll()'s return value),
+     * false otherwise.
+     */
+    bool (*dispatch)(AioContext *ctx);
+
     /*
      * gsource_prepare:
      * @ctx: the AioContext
diff --git a/util/aio-posix.c b/util/aio-posix.c
index c0285a26a3..6ff36b6e51 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -385,10 +385,15 @@ void aio_dispatch(AioContext *ctx)
     AioHandlerList ready_list = QLIST_HEAD_INITIALIZER(ready_list);
 
     qemu_lockcnt_inc(&ctx->list_lock);
+
     aio_bh_poll(ctx);
 
     ctx->fdmon_ops->gsource_dispatch(ctx, &ready_list);
 
+    if (ctx->fdmon_ops->dispatch) {
+        ctx->fdmon_ops->dispatch(ctx);
+    }
+
     /* block_ns is 0 because polling is disabled in the glib event loop */
     aio_dispatch_ready_handlers(ctx, &ready_list, 0);
 
@@ -707,6 +712,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
         block_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start;
     }
 
+    if (ctx->fdmon_ops->dispatch) {
+        progress |= ctx->fdmon_ops->dispatch(ctx);
+    }
+
     progress |= aio_bh_poll(ctx);
     progress |= aio_dispatch_ready_handlers(ctx, &ready_list, block_ns);
 
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 13/27] aio-posix: add aio_add_sqe() API for user-defined io_uring requests
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (11 preceding siblings ...)
  2025-11-04 17:54 ` [PULL 12/27] aio-posix: add fdmon_ops->dispatch() Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 14/27] block/io_uring: use aio_add_sqe() Kevin Wolf
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Stefan Hajnoczi <stefanha@redhat.com>

Introduce the aio_add_sqe() API for submitting io_uring requests in the
current AioContext. This allows other components in QEMU, like the block
layer, to take advantage of io_uring features without creating their own
io_uring context.

This API supports nested event loops just like file descriptor
monitoring and BHs do. This comes at a complexity cost: CQE callbacks
must be placed on a list so that nested event loops can invoke pending
CQE callbacks from parent event loops. If you're wondering why
CqeHandler exists instead of just a callback function pointer, this is
why.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-14-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/aio.h   |  83 +++++++++++++++++++++++++++++++-
 util/aio-posix.h      |   1 +
 util/aio-posix.c      |   9 ++++
 util/fdmon-io_uring.c | 109 ++++++++++++++++++++++++++++++++++++------
 util/trace-events     |   4 ++
 5 files changed, 190 insertions(+), 16 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index b266daa58f..05d1bf4036 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -61,6 +61,27 @@ typedef struct LuringState LuringState;
 /* Is polling disabled? */
 bool aio_poll_disabled(AioContext *ctx);
 
+#ifdef CONFIG_LINUX_IO_URING
+/*
+ * Each io_uring request must have a unique CqeHandler that processes the cqe.
+ * The lifetime of a CqeHandler must be at least from aio_add_sqe() until
+ * ->cb() invocation.
+ */
+typedef struct CqeHandler CqeHandler;
+struct CqeHandler {
+    /* Called by the AioContext when the request has completed */
+    void (*cb)(CqeHandler *handler);
+
+    /* Used internally, do not access this */
+    QSIMPLEQ_ENTRY(CqeHandler) next;
+
+    /* This field is filled in before ->cb() is called */
+    struct io_uring_cqe cqe;
+};
+
+typedef QSIMPLEQ_HEAD(, CqeHandler) CqeHandlerSimpleQ;
+#endif /* CONFIG_LINUX_IO_URING */
+
 /* Callbacks for file descriptor monitoring implementations */
 typedef struct {
     /*
@@ -157,6 +178,27 @@ typedef struct {
      * Called with list_lock incremented.
      */
     void (*gsource_dispatch)(AioContext *ctx, AioHandlerList *ready_list);
+
+#ifdef CONFIG_LINUX_IO_URING
+    /**
+     * add_sqe: Add an io_uring sqe for submission.
+     * @prep_sqe: invoked with an sqe that should be prepared for submission
+     * @opaque: user-defined argument to @prep_sqe()
+     * @cqe_handler: the unique cqe handler associated with this request
+     *
+     * The caller's @prep_sqe() function is invoked to fill in the details of
+     * the sqe. Do not call io_uring_sqe_set_data() on this sqe.
+     *
+     * The kernel may see the sqe as soon as @prep_sqe() returns or it may take
+     * until the next event loop iteration.
+     *
+     * This function is called from the current AioContext and is not
+     * thread-safe.
+     */
+    void (*add_sqe)(AioContext *ctx,
+                    void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
+                    void *opaque, CqeHandler *cqe_handler);
+#endif /* CONFIG_LINUX_IO_URING */
 } FDMonOps;
 
 /*
@@ -274,7 +316,10 @@ struct AioContext {
     struct io_uring fdmon_io_uring;
     AioHandlerSList submit_list;
     gpointer io_uring_fd_tag;
-#endif
+
+    /* Pending callback state for cqe handlers */
+    CqeHandlerSimpleQ cqe_handler_ready_list;
+#endif /* CONFIG_LINUX_IO_URING */
 
     /* TimerLists for calling timers - one per clock type.  Has its own
      * locking.
@@ -782,4 +827,40 @@ void aio_context_set_aio_params(AioContext *ctx, int64_t max_batch);
  */
 void aio_context_set_thread_pool_params(AioContext *ctx, int64_t min,
                                         int64_t max, Error **errp);
+
+#ifdef CONFIG_LINUX_IO_URING
+/**
+ * aio_has_io_uring: Return whether io_uring is available.
+ *
+ * io_uring is either available in all AioContexts or in none, so this only
+ * needs to be called once from within any thread's AioContext.
+ */
+static inline bool aio_has_io_uring(void)
+{
+    AioContext *ctx = qemu_get_current_aio_context();
+    return ctx->fdmon_ops->add_sqe;
+}
+
+/**
+ * aio_add_sqe: Add an io_uring sqe for submission.
+ * @prep_sqe: invoked with an sqe that should be prepared for submission
+ * @opaque: user-defined argument to @prep_sqe()
+ * @cqe_handler: the unique cqe handler associated with this request
+ *
+ * The caller's @prep_sqe() function is invoked to fill in the details of the
+ * sqe. Do not call io_uring_sqe_set_data() on this sqe.
+ *
+ * The sqe is submitted by the current AioContext. The kernel may see the sqe
+ * as soon as @prep_sqe() returns or it may take until the next event loop
+ * iteration.
+ *
+ * When the AioContext is destroyed, pending sqes are ignored and their
+ * CqeHandlers are not invoked.
+ *
+ * This function must be called only when aio_has_io_uring() returns true.
+ */
+void aio_add_sqe(void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
+                 void *opaque, CqeHandler *cqe_handler);
+#endif /* CONFIG_LINUX_IO_URING */
+
 #endif
diff --git a/util/aio-posix.h b/util/aio-posix.h
index dfa1a51c0b..babbfa8314 100644
--- a/util/aio-posix.h
+++ b/util/aio-posix.h
@@ -36,6 +36,7 @@ struct AioHandler {
 #ifdef CONFIG_LINUX_IO_URING
     QSLIST_ENTRY(AioHandler) node_submitted;
     unsigned flags; /* see fdmon-io_uring.c */
+    CqeHandler internal_cqe_handler; /* used for POLL_ADD/POLL_REMOVE */
 #endif
     int64_t poll_idle_timeout; /* when to stop userspace polling */
     bool poll_ready; /* has polling detected an event? */
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 6ff36b6e51..e24b955fd9 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -806,3 +806,12 @@ void aio_context_set_aio_params(AioContext *ctx, int64_t max_batch)
 
     aio_notify(ctx);
 }
+
+#ifdef CONFIG_LINUX_IO_URING
+void aio_add_sqe(void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
+                 void *opaque, CqeHandler *cqe_handler)
+{
+    AioContext *ctx = qemu_get_current_aio_context();
+    ctx->fdmon_ops->add_sqe(ctx, prep_sqe, opaque, cqe_handler);
+}
+#endif /* CONFIG_LINUX_IO_URING */
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index a06bbe2715..4230bf33e3 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -46,8 +46,10 @@
 #include "qemu/osdep.h"
 #include <poll.h>
 #include "qapi/error.h"
+#include "qemu/defer-call.h"
 #include "qemu/rcu_queue.h"
 #include "aio-posix.h"
+#include "trace.h"
 
 enum {
     FDMON_IO_URING_ENTRIES  = 128, /* sq/cq ring size */
@@ -76,8 +78,8 @@ static inline int pfd_events_from_poll(int poll_events)
 }
 
 /*
- * Returns an sqe for submitting a request.  Only be called within
- * fdmon_io_uring_wait().
+ * Returns an sqe for submitting a request. Only called from the AioContext
+ * thread.
  */
 static struct io_uring_sqe *get_sqe(AioContext *ctx)
 {
@@ -168,23 +170,46 @@ static void fdmon_io_uring_update(AioContext *ctx,
     }
 }
 
+static void fdmon_io_uring_add_sqe(AioContext *ctx,
+        void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
+        void *opaque, CqeHandler *cqe_handler)
+{
+    struct io_uring_sqe *sqe = get_sqe(ctx);
+
+    prep_sqe(sqe, opaque);
+    io_uring_sqe_set_data(sqe, cqe_handler);
+
+    trace_fdmon_io_uring_add_sqe(ctx, opaque, sqe->opcode, sqe->fd, sqe->off,
+                                 cqe_handler);
+}
+
+static void fdmon_special_cqe_handler(CqeHandler *cqe_handler)
+{
+    /*
+     * This is an empty function that is never called. It is used as a function
+     * pointer to distinguish it from ordinary cqe handlers.
+     */
+}
+
 static void add_poll_add_sqe(AioContext *ctx, AioHandler *node)
 {
     struct io_uring_sqe *sqe = get_sqe(ctx);
     int events = poll_events_from_pfd(node->pfd.events);
 
     io_uring_prep_poll_add(sqe, node->pfd.fd, events);
-    io_uring_sqe_set_data(sqe, node);
+    node->internal_cqe_handler.cb = fdmon_special_cqe_handler;
+    io_uring_sqe_set_data(sqe, &node->internal_cqe_handler);
 }
 
 static void add_poll_remove_sqe(AioContext *ctx, AioHandler *node)
 {
     struct io_uring_sqe *sqe = get_sqe(ctx);
+    CqeHandler *cqe_handler = &node->internal_cqe_handler;
 
 #ifdef LIBURING_HAVE_DATA64
-    io_uring_prep_poll_remove(sqe, (uintptr_t)node);
+    io_uring_prep_poll_remove(sqe, (uintptr_t)cqe_handler);
 #else
-    io_uring_prep_poll_remove(sqe, node);
+    io_uring_prep_poll_remove(sqe, cqe_handler);
 #endif
     io_uring_sqe_set_data(sqe, NULL);
 }
@@ -219,19 +244,13 @@ static void fill_sq_ring(AioContext *ctx)
     }
 }
 
-/* Returns true if a handler became ready */
-static bool process_cqe(AioContext *ctx,
-                        AioHandlerList *ready_list,
-                        struct io_uring_cqe *cqe)
+static bool process_cqe_aio_handler(AioContext *ctx,
+                                    AioHandlerList *ready_list,
+                                    AioHandler *node,
+                                    struct io_uring_cqe *cqe)
 {
-    AioHandler *node = io_uring_cqe_get_data(cqe);
     unsigned flags;
 
-    /* poll_timeout and poll_remove have a zero user_data field */
-    if (!node) {
-        return false;
-    }
-
     /*
      * Deletion can only happen when IORING_OP_POLL_ADD completes.  If we race
      * with enqueue() here then we can safely clear the FDMON_IO_URING_REMOVE
@@ -255,6 +274,35 @@ static bool process_cqe(AioContext *ctx,
     return true;
 }
 
+/* Returns true if a handler became ready */
+static bool process_cqe(AioContext *ctx,
+                        AioHandlerList *ready_list,
+                        struct io_uring_cqe *cqe)
+{
+    CqeHandler *cqe_handler = io_uring_cqe_get_data(cqe);
+
+    /* poll_timeout and poll_remove have a zero user_data field */
+    if (!cqe_handler) {
+        return false;
+    }
+
+    /*
+     * Special handling for AioHandler cqes. They need ready_list and have a
+     * return value.
+     */
+    if (cqe_handler->cb == fdmon_special_cqe_handler) {
+        AioHandler *node = container_of(cqe_handler, AioHandler,
+                                        internal_cqe_handler);
+        return process_cqe_aio_handler(ctx, ready_list, node, cqe);
+    }
+
+    cqe_handler->cqe = *cqe;
+
+    /* Handlers are invoked later by fdmon_io_uring_dispatch() */
+    QSIMPLEQ_INSERT_TAIL(&ctx->cqe_handler_ready_list, cqe_handler, next);
+    return false;
+}
+
 static int process_cq_ring(AioContext *ctx, AioHandlerList *ready_list)
 {
     struct io_uring *ring = &ctx->fdmon_io_uring;
@@ -297,6 +345,32 @@ static bool fdmon_io_uring_gsource_check(AioContext *ctx)
     return g_source_query_unix_fd(&ctx->source, tag) & G_IO_IN;
 }
 
+/* Dispatch CQE handlers that are ready */
+static bool fdmon_io_uring_dispatch(AioContext *ctx)
+{
+    CqeHandlerSimpleQ *ready_list = &ctx->cqe_handler_ready_list;
+    bool progress = false;
+
+    /* Handlers may use defer_call() to coalesce frequent operations */
+    defer_call_begin();
+
+    while (!QSIMPLEQ_EMPTY(ready_list)) {
+        CqeHandler *cqe_handler = QSIMPLEQ_FIRST(ready_list);
+
+        QSIMPLEQ_REMOVE_HEAD(ready_list, next);
+
+        trace_fdmon_io_uring_cqe_handler(ctx, cqe_handler,
+                                         cqe_handler->cqe.res);
+        cqe_handler->cb(cqe_handler);
+        progress = true;
+    }
+
+    defer_call_end();
+
+    return progress;
+}
+
+
 /* This is where CQEs are processed in the glib event loop */
 static void fdmon_io_uring_gsource_dispatch(AioContext *ctx,
                                             AioHandlerList *ready_list)
@@ -369,9 +443,11 @@ static const FDMonOps fdmon_io_uring_ops = {
     .update = fdmon_io_uring_update,
     .wait = fdmon_io_uring_wait,
     .need_wait = fdmon_io_uring_need_wait,
+    .dispatch = fdmon_io_uring_dispatch,
     .gsource_prepare = fdmon_io_uring_gsource_prepare,
     .gsource_check = fdmon_io_uring_gsource_check,
     .gsource_dispatch = fdmon_io_uring_gsource_dispatch,
+    .add_sqe = fdmon_io_uring_add_sqe,
 };
 
 bool fdmon_io_uring_setup(AioContext *ctx, Error **errp)
@@ -387,6 +463,7 @@ bool fdmon_io_uring_setup(AioContext *ctx, Error **errp)
     }
 
     QSLIST_INIT(&ctx->submit_list);
+    QSIMPLEQ_INIT(&ctx->cqe_handler_ready_list);
     ctx->fdmon_ops = &fdmon_io_uring_ops;
     ctx->io_uring_fd_tag = g_source_add_unix_fd(&ctx->source,
             ctx->fdmon_io_uring.ring_fd, G_IO_IN);
@@ -423,6 +500,8 @@ void fdmon_io_uring_destroy(AioContext *ctx)
     g_source_remove_unix_fd(&ctx->source, ctx->io_uring_fd_tag);
     ctx->io_uring_fd_tag = NULL;
 
+    assert(QSIMPLEQ_EMPTY(&ctx->cqe_handler_ready_list));
+
     qemu_lockcnt_lock(&ctx->list_lock);
     fdmon_poll_downgrade(ctx);
     qemu_lockcnt_unlock(&ctx->list_lock);
diff --git a/util/trace-events b/util/trace-events
index bd8f25fb59..540d662507 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -24,6 +24,10 @@ buffer_move_empty(const char *buf, size_t len, const char *from) "%s: %zd bytes
 buffer_move(const char *buf, size_t len, const char *from) "%s: %zd bytes from %s"
 buffer_free(const char *buf, size_t len) "%s: capacity %zd"
 
+# fdmon-io_uring.c
+fdmon_io_uring_add_sqe(void *ctx, void *opaque, int opcode, int fd, uint64_t off, void *cqe_handler) "ctx %p opaque %p opcode %d fd %d off %"PRId64" cqe_handler %p"
+fdmon_io_uring_cqe_handler(void *ctx, void *cqe_handler, int cqe_res) "ctx %p cqe_handler %p cqe_res %d"
+
 # filemonitor-inotify.c
 qemu_file_monitor_add_watch(void *mon, const char *dirpath, const char *filename, void *cb, void *opaque, int64_t id) "File monitor %p add watch dir='%s' file='%s' cb=%p opaque=%p id=%" PRId64
 qemu_file_monitor_remove_watch(void *mon, const char *dirpath, int64_t id) "File monitor %p remove watch dir='%s' id=%" PRId64
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 14/27] block/io_uring: use aio_add_sqe()
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (12 preceding siblings ...)
  2025-11-04 17:54 ` [PULL 13/27] aio-posix: add aio_add_sqe() API for user-defined io_uring requests Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 15/27] block/io_uring: use non-vectored read/write when possible Kevin Wolf
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Stefan Hajnoczi <stefanha@redhat.com>

AioContext has its own io_uring instance for file descriptor monitoring.
The disk I/O io_uring code was developed separately. Originally I
thought the characteristics of file descriptor monitoring and disk I/O
were too different, requiring separate io_uring instances.

Now it has become clear to me that it's feasible to share a single
io_uring instance for file descriptor monitoring and disk I/O. We're not
using io_uring's IOPOLL feature or anything else that would require a
separate instance.

Unify block/io_uring.c and util/fdmon-io_uring.c using the new
aio_add_sqe() API that allows user-defined io_uring sqe submission. Now
block/io_uring.c just needs to submit readv/writev/fsync and most of the
io_uring-specific logic is handled by fdmon-io_uring.c.

There are two immediate advantages:
1. Fewer system calls. There is no need to monitor the disk I/O io_uring
   ring fd from the file descriptor monitoring io_uring instance. Disk
   I/O completions are now picked up directly. Also, sqes are
   accumulated in the sq ring until the end of the event loop iteration
   and there are fewer io_uring_enter(2) syscalls.
2. Less code duplication.

Note that error_setg() messages are not supposed to end with
punctuation, so I removed a '.' for the non-io_uring build error
message.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-15-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/aio.h     |   7 -
 include/block/raw-aio.h |   5 -
 block/file-posix.c      |  40 ++--
 block/io_uring.c        | 489 ++++++++++------------------------------
 stubs/io_uring.c        |  32 ---
 util/async.c            |  35 ---
 block/trace-events      |  12 +-
 stubs/meson.build       |   3 -
 8 files changed, 130 insertions(+), 493 deletions(-)
 delete mode 100644 stubs/io_uring.c

diff --git a/include/block/aio.h b/include/block/aio.h
index 05d1bf4036..540bbc5d60 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -310,8 +310,6 @@ struct AioContext {
     struct LinuxAioState *linux_aio;
 #endif
 #ifdef CONFIG_LINUX_IO_URING
-    LuringState *linux_io_uring;
-
     /* State for file descriptor monitoring using Linux io_uring */
     struct io_uring fdmon_io_uring;
     AioHandlerSList submit_list;
@@ -615,11 +613,6 @@ struct LinuxAioState *aio_setup_linux_aio(AioContext *ctx, Error **errp);
 /* Return the LinuxAioState bound to this AioContext */
 struct LinuxAioState *aio_get_linux_aio(AioContext *ctx);
 
-/* Setup the LuringState bound to this AioContext */
-LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp);
-
-/* Return the LuringState bound to this AioContext */
-LuringState *aio_get_linux_io_uring(AioContext *ctx);
 /**
  * aio_timer_new_with_attrs:
  * @ctx: the aio context
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 6570244496..30e5fc9a9f 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -74,15 +74,10 @@ static inline bool laio_has_fua(void)
 #endif
 /* io_uring.c - Linux io_uring implementation */
 #ifdef CONFIG_LINUX_IO_URING
-LuringState *luring_init(Error **errp);
-void luring_cleanup(LuringState *s);
-
 /* luring_co_submit: submit I/O requests in the thread's current AioContext. */
 int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
                                   QEMUIOVector *qiov, int type,
                                   BdrvRequestFlags flags);
-void luring_detach_aio_context(LuringState *s, AioContext *old_context);
-void luring_attach_aio_context(LuringState *s, AioContext *new_context);
 bool luring_has_fua(void);
 #else
 static inline bool luring_has_fua(void)
diff --git a/block/file-posix.c b/block/file-posix.c
index 8c738674ce..8b7c02d19a 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -755,14 +755,23 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     }
 #endif /* !defined(CONFIG_LINUX_AIO) */
 
-#ifndef CONFIG_LINUX_IO_URING
     if (s->use_linux_io_uring) {
+#ifdef CONFIG_LINUX_IO_URING
+        if (!aio_has_io_uring()) {
+            error_setg(errp, "aio=io_uring was specified, but is not "
+                             "available (disabled via io_uring_disabled "
+                             "sysctl or blocked by container runtime "
+                             "seccomp policy?)");
+            ret = -EINVAL;
+            goto fail;
+        }
+#else
         error_setg(errp, "aio=io_uring was specified, but is not supported "
-                         "in this build.");
+                         "in this build");
         ret = -EINVAL;
         goto fail;
-    }
 #endif /* !defined(CONFIG_LINUX_IO_URING) */
+    }
 
     s->has_discard = true;
     s->has_write_zeroes = true;
@@ -2522,27 +2531,6 @@ static bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
     return true;
 }
 
-#ifdef CONFIG_LINUX_IO_URING
-static inline bool raw_check_linux_io_uring(BDRVRawState *s)
-{
-    Error *local_err = NULL;
-    AioContext *ctx;
-
-    if (!s->use_linux_io_uring) {
-        return false;
-    }
-
-    ctx = qemu_get_current_aio_context();
-    if (unlikely(!aio_setup_linux_io_uring(ctx, &local_err))) {
-        error_reportf_err(local_err, "Unable to use linux io_uring, "
-                                     "falling back to thread pool: ");
-        s->use_linux_io_uring = false;
-        return false;
-    }
-    return true;
-}
-#endif
-
 #ifdef CONFIG_LINUX_AIO
 static inline bool raw_check_linux_aio(BDRVRawState *s)
 {
@@ -2595,7 +2583,7 @@ raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr, uint64_t bytes,
     if (s->needs_alignment && !bdrv_qiov_is_aligned(bs, qiov)) {
         type |= QEMU_AIO_MISALIGNED;
 #ifdef CONFIG_LINUX_IO_URING
-    } else if (raw_check_linux_io_uring(s)) {
+    } else if (s->use_linux_io_uring) {
         assert(qiov->size == bytes);
         ret = luring_co_submit(bs, s->fd, offset, qiov, type, flags);
         goto out;
@@ -2692,7 +2680,7 @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
     };
 
 #ifdef CONFIG_LINUX_IO_URING
-    if (raw_check_linux_io_uring(s)) {
+    if (s->use_linux_io_uring) {
         return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
     }
 #endif
diff --git a/block/io_uring.c b/block/io_uring.c
index dd4f304910..dd930ee57e 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -11,28 +11,20 @@
 #include "qemu/osdep.h"
 #include <liburing.h>
 #include "block/aio.h"
-#include "qemu/queue.h"
 #include "block/block.h"
 #include "block/raw-aio.h"
 #include "qemu/coroutine.h"
-#include "qemu/defer-call.h"
-#include "qapi/error.h"
 #include "system/block-backend.h"
 #include "trace.h"
 
-/* Only used for assertions.  */
-#include "qemu/coroutine_int.h"
-
-/* io_uring ring size */
-#define MAX_ENTRIES 128
-
-typedef struct LuringAIOCB {
+typedef struct {
     Coroutine *co;
-    struct io_uring_sqe sqeq;
-    ssize_t ret;
     QEMUIOVector *qiov;
-    bool is_read;
-    QSIMPLEQ_ENTRY(LuringAIOCB) next;
+    uint64_t offset;
+    ssize_t ret;
+    int type;
+    int fd;
+    BdrvRequestFlags flags;
 
     /*
      * Buffered reads may require resubmission, see
@@ -40,36 +32,51 @@ typedef struct LuringAIOCB {
      */
     int total_read;
     QEMUIOVector resubmit_qiov;
-} LuringAIOCB;
-
-typedef struct LuringQueue {
-    unsigned int in_queue;
-    unsigned int in_flight;
-    bool blocked;
-    QSIMPLEQ_HEAD(, LuringAIOCB) submit_queue;
-} LuringQueue;
 
-struct LuringState {
-    AioContext *aio_context;
+    CqeHandler cqe_handler;
+} LuringRequest;
 
-    struct io_uring ring;
-
-    /* No locking required, only accessed from AioContext home thread */
-    LuringQueue io_q;
-
-    QEMUBH *completion_bh;
-};
-
-/**
- * luring_resubmit:
- *
- * Resubmit a request by appending it to submit_queue.  The caller must ensure
- * that ioq_submit() is called later so that submit_queue requests are started.
- */
-static void luring_resubmit(LuringState *s, LuringAIOCB *luringcb)
+static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
 {
-    QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
-    s->io_q.in_queue++;
+    LuringRequest *req = opaque;
+    QEMUIOVector *qiov = req->qiov;
+    uint64_t offset = req->offset;
+    int fd = req->fd;
+    BdrvRequestFlags flags = req->flags;
+
+    switch (req->type) {
+    case QEMU_AIO_WRITE:
+#ifdef HAVE_IO_URING_PREP_WRITEV2
+    {
+        int luring_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
+        io_uring_prep_writev2(sqe, fd, qiov->iov,
+                              qiov->niov, offset, luring_flags);
+    }
+#else
+        assert(flags == 0);
+        io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset);
+#endif
+        break;
+    case QEMU_AIO_ZONE_APPEND:
+        io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset);
+        break;
+    case QEMU_AIO_READ:
+    {
+        if (req->resubmit_qiov.iov != NULL) {
+            qiov = &req->resubmit_qiov;
+        }
+        io_uring_prep_readv(sqe, fd, qiov->iov, qiov->niov,
+                            offset + req->total_read);
+        break;
+    }
+    case QEMU_AIO_FLUSH:
+        io_uring_prep_fsync(sqe, fd, IORING_FSYNC_DATASYNC);
+        break;
+    default:
+        fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
+                        __func__, req->type);
+        abort();
+    }
 }
 
 /**
@@ -78,385 +85,115 @@ static void luring_resubmit(LuringState *s, LuringAIOCB *luringcb)
  * Short reads are rare but may occur. The remaining read request needs to be
  * resubmitted.
  */
-static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
-                                       int nread)
+static void luring_resubmit_short_read(LuringRequest *req, int nread)
 {
     QEMUIOVector *resubmit_qiov;
     size_t remaining;
 
-    trace_luring_resubmit_short_read(s, luringcb, nread);
+    trace_luring_resubmit_short_read(req, nread);
 
     /* Update read position */
-    luringcb->total_read += nread;
-    remaining = luringcb->qiov->size - luringcb->total_read;
+    req->total_read += nread;
+    remaining = req->qiov->size - req->total_read;
 
     /* Shorten qiov */
-    resubmit_qiov = &luringcb->resubmit_qiov;
+    resubmit_qiov = &req->resubmit_qiov;
     if (resubmit_qiov->iov == NULL) {
-        qemu_iovec_init(resubmit_qiov, luringcb->qiov->niov);
+        qemu_iovec_init(resubmit_qiov, req->qiov->niov);
     } else {
         qemu_iovec_reset(resubmit_qiov);
     }
-    qemu_iovec_concat(resubmit_qiov, luringcb->qiov, luringcb->total_read,
-                      remaining);
+    qemu_iovec_concat(resubmit_qiov, req->qiov, req->total_read, remaining);
 
-    /* Update sqe */
-    luringcb->sqeq.off += nread;
-    luringcb->sqeq.addr = (uintptr_t)luringcb->resubmit_qiov.iov;
-    luringcb->sqeq.len = luringcb->resubmit_qiov.niov;
-
-    luring_resubmit(s, luringcb);
+    aio_add_sqe(luring_prep_sqe, req, &req->cqe_handler);
 }
 
-/**
- * luring_process_completions:
- * @s: AIO state
- *
- * Fetches completed I/O requests, consumes cqes and invokes their callbacks
- * The function is somewhat tricky because it supports nested event loops, for
- * example when a request callback invokes aio_poll().
- *
- * Function schedules BH completion so it  can be called again in a nested
- * event loop.  When there are no events left  to complete the BH is being
- * canceled.
- *
- */
-static void luring_process_completions(LuringState *s)
+static void luring_cqe_handler(CqeHandler *cqe_handler)
 {
-    struct io_uring_cqe *cqes;
-    int total_bytes;
-
-    defer_call_begin();
-
-    /*
-     * Request completion callbacks can run the nested event loop.
-     * Schedule ourselves so the nested event loop will "see" remaining
-     * completed requests and process them.  Without this, completion
-     * callbacks that wait for other requests using a nested event loop
-     * would hang forever.
-     *
-     * This workaround is needed because io_uring uses poll_wait, which
-     * is woken up when new events are added to the uring, thus polling on
-     * the same uring fd will block unless more events are received.
-     *
-     * Other leaf block drivers (drivers that access the data themselves)
-     * are networking based, so they poll sockets for data and run the
-     * correct coroutine.
-     */
-    qemu_bh_schedule(s->completion_bh);
+    LuringRequest *req = container_of(cqe_handler, LuringRequest, cqe_handler);
+    int ret = cqe_handler->cqe.res;
 
-    while (io_uring_peek_cqe(&s->ring, &cqes) == 0) {
-        LuringAIOCB *luringcb;
-        int ret;
+    trace_luring_cqe_handler(req, ret);
 
-        if (!cqes) {
-            break;
+    if (ret < 0) {
+        /*
+         * Only writev/readv/fsync requests on regular files or host block
+         * devices are submitted. Therefore -EAGAIN is not expected but it's
+         * known to happen sometimes with Linux SCSI. Submit again and hope
+         * the request completes successfully.
+         *
+         * For more information, see:
+         * https://lore.kernel.org/io-uring/20210727165811.284510-3-axboe@kernel.dk/T/#u
+         *
+         * If the code is changed to submit other types of requests in the
+         * future, then this workaround may need to be extended to deal with
+         * genuine -EAGAIN results that should not be resubmitted
+         * immediately.
+         */
+        if (ret == -EINTR || ret == -EAGAIN) {
+            aio_add_sqe(luring_prep_sqe, req, &req->cqe_handler);
+            return;
         }
-
-        luringcb = io_uring_cqe_get_data(cqes);
-        ret = cqes->res;
-        io_uring_cqe_seen(&s->ring, cqes);
-        cqes = NULL;
-
-        /* Change counters one-by-one because we can be nested. */
-        s->io_q.in_flight--;
-        trace_luring_process_completion(s, luringcb, ret);
-
+    } else if (req->qiov) {
         /* total_read is non-zero only for resubmitted read requests */
-        total_bytes = ret + luringcb->total_read;
+        int total_bytes = ret + req->total_read;
 
-        if (ret < 0) {
-            /*
-             * Only writev/readv/fsync requests on regular files or host block
-             * devices are submitted. Therefore -EAGAIN is not expected but it's
-             * known to happen sometimes with Linux SCSI. Submit again and hope
-             * the request completes successfully.
-             *
-             * For more information, see:
-             * https://lore.kernel.org/io-uring/20210727165811.284510-3-axboe@kernel.dk/T/#u
-             *
-             * If the code is changed to submit other types of requests in the
-             * future, then this workaround may need to be extended to deal with
-             * genuine -EAGAIN results that should not be resubmitted
-             * immediately.
-             */
-            if (ret == -EINTR || ret == -EAGAIN) {
-                luring_resubmit(s, luringcb);
-                continue;
-            }
-        } else if (!luringcb->qiov) {
-            goto end;
-        } else if (total_bytes == luringcb->qiov->size) {
+        if (total_bytes == req->qiov->size) {
             ret = 0;
-        /* Only read/write */
         } else {
             /* Short Read/Write */
-            if (luringcb->is_read) {
+            if (req->type == QEMU_AIO_READ) {
                 if (ret > 0) {
-                    luring_resubmit_short_read(s, luringcb, ret);
-                    continue;
-                } else {
-                    /* Pad with zeroes */
-                    qemu_iovec_memset(luringcb->qiov, total_bytes, 0,
-                                      luringcb->qiov->size - total_bytes);
-                    ret = 0;
+                    luring_resubmit_short_read(req, ret);
+                    return;
                 }
+
+                /* Pad with zeroes */
+                qemu_iovec_memset(req->qiov, total_bytes, 0,
+                                  req->qiov->size - total_bytes);
+                ret = 0;
             } else {
                 ret = -ENOSPC;
             }
         }
-end:
-        luringcb->ret = ret;
-        qemu_iovec_destroy(&luringcb->resubmit_qiov);
-
-        /*
-         * If the coroutine is already entered it must be in ioq_submit()
-         * and will notice luringcb->ret has been filled in when it
-         * eventually runs later. Coroutines cannot be entered recursively
-         * so avoid doing that!
-         */
-        assert(luringcb->co->ctx == s->aio_context);
-        if (!qemu_coroutine_entered(luringcb->co)) {
-            aio_co_wake(luringcb->co);
-        }
-    }
-
-    qemu_bh_cancel(s->completion_bh);
-
-    defer_call_end();
-}
-
-static int ioq_submit(LuringState *s)
-{
-    int ret = 0;
-    LuringAIOCB *luringcb, *luringcb_next;
-
-    while (s->io_q.in_queue > 0) {
-        /*
-         * Try to fetch sqes from the ring for requests waiting in
-         * the overflow queue
-         */
-        QSIMPLEQ_FOREACH_SAFE(luringcb, &s->io_q.submit_queue, next,
-                              luringcb_next) {
-            struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
-            if (!sqes) {
-                break;
-            }
-            /* Prep sqe for submission */
-            *sqes = luringcb->sqeq;
-            QSIMPLEQ_REMOVE_HEAD(&s->io_q.submit_queue, next);
-        }
-        ret = io_uring_submit(&s->ring);
-        trace_luring_io_uring_submit(s, ret);
-        /* Prevent infinite loop if submission is refused */
-        if (ret <= 0) {
-            if (ret == -EAGAIN || ret == -EINTR) {
-                continue;
-            }
-            break;
-        }
-        s->io_q.in_flight += ret;
-        s->io_q.in_queue  -= ret;
     }
-    s->io_q.blocked = (s->io_q.in_queue > 0);
 
-    if (s->io_q.in_flight) {
-        /*
-         * We can try to complete something just right away if there are
-         * still requests in-flight.
-         */
-        luring_process_completions(s);
-    }
-    return ret;
-}
-
-static void luring_process_completions_and_submit(LuringState *s)
-{
-    luring_process_completions(s);
+    req->ret = ret;
+    qemu_iovec_destroy(&req->resubmit_qiov);
 
-    if (s->io_q.in_queue > 0) {
-        ioq_submit(s);
-    }
-}
-
-static void qemu_luring_completion_bh(void *opaque)
-{
-    LuringState *s = opaque;
-    luring_process_completions_and_submit(s);
-}
-
-static void qemu_luring_completion_cb(void *opaque)
-{
-    LuringState *s = opaque;
-    luring_process_completions_and_submit(s);
-}
-
-static bool qemu_luring_poll_cb(void *opaque)
-{
-    LuringState *s = opaque;
-
-    return io_uring_cq_ready(&s->ring);
-}
-
-static void qemu_luring_poll_ready(void *opaque)
-{
-    LuringState *s = opaque;
-
-    luring_process_completions_and_submit(s);
-}
-
-static void ioq_init(LuringQueue *io_q)
-{
-    QSIMPLEQ_INIT(&io_q->submit_queue);
-    io_q->in_queue = 0;
-    io_q->in_flight = 0;
-    io_q->blocked = false;
-}
-
-static void luring_deferred_fn(void *opaque)
-{
-    LuringState *s = opaque;
-    trace_luring_unplug_fn(s, s->io_q.blocked, s->io_q.in_queue,
-                           s->io_q.in_flight);
-    if (!s->io_q.blocked && s->io_q.in_queue > 0) {
-        ioq_submit(s);
-    }
-}
-
-/**
- * luring_do_submit:
- * @fd: file descriptor for I/O
- * @luringcb: AIO control block
- * @s: AIO state
- * @offset: offset for request
- * @type: type of request
- *
- * Fetches sqes from ring, adds to pending queue and preps them
- *
- */
-static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
-                            uint64_t offset, int type, BdrvRequestFlags flags)
-{
-    int ret;
-    struct io_uring_sqe *sqes = &luringcb->sqeq;
-
-    switch (type) {
-    case QEMU_AIO_WRITE:
-#ifdef HAVE_IO_URING_PREP_WRITEV2
-    {
-        int luring_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
-        io_uring_prep_writev2(sqes, fd, luringcb->qiov->iov,
-                              luringcb->qiov->niov, offset, luring_flags);
-    }
-#else
-        assert(flags == 0);
-        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
-                             luringcb->qiov->niov, offset);
-#endif
-        break;
-    case QEMU_AIO_ZONE_APPEND:
-        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
-                             luringcb->qiov->niov, offset);
-        break;
-    case QEMU_AIO_READ:
-        io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
-                            luringcb->qiov->niov, offset);
-        break;
-    case QEMU_AIO_FLUSH:
-        io_uring_prep_fsync(sqes, fd, IORING_FSYNC_DATASYNC);
-        break;
-    default:
-        fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
-                        __func__, type);
-        abort();
-    }
-    io_uring_sqe_set_data(sqes, luringcb);
-
-    QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
-    s->io_q.in_queue++;
-    trace_luring_do_submit(s, s->io_q.blocked, s->io_q.in_queue,
-                           s->io_q.in_flight);
-    if (!s->io_q.blocked) {
-        if (s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES) {
-            ret = ioq_submit(s);
-            trace_luring_do_submit_done(s, ret);
-            return ret;
-        }
-
-        defer_call(luring_deferred_fn, s);
+    /*
+     * If the coroutine is already entered it must be in luring_co_submit() and
+     * will notice req->ret has been filled in when it eventually runs later.
+     * Coroutines cannot be entered recursively so avoid doing that!
+     */
+    if (!qemu_coroutine_entered(req->co)) {
+        aio_co_wake(req->co);
     }
-    return 0;
 }
 
-int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
-                                  QEMUIOVector *qiov, int type,
-                                  BdrvRequestFlags flags)
+int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd,
+                                  uint64_t offset, QEMUIOVector *qiov,
+                                  int type, BdrvRequestFlags flags)
 {
-    int ret;
-    AioContext *ctx = qemu_get_current_aio_context();
-    LuringState *s = aio_get_linux_io_uring(ctx);
-    LuringAIOCB luringcb = {
+    LuringRequest req = {
         .co         = qemu_coroutine_self(),
-        .ret        = -EINPROGRESS,
         .qiov       = qiov,
-        .is_read    = (type == QEMU_AIO_READ),
+        .ret        = -EINPROGRESS,
+        .type       = type,
+        .fd         = fd,
+        .offset     = offset,
+        .flags      = flags,
     };
-    trace_luring_co_submit(bs, s, &luringcb, fd, offset, qiov ? qiov->size : 0,
-                           type);
-    ret = luring_do_submit(fd, &luringcb, s, offset, type, flags);
 
-    if (ret < 0) {
-        return ret;
-    }
+    req.cqe_handler.cb = luring_cqe_handler;
 
-    if (luringcb.ret == -EINPROGRESS) {
-        qemu_coroutine_yield();
-    }
-    return luringcb.ret;
-}
-
-void luring_detach_aio_context(LuringState *s, AioContext *old_context)
-{
-    aio_set_fd_handler(old_context, s->ring.ring_fd,
-                       NULL, NULL, NULL, NULL, s);
-    qemu_bh_delete(s->completion_bh);
-    s->aio_context = NULL;
-}
-
-void luring_attach_aio_context(LuringState *s, AioContext *new_context)
-{
-    s->aio_context = new_context;
-    s->completion_bh = aio_bh_new(new_context, qemu_luring_completion_bh, s);
-    aio_set_fd_handler(s->aio_context, s->ring.ring_fd,
-                       qemu_luring_completion_cb, NULL,
-                       qemu_luring_poll_cb, qemu_luring_poll_ready, s);
-}
+    trace_luring_co_submit(bs, &req, fd, offset, qiov ? qiov->size : 0, type);
+    aio_add_sqe(luring_prep_sqe, &req, &req.cqe_handler);
 
-LuringState *luring_init(Error **errp)
-{
-    int rc;
-    LuringState *s = g_new0(LuringState, 1);
-    struct io_uring *ring = &s->ring;
-
-    trace_luring_init_state(s, sizeof(*s));
-
-    rc = io_uring_queue_init(MAX_ENTRIES, ring, 0);
-    if (rc < 0) {
-        error_setg_errno(errp, -rc, "failed to init linux io_uring ring");
-        g_free(s);
-        return NULL;
+    if (req.ret == -EINPROGRESS) {
+        qemu_coroutine_yield();
     }
-
-    ioq_init(&s->io_q);
-    return s;
-
-}
-
-void luring_cleanup(LuringState *s)
-{
-    io_uring_queue_exit(&s->ring);
-    trace_luring_cleanup_state(s);
-    g_free(s);
+    return req.ret;
 }
 
 bool luring_has_fua(void)
diff --git a/stubs/io_uring.c b/stubs/io_uring.c
deleted file mode 100644
index 622d1e4648..0000000000
--- a/stubs/io_uring.c
+++ /dev/null
@@ -1,32 +0,0 @@
-/*
- * Linux io_uring support.
- *
- * Copyright (C) 2009 IBM, Corp.
- * Copyright (C) 2009 Red Hat, Inc.
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- */
-#include "qemu/osdep.h"
-#include "block/aio.h"
-#include "block/raw-aio.h"
-
-void luring_detach_aio_context(LuringState *s, AioContext *old_context)
-{
-    abort();
-}
-
-void luring_attach_aio_context(LuringState *s, AioContext *new_context)
-{
-    abort();
-}
-
-LuringState *luring_init(Error **errp)
-{
-    abort();
-}
-
-void luring_cleanup(LuringState *s)
-{
-    abort();
-}
diff --git a/util/async.c b/util/async.c
index 00e46b99f9..a216cf8695 100644
--- a/util/async.c
+++ b/util/async.c
@@ -386,14 +386,6 @@ aio_ctx_finalize(GSource *source)
     }
 #endif
 
-#ifdef CONFIG_LINUX_IO_URING
-    if (ctx->linux_io_uring) {
-        luring_detach_aio_context(ctx->linux_io_uring, ctx);
-        luring_cleanup(ctx->linux_io_uring);
-        ctx->linux_io_uring = NULL;
-    }
-#endif
-
     assert(QSLIST_EMPTY(&ctx->scheduled_coroutines));
     qemu_bh_delete(ctx->co_schedule_bh);
 
@@ -468,29 +460,6 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
 }
 #endif
 
-#ifdef CONFIG_LINUX_IO_URING
-LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp)
-{
-    if (ctx->linux_io_uring) {
-        return ctx->linux_io_uring;
-    }
-
-    ctx->linux_io_uring = luring_init(errp);
-    if (!ctx->linux_io_uring) {
-        return NULL;
-    }
-
-    luring_attach_aio_context(ctx->linux_io_uring, ctx);
-    return ctx->linux_io_uring;
-}
-
-LuringState *aio_get_linux_io_uring(AioContext *ctx)
-{
-    assert(ctx->linux_io_uring);
-    return ctx->linux_io_uring;
-}
-#endif
-
 void aio_notify(AioContext *ctx)
 {
     /*
@@ -630,10 +599,6 @@ AioContext *aio_context_new(Error **errp)
     ctx->linux_aio = NULL;
 #endif
 
-#ifdef CONFIG_LINUX_IO_URING
-    ctx->linux_io_uring = NULL;
-#endif
-
     ctx->thread_pool = NULL;
     qemu_rec_mutex_init(&ctx->lock);
     timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
diff --git a/block/trace-events b/block/trace-events
index 8e789e1f12..c9b4736ff8 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -62,15 +62,9 @@ qmp_block_stream(void *bs) "bs %p"
 file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) "acb %p opaque %p offset %"PRId64" count %d type %d"
 
 # io_uring.c
-luring_init_state(void *s, size_t size) "s %p size %zu"
-luring_cleanup_state(void *s) "%p freed"
-luring_unplug_fn(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
-luring_do_submit(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
-luring_do_submit_done(void *s, int ret) "LuringState %p submitted to kernel %d"
-luring_co_submit(void *bs, void *s, void *luringcb, int fd, uint64_t offset, size_t nbytes, int type) "bs %p s %p luringcb %p fd %d offset %" PRId64 " nbytes %zd type %d"
-luring_process_completion(void *s, void *aiocb, int ret) "LuringState %p luringcb %p ret %d"
-luring_io_uring_submit(void *s, int ret) "LuringState %p ret %d"
-luring_resubmit_short_read(void *s, void *luringcb, int nread) "LuringState %p luringcb %p nread %d"
+luring_cqe_handler(void *req, int ret) "req %p ret %d"
+luring_co_submit(void *bs, void *req, int fd, uint64_t offset, size_t nbytes, int type) "bs %p req %p fd %d offset %" PRId64 " nbytes %zd type %d"
+luring_resubmit_short_read(void *req, int nread) "req %p nread %d"
 
 # qcow2.c
 qcow2_add_task(void *co, void *bs, void *pool, const char *action, int cluster_type, uint64_t host_offset, uint64_t offset, uint64_t bytes, void *qiov, size_t qiov_offset) "co %p bs %p pool %p: %s: cluster_type %d file_cluster_offset %" PRIu64 " offset %" PRIu64 " bytes %" PRIu64 " qiov %p qiov_offset %zu"
diff --git a/stubs/meson.build b/stubs/meson.build
index 27be2dec9f..0b2778c568 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -32,9 +32,6 @@ if have_block or have_ga
   stub_ss.add(files('cpus-virtual-clock.c'))
   stub_ss.add(files('icount.c'))
   stub_ss.add(files('graph-lock.c'))
-  if linux_io_uring.found()
-    stub_ss.add(files('io_uring.c'))
-  endif
   if libaio.found()
     stub_ss.add(files('linux-aio.c'))
   endif
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 15/27] block/io_uring: use non-vectored read/write when possible
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (13 preceding siblings ...)
  2025-11-04 17:54 ` [PULL 14/27] block/io_uring: use aio_add_sqe() Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 16/27] block: replace TABs with space Kevin Wolf
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Stefan Hajnoczi <stefanha@redhat.com>

The io_uring_prep_readv2/writev2() man pages recommend using the
non-vectored read/write operations when possible for performance
reasons.

I didn't measure a significant difference but it doesn't hurt to have
this optimization in place.

Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20251104022933.618123-16-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io_uring.c | 34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/block/io_uring.c b/block/io_uring.c
index dd930ee57e..f1514cf024 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -46,17 +46,28 @@ static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
 
     switch (req->type) {
     case QEMU_AIO_WRITE:
-#ifdef HAVE_IO_URING_PREP_WRITEV2
     {
         int luring_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
-        io_uring_prep_writev2(sqe, fd, qiov->iov,
-                              qiov->niov, offset, luring_flags);
-    }
+        if (luring_flags != 0 || qiov->niov > 1) {
+#ifdef HAVE_IO_URING_PREP_WRITEV2
+            io_uring_prep_writev2(sqe, fd, qiov->iov,
+                                  qiov->niov, offset, luring_flags);
 #else
-        assert(flags == 0);
-        io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset);
+            /*
+             * FUA should only be enabled with HAVE_IO_URING_PREP_WRITEV2, see
+             * luring_has_fua().
+             */
+            assert(luring_flags == 0);
+
+            io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset);
 #endif
+        } else {
+            /* The man page says non-vectored is faster than vectored */
+            struct iovec *iov = qiov->iov;
+            io_uring_prep_write(sqe, fd, iov->iov_base, iov->iov_len, offset);
+        }
         break;
+    }
     case QEMU_AIO_ZONE_APPEND:
         io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset);
         break;
@@ -65,8 +76,15 @@ static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
         if (req->resubmit_qiov.iov != NULL) {
             qiov = &req->resubmit_qiov;
         }
-        io_uring_prep_readv(sqe, fd, qiov->iov, qiov->niov,
-                            offset + req->total_read);
+        if (qiov->niov > 1) {
+            io_uring_prep_readv(sqe, fd, qiov->iov, qiov->niov,
+                                offset + req->total_read);
+        } else {
+            /* The man page says non-vectored is faster than vectored */
+            struct iovec *iov = qiov->iov;
+            io_uring_prep_read(sqe, fd, iov->iov_base, iov->iov_len,
+                               offset + req->total_read);
+        }
         break;
     }
     case QEMU_AIO_FLUSH:
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 16/27] block: replace TABs with space
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (14 preceding siblings ...)
  2025-11-04 17:54 ` [PULL 15/27] block/io_uring: use non-vectored read/write when possible Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 17/27] block: Drop detach_subchain for bdrv_replace_node Kevin Wolf
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Yeqi Fu <fufuyqqqqqq@gmail.com>

Bring the block files in line with the QEMU coding style, with spaces
for indentation. This patch partially resolves the issue 371.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/371
Signed-off-by: Yeqi Fu <fufuyqqqqqq@gmail.com>
Message-ID: <20230325085224.23842-1-fufuyqqqqqq@gmail.com>
[thuth: Rebased the patch to the current master branch]
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20251007163511.334178-1-thuth@redhat.com>
[kwolf: Fixed up vertical alignemnt]
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/nbd.h |  2 +-
 block/bochs.c       | 14 +++++------
 block/file-posix.c  | 58 ++++++++++++++++++++++-----------------------
 block/file-win32.c  | 38 ++++++++++++++---------------
 block/qcow.c        | 10 ++++----
 5 files changed, 61 insertions(+), 61 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 92987c76fd..ab40842da9 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -296,7 +296,7 @@ enum {
     NBD_CMD_BLOCK_STATUS = 7,
 };
 
-#define NBD_DEFAULT_PORT	10809
+#define NBD_DEFAULT_PORT 10809
 
 /* Maximum size of a single READ/WRITE data buffer */
 #define NBD_MAX_BUFFER_SIZE (32 * 1024 * 1024)
diff --git a/block/bochs.c b/block/bochs.c
index b099fb52fe..bfda88017d 100644
--- a/block/bochs.c
+++ b/block/bochs.c
@@ -300,15 +300,15 @@ static void bochs_close(BlockDriverState *bs)
 }
 
 static BlockDriver bdrv_bochs = {
-    .format_name	= "bochs",
-    .instance_size	= sizeof(BDRVBochsState),
-    .bdrv_probe		= bochs_probe,
-    .bdrv_open		= bochs_open,
+    .format_name         = "bochs",
+    .instance_size       = sizeof(BDRVBochsState),
+    .bdrv_probe          = bochs_probe,
+    .bdrv_open           = bochs_open,
     .bdrv_child_perm     = bdrv_default_perms,
     .bdrv_refresh_limits = bochs_refresh_limits,
-    .bdrv_co_preadv = bochs_co_preadv,
-    .bdrv_close		= bochs_close,
-    .is_format          = true,
+    .bdrv_co_preadv      = bochs_co_preadv,
+    .bdrv_close          = bochs_close,
+    .is_format           = true,
 };
 
 static void bdrv_bochs_init(void)
diff --git a/block/file-posix.c b/block/file-posix.c
index 8b7c02d19a..12d12970fa 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -133,7 +133,7 @@
 #define FTYPE_FILE   0
 #define FTYPE_CD     1
 
-#define MAX_BLOCKSIZE	4096
+#define MAX_BLOCKSIZE 4096
 
 /* Posix file locking bytes. Libvirt takes byte 0, we start from higher bytes,
  * leaving a few more bytes for its future use. */
@@ -4562,20 +4562,20 @@ static void coroutine_fn cdrom_co_lock_medium(BlockDriverState *bs, bool locked)
 }
 
 static BlockDriver bdrv_host_cdrom = {
-    .format_name        = "host_cdrom",
-    .protocol_name      = "host_cdrom",
-    .instance_size      = sizeof(BDRVRawState),
-    .bdrv_needs_filename = true,
-    .bdrv_probe_device	= cdrom_probe_device,
-    .bdrv_parse_filename = cdrom_parse_filename,
-    .bdrv_open          = cdrom_open,
-    .bdrv_close         = raw_close,
-    .bdrv_reopen_prepare = raw_reopen_prepare,
-    .bdrv_reopen_commit  = raw_reopen_commit,
-    .bdrv_reopen_abort   = raw_reopen_abort,
-    .bdrv_co_create_opts = bdrv_co_create_opts_simple,
-    .create_opts         = &bdrv_create_opts_simple,
-    .mutable_opts        = mutable_opts,
+    .format_name            = "host_cdrom",
+    .protocol_name          = "host_cdrom",
+    .instance_size          = sizeof(BDRVRawState),
+    .bdrv_needs_filename    = true,
+    .bdrv_probe_device      = cdrom_probe_device,
+    .bdrv_parse_filename    = cdrom_parse_filename,
+    .bdrv_open              = cdrom_open,
+    .bdrv_close             = raw_close,
+    .bdrv_reopen_prepare    = raw_reopen_prepare,
+    .bdrv_reopen_commit     = raw_reopen_commit,
+    .bdrv_reopen_abort      = raw_reopen_abort,
+    .bdrv_co_create_opts    = bdrv_co_create_opts_simple,
+    .create_opts            = &bdrv_create_opts_simple,
+    .mutable_opts           = mutable_opts,
     .bdrv_co_invalidate_cache = raw_co_invalidate_cache,
 
     .bdrv_co_preadv         = raw_co_preadv,
@@ -4688,20 +4688,20 @@ static void coroutine_fn cdrom_co_lock_medium(BlockDriverState *bs, bool locked)
 }
 
 static BlockDriver bdrv_host_cdrom = {
-    .format_name        = "host_cdrom",
-    .protocol_name      = "host_cdrom",
-    .instance_size      = sizeof(BDRVRawState),
-    .bdrv_needs_filename = true,
-    .bdrv_probe_device	= cdrom_probe_device,
-    .bdrv_parse_filename = cdrom_parse_filename,
-    .bdrv_open          = cdrom_open,
-    .bdrv_close         = raw_close,
-    .bdrv_reopen_prepare = raw_reopen_prepare,
-    .bdrv_reopen_commit  = raw_reopen_commit,
-    .bdrv_reopen_abort   = raw_reopen_abort,
-    .bdrv_co_create_opts = bdrv_co_create_opts_simple,
-    .create_opts         = &bdrv_create_opts_simple,
-    .mutable_opts       = mutable_opts,
+    .format_name            = "host_cdrom",
+    .protocol_name          = "host_cdrom",
+    .instance_size          = sizeof(BDRVRawState),
+    .bdrv_needs_filename    = true,
+    .bdrv_probe_device      = cdrom_probe_device,
+    .bdrv_parse_filename    = cdrom_parse_filename,
+    .bdrv_open              = cdrom_open,
+    .bdrv_close             = raw_close,
+    .bdrv_reopen_prepare    = raw_reopen_prepare,
+    .bdrv_reopen_commit     = raw_reopen_commit,
+    .bdrv_reopen_abort      = raw_reopen_abort,
+    .bdrv_co_create_opts    = bdrv_co_create_opts_simple,
+    .create_opts            = &bdrv_create_opts_simple,
+    .mutable_opts           = mutable_opts,
 
     .bdrv_co_preadv         = raw_co_preadv,
     .bdrv_co_pwritev        = raw_co_pwritev,
diff --git a/block/file-win32.c b/block/file-win32.c
index af9aea631c..0efb609e1d 100644
--- a/block/file-win32.c
+++ b/block/file-win32.c
@@ -741,16 +741,16 @@ static QemuOptsList raw_create_opts = {
 };
 
 BlockDriver bdrv_file = {
-    .format_name	= "file",
-    .protocol_name	= "file",
-    .instance_size	= sizeof(BDRVRawState),
-    .bdrv_needs_filename = true,
-    .bdrv_parse_filename = raw_parse_filename,
-    .bdrv_open          = raw_open,
-    .bdrv_refresh_limits = raw_probe_alignment,
-    .bdrv_close         = raw_close,
-    .bdrv_co_create_opts = raw_co_create_opts,
-    .bdrv_has_zero_init = bdrv_has_zero_init_1,
+    .format_name            = "file",
+    .protocol_name          = "file",
+    .instance_size          = sizeof(BDRVRawState),
+    .bdrv_needs_filename    = true,
+    .bdrv_parse_filename    = raw_parse_filename,
+    .bdrv_open              = raw_open,
+    .bdrv_refresh_limits    = raw_probe_alignment,
+    .bdrv_close             = raw_close,
+    .bdrv_co_create_opts    = raw_co_create_opts,
+    .bdrv_has_zero_init     = bdrv_has_zero_init_1,
 
     .bdrv_reopen_prepare = raw_reopen_prepare,
     .bdrv_reopen_commit  = raw_reopen_commit,
@@ -914,15 +914,15 @@ done:
 }
 
 static BlockDriver bdrv_host_device = {
-    .format_name	= "host_device",
-    .protocol_name	= "host_device",
-    .instance_size	= sizeof(BDRVRawState),
-    .bdrv_needs_filename = true,
-    .bdrv_parse_filename = hdev_parse_filename,
-    .bdrv_probe_device	= hdev_probe_device,
-    .bdrv_open     	= hdev_open,
-    .bdrv_close		= raw_close,
-    .bdrv_refresh_limits = hdev_refresh_limits,
+    .format_name            = "host_device",
+    .protocol_name          = "host_device",
+    .instance_size          = sizeof(BDRVRawState),
+    .bdrv_needs_filename    = true,
+    .bdrv_parse_filename    = hdev_parse_filename,
+    .bdrv_probe_device      = hdev_probe_device,
+    .bdrv_open              = hdev_open,
+    .bdrv_close             = raw_close,
+    .bdrv_refresh_limits    = hdev_refresh_limits,
 
     .bdrv_aio_preadv    = raw_aio_preadv,
     .bdrv_aio_pwritev   = raw_aio_pwritev,
diff --git a/block/qcow.c b/block/qcow.c
index 8a3e7591a9..b442bfe835 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -1184,11 +1184,11 @@ static const char *const qcow_strong_runtime_opts[] = {
 };
 
 static BlockDriver bdrv_qcow = {
-    .format_name	= "qcow",
-    .instance_size	= sizeof(BDRVQcowState),
-    .bdrv_probe		= qcow_probe,
-    .bdrv_open		= qcow_open,
-    .bdrv_close		= qcow_close,
+    .format_name            = "qcow",
+    .instance_size          = sizeof(BDRVQcowState),
+    .bdrv_probe             = qcow_probe,
+    .bdrv_open              = qcow_open,
+    .bdrv_close             = qcow_close,
     .bdrv_child_perm        = bdrv_default_perms,
     .bdrv_reopen_prepare    = qcow_reopen_prepare,
     .bdrv_co_create         = qcow_co_create,
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 17/27] block: Drop detach_subchain for bdrv_replace_node
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (15 preceding siblings ...)
  2025-11-04 17:54 ` [PULL 16/27] block: replace TABs with space Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 18/27] iotests: Test resizing file node under raw with size/offset Kevin Wolf
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Wesley Hershberger <wesley.hershberger@canonical.com>

Detaching filters using detach_subchain=true can cause segfaults as
described in #3149.

More specifically, this was observed when executing concurrent
block-stream and query-named-block-nodes. block-stream adds a
copy-on-read filter as the main BDS for the blockjob; that filter was
dropped with detach_subchain=true but not unref'd until the the blockjob
was free'd. Because query-named-block-nodes assumes that a filter will
always have exactly one child, it caused a segfault when it observed the
detached filter. Stacktrace:

0  bdrv_refresh_filename (bs=0x5efed72f8350)
    at /usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:8082
1  0x00005efea73cf9dc in bdrv_block_device_info
    (blk=0x0, bs=0x5efed72f8350, flat=true, errp=0x7ffeb829ebd8)
    at block/qapi.c:62
2  0x00005efea7391ed3 in bdrv_named_nodes_list
    (flat=<optimized out>, errp=0x7ffeb829ebd8)
    at /usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:6275
3  0x00005efea7471993 in qmp_query_named_block_nodes
    (has_flat=<optimized out>, flat=<optimized out>, errp=0x7ffeb829ebd8)
    at /usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/blockdev.c:2834
4  qmp_marshal_query_named_block_nodes
    (args=<optimized out>, ret=0x7f2b753beec0, errp=0x7f2b753beec8)
    at qapi/qapi-commands-block-core.c:553
5  0x00005efea74f03a5 in do_qmp_dispatch_bh (opaque=0x7f2b753beed0)
    at qapi/qmp-dispatch.c:128
6  0x00005efea75108e6 in aio_bh_poll (ctx=0x5efed6f3f430)
    at util/async.c:219
7  0x00005efea74ffdb2 in aio_dispatch (ctx=0x5efed6f3f430)
    at util/aio-posix.c:436
8  0x00005efea7512846 in aio_ctx_dispatch (source=<optimized out>,
    callback=<optimized out>,user_data=<optimized out>)
    at util/async.c:361
9  0x00007f2b77809bfb in ?? ()
    from /lib/x86_64-linux-gnu/libglib-2.0.so.0
10 0x00007f2b77809e70 in g_main_context_dispatch ()
    from /lib/x86_64-linux-gnu/libglib-2.0.so.0
11 0x00005efea7517228 in glib_pollfds_poll () at util/main-loop.c:287
12 os_host_main_loop_wait (timeout=0) at util/main-loop.c:310
13 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:589
14 0x00005efea7140482 in qemu_main_loop () at system/runstate.c:905
15 0x00005efea744e4e8 in qemu_default_main (opaque=opaque@entry=0x0)
    at system/main.c:50
16 0x00005efea6e76319 in main
    (argc=<optimized out>, argv=<optimized out>)
    at system/main.c:93

As discussed in 20251024-second-fix-3149-v1-1-d997fa3d5ce2@canonical.com,
a filter should not exist without children in the first place; therefore,
drop the parameter entirely as it is only used for filters.

This is a partial revert of 3108a15cf09865456d499b08fe14e3dbec4ccbb3.

After this change, a blockdev-backup job's copy-before-write filter will
hold references to its children until the filter is unref'd. This causes
an additional flush during bdrv_close, so also update iotest 257.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/3149
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Wesley Hershberger <wesley.hershberger@canonical.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-ID: <20251029-third-fix-3149-v2-1-94932bb404f4@canonical.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c                    | 38 ++++----------------------------------
 tests/qemu-iotests/257     |  8 ++++++--
 tests/qemu-iotests/257.out | 14 +++++++-------
 3 files changed, 17 insertions(+), 43 deletions(-)

diff --git a/block.c b/block.c
index cf08e64add..0fe0152a7c 100644
--- a/block.c
+++ b/block.c
@@ -5398,17 +5398,13 @@ bdrv_replace_node_noperm(BlockDriverState *from,
  *
  * With auto_skip=false the error is returned if from has a parent which should
  * not be updated.
- *
- * With @detach_subchain=true @to must be in a backing chain of @from. In this
- * case backing link of the cow-parent of @to is removed.
  */
 static int GRAPH_WRLOCK
 bdrv_replace_node_common(BlockDriverState *from, BlockDriverState *to,
-                         bool auto_skip, bool detach_subchain, Error **errp)
+                         bool auto_skip, Error **errp)
 {
     Transaction *tran = tran_new();
     g_autoptr(GSList) refresh_list = NULL;
-    BlockDriverState *to_cow_parent = NULL;
     int ret;
 
     GLOBAL_STATE_CODE();
@@ -5417,17 +5413,6 @@ bdrv_replace_node_common(BlockDriverState *from, BlockDriverState *to,
     assert(to->quiesce_counter);
     assert(bdrv_get_aio_context(from) == bdrv_get_aio_context(to));
 
-    if (detach_subchain) {
-        assert(bdrv_chain_contains(from, to));
-        assert(from != to);
-        for (to_cow_parent = from;
-             bdrv_filter_or_cow_bs(to_cow_parent) != to;
-             to_cow_parent = bdrv_filter_or_cow_bs(to_cow_parent))
-        {
-            ;
-        }
-    }
-
     /*
      * Do the replacement without permission update.
      * Replacement may influence the permissions, we should calculate new
@@ -5439,11 +5424,6 @@ bdrv_replace_node_common(BlockDriverState *from, BlockDriverState *to,
         goto out;
     }
 
-    if (detach_subchain) {
-        /* to_cow_parent is already drained because from is drained */
-        bdrv_remove_child(bdrv_filter_or_cow_child(to_cow_parent), tran);
-    }
-
     refresh_list = g_slist_prepend(refresh_list, to);
     refresh_list = g_slist_prepend(refresh_list, from);
 
@@ -5462,7 +5442,7 @@ out:
 int bdrv_replace_node(BlockDriverState *from, BlockDriverState *to,
                       Error **errp)
 {
-    return bdrv_replace_node_common(from, to, true, false, errp);
+    return bdrv_replace_node_common(from, to, true, errp);
 }
 
 int bdrv_drop_filter(BlockDriverState *bs, Error **errp)
@@ -5478,7 +5458,7 @@ int bdrv_drop_filter(BlockDriverState *bs, Error **errp)
 
     bdrv_drained_begin(child_bs);
     bdrv_graph_wrlock();
-    ret = bdrv_replace_node_common(bs, child_bs, true, true, errp);
+    ret = bdrv_replace_node_common(bs, child_bs, true, errp);
     bdrv_graph_wrunlock();
     bdrv_drained_end(child_bs);
 
@@ -5929,17 +5909,7 @@ int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
         updated_children = g_slist_prepend(updated_children, c);
     }
 
-    /*
-     * It seems correct to pass detach_subchain=true here, but it triggers
-     * one more yet not fixed bug, when due to nested aio_poll loop we switch to
-     * another drained section, which modify the graph (for example, removing
-     * the child, which we keep in updated_children list). So, it's a TODO.
-     *
-     * Note, bug triggered if pass detach_subchain=true here and run
-     * test-bdrv-drain. test_drop_intermediate_poll() test-case will crash.
-     * That's a FIXME.
-     */
-    bdrv_replace_node_common(top, base, false, false, &local_err);
+    bdrv_replace_node_common(top, base, false, &local_err);
     bdrv_graph_wrunlock();
 
     if (local_err) {
diff --git a/tests/qemu-iotests/257 b/tests/qemu-iotests/257
index 7d3720b8e5..cd0468aaa1 100755
--- a/tests/qemu-iotests/257
+++ b/tests/qemu-iotests/257
@@ -310,14 +310,18 @@ def test_bitmap_sync(bsync_mode, msync_mode='bitmap', failure=None):
                     'state': 1,
                     'new_state': 2
                 }, {
-                    'event': 'read_aio',
+                    'event': 'flush_to_disk',
                     'state': 2,
                     'new_state': 3
+                }, {
+                    'event': "read_aio",
+                    'state': 3,
+                    'new_state': 4
                 }],
                 'inject-error': [{
                     'event': 'read_aio',
                     'errno': 5,
-                    'state': 3,
+                    'state': 4,
                     'immediately': False,
                     'once': True
                 }]
diff --git a/tests/qemu-iotests/257.out b/tests/qemu-iotests/257.out
index c33dd7f3a9..fb28333cb2 100644
--- a/tests/qemu-iotests/257.out
+++ b/tests/qemu-iotests/257.out
@@ -272,7 +272,7 @@ qemu_img compare "TEST_DIR/PID-img" "TEST_DIR/PID-fbackup2" ==> Identical, OK!
 
 --- Preparing image & VM ---
 
-{"execute": "blockdev-add", "arguments": {"driver": "qcow2", "file": {"driver": "blkdebug", "image": {"driver": "file", "filename": "TEST_DIR/PID-img"}, "inject-error": [{"errno": 5, "event": "read_aio", "immediately": false, "once": true, "state": 3}], "set-state": [{"event": "flush_to_disk", "new-state": 2, "state": 1}, {"event": "read_aio", "new-state": 3, "state": 2}]}, "node-name": "drive0"}}
+{"execute": "blockdev-add", "arguments": {"driver": "qcow2", "file": {"driver": "blkdebug", "image": {"driver": "file", "filename": "TEST_DIR/PID-img"}, "inject-error": [{"errno": 5, "event": "read_aio", "immediately": false, "once": true, "state": 4}], "set-state": [{"event": "flush_to_disk", "new-state": 2, "state": 1}, {"event": "flush_to_disk", "new-state": 3, "state": 2}, {"event": "read_aio", "new-state": 4, "state": 3}]}, "node-name": "drive0"}}
 {"return": {}}
 
 --- Write #0 ---
@@ -1017,7 +1017,7 @@ qemu_img compare "TEST_DIR/PID-img" "TEST_DIR/PID-fbackup2" ==> Identical, OK!
 
 --- Preparing image & VM ---
 
-{"execute": "blockdev-add", "arguments": {"driver": "qcow2", "file": {"driver": "blkdebug", "image": {"driver": "file", "filename": "TEST_DIR/PID-img"}, "inject-error": [{"errno": 5, "event": "read_aio", "immediately": false, "once": true, "state": 3}], "set-state": [{"event": "flush_to_disk", "new-state": 2, "state": 1}, {"event": "read_aio", "new-state": 3, "state": 2}]}, "node-name": "drive0"}}
+{"execute": "blockdev-add", "arguments": {"driver": "qcow2", "file": {"driver": "blkdebug", "image": {"driver": "file", "filename": "TEST_DIR/PID-img"}, "inject-error": [{"errno": 5, "event": "read_aio", "immediately": false, "once": true, "state": 4}], "set-state": [{"event": "flush_to_disk", "new-state": 2, "state": 1}, {"event": "flush_to_disk", "new-state": 3, "state": 2}, {"event": "read_aio", "new-state": 4, "state": 3}]}, "node-name": "drive0"}}
 {"return": {}}
 
 --- Write #0 ---
@@ -1762,7 +1762,7 @@ qemu_img compare "TEST_DIR/PID-img" "TEST_DIR/PID-fbackup2" ==> Identical, OK!
 
 --- Preparing image & VM ---
 
-{"execute": "blockdev-add", "arguments": {"driver": "qcow2", "file": {"driver": "blkdebug", "image": {"driver": "file", "filename": "TEST_DIR/PID-img"}, "inject-error": [{"errno": 5, "event": "read_aio", "immediately": false, "once": true, "state": 3}], "set-state": [{"event": "flush_to_disk", "new-state": 2, "state": 1}, {"event": "read_aio", "new-state": 3, "state": 2}]}, "node-name": "drive0"}}
+{"execute": "blockdev-add", "arguments": {"driver": "qcow2", "file": {"driver": "blkdebug", "image": {"driver": "file", "filename": "TEST_DIR/PID-img"}, "inject-error": [{"errno": 5, "event": "read_aio", "immediately": false, "once": true, "state": 4}], "set-state": [{"event": "flush_to_disk", "new-state": 2, "state": 1}, {"event": "flush_to_disk", "new-state": 3, "state": 2}, {"event": "read_aio", "new-state": 4, "state": 3}]}, "node-name": "drive0"}}
 {"return": {}}
 
 --- Write #0 ---
@@ -2507,7 +2507,7 @@ qemu_img compare "TEST_DIR/PID-img" "TEST_DIR/PID-fbackup2" ==> Identical, OK!
 
 --- Preparing image & VM ---
 
-{"execute": "blockdev-add", "arguments": {"driver": "qcow2", "file": {"driver": "blkdebug", "image": {"driver": "file", "filename": "TEST_DIR/PID-img"}, "inject-error": [{"errno": 5, "event": "read_aio", "immediately": false, "once": true, "state": 3}], "set-state": [{"event": "flush_to_disk", "new-state": 2, "state": 1}, {"event": "read_aio", "new-state": 3, "state": 2}]}, "node-name": "drive0"}}
+{"execute": "blockdev-add", "arguments": {"driver": "qcow2", "file": {"driver": "blkdebug", "image": {"driver": "file", "filename": "TEST_DIR/PID-img"}, "inject-error": [{"errno": 5, "event": "read_aio", "immediately": false, "once": true, "state": 4}], "set-state": [{"event": "flush_to_disk", "new-state": 2, "state": 1}, {"event": "flush_to_disk", "new-state": 3, "state": 2}, {"event": "read_aio", "new-state": 4, "state": 3}]}, "node-name": "drive0"}}
 {"return": {}}
 
 --- Write #0 ---
@@ -3252,7 +3252,7 @@ qemu_img compare "TEST_DIR/PID-img" "TEST_DIR/PID-fbackup2" ==> Identical, OK!
 
 --- Preparing image & VM ---
 
-{"execute": "blockdev-add", "arguments": {"driver": "qcow2", "file": {"driver": "blkdebug", "image": {"driver": "file", "filename": "TEST_DIR/PID-img"}, "inject-error": [{"errno": 5, "event": "read_aio", "immediately": false, "once": true, "state": 3}], "set-state": [{"event": "flush_to_disk", "new-state": 2, "state": 1}, {"event": "read_aio", "new-state": 3, "state": 2}]}, "node-name": "drive0"}}
+{"execute": "blockdev-add", "arguments": {"driver": "qcow2", "file": {"driver": "blkdebug", "image": {"driver": "file", "filename": "TEST_DIR/PID-img"}, "inject-error": [{"errno": 5, "event": "read_aio", "immediately": false, "once": true, "state": 4}], "set-state": [{"event": "flush_to_disk", "new-state": 2, "state": 1}, {"event": "flush_to_disk", "new-state": 3, "state": 2}, {"event": "read_aio", "new-state": 4, "state": 3}]}, "node-name": "drive0"}}
 {"return": {}}
 
 --- Write #0 ---
@@ -3997,7 +3997,7 @@ qemu_img compare "TEST_DIR/PID-img" "TEST_DIR/PID-fbackup2" ==> Identical, OK!
 
 --- Preparing image & VM ---
 
-{"execute": "blockdev-add", "arguments": {"driver": "qcow2", "file": {"driver": "blkdebug", "image": {"driver": "file", "filename": "TEST_DIR/PID-img"}, "inject-error": [{"errno": 5, "event": "read_aio", "immediately": false, "once": true, "state": 3}], "set-state": [{"event": "flush_to_disk", "new-state": 2, "state": 1}, {"event": "read_aio", "new-state": 3, "state": 2}]}, "node-name": "drive0"}}
+{"execute": "blockdev-add", "arguments": {"driver": "qcow2", "file": {"driver": "blkdebug", "image": {"driver": "file", "filename": "TEST_DIR/PID-img"}, "inject-error": [{"errno": 5, "event": "read_aio", "immediately": false, "once": true, "state": 4}], "set-state": [{"event": "flush_to_disk", "new-state": 2, "state": 1}, {"event": "flush_to_disk", "new-state": 3, "state": 2}, {"event": "read_aio", "new-state": 4, "state": 3}]}, "node-name": "drive0"}}
 {"return": {}}
 
 --- Write #0 ---
@@ -4742,7 +4742,7 @@ qemu_img compare "TEST_DIR/PID-img" "TEST_DIR/PID-fbackup2" ==> Identical, OK!
 
 --- Preparing image & VM ---
 
-{"execute": "blockdev-add", "arguments": {"driver": "qcow2", "file": {"driver": "blkdebug", "image": {"driver": "file", "filename": "TEST_DIR/PID-img"}, "inject-error": [{"errno": 5, "event": "read_aio", "immediately": false, "once": true, "state": 3}], "set-state": [{"event": "flush_to_disk", "new-state": 2, "state": 1}, {"event": "read_aio", "new-state": 3, "state": 2}]}, "node-name": "drive0"}}
+{"execute": "blockdev-add", "arguments": {"driver": "qcow2", "file": {"driver": "blkdebug", "image": {"driver": "file", "filename": "TEST_DIR/PID-img"}, "inject-error": [{"errno": 5, "event": "read_aio", "immediately": false, "once": true, "state": 4}], "set-state": [{"event": "flush_to_disk", "new-state": 2, "state": 1}, {"event": "flush_to_disk", "new-state": 3, "state": 2}, {"event": "read_aio", "new-state": 4, "state": 3}]}, "node-name": "drive0"}}
 {"return": {}}
 
 --- Write #0 ---
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 18/27] iotests: Test resizing file node under raw with size/offset
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (16 preceding siblings ...)
  2025-11-04 17:54 ` [PULL 17/27] block: Drop detach_subchain for bdrv_replace_node Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 19/27] qemu-img: Fix amend option parse error handling Kevin Wolf
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

This adds some more tests for using the 'size' and 'offset' options of
raw to the recently added resize-below-raw test.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251028094328.17919-1-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/tests/resize-below-raw     | 51 +++++++++++++++++--
 tests/qemu-iotests/tests/resize-below-raw.out |  4 +-
 2 files changed, 49 insertions(+), 6 deletions(-)

diff --git a/tests/qemu-iotests/tests/resize-below-raw b/tests/qemu-iotests/tests/resize-below-raw
index 3c9241c918..41d21c83b7 100755
--- a/tests/qemu-iotests/tests/resize-below-raw
+++ b/tests/qemu-iotests/tests/resize-below-raw
@@ -14,10 +14,19 @@ from iotests import imgfmt, qemu_img_create, QMPTestCase
 image_size = 1 * 1024 * 1024
 image = os.path.join(iotests.test_dir, 'test.img')
 
-class TestResizeBelowRaw(QMPTestCase):
+class BaseResizeBelowRaw(QMPTestCase):
+    raw_size = None
+    raw_offset = None
+
     def setUp(self) -> None:
         qemu_img_create('-f', imgfmt, image, str(image_size))
 
+        extra_options = {}
+        if self.raw_size is not None:
+            extra_options['size'] = str(self.raw_size)
+        if self.raw_offset is not None:
+            extra_options['offset'] = str(self.raw_offset)
+
         self.vm = iotests.VM()
         self.vm.add_blockdev(self.vm.qmp_to_opts({
             'driver': imgfmt,
@@ -26,7 +35,8 @@ class TestResizeBelowRaw(QMPTestCase):
                 'driver': 'file',
                 'filename': image,
                 'node-name': 'file0',
-            }
+            },
+            **extra_options
         }))
         self.vm.launch()
 
@@ -34,14 +44,16 @@ class TestResizeBelowRaw(QMPTestCase):
         self.vm.shutdown()
         os.remove(image)
 
-    def assert_size(self, size: int) -> None:
+    def assert_size(self, size: int, file_size: int|None = None) -> None:
         nodes = self.vm.qmp('query-named-block-nodes', flat=True)['return']
         self.assertEqual(len(nodes), 2)
         for node in nodes:
-            if node['drv'] == 'file':
+            if node['drv'] == 'file' and file_size is not None:
+                self.assertEqual(node['image']['virtual-size'], file_size)
                 continue
             self.assertEqual(node['image']['virtual-size'], size)
 
+class TestResizeBelowUnlimitedRaw(BaseResizeBelowRaw):
     def test_resize_below_raw(self) -> None:
         self.assert_size(image_size)
         self.vm.qmp('block_resize', node_name='file0', size=2*image_size)
@@ -49,5 +61,36 @@ class TestResizeBelowRaw(QMPTestCase):
         self.vm.qmp('block_resize', node_name='node0', size=3*image_size)
         self.assert_size(3*image_size)
 
+# offset = 0 behaves the same as absent offset
+class TestResizeBelowRawWithZeroOffset(TestResizeBelowUnlimitedRaw):
+    raw_offset = 0
+
+class TestResizeBelowRawWithSize(BaseResizeBelowRaw):
+    raw_size = image_size // 2
+
+    def test_resize_below_raw_with_size(self) -> None:
+        self.assert_size(image_size // 2, image_size)
+
+        # This QMP command fails because node0 unshares RESIZE
+        self.vm.qmp('block_resize', node_name='file0', size=2*image_size)
+        self.assert_size(image_size // 2, image_size)
+
+        # This QMP command fails because node0 is a fixed-size disk
+        self.vm.qmp('block_resize', node_name='node0', size=3*image_size)
+        self.assert_size(image_size // 2, image_size)
+
+class TestResizeBelowRawWithOffset(BaseResizeBelowRaw):
+    raw_offset = image_size // 4
+
+    def test_resize_below_raw_with_offset(self) -> None:
+        self.assert_size(image_size * 3 // 4, image_size)
+
+        # This QMP command fails because node0 unshares RESIZE
+        self.vm.qmp('block_resize', node_name='file0', size=2*image_size)
+        self.assert_size(image_size * 3 // 4, image_size)
+
+        self.vm.qmp('block_resize', node_name='node0', size=3*image_size)
+        self.assert_size(3 * image_size, image_size * 13 // 4)
+
 if __name__ == '__main__':
     iotests.main(supported_fmts=['raw'], supported_protocols=['file'])
diff --git a/tests/qemu-iotests/tests/resize-below-raw.out b/tests/qemu-iotests/tests/resize-below-raw.out
index ae1213e6f8..89968f35d7 100644
--- a/tests/qemu-iotests/tests/resize-below-raw.out
+++ b/tests/qemu-iotests/tests/resize-below-raw.out
@@ -1,5 +1,5 @@
-.
+....
 ----------------------------------------------------------------------
-Ran 1 tests
+Ran 4 tests
 
 OK
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 19/27] qemu-img: Fix amend option parse error handling
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (17 preceding siblings ...)
  2025-11-04 17:54 ` [PULL 18/27] iotests: Test resizing file node under raw with size/offset Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 20/27] iotests: Run iotests with sanitizers Kevin Wolf
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>

qemu_opts_del(opts) dereferences opts->list, which is the old amend_opts
pointer that can be dangling after executing
qemu_opts_append(amend_opts, bs->drv->create_opts) and cause
use-after-free.

Fix the potential use-after-free by moving the qemu_opts_del() call
before the qemu_opts_append() call.

Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Message-ID: <20251023-iotests-v1-1-fab143ca4c2f@rsg.ci.i.u-tokyo.ac.jp>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-img.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/qemu-img.c b/qemu-img.c
index a7791896c1..7a32d2d16c 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -4597,9 +4597,9 @@ static int img_amend(const img_cmd_t *ccmd, int argc, char **argv)
     amend_opts = qemu_opts_append(amend_opts, bs->drv->amend_opts);
     opts = qemu_opts_create(amend_opts, NULL, 0, &error_abort);
     if (!qemu_opts_do_parse(opts, options, NULL, &err)) {
+        qemu_opts_del(opts);
         /* Try to parse options using the create options */
         amend_opts = qemu_opts_append(amend_opts, bs->drv->create_opts);
-        qemu_opts_del(opts);
         opts = qemu_opts_create(amend_opts, NULL, 0, &error_abort);
         if (qemu_opts_do_parse(opts, options, NULL, NULL)) {
             error_append_hint(&err,
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 20/27] iotests: Run iotests with sanitizers
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (18 preceding siblings ...)
  2025-11-04 17:54 ` [PULL 19/27] qemu-img: Fix amend option parse error handling Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 21/27] qcow2: rename update_refcount_discard to queue_discard Kevin Wolf
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>

Commit 2cc4d1c5eab1 ("tests/check-block: Skip iotests when sanitizers
are enabled") changed iotests to skip when sanitizers are enabled.
The rationale is that AddressSanitizer emits warnings and reports leaks,
which results in test breakage. Later, sanitizers that are enabled for
production environments (safe-stack and cfi-icall) were exempted.

However, this approach has a few problems.

- It requires rebuild to disable sanitizers if the existing build has
  them enabled.
- It disables other useful non-production sanitizers.
- The exemption of safe-stack and cfi-icall is not correctly
  implemented, so qemu-iotests are incorrectly enabled whenever either
  safe-stack or cfi-icall is enabled *and*, even if there is another
  sanitizer like AddressSanitizer.

To solve these problems, direct AddressSanitizer warnings to separate
files to avoid changing the test results, and selectively disable
leak detection at runtime instead of requiring to disable all
sanitizers at buildtime.

Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Message-ID: <20251023-iotests-v1-2-fab143ca4c2f@rsg.ci.i.u-tokyo.ac.jp>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/testrunner.py | 12 ++++++++++++
 tests/qemu-iotests/meson.build   |  8 --------
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/tests/qemu-iotests/testrunner.py b/tests/qemu-iotests/testrunner.py
index 14cc8492f9..e2a3658994 100644
--- a/tests/qemu-iotests/testrunner.py
+++ b/tests/qemu-iotests/testrunner.py
@@ -263,10 +263,21 @@ def do_run_test(self, test: str) -> TestResult:
             Path(env[d]).mkdir(parents=True, exist_ok=True)
 
         test_dir = env['TEST_DIR']
+        f_asan = Path(test_dir, f_test.name + '.out.asan')
         f_bad = Path(test_dir, f_test.name + '.out.bad')
         f_notrun = Path(test_dir, f_test.name + '.notrun')
         f_casenotrun = Path(test_dir, f_test.name + '.casenotrun')
 
+        env['ASAN_OPTIONS'] = f'detect_leaks=0:log_path={f_asan}'
+
+        def unlink_asan():
+            with os.scandir(test_dir) as it:
+                for entry in it:
+                    if entry.name.startswith(f_asan.name):
+                        os.unlink(entry)
+
+        unlink_asan()
+
         for p in (f_notrun, f_casenotrun):
             silent_unlink(p)
 
@@ -312,6 +323,7 @@ def do_run_test(self, test: str) -> TestResult:
                               description=f'output mismatch (see {f_bad})',
                               diff=diff, casenotrun=casenotrun)
         else:
+            unlink_asan()
             f_bad.unlink()
             return TestResult(status='pass', elapsed=elapsed,
                               casenotrun=casenotrun)
diff --git a/tests/qemu-iotests/meson.build b/tests/qemu-iotests/meson.build
index fad340ad59..56b0446827 100644
--- a/tests/qemu-iotests/meson.build
+++ b/tests/qemu-iotests/meson.build
@@ -2,14 +2,6 @@ if not have_tools or host_os == 'windows'
   subdir_done()
 endif
 
-foreach cflag: qemu_ldflags
-  if cflag.startswith('-fsanitize') and \
-     not cflag.contains('safe-stack') and not cflag.contains('cfi-icall')
-    message('Sanitizers are enabled ==> Disabled the qemu-iotests.')
-    subdir_done()
-  endif
-endforeach
-
 bash = find_program('bash', required: false, version: '>= 4.0')
 if not bash.found()
   message('bash >= v4.0 not available ==> Disabled the qemu-iotests.')
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 21/27] qcow2: rename update_refcount_discard to queue_discard
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (19 preceding siblings ...)
  2025-11-04 17:54 ` [PULL 20/27] iotests: Run iotests with sanitizers Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 22/27] qcow2: put discards in discard queue when discard-no-unref is enabled Kevin Wolf
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Jean-Louis Dupond <jean-louis@dupond.be>

The function just queues discards, and doesn't do any refcount change.
So let's change the function name to align with its function.

Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be>
Message-ID: <20250513132628.1055549-2-jean-louis@dupond.be>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-refcount.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 0266542cee..8fb210501c 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -754,8 +754,8 @@ void qcow2_process_discards(BlockDriverState *bs, int ret)
     }
 }
 
-static void update_refcount_discard(BlockDriverState *bs,
-                                    uint64_t offset, uint64_t length)
+static void queue_discard(BlockDriverState *bs,
+                          uint64_t offset, uint64_t length)
 {
     BDRVQcow2State *s = bs->opaque;
     Qcow2DiscardRegion *d, *p, *next;
@@ -902,7 +902,7 @@ update_refcount(BlockDriverState *bs, int64_t offset, int64_t length,
             }
 
             if (s->discard_passthrough[type]) {
-                update_refcount_discard(bs, cluster_offset, s->cluster_size);
+                queue_discard(bs, cluster_offset, s->cluster_size);
             }
         }
     }
@@ -3619,7 +3619,7 @@ qcow2_discard_refcount_block(BlockDriverState *bs, uint64_t discard_block_offs)
         /* discard refblock from the cache if refblock is cached */
         qcow2_cache_discard(s->refcount_block_cache, refblock);
     }
-    update_refcount_discard(bs, discard_block_offs, s->cluster_size);
+    queue_discard(bs, discard_block_offs, s->cluster_size);
 
     return 0;
 }
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 22/27] qcow2: put discards in discard queue when discard-no-unref is enabled
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (20 preceding siblings ...)
  2025-11-04 17:54 ` [PULL 21/27] qcow2: rename update_refcount_discard to queue_discard Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 23/27] tests/qemu-iotests/184: Fix skip message for qemu-img without throttle Kevin Wolf
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Jean-Louis Dupond <jean-louis@dupond.be>

When discard-no-unref is enabled, discards are not queued like it
should.
This was broken since discard-no-unref was added.

Add a helper function qcow2_discard_cluster which handles some common
checks and calls the queue_discards function if needed to add the
discard request to the queue.

Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be>
Message-ID: <20250513132628.1055549-3-jean-louis@dupond.be>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.h          |  4 ++++
 block/qcow2-cluster.c  | 16 ++++++----------
 block/qcow2-refcount.c | 17 +++++++++++++++++
 3 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index a9e3481c6e..547bb2b814 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -880,6 +880,10 @@ void GRAPH_RDLOCK qcow2_free_clusters(BlockDriverState *bs,
 void GRAPH_RDLOCK
 qcow2_free_any_cluster(BlockDriverState *bs, uint64_t l2_entry,
                        enum qcow2_discard_type type);
+void GRAPH_RDLOCK
+qcow2_discard_cluster(BlockDriverState *bs, uint64_t offset,
+                      uint64_t length, QCow2ClusterType ctype,
+                      enum qcow2_discard_type dtype);
 
 int GRAPH_RDLOCK
 qcow2_update_snapshot_refcount(BlockDriverState *bs, int64_t l1_table_offset,
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index ce8c0076b3..c655bf6df4 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1978,12 +1978,10 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t offset, uint64_t nb_clusters,
         if (!keep_reference) {
             /* Then decrease the refcount */
             qcow2_free_any_cluster(bs, old_l2_entry, type);
-        } else if (s->discard_passthrough[type] &&
-                   (cluster_type == QCOW2_CLUSTER_NORMAL ||
-                    cluster_type == QCOW2_CLUSTER_ZERO_ALLOC)) {
+        } else {
             /* If we keep the reference, pass on the discard still */
-            bdrv_pdiscard(s->data_file, old_l2_entry & L2E_OFFSET_MASK,
-                          s->cluster_size);
+            qcow2_discard_cluster(bs, old_l2_entry & L2E_OFFSET_MASK,
+                                  s->cluster_size, cluster_type, type);
         }
     }
 
@@ -2092,12 +2090,10 @@ zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
             if (!keep_reference) {
                 /* Then decrease the refcount */
                 qcow2_free_any_cluster(bs, old_l2_entry, QCOW2_DISCARD_REQUEST);
-            } else if (s->discard_passthrough[QCOW2_DISCARD_REQUEST] &&
-                       (type == QCOW2_CLUSTER_NORMAL ||
-                        type == QCOW2_CLUSTER_ZERO_ALLOC)) {
+            } else {
                 /* If we keep the reference, pass on the discard still */
-                bdrv_pdiscard(s->data_file, old_l2_entry & L2E_OFFSET_MASK,
-                            s->cluster_size);
+                qcow2_discard_cluster(bs, old_l2_entry & L2E_OFFSET_MASK,
+                            s->cluster_size, type, QCOW2_DISCARD_REQUEST);
             }
         }
     }
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 8fb210501c..6512cda407 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1205,6 +1205,23 @@ void qcow2_free_any_cluster(BlockDriverState *bs, uint64_t l2_entry,
     }
 }
 
+void qcow2_discard_cluster(BlockDriverState *bs, uint64_t offset,
+                           uint64_t length, QCow2ClusterType ctype,
+                           enum qcow2_discard_type dtype)
+{
+    BDRVQcow2State *s = bs->opaque;
+
+    if (s->discard_passthrough[dtype] &&
+        (ctype == QCOW2_CLUSTER_NORMAL ||
+         ctype == QCOW2_CLUSTER_ZERO_ALLOC)) {
+        if (has_data_file(bs)) {
+            bdrv_pdiscard(s->data_file, offset, length);
+        } else {
+            queue_discard(bs, offset, length);
+        }
+    }
+}
+
 int qcow2_write_caches(BlockDriverState *bs)
 {
     BDRVQcow2State *s = bs->opaque;
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 23/27] tests/qemu-iotests/184: Fix skip message for qemu-img without throttle
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (21 preceding siblings ...)
  2025-11-04 17:54 ` [PULL 22/27] qcow2: put discards in discard queue when discard-no-unref is enabled Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 24/27] tests/qemu-iotests: Improve the dry run list to speed up thorough testing Kevin Wolf
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Thomas Huth <thuth@redhat.com>

If qemu-img does not support throttling, test 184 currently skips
with the message:

  not suitable for this image format: raw

But that's wrong, it's not about the image format, it's about the
throttling not being available in qemu-img. Thus fix this by using
_notrun with a proper message instead.

Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20251014104142.1281028-2-thuth@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/184 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/184 b/tests/qemu-iotests/184
index 6d0afe9d38..9248b3265d 100755
--- a/tests/qemu-iotests/184
+++ b/tests/qemu-iotests/184
@@ -51,7 +51,7 @@ run_qemu()
 }
 
 test_throttle=$($QEMU_IMG --help|grep throttle)
-[ "$test_throttle" = "" ] && _supported_fmt throttle
+[ "$test_throttle" = "" ] && _notrun "qemu-img does not support throttle"
 
 echo
 echo "== checking interface =="
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 24/27] tests/qemu-iotests: Improve the dry run list to speed up thorough testing
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (22 preceding siblings ...)
  2025-11-04 17:54 ` [PULL 23/27] tests/qemu-iotests/184: Fix skip message for qemu-img without throttle Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 25/27] tests/qemu-iotest: Add more image formats to the " Kevin Wolf
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Thomas Huth <thuth@redhat.com>

When running the tests in thorough mode, e.g. with:

 make -j$(nproc) check SPEED=thorough

we currently always get a huge amount of total tests that the test
runner tries to execute (2457 in my case), but a big bunch of them are
only skipped (1099 in my case, meaning that only 1358 got executed).
This happens because we try to run the whole set of iotests for multiple
image formats while a lot of the tests can only run with one certain
format only and thus are marked as SKIP during execution. This is quite a
waste of time during each test run, and also unnecessarily blows up the
displayed list of executed tests in the console output.

Thus let's try to be a little bit smarter: If the "check" script is run
with "-n" and an image format switch (like "-qed") at the same time (which
is what we do for discovering the tests for the meson test runner already),
only report the tests that likely support the given format instead of
providing the whole list of all tests. We can determine whether a test
supports a format or not by looking at the lines in the file that contain
a "supported_fmt" or "unsupported_fmt" statement. This is only heuristics,
of course, but it is good enough for running the iotests via "make
check-block" - I double-checked that the list of executed tests does not
get changed by this patch, it's only the tests that are skipped anyway that
are now not run anymore.

This way the amount of total tests drops from 2457 to 1432 for me, and
the amount of skipped tests drops from 1099 to just 74 (meaning that we
still properly run 1432 - 74 = 1358 tests as we did before).

Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20251014104142.1281028-3-thuth@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/check | 42 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/check b/tests/qemu-iotests/check
index d9b7c1d598..3941eac8e2 100755
--- a/tests/qemu-iotests/check
+++ b/tests/qemu-iotests/check
@@ -17,6 +17,7 @@
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 
 import os
+import re
 import sys
 import argparse
 import shutil
@@ -82,7 +83,7 @@ def make_argparser() -> argparse.ArgumentParser:
     g_env.add_argument('-i', dest='aiomode', default='threads',
                        help='sets AIOMODE environment variable')
 
-    p.set_defaults(imgfmt='raw', imgproto='file')
+    p.set_defaults(imgproto='file')
 
     format_list = ['raw', 'bochs', 'cloop', 'parallels', 'qcow', 'qcow2',
                    'qed', 'vdi', 'vpc', 'vhdx', 'vmdk', 'luks', 'dmg', 'vvfat']
@@ -137,15 +138,50 @@ def make_argparser() -> argparse.ArgumentParser:
     return p
 
 
+def dry_run_list(test_dir, imgfmt, testlist):
+    for t in testlist:
+        if not imgfmt:
+            print('\n'.join([os.path.basename(t)]))
+            continue
+        # If a format has been given, we look for the "supported_fmt"
+        # and the "unsupported_fmt" lines in the test and try to find out
+        # whether the format is supported or not. This is only heuristics
+        # (it can e.g. fail if the "unsupported_fmts" and "supported_fmts"
+        # statements are in the same line), but it should be good enough
+        # to get a proper list for "make check-block"
+        with open(os.path.join(test_dir, t), 'r', encoding='utf-8') as fh:
+            supported = True
+            check_next_line = False
+            sd = "[ \t'\"]"           # Start delimiter
+            ed = "([ \t'\"]|$)"       # End delimiter
+            for line in fh:
+                if 'unsupported_fmt' in line:
+                    if re.search(sd + imgfmt + ed, line):
+                        supported = False
+                        break
+                elif 'supported_fmt' in line or check_next_line:
+                    if re.search(sd + 'generic' + ed, line):
+                        continue      # Might be followed by "unsupported" line
+                    supported = re.search(sd + imgfmt + ed, line)
+                    check_next_line = not ']' in line and \
+                            ('supported_fmts=[' in line or check_next_line)
+                    if supported or not check_next_line:
+                        break
+            if supported:
+                print('\n'.join([os.path.basename(t)]))
+
+
 if __name__ == '__main__':
     warnings.simplefilter("default")
     os.environ["PYTHONWARNINGS"] = "default"
 
     args = make_argparser().parse_args()
 
+    image_format = args.imgfmt or 'raw'
+
     env = TestEnv(source_dir=args.source_dir,
                   build_dir=args.build_dir,
-                  imgfmt=args.imgfmt, imgproto=args.imgproto,
+                  imgfmt=image_format, imgproto=args.imgproto,
                   aiomode=args.aiomode, cachemode=args.cachemode,
                   imgopts=args.imgopts, misalign=args.misalign,
                   debug=args.debug, valgrind=args.valgrind,
@@ -189,7 +225,7 @@ if __name__ == '__main__':
 
     if args.dry_run:
         with env:
-            print('\n'.join([os.path.basename(t) for t in tests]))
+            dry_run_list(env.source_iotests, args.imgfmt, tests)
     else:
         with TestRunner(env, tap=args.tap,
                         color=args.color) as tr:
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 25/27] tests/qemu-iotest: Add more image formats to the thorough testing
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (23 preceding siblings ...)
  2025-11-04 17:54 ` [PULL 24/27] tests/qemu-iotests: Improve the dry run list to speed up thorough testing Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 26/27] block: Allow drivers to control protocol prefix at creation Kevin Wolf
  2025-11-04 17:54 ` [PULL 27/27] qcow2, vmdk: Restrict creation with secondary file using protocol Kevin Wolf
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Thomas Huth <thuth@redhat.com>

Now that the "check" script is a little bit smarter with providing
a list of tests that are supported for an image format, we can also
add more image formats that can be used for generic block layer
testing. (Note: qcow1 and luks are not added because some tests
there currently fail, and other formats like bochs, cloop, dmg and
vvfat do not work with the generic tests and thus would only get
skipped if we'd tried to add them here)

Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20251014104142.1281028-4-thuth@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/meson.build | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tests/qemu-iotests/meson.build b/tests/qemu-iotests/meson.build
index 56b0446827..d7bae71ced 100644
--- a/tests/qemu-iotests/meson.build
+++ b/tests/qemu-iotests/meson.build
@@ -13,7 +13,10 @@ qemu_iotests_env = {'PYTHON': python.full_path()}
 qemu_iotests_formats = {
   'qcow2': 'quick',
   'raw': 'slow',
+  'parallels': 'thorough',
   'qed': 'thorough',
+  'vdi': 'thorough',
+  'vhdx': 'thorough',
   'vmdk': 'thorough',
   'vpc': 'thorough'
 }
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 26/27] block: Allow drivers to control protocol prefix at creation
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (24 preceding siblings ...)
  2025-11-04 17:54 ` [PULL 25/27] tests/qemu-iotest: Add more image formats to the " Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  2025-11-04 17:54 ` [PULL 27/27] qcow2, vmdk: Restrict creation with secondary file using protocol Kevin Wolf
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Eric Blake <eblake@redhat.com>

This patch is pure refactoring: instead of hard-coding permission to
use a protocol prefix when creating an image, the drivers can now pass
in a parameter, comparable to what they could already do for opening a
pre-existing image.  This patch is purely mechanical (all drivers pass
in true for now), but it will enable the next patch to cater to
drivers that want to differ in behavior for the primary image vs. any
secondary images that are opened at the same time as creating the
primary image.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20250915213919.3121401-5-eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block-global-state.h | 3 ++-
 block.c                            | 4 ++--
 block/crypto.c                     | 2 +-
 block/parallels.c                  | 2 +-
 block/qcow.c                       | 2 +-
 block/qcow2.c                      | 4 ++--
 block/qed.c                        | 2 +-
 block/raw-format.c                 | 2 +-
 block/vdi.c                        | 2 +-
 block/vhdx.c                       | 2 +-
 block/vmdk.c                       | 2 +-
 block/vpc.c                        | 2 +-
 12 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
index 62da83c616..479ca2858e 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -65,7 +65,8 @@ int co_wrapper bdrv_create(BlockDriver *drv, const char *filename,
                            QemuOpts *opts, Error **errp);
 
 int coroutine_fn GRAPH_UNLOCKED
-bdrv_co_create_file(const char *filename, QemuOpts *opts, Error **errp);
+bdrv_co_create_file(const char *filename, QemuOpts *opts,
+                    bool allow_protocol_prefix, Error **errp);
 
 BlockDriverState *bdrv_new(void);
 int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
diff --git a/block.c b/block.c
index 0fe0152a7c..4f1581cedf 100644
--- a/block.c
+++ b/block.c
@@ -693,7 +693,7 @@ out:
 }
 
 int coroutine_fn bdrv_co_create_file(const char *filename, QemuOpts *opts,
-                                     Error **errp)
+                                     bool allow_protocol_prefix, Error **errp)
 {
     QemuOpts *protocol_opts;
     BlockDriver *drv;
@@ -702,7 +702,7 @@ int coroutine_fn bdrv_co_create_file(const char *filename, QemuOpts *opts,
 
     GLOBAL_STATE_CODE();
 
-    drv = bdrv_find_protocol(filename, true, errp);
+    drv = bdrv_find_protocol(filename, allow_protocol_prefix, errp);
     if (drv == NULL) {
         return -ENOENT;
     }
diff --git a/block/crypto.c b/block/crypto.c
index 7c37b23e36..b97d027444 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -835,7 +835,7 @@ block_crypto_co_create_opts_luks(BlockDriver *drv, const char *filename,
     }
 
     /* Create protocol layer */
-    ret = bdrv_co_create_file(filename, opts, errp);
+    ret = bdrv_co_create_file(filename, opts, true, errp);
     if (ret < 0) {
         goto fail;
     }
diff --git a/block/parallels.c b/block/parallels.c
index 3a375e2a8a..7a90fb5220 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -1117,7 +1117,7 @@ parallels_co_create_opts(BlockDriver *drv, const char *filename,
     }
 
     /* Create and open the file (protocol layer) */
-    ret = bdrv_co_create_file(filename, opts, errp);
+    ret = bdrv_co_create_file(filename, opts, true, errp);
     if (ret < 0) {
         goto done;
     }
diff --git a/block/qcow.c b/block/qcow.c
index b442bfe835..3d37d26ee8 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -978,7 +978,7 @@ qcow_co_create_opts(BlockDriver *drv, const char *filename,
     }
 
     /* Create and open the file (protocol layer) */
-    ret = bdrv_co_create_file(filename, opts, errp);
+    ret = bdrv_co_create_file(filename, opts, true, errp);
     if (ret < 0) {
         goto fail;
     }
diff --git a/block/qcow2.c b/block/qcow2.c
index 4aa9f9e068..ec72e27214 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3956,7 +3956,7 @@ qcow2_co_create_opts(BlockDriver *drv, const char *filename, QemuOpts *opts,
     }
 
     /* Create and open the file (protocol layer) */
-    ret = bdrv_co_create_file(filename, opts, errp);
+    ret = bdrv_co_create_file(filename, opts, true, errp);
     if (ret < 0) {
         goto finish;
     }
@@ -3971,7 +3971,7 @@ qcow2_co_create_opts(BlockDriver *drv, const char *filename, QemuOpts *opts,
     /* Create and open an external data file (protocol layer) */
     val = qdict_get_try_str(qdict, BLOCK_OPT_DATA_FILE);
     if (val) {
-        ret = bdrv_co_create_file(val, opts, errp);
+        ret = bdrv_co_create_file(val, opts, true, errp);
         if (ret < 0) {
             goto finish;
         }
diff --git a/block/qed.c b/block/qed.c
index 4a36fb3929..da23a83d62 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -788,7 +788,7 @@ bdrv_qed_co_create_opts(BlockDriver *drv, const char *filename,
     }
 
     /* Create and open the file (protocol layer) */
-    ret = bdrv_co_create_file(filename, opts, errp);
+    ret = bdrv_co_create_file(filename, opts, true, errp);
     if (ret < 0) {
         goto fail;
     }
diff --git a/block/raw-format.c b/block/raw-format.c
index df16ac1ea2..a57c2922d5 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -463,7 +463,7 @@ static int coroutine_fn GRAPH_UNLOCKED
 raw_co_create_opts(BlockDriver *drv, const char *filename,
                    QemuOpts *opts, Error **errp)
 {
-    return bdrv_co_create_file(filename, opts, errp);
+    return bdrv_co_create_file(filename, opts, true, errp);
 }
 
 static int raw_open(BlockDriverState *bs, QDict *options, int flags,
diff --git a/block/vdi.c b/block/vdi.c
index 3ddc62a569..87b874a7ef 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -938,7 +938,7 @@ vdi_co_create_opts(BlockDriver *drv, const char *filename,
     qdict = qemu_opts_to_qdict_filtered(opts, NULL, &vdi_create_opts, true);
 
     /* Create and open the file (protocol layer) */
-    ret = bdrv_co_create_file(filename, opts, errp);
+    ret = bdrv_co_create_file(filename, opts, true, errp);
     if (ret < 0) {
         goto done;
     }
diff --git a/block/vhdx.c b/block/vhdx.c
index b2a4b813a0..c16e4a00c8 100644
--- a/block/vhdx.c
+++ b/block/vhdx.c
@@ -2096,7 +2096,7 @@ vhdx_co_create_opts(BlockDriver *drv, const char *filename,
     }
 
     /* Create and open the file (protocol layer) */
-    ret = bdrv_co_create_file(filename, opts, errp);
+    ret = bdrv_co_create_file(filename, opts, true, errp);
     if (ret < 0) {
         goto fail;
     }
diff --git a/block/vmdk.c b/block/vmdk.c
index 7b98debc2b..eb3c174eca 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -2334,7 +2334,7 @@ vmdk_create_extent(const char *filename, int64_t filesize, bool flat,
     int ret;
     BlockBackend *blk = NULL;
 
-    ret = bdrv_co_create_file(filename, opts, errp);
+    ret = bdrv_co_create_file(filename, opts, true, errp);
     if (ret < 0) {
         goto exit;
     }
diff --git a/block/vpc.c b/block/vpc.c
index 801ff5793f..07e8ae0309 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -1118,7 +1118,7 @@ vpc_co_create_opts(BlockDriver *drv, const char *filename,
     }
 
     /* Create and open the file (protocol layer) */
-    ret = bdrv_co_create_file(filename, opts, errp);
+    ret = bdrv_co_create_file(filename, opts, true, errp);
     if (ret < 0) {
         goto fail;
     }
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PULL 27/27] qcow2, vmdk: Restrict creation with secondary file using protocol
  2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
                   ` (25 preceding siblings ...)
  2025-11-04 17:54 ` [PULL 26/27] block: Allow drivers to control protocol prefix at creation Kevin Wolf
@ 2025-11-04 17:54 ` Kevin Wolf
  26 siblings, 0 replies; 31+ messages in thread
From: Kevin Wolf @ 2025-11-04 17:54 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Eric Blake <eblake@redhat.com>

Ever since CVE-2024-4467 (see commit 7ead9469 in qemu v9.1.0), we have
intentionally treated the opening of secondary files whose name is
specified in the contents of the primary file, such as a qcow2
data_file, as something that must be a local file and not a protocol
prefix (it is still possible to open a qcow2 file that wraps an NBD
data image by using QMP commands, but that is from the explicit action
of the QMP overriding any string encoded in the qcow2 file).  At the
time, we did not prevent the use of protocol prefixes on the secondary
image while creating a qcow2 file, but it results in a qcow2 file that
records an empty string for the data_file, rather than the protocol
passed in during creation:

$ qemu-img create -f raw datastore.raw 2G
$ qemu-nbd -e 0 -t -f raw datastore.raw &
$ qemu-img create -f qcow2 -o data_file=nbd://localhost:10809/ \
  datastore_nbd.qcow2 2G
Formatting 'datastore_nbd.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2147483648 data_file=nbd://localhost:10809/ lazy_refcounts=off refcount_bits=16
$ qemu-img info datastore_nbd.qcow2 | grep data
$ qemu-img info datastore_nbd.qcow2 | grep data
image: datastore_nbd.qcow2
    data file:
    data file raw: false
    filename: datastore_nbd.qcow2

And since an empty string was recorded in the file, attempting to open
the image without using QMP to supply the NBD data store fails, with a
somewhat confusing error message:

$ qemu-io -f qcow2 datastore_nbd.qcow2
qemu-io: can't open device datastore_nbd.qcow2: The 'file' block driver requires a file name

Although the ability to create an image with a convenience reference
to a protocol data file is not a security hole (unlike the case with
open, the image is not untrusted if we are the ones creating it), the
above demo shows that it is still inconsistent.  Thus, it makes more
sense if we also insist that image creation rejects a protocol prefix
when using the same syntax.  Now, the above attempt produces:

$ qemu-img create -f qcow2 -o data_file=nbd://localhost:10809/ \
  datastore_nbd.qcow2 2G
Formatting 'datastore_nbd.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2147483648 data_file=nbd://localhost:10809/ lazy_refcounts=off refcount_bits=16
qemu-img: datastore_nbd.qcow2: Could not create 'nbd://localhost:10809/': No such file or directory

with datastore_nbd.qcow2 no longer created.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20250915213919.3121401-6-eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.c | 2 +-
 block/vmdk.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index ec72e27214..cb0bdb32ec 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3971,7 +3971,7 @@ qcow2_co_create_opts(BlockDriver *drv, const char *filename, QemuOpts *opts,
     /* Create and open an external data file (protocol layer) */
     val = qdict_get_try_str(qdict, BLOCK_OPT_DATA_FILE);
     if (val) {
-        ret = bdrv_co_create_file(val, opts, true, errp);
+        ret = bdrv_co_create_file(val, opts, false, errp);
         if (ret < 0) {
             goto finish;
         }
diff --git a/block/vmdk.c b/block/vmdk.c
index eb3c174eca..3b35b63cb5 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -2334,7 +2334,7 @@ vmdk_create_extent(const char *filename, int64_t filesize, bool flat,
     int ret;
     BlockBackend *blk = NULL;
 
-    ret = bdrv_co_create_file(filename, opts, true, errp);
+    ret = bdrv_co_create_file(filename, opts, false, errp);
     if (ret < 0) {
         goto exit;
     }
-- 
2.51.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PULL 06/27] aio-posix: integrate fdmon into glib event loop
  2025-11-04 17:53 ` [PULL 06/27] aio-posix: integrate fdmon into glib event loop Kevin Wolf
@ 2025-11-05 15:06   ` Richard Henderson
  0 siblings, 0 replies; 31+ messages in thread
From: Richard Henderson @ 2025-11-05 15:06 UTC (permalink / raw)
  To: qemu-devel

On 11/4/25 18:53, Kevin Wolf wrote:
> From: Stefan Hajnoczi <stefanha@redhat.com>
> 
> AioContext's glib integration only supports ppoll(2) file descriptor
> monitoring. epoll(7) and io_uring(7) disable themselves and switch back
> to ppoll(2) when the glib event loop is used. The main loop thread
> cannot use epoll(7) or io_uring(7) because it always uses the glib event
> loop.
> 
> Future QEMU features may require io_uring(7). One example is uring_cmd
> support in FUSE exports. Each feature could create its own io_uring(7)
> context and integrate it into the event loop, but this is inefficient
> due to extra syscalls. It would be more efficient to reuse the
> AioContext's existing fdmon-io_uring.c io_uring(7) context because
> fdmon-io_uring.c will already be active on systems where Linux io_uring
> is available.
> 
> In order to keep fdmon-io_uring.c's AioContext operational even when the
> glib event loop is used, extend FDMonOps with an API similar to
> GSourceFuncs so that file descriptor monitoring can integrate into the
> glib event loop.
> 
> A quick summary of the GSourceFuncs API:
> - prepare() is called each event loop iteration before waiting for file
>    descriptors and timers.
> - check() is called to determine whether events are ready to be
>    dispatched after waiting.
> - dispatch() is called to process events.
> 
> More details here: https://docs.gtk.org/glib/struct.SourceFuncs.html
> 
> Move the ppoll(2)-specific code from aio-posix.c into fdmon-poll.c and
> also implement epoll(7)- and io_uring(7)-specific file descriptor
> monitoring code for glib event loops.
> 
> Note that it's still faster to use aio_poll() rather than the glib event
> loop since glib waits for file descriptor activity with ppoll(2) and
> does not support adaptive polling. But at least epoll(7) and io_uring(7)
> now work in glib event loops.
> 
> Splitting this into multiple commits without temporarily breaking
> AioContext proved difficult so this commit makes all the changes. The
> next commit will remove the aio_context_use_g_source() API because it is
> no longer needed.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> Message-ID: <20251104022933.618123-7-stefanha@redhat.com>
> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   include/block/aio.h   | 36 ++++++++++++++++++
>   util/aio-posix.h      |  5 +++
>   tests/unit/test-aio.c |  7 +++-
>   util/aio-posix.c      | 69 ++++++++---------------------------
>   util/fdmon-epoll.c    | 34 ++++++++++++++---
>   util/fdmon-io_uring.c | 44 +++++++++++++++++++++-
>   util/fdmon-poll.c     | 85 ++++++++++++++++++++++++++++++++++++++++++-
>   7 files changed, 218 insertions(+), 62 deletions(-)
> 
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 99ff48420b..39ed86d14d 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -106,6 +106,38 @@ typedef struct {
>        * Returns: true if ->wait() should be called, false otherwise.
>        */
>       bool (*need_wait)(AioContext *ctx);
> +
> +    /*
> +     * gsource_prepare:
> +     * @ctx: the AioContext
> +     *
> +     * Prepare for the glib event loop to wait for events instead of the usual
> +     * ->wait() call. See glib's GSourceFuncs->prepare().
> +     */
> +    void (*gsource_prepare)(AioContext *ctx);
> +
> +    /*
> +     * gsource_check:
> +     * @ctx: the AioContext
> +     *
> +     * Called by the glib event loop from glib's GSourceFuncs->check() after
> +     * waiting for events.
> +     *
> +     * Returns: true when ready to be dispatched.
> +     */
> +    bool (*gsource_check)(AioContext *ctx);
> +
> +    /*
> +     * gsource_dispatch:
> +     * @ctx: the AioContext
> +     * @ready_list: list for handlers that become ready
> +     *
> +     * Place ready AioHandlers on ready_list. Called as part of the glib event
> +     * loop from glib's GSourceFuncs->dispatch().
> +     *
> +     * Called with list_lock incremented.
> +     */
> +    void (*gsource_dispatch)(AioContext *ctx, AioHandlerList *ready_list);
>   } FDMonOps;
>   
>   /*
> @@ -222,6 +254,7 @@ struct AioContext {
>       /* State for file descriptor monitoring using Linux io_uring */
>       struct io_uring fdmon_io_uring;
>       AioHandlerSList submit_list;
> +    gpointer io_uring_fd_tag;
>   #endif
>   
>       /* TimerLists for calling timers - one per clock type.  Has its own
> @@ -254,6 +287,9 @@ struct AioContext {
>       /* epoll(7) state used when built with CONFIG_EPOLL */
>       int epollfd;
>   
> +    /* The GSource unix fd tag for epollfd */
> +    gpointer epollfd_tag;
> +
>       const FDMonOps *fdmon_ops;
>   };
>   
> diff --git a/util/aio-posix.h b/util/aio-posix.h
> index 82a0201ea4..f9994ed79e 100644
> --- a/util/aio-posix.h
> +++ b/util/aio-posix.h
> @@ -47,9 +47,14 @@ void aio_add_ready_handler(AioHandlerList *ready_list, AioHandler *node,
>   
>   extern const FDMonOps fdmon_poll_ops;
>   
> +/* Switch back to poll(2). list_lock must be held. */
> +void fdmon_poll_downgrade(AioContext *ctx);
> +
>   #ifdef CONFIG_EPOLL_CREATE1
>   bool fdmon_epoll_try_upgrade(AioContext *ctx, unsigned npfd);
>   void fdmon_epoll_setup(AioContext *ctx);
> +
> +/* list_lock must be held */
>   void fdmon_epoll_disable(AioContext *ctx);
>   #else
>   static inline bool fdmon_epoll_try_upgrade(AioContext *ctx, unsigned npfd)
> diff --git a/tests/unit/test-aio.c b/tests/unit/test-aio.c
> index e77d86be87..010d65b79a 100644
> --- a/tests/unit/test-aio.c
> +++ b/tests/unit/test-aio.c
> @@ -527,7 +527,12 @@ static void test_source_bh_delete_from_cb(void)
>       g_assert_cmpint(data1.n, ==, data1.max);
>       g_assert(data1.bh == NULL);
>   
> -    assert(g_main_context_iteration(NULL, false));
> +    /*
> +     * There may be up to one more iteration due to the aio_notify
> +     * EventNotifier.
> +     */
> +    g_main_context_iteration(NULL, false);
> +
>       assert(!g_main_context_iteration(NULL, false));
>   }
>   
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index 824fdc34cc..9de05ee7e8 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -70,15 +70,6 @@ static AioHandler *find_aio_handler(AioContext *ctx, int fd)
>   
>   static bool aio_remove_fd_handler(AioContext *ctx, AioHandler *node)
>   {
> -    /* If the GSource is in the process of being destroyed then
> -     * g_source_remove_poll() causes an assertion failure.  Skip
> -     * removal in that case, because glib cleans up its state during
> -     * destruction anyway.
> -     */
> -    if (!g_source_is_destroyed(&ctx->source)) {
> -        g_source_remove_poll(&ctx->source, &node->pfd);
> -    }
> -
>       node->pfd.revents = 0;
>       node->poll_ready = false;
>   
> @@ -153,7 +144,6 @@ void aio_set_fd_handler(AioContext *ctx,
>           } else {
>               new_node->pfd = node->pfd;
>           }
> -        g_source_add_poll(&ctx->source, &new_node->pfd);
>   
>           new_node->pfd.events = (io_read ? G_IO_IN | G_IO_HUP | G_IO_ERR : 0);
>           new_node->pfd.events |= (io_write ? G_IO_OUT | G_IO_ERR : 0);
> @@ -267,37 +257,13 @@ bool aio_prepare(AioContext *ctx)
>       poll_set_started(ctx, &ready_list, false);
>       /* TODO what to do with this list? */
>   
> +    ctx->fdmon_ops->gsource_prepare(ctx);
>       return false;
>   }
>   
>   bool aio_pending(AioContext *ctx)
>   {
> -    AioHandler *node;
> -    bool result = false;
> -
> -    /*
> -     * We have to walk very carefully in case aio_set_fd_handler is
> -     * called while we're walking.
> -     */
> -    qemu_lockcnt_inc(&ctx->list_lock);
> -
> -    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
> -        int revents;
> -
> -        /* TODO should this check poll ready? */
> -        revents = node->pfd.revents & node->pfd.events;
> -        if (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR) && node->io_read) {
> -            result = true;
> -            break;
> -        }
> -        if (revents & (G_IO_OUT | G_IO_ERR) && node->io_write) {
> -            result = true;
> -            break;
> -        }
> -    }
> -    qemu_lockcnt_dec(&ctx->list_lock);
> -
> -    return result;
> +    return ctx->fdmon_ops->gsource_check(ctx);
>   }
>   
>   static void aio_free_deleted_handlers(AioContext *ctx)
> @@ -390,10 +356,6 @@ static bool aio_dispatch_handler(AioContext *ctx, AioHandler *node)
>       return progress;
>   }
>   
> -/*
> - * If we have a list of ready handlers then this is more efficient than
> - * scanning all handlers with aio_dispatch_handlers().
> - */
>   static bool aio_dispatch_ready_handlers(AioContext *ctx,
>                                           AioHandlerList *ready_list,
>                                           int64_t block_ns)
> @@ -417,24 +379,18 @@ static bool aio_dispatch_ready_handlers(AioContext *ctx,
>       return progress;
>   }
>   
> -/* Slower than aio_dispatch_ready_handlers() but only used via glib */
> -static bool aio_dispatch_handlers(AioContext *ctx)
> -{
> -    AioHandler *node, *tmp;
> -    bool progress = false;
> -
> -    QLIST_FOREACH_SAFE_RCU(node, &ctx->aio_handlers, node, tmp) {
> -        progress = aio_dispatch_handler(ctx, node) || progress;
> -    }
> -
> -    return progress;
> -}
> -
>   void aio_dispatch(AioContext *ctx)
>   {
> +    AioHandlerList ready_list = QLIST_HEAD_INITIALIZER(ready_list);
> +
>       qemu_lockcnt_inc(&ctx->list_lock);
>       aio_bh_poll(ctx);
> -    aio_dispatch_handlers(ctx);
> +
> +    ctx->fdmon_ops->gsource_dispatch(ctx, &ready_list);
> +
> +    /* block_ns is 0 because polling is disabled in the glib event loop */
> +    aio_dispatch_ready_handlers(ctx, &ready_list, 0);
> +
>       aio_free_deleted_handlers(ctx);
>       qemu_lockcnt_dec(&ctx->list_lock);
>   
> @@ -766,6 +722,7 @@ void aio_context_setup(AioContext *ctx)
>   {
>       ctx->fdmon_ops = &fdmon_poll_ops;
>       ctx->epollfd = -1;
> +    ctx->epollfd_tag = NULL;
>   
>       /* Use the fastest fd monitoring implementation if available */
>       if (fdmon_io_uring_setup(ctx)) {
> @@ -778,7 +735,11 @@ void aio_context_setup(AioContext *ctx)
>   void aio_context_destroy(AioContext *ctx)
>   {
>       fdmon_io_uring_destroy(ctx);
> +
> +    qemu_lockcnt_lock(&ctx->list_lock);
>       fdmon_epoll_disable(ctx);
> +    qemu_lockcnt_unlock(&ctx->list_lock);
> +
>       aio_free_deleted_handlers(ctx);
>   }
>   
> diff --git a/util/fdmon-epoll.c b/util/fdmon-epoll.c
> index 9fb8800dde..61118e1ee6 100644
> --- a/util/fdmon-epoll.c
> +++ b/util/fdmon-epoll.c
> @@ -19,8 +19,12 @@ void fdmon_epoll_disable(AioContext *ctx)
>           ctx->epollfd = -1;
>       }
>   
> -    /* Switch back */
> -    ctx->fdmon_ops = &fdmon_poll_ops;
> +    if (ctx->epollfd_tag) {
> +        g_source_remove_unix_fd(&ctx->source, ctx->epollfd_tag);
> +        ctx->epollfd_tag = NULL;
> +    }
> +
> +    fdmon_poll_downgrade(ctx);
>   }
>   
>   static inline int epoll_events_from_pfd(int pfd_events)
> @@ -93,10 +97,29 @@ out:
>       return ret;
>   }
>   
> +static void fdmon_epoll_gsource_prepare(AioContext *ctx)
> +{
> +    /* Do nothing */
> +}
> +
> +static bool fdmon_epoll_gsource_check(AioContext *ctx)
> +{
> +    return g_source_query_unix_fd(&ctx->source, ctx->epollfd_tag) & G_IO_IN;
> +}
> +
> +static void fdmon_epoll_gsource_dispatch(AioContext *ctx,
> +                                         AioHandlerList *ready_list)
> +{
> +    fdmon_epoll_wait(ctx, ready_list, 0);
> +}
> +
>   static const FDMonOps fdmon_epoll_ops = {
>       .update = fdmon_epoll_update,
>       .wait = fdmon_epoll_wait,
>       .need_wait = aio_poll_disabled,
> +    .gsource_prepare = fdmon_epoll_gsource_prepare,
> +    .gsource_check = fdmon_epoll_gsource_check,
> +    .gsource_dispatch = fdmon_epoll_gsource_dispatch,
>   };
>   
>   static bool fdmon_epoll_try_enable(AioContext *ctx)
> @@ -118,6 +141,8 @@ static bool fdmon_epoll_try_enable(AioContext *ctx)
>       }
>   
>       ctx->fdmon_ops = &fdmon_epoll_ops;
> +    ctx->epollfd_tag = g_source_add_unix_fd(&ctx->source, ctx->epollfd,
> +                                            G_IO_IN);
>       return true;
>   }
>   
> @@ -139,12 +164,11 @@ bool fdmon_epoll_try_upgrade(AioContext *ctx, unsigned npfd)
>       }
>   
>       ok = fdmon_epoll_try_enable(ctx);
> -
> -    qemu_lockcnt_inc_and_unlock(&ctx->list_lock);
> -
>       if (!ok) {
>           fdmon_epoll_disable(ctx);
>       }
> +
> +    qemu_lockcnt_inc_and_unlock(&ctx->list_lock);
>       return ok;
>   }
>   
> diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> index 3d8638b0e5..0a5ec5ead6 100644
> --- a/util/fdmon-io_uring.c
> +++ b/util/fdmon-io_uring.c
> @@ -262,6 +262,11 @@ static int process_cq_ring(AioContext *ctx, AioHandlerList *ready_list)
>       unsigned num_ready = 0;
>       unsigned head;
>   
> +    /* If the CQ overflowed then fetch CQEs with a syscall */
> +    if (io_uring_cq_has_overflow(ring)) {
> +        io_uring_get_events(ring);
> +    }


https://gitlab.com/qemu-project/qemu/-/jobs/11984045425#L2379


../util/fdmon-io_uring.c: In function 'process_cq_ring':
../util/fdmon-io_uring.c:315:9: error: implicit declaration of function 
'io_uring_cq_has_overflow' [-Werror=implicit-function-declaration]
   315 |     if (io_uring_cq_has_overflow(ring)) {
       |         ^~~~~~~~~~~~~~~~~~~~~~~~
../util/fdmon-io_uring.c:315:9: error: nested extern declaration of 
'io_uring_cq_has_overflow' [-Werror=nested-externs]
../util/fdmon-io_uring.c:316:9: error: implicit declaration of function 
'io_uring_get_events'; did you mean 'io_uring_get_sqe'? 
[-Werror=implicit-function-declaration]
   316 |         io_uring_get_events(ring);
       |         ^~~~~~~~~~~~~~~~~~~
       |         io_uring_get_sqe
../util/fdmon-io_uring.c:316:9: error: nested extern declaration of 'io_uring_get_events' 
[-Werror=nested-externs]


r~


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-11-05 15:07 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-04 17:53 [PULL 00/27] Block layer patches Kevin Wolf
2025-11-04 17:53 ` [PULL 01/27] aio-posix: fix race between io_uring CQE and AioHandler deletion Kevin Wolf
2025-11-04 17:53 ` [PULL 02/27] aio-posix: fix fdmon-io_uring.c timeout stack variable lifetime Kevin Wolf
2025-11-04 17:53 ` [PULL 03/27] aio-posix: fix spurious return from ->wait() due to signals Kevin Wolf
2025-11-04 17:53 ` [PULL 04/27] aio-posix: keep polling enabled with fdmon-io_uring.c Kevin Wolf
2025-11-04 17:53 ` [PULL 05/27] tests/unit: skip test-nested-aio-poll with io_uring Kevin Wolf
2025-11-04 17:53 ` [PULL 06/27] aio-posix: integrate fdmon into glib event loop Kevin Wolf
2025-11-05 15:06   ` Richard Henderson
2025-11-04 17:53 ` [PULL 07/27] aio: remove aio_context_use_g_source() Kevin Wolf
2025-11-04 17:53 ` [PULL 08/27] aio: free AioContext when aio_context_new() fails Kevin Wolf
2025-11-04 17:53 ` [PULL 09/27] aio: add errp argument to aio_context_setup() Kevin Wolf
2025-11-04 17:53 ` [PULL 10/27] aio-posix: gracefully handle io_uring_queue_init() failure Kevin Wolf
2025-11-04 17:53 ` [PULL 11/27] aio-posix: unindent fdmon_io_uring_destroy() Kevin Wolf
2025-11-04 17:54 ` [PULL 12/27] aio-posix: add fdmon_ops->dispatch() Kevin Wolf
2025-11-04 17:54 ` [PULL 13/27] aio-posix: add aio_add_sqe() API for user-defined io_uring requests Kevin Wolf
2025-11-04 17:54 ` [PULL 14/27] block/io_uring: use aio_add_sqe() Kevin Wolf
2025-11-04 17:54 ` [PULL 15/27] block/io_uring: use non-vectored read/write when possible Kevin Wolf
2025-11-04 17:54 ` [PULL 16/27] block: replace TABs with space Kevin Wolf
2025-11-04 17:54 ` [PULL 17/27] block: Drop detach_subchain for bdrv_replace_node Kevin Wolf
2025-11-04 17:54 ` [PULL 18/27] iotests: Test resizing file node under raw with size/offset Kevin Wolf
2025-11-04 17:54 ` [PULL 19/27] qemu-img: Fix amend option parse error handling Kevin Wolf
2025-11-04 17:54 ` [PULL 20/27] iotests: Run iotests with sanitizers Kevin Wolf
2025-11-04 17:54 ` [PULL 21/27] qcow2: rename update_refcount_discard to queue_discard Kevin Wolf
2025-11-04 17:54 ` [PULL 22/27] qcow2: put discards in discard queue when discard-no-unref is enabled Kevin Wolf
2025-11-04 17:54 ` [PULL 23/27] tests/qemu-iotests/184: Fix skip message for qemu-img without throttle Kevin Wolf
2025-11-04 17:54 ` [PULL 24/27] tests/qemu-iotests: Improve the dry run list to speed up thorough testing Kevin Wolf
2025-11-04 17:54 ` [PULL 25/27] tests/qemu-iotest: Add more image formats to the " Kevin Wolf
2025-11-04 17:54 ` [PULL 26/27] block: Allow drivers to control protocol prefix at creation Kevin Wolf
2025-11-04 17:54 ` [PULL 27/27] qcow2, vmdk: Restrict creation with secondary file using protocol Kevin Wolf
  -- strict thread matches above, loose matches on Subject: below --
2023-10-31 18:58 [PULL 00/27] Block layer patches Kevin Wolf
2023-10-31 23:31 ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).