qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports
@ 2025-08-30  2:50 Brian Song
  2025-08-30  2:50 ` [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring Brian Song
                   ` (4 more replies)
  0 siblings, 5 replies; 38+ messages in thread
From: Brian Song @ 2025-08-30  2:50 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, armbru, bernd, fam, hibriansong, hreitz, kwolf,
	stefanha

Hi all,

This is a GSoC project. More details are available here:
https://wiki.qemu.org/Google_Summer_of_Code_2025#FUSE-over-io_uring_exports

This patch series includes:
- Add a round-robin mechanism to distribute the kernel-required Ring
Queues to FUSE Queues
- Support multiple in-flight requests (multiple ring entries)
- Add tests for FUSE-over-io_uring

More detail in the v2 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2025-08/msg00140.html

And in the v1 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2025-07/msg00280.html


Brian Song (4):
  export/fuse: add opt to enable FUSE-over-io_uring
  export/fuse: process FUSE-over-io_uring requests
  export/fuse: Safe termination for FUSE-uring
  iotests: add tests for FUSE-over-io_uring

 block/export/fuse.c                  | 838 +++++++++++++++++++++------
 docs/tools/qemu-storage-daemon.rst   |  11 +-
 qapi/block-export.json               |   5 +-
 storage-daemon/qemu-storage-daemon.c |   1 +
 tests/qemu-iotests/check             |   2 +
 tests/qemu-iotests/common.rc         |  45 +-
 util/fdmon-io_uring.c                |   5 +-
 7 files changed, 717 insertions(+), 190 deletions(-)

-- 
2.45.2



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring
  2025-08-30  2:50 [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports Brian Song
@ 2025-08-30  2:50 ` Brian Song
  2025-09-03 10:53   ` Stefan Hajnoczi
                     ` (2 more replies)
  2025-08-30  2:50 ` [PATCH 2/4] export/fuse: process FUSE-over-io_uring requests Brian Song
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 38+ messages in thread
From: Brian Song @ 2025-08-30  2:50 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, armbru, bernd, fam, hibriansong, hreitz, kwolf,
	stefanha

This patch adds a new export option for storage-export-daemon to enable
FUSE-over-io_uring via the switch io-uring=on|off (disableby default).
It also implements the protocol handshake with the Linux kernel
during the FUSE-over-io_uring initialization phase.

See: https://docs.kernel.org/filesystems/fuse-io-uring.html

The kernel documentation describes in detail how FUSE-over-io_uring
works. This patch implements the Initial SQE stage shown in thediagram:
it initializes one queue per IOThread, each currently supporting a
single submission queue entry (SQE). When the FUSE driver sends the
first FUSE request (FUSE_INIT), storage-export-daemon calls
fuse_uring_start() to complete initialization, ultimately submitting
the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm
successful initialization with the kernel.

We also added support for multiple IOThreads. The current Linux kernel
requires registering $(nproc) queues when setting up FUSE-over-io_uring
To let users customize the number of FUSE Queues (i.e., IOThreads),
we first create nproc Ring Queues as required by the kernel, then
distribute them in a round-robin manner to the FUSE Queues for
registration. In addition, to support multiple in-flight requests,
we configure each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH
entries/requests.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Brian Song <hibriansong@gmail.com>
---
 block/export/fuse.c                  | 310 +++++++++++++++++++++++++--
 docs/tools/qemu-storage-daemon.rst   |  11 +-
 qapi/block-export.json               |   5 +-
 storage-daemon/qemu-storage-daemon.c |   1 +
 util/fdmon-io_uring.c                |   5 +-
 5 files changed, 309 insertions(+), 23 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index c0ad4696ce..19bf9e5f74 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -48,6 +48,9 @@
 #include <linux/fs.h>
 #endif
 
+/* room needed in buffer to accommodate header */
+#define FUSE_BUFFER_HEADER_SIZE 0x1000
+
 /* Prevent overly long bounce buffer allocations */
 #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
 /*
@@ -63,12 +66,59 @@
     (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
 
 typedef struct FuseExport FuseExport;
+typedef struct FuseQueue FuseQueue;
+
+#ifdef CONFIG_LINUX_IO_URING
+#define FUSE_DEFAULT_RING_QUEUE_DEPTH 64
+#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
+
+typedef struct FuseRingQueue FuseRingQueue;
+typedef struct FuseRingEnt {
+    /* back pointer */
+    FuseRingQueue *rq;
+
+    /* commit id of a fuse request */
+    uint64_t req_commit_id;
+
+    /* fuse request header and payload */
+    struct fuse_uring_req_header req_header;
+    void *op_payload;
+    size_t req_payload_sz;
+
+    /* The vector passed to the kernel */
+    struct iovec iov[2];
+
+    CqeHandler fuse_cqe_handler;
+} FuseRingEnt;
+
+struct FuseRingQueue {
+    int rqid;
+
+    /* back pointer */
+    FuseQueue *q;
+    FuseRingEnt *ent;
+
+    /* List entry for ring_queues */
+    QLIST_ENTRY(FuseRingQueue) next;
+};
+
+/*
+ * Round-robin distribution of ring queues across FUSE queues.
+ * This structure manages the mapping between kernel ring queues and user
+ * FUSE queues.
+ */
+typedef struct FuseRingQueueManager {
+    FuseRingQueue *ring_queues;
+    int num_ring_queues;
+    int num_fuse_queues;
+} FuseRingQueueManager;
+#endif
 
 /*
  * One FUSE "queue", representing one FUSE FD from which requests are fetched
  * and processed.  Each queue is tied to an AioContext.
  */
-typedef struct FuseQueue {
+struct FuseQueue {
     FuseExport *exp;
 
     AioContext *ctx;
@@ -109,15 +159,11 @@ typedef struct FuseQueue {
      * Free this buffer with qemu_vfree().
      */
     void *spillover_buf;
-} FuseQueue;
 
-/*
- * Verify that FuseQueue.request_buf plus the spill-over buffer together
- * are big enough to be accepted by the FUSE kernel driver.
- */
-QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
-                  FUSE_SPILLOVER_BUF_SIZE <
-                  FUSE_MIN_READ_BUFFER);
+#ifdef CONFIG_LINUX_IO_URING
+    QLIST_HEAD(, FuseRingQueue) ring_queue_list;
+#endif
+};
 
 struct FuseExport {
     BlockExport common;
@@ -133,7 +179,7 @@ struct FuseExport {
      */
     bool halted;
 
-    int num_queues;
+    size_t num_queues;
     FuseQueue *queues;
     /*
      * True if this export should follow the generic export's AioContext.
@@ -149,6 +195,12 @@ struct FuseExport {
     /* Whether allow_other was used as a mount option or not */
     bool allow_other;
 
+#ifdef CONFIG_LINUX_IO_URING
+    bool is_uring;
+    size_t ring_queue_depth;
+    FuseRingQueueManager *ring_queue_manager;
+#endif
+
     mode_t st_mode;
     uid_t st_uid;
     gid_t st_gid;
@@ -205,7 +257,7 @@ static void fuse_attach_handlers(FuseExport *exp)
         return;
     }
 
-    for (int i = 0; i < exp->num_queues; i++) {
+    for (size_t i = 0; i < exp->num_queues; i++) {
         aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
                            read_from_fuse_fd, NULL, NULL, NULL,
                            &exp->queues[i]);
@@ -257,6 +309,189 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
     .drained_poll  = fuse_export_drained_poll,
 };
 
+#ifdef CONFIG_LINUX_IO_URING
+static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req,
+                    const unsigned int rqid,
+                    const unsigned int commit_id)
+{
+    req->qid = rqid;
+    req->commit_id = commit_id;
+    req->flags = 0;
+}
+
+static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q,
+               __u32 cmd_op)
+{
+    sqe->opcode = IORING_OP_URING_CMD;
+
+    sqe->fd = q->fuse_fd;
+    sqe->rw_flags = 0;
+    sqe->ioprio = 0;
+    sqe->off = 0;
+
+    sqe->cmd_op = cmd_op;
+    sqe->__pad1 = 0;
+}
+
+static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque)
+{
+    FuseRingEnt *ent = opaque;
+    struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0];
+
+    fuse_uring_sqe_prepare(sqe, ent->rq->q, FUSE_IO_URING_CMD_REGISTER);
+
+    sqe->addr = (uint64_t)(ent->iov);
+    sqe->len = 2;
+
+    fuse_uring_sqe_set_req_data(req, ent->rq->rqid, 0);
+}
+
+static void fuse_uring_submit_register(void *opaque)
+{
+    FuseRingEnt *ent = opaque;
+    FuseExport *exp = ent->rq->q->exp;
+
+
+    aio_add_sqe(fuse_uring_prep_sqe_register, ent, &(ent->fuse_cqe_handler));
+}
+
+/**
+ * Distribute ring queues across FUSE queues using round-robin algorithm.
+ * This ensures even distribution of kernel ring queues across user-specified
+ * FUSE queues.
+ */
+static
+FuseRingQueueManager *fuse_ring_queue_manager_create(int num_fuse_queues,
+                                                    size_t ring_queue_depth,
+                                                    size_t bufsize)
+{
+    int num_ring_queues = get_nprocs();
+    FuseRingQueueManager *manager = g_new(FuseRingQueueManager, 1);
+
+    if (!manager) {
+        return NULL;
+    }
+
+    manager->ring_queues = g_new(FuseRingQueue, num_ring_queues);
+    manager->num_ring_queues = num_ring_queues;
+    manager->num_fuse_queues = num_fuse_queues;
+
+    if (!manager->ring_queues) {
+        g_free(manager);
+        return NULL;
+    }
+
+    for (int i = 0; i < num_ring_queues; i++) {
+        FuseRingQueue *rq = &manager->ring_queues[i];
+        rq->rqid = i;
+        rq->ent = g_new(FuseRingEnt, ring_queue_depth);
+
+        if (!rq->ent) {
+            for (int j = 0; j < i; j++) {
+                g_free(manager->ring_queues[j].ent);
+            }
+            g_free(manager->ring_queues);
+            g_free(manager);
+            return NULL;
+        }
+
+        for (size_t j = 0; j < ring_queue_depth; j++) {
+            FuseRingEnt *ent = &rq->ent[j];
+            ent->rq = rq;
+            ent->req_payload_sz = bufsize - FUSE_BUFFER_HEADER_SIZE;
+            ent->op_payload = g_malloc0(ent->req_payload_sz);
+
+            if (!ent->op_payload) {
+                for (size_t k = 0; k < j; k++) {
+                    g_free(rq->ent[k].op_payload);
+                }
+                g_free(rq->ent);
+                for (int k = 0; k < i; k++) {
+                    g_free(manager->ring_queues[k].ent);
+                }
+                g_free(manager->ring_queues);
+                g_free(manager);
+                return NULL;
+            }
+
+            ent->iov[0] = (struct iovec) {
+                &(ent->req_header),
+                sizeof(struct fuse_uring_req_header)
+            };
+            ent->iov[1] = (struct iovec) {
+                ent->op_payload,
+                ent->req_payload_sz
+            };
+
+            ent->fuse_cqe_handler.cb = fuse_uring_cqe_handler;
+        }
+    }
+
+    return manager;
+}
+
+static
+void fuse_distribute_ring_queues(FuseExport *exp, FuseRingQueueManager *manager)
+{
+    int queue_index = 0;
+
+    for (int i = 0; i < manager->num_ring_queues; i++) {
+        FuseRingQueue *rq = &manager->ring_queues[i];
+
+        rq->q = &exp->queues[queue_index];
+        QLIST_INSERT_HEAD(&(rq->q->ring_queue_list), rq, next);
+
+        queue_index = (queue_index + 1) % manager->num_fuse_queues;
+    }
+}
+
+static
+void fuse_schedule_ring_queue_registrations(FuseExport *exp,
+                                            FuseRingQueueManager *manager)
+{
+    for (int i = 0; i < manager->num_fuse_queues; i++) {
+        FuseQueue *q = &exp->queues[i];
+        FuseRingQueue *rq;
+
+        QLIST_FOREACH(rq, &q->ring_queue_list, next) {
+            for (int j = 0; j < exp->ring_queue_depth; j++) {
+                aio_bh_schedule_oneshot(q->ctx, fuse_uring_submit_register,
+                                        &(rq->ent[j]));
+            }
+        }
+    }
+}
+
+static void fuse_uring_start(FuseExport *exp, struct fuse_init_out *out)
+{
+    /*
+     * Since we didn't enable the FUSE_MAX_PAGES feature, the value of
+     * fc->max_pages should be FUSE_DEFAULT_MAX_PAGES_PER_REQ, which is set by
+     * the kernel by default. Also, max_write should not exceed
+     * FUSE_DEFAULT_MAX_PAGES_PER_REQ * PAGE_SIZE.
+     */
+    size_t bufsize = out->max_write + FUSE_BUFFER_HEADER_SIZE;
+
+    if (!(out->flags & FUSE_MAX_PAGES)) {
+        bufsize = FUSE_DEFAULT_MAX_PAGES_PER_REQ * qemu_real_host_page_size()
+                         + FUSE_BUFFER_HEADER_SIZE;
+    }
+
+    exp->ring_queue_manager = fuse_ring_queue_manager_create(
+        exp->num_queues, exp->ring_queue_depth, bufsize);
+
+    if (!exp->ring_queue_manager) {
+        error_report("Failed to create ring queue manager");
+        return;
+    }
+
+    /* Distribute ring queues across FUSE queues using round-robin */
+    fuse_distribute_ring_queues(exp, exp->ring_queue_manager);
+
+    fuse_schedule_ring_queue_registrations(exp, exp->ring_queue_manager);
+}
+#endif
+
 static int fuse_export_create(BlockExport *blk_exp,
                               BlockExportOptions *blk_exp_args,
                               AioContext *const *multithread,
@@ -270,6 +505,11 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
 
+#ifdef CONFIG_LINUX_IO_URING
+    exp->is_uring = args->io_uring;
+    exp->ring_queue_depth = FUSE_DEFAULT_RING_QUEUE_DEPTH;
+#endif
+
     if (multithread) {
         /* Guaranteed by common export code */
         assert(mt_count >= 1);
@@ -283,6 +523,10 @@ static int fuse_export_create(BlockExport *blk_exp,
                 .exp = exp,
                 .ctx = multithread[i],
                 .fuse_fd = -1,
+#ifdef CONFIG_LINUX_IO_URING
+                .ring_queue_list =
+                    QLIST_HEAD_INITIALIZER(exp->queues[i].ring_queue_list),
+#endif
             };
         }
     } else {
@@ -296,6 +540,10 @@ static int fuse_export_create(BlockExport *blk_exp,
             .exp = exp,
             .ctx = exp->common.ctx,
             .fuse_fd = -1,
+#ifdef CONFIG_LINUX_IO_URING
+            .ring_queue_list =
+                QLIST_HEAD_INITIALIZER(exp->queues[0].ring_queue_list),
+#endif
         };
     }
 
@@ -685,17 +933,39 @@ static bool is_regular_file(const char *path, Error **errp)
  */
 static ssize_t coroutine_fn
 fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
-             uint32_t max_readahead, uint32_t flags)
+             uint32_t max_readahead, const struct fuse_init_in *in)
 {
-    const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
+    uint64_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO
+                                     | FUSE_INIT_EXT;
+    uint64_t outargflags = 0;
+    uint64_t inargflags = in->flags;
+
+    ssize_t ret = 0;
+
+    if (inargflags & FUSE_INIT_EXT) {
+        inargflags = inargflags | (uint64_t) in->flags2 << 32;
+    }
+
+#ifdef CONFIG_LINUX_IO_URING
+    if (exp->is_uring) {
+        if (inargflags & FUSE_OVER_IO_URING) {
+            supported_flags |= FUSE_OVER_IO_URING;
+        } else {
+            exp->is_uring = false;
+            ret = -ENODEV;
+        }
+    }
+#endif
+
+    outargflags = inargflags & supported_flags;
 
     *out = (struct fuse_init_out) {
         .major = FUSE_KERNEL_VERSION,
         .minor = FUSE_KERNEL_MINOR_VERSION,
         .max_readahead = max_readahead,
         .max_write = FUSE_MAX_WRITE_BYTES,
-        .flags = flags & supported_flags,
-        .flags2 = 0,
+        .flags = outargflags,
+        .flags2 = outargflags >> 32,
 
         /* libfuse maximum: 2^16 - 1 */
         .max_background = UINT16_MAX,
@@ -717,7 +987,7 @@ fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
         .map_alignment = 0,
     };
 
-    return sizeof(*out);
+    return ret < 0 ? ret : sizeof(*out);
 }
 
 /**
@@ -1506,6 +1776,14 @@ fuse_co_process_request(FuseQueue *q, void *spillover_buf)
         fuse_write_buf_response(q->fuse_fd, req_id, out_hdr,
                                 out_data_buffer, ret);
         qemu_vfree(out_data_buffer);
+#ifdef CONFIG_LINUX_IO_URING
+    /* Handle FUSE-over-io_uring initialization */
+    if (unlikely(opcode == FUSE_INIT && exp->is_uring)) {
+        struct fuse_init_out *out =
+            (struct fuse_init_out *)FUSE_OUT_OP_STRUCT(out_buf);
+        fuse_uring_start(exp, out);
+    }
+#endif
     } else {
         fuse_write_response(q->fuse_fd, req_id, out_hdr,
                             ret < 0 ? ret : 0,
diff --git a/docs/tools/qemu-storage-daemon.rst b/docs/tools/qemu-storage-daemon.rst
index 35ab2d7807..c5076101e0 100644
--- a/docs/tools/qemu-storage-daemon.rst
+++ b/docs/tools/qemu-storage-daemon.rst
@@ -78,7 +78,7 @@ Standard options:
 .. option:: --export [type=]nbd,id=<id>,node-name=<node-name>[,name=<export-name>][,writable=on|off][,bitmap=<name>]
   --export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=unix,addr.path=<socket-path>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>]
   --export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=fd,addr.str=<fd>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>]
-  --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>[,growable=on|off][,writable=on|off][,allow-other=on|off|auto]
+  --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>[,growable=on|off][,writable=on|off][,allow-other=on|off|auto][,io-uring=on|off]
   --export [type=]vduse-blk,id=<id>,node-name=<node-name>,name=<vduse-name>[,writable=on|off][,num-queues=<num-queues>][,queue-size=<queue-size>][,logical-block-size=<block-size>][,serial=<serial-number>]
 
   is a block export definition. ``node-name`` is the block node that should be
@@ -111,10 +111,11 @@ Standard options:
   that enabling this option as a non-root user requires enabling the
   user_allow_other option in the global fuse.conf configuration file.  Setting
   ``allow-other`` to auto (the default) will try enabling this option, and on
-  error fall back to disabling it.
-
-  The ``vduse-blk`` export type takes a ``name`` (must be unique across the host)
-  to create the VDUSE device.
+  error fall back to disabling it. Once ``io-uring`` is enabled (off by default),
+  the FUSE-over-io_uring-related settings will be initialized to bypass the
+  traditional /dev/fuse communication mechanism and instead use io_uring to
+  handle FUSE operations. The ``vduse-blk`` export type takes a ``name``
+  (must be unique across the host) to create the VDUSE device.
   ``num-queues`` sets the number of virtqueues (the default is 1).
   ``queue-size`` sets the virtqueue descriptor table size (the default is 256).
 
diff --git a/qapi/block-export.json b/qapi/block-export.json
index 9ae703ad01..37f2fc47e2 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -184,12 +184,15 @@
 #     mount the export with allow_other, and if that fails, try again
 #     without.  (since 6.1; default: auto)
 #
+# @io-uring: Use FUSE-over-io-uring.  (since 10.2; default: false)
+#
 # Since: 6.0
 ##
 { 'struct': 'BlockExportOptionsFuse',
   'data': { 'mountpoint': 'str',
             '*growable': 'bool',
-            '*allow-other': 'FuseExportAllowOther' },
+            '*allow-other': 'FuseExportAllowOther',
+            '*io-uring': 'bool' },
   'if': 'CONFIG_FUSE' }
 
 ##
diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-storage-daemon.c
index eb72561358..0cd4cd2b58 100644
--- a/storage-daemon/qemu-storage-daemon.c
+++ b/storage-daemon/qemu-storage-daemon.c
@@ -107,6 +107,7 @@ static void help(void)
 #ifdef CONFIG_FUSE
 "  --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>\n"
 "           [,growable=on|off][,writable=on|off][,allow-other=on|off|auto]\n"
+"           [,io-uring=on|off]"
 "                         export the specified block node over FUSE\n"
 "\n"
 #endif /* CONFIG_FUSE */
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index d2433d1d99..68d3fe8e01 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -452,10 +452,13 @@ static const FDMonOps fdmon_io_uring_ops = {
 void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
 {
     int ret;
+    int flags;
 
     ctx->io_uring_fd_tag = NULL;
+    flags = IORING_SETUP_SQE128;
 
-    ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
+    ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES,
+                            &ctx->fdmon_io_uring, flags);
     if (ret != 0) {
         error_setg_errno(errp, -ret, "Failed to initialize io_uring");
         return;
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/4] export/fuse: process FUSE-over-io_uring requests
  2025-08-30  2:50 [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports Brian Song
  2025-08-30  2:50 ` [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring Brian Song
@ 2025-08-30  2:50 ` Brian Song
  2025-09-03 11:51   ` Stefan Hajnoczi
  2025-09-19 13:54   ` Kevin Wolf
  2025-08-30  2:50 ` [PATCH 3/4] export/fuse: Safe termination for FUSE-uring Brian Song
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 38+ messages in thread
From: Brian Song @ 2025-08-30  2:50 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, armbru, bernd, fam, hibriansong, hreitz, kwolf,
	stefanha

https://docs.kernel.org/filesystems/fuse-io-uring.html

As described in the kernel documentation, after FUSE-over-io_uring
initialization and handshake, FUSE interacts with the kernel using
SQE/CQE to send requests and receive responses. This corresponds to
the "Sending requests with CQEs" section in the docs.

This patch implements three key parts: registering the CQE handler
(fuse_uring_cqe_handler), processing FUSE requests (fuse_uring_co_
process_request), and sending response results (fuse_uring_send_
response). It also merges the traditional /dev/fuse request handling
with the FUSE-over-io_uring handling functions.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Brian Song <hibriansong@gmail.com>
---
 block/export/fuse.c | 457 ++++++++++++++++++++++++++++++--------------
 1 file changed, 309 insertions(+), 148 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 19bf9e5f74..07f74fc8ec 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -310,6 +310,47 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
 };
 
 #ifdef CONFIG_LINUX_IO_URING
+static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent);
+
+static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque)
+{
+    FuseRingEnt *ent = opaque;
+    FuseExport *exp = ent->rq->q->exp;
+
+    /* Going to process requests */
+    fuse_inc_in_flight(exp);
+
+    /* A ring entry returned */
+    fuse_uring_co_process_request(ent);
+
+    /* Finished processing requests */
+    fuse_dec_in_flight(exp);
+}
+
+static void fuse_uring_cqe_handler(CqeHandler *cqe_handler)
+{
+    FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, fuse_cqe_handler);
+    Coroutine *co;
+    FuseExport *exp = ent->rq->q->exp;
+
+    if (unlikely(exp->halted)) {
+        return;
+    }
+
+    int err = cqe_handler->cqe.res;
+
+    if (err != 0) {
+        /* -ENOTCONN is ok on umount  */
+        if (err != -EINTR && err != -EAGAIN &&
+            err != -ENOTCONN) {
+            fuse_export_halt(exp);
+        }
+    } else {
+        co = qemu_coroutine_create(co_fuse_uring_queue_handle_cqes, ent);
+        qemu_coroutine_enter(co);
+    }
+}
+
 static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req,
                     const unsigned int rqid,
                     const unsigned int commit_id)
@@ -1213,6 +1254,9 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
  * Data in @in_place_buf is assumed to be overwritten after yielding, so will
  * be copied to a bounce buffer beforehand.  @spillover_buf in contrast is
  * assumed to be exclusively owned and will be used as-is.
+ * In FUSE-over-io_uring mode, the actual op_payload content is stored in
+ * @spillover_buf. To ensure this buffer is used for writing, @in_place_buf
+ * is explicitly set to NULL.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
 static ssize_t coroutine_fn
@@ -1220,8 +1264,8 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
               uint64_t offset, uint32_t size,
               const void *in_place_buf, const void *spillover_buf)
 {
-    size_t in_place_size;
-    void *copied;
+    size_t in_place_size = 0;
+    void *copied = NULL;
     int64_t blk_len;
     int ret;
     struct iovec iov[2];
@@ -1236,10 +1280,12 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
         return -EACCES;
     }
 
-    /* Must copy to bounce buffer before potentially yielding */
-    in_place_size = MIN(size, FUSE_IN_PLACE_WRITE_BYTES);
-    copied = blk_blockalign(exp->common.blk, in_place_size);
-    memcpy(copied, in_place_buf, in_place_size);
+    if (in_place_buf) {
+        /* Must copy to bounce buffer before potentially yielding */
+        in_place_size = MIN(size, FUSE_IN_PLACE_WRITE_BYTES);
+        copied = blk_blockalign(exp->common.blk, in_place_size);
+        memcpy(copied, in_place_buf, in_place_size);
+    }
 
     /**
      * Clients will expect short writes at EOF, so we have to limit
@@ -1263,26 +1309,38 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
         }
     }
 
-    iov[0] = (struct iovec) {
-        .iov_base = copied,
-        .iov_len = in_place_size,
-    };
-    if (size > FUSE_IN_PLACE_WRITE_BYTES) {
-        assert(size - FUSE_IN_PLACE_WRITE_BYTES <= FUSE_SPILLOVER_BUF_SIZE);
-        iov[1] = (struct iovec) {
-            .iov_base = (void *)spillover_buf,
-            .iov_len = size - FUSE_IN_PLACE_WRITE_BYTES,
+    if (in_place_buf) {
+        iov[0] = (struct iovec) {
+            .iov_base = copied,
+            .iov_len = in_place_size,
         };
-        qemu_iovec_init_external(&qiov, iov, 2);
+        if (size > FUSE_IN_PLACE_WRITE_BYTES) {
+            assert(size - FUSE_IN_PLACE_WRITE_BYTES <= FUSE_SPILLOVER_BUF_SIZE);
+            iov[1] = (struct iovec) {
+                .iov_base = (void *)spillover_buf,
+                .iov_len = size - FUSE_IN_PLACE_WRITE_BYTES,
+            };
+            qemu_iovec_init_external(&qiov, iov, 2);
+        } else {
+            qemu_iovec_init_external(&qiov, iov, 1);
+        }
     } else {
+        /* fuse over io_uring */
+        iov[0] = (struct iovec) {
+            .iov_base = (void *)spillover_buf,
+            .iov_len = size,
+        };
         qemu_iovec_init_external(&qiov, iov, 1);
     }
+
     ret = blk_co_pwritev(exp->common.blk, offset, size, &qiov, 0);
     if (ret < 0) {
         goto fail_free_buffer;
     }
 
-    qemu_vfree(copied);
+    if (in_place_buf) {
+        qemu_vfree(copied);
+    }
 
     *out = (struct fuse_write_out) {
         .size = size,
@@ -1290,7 +1348,9 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
     return sizeof(*out);
 
 fail_free_buffer:
-    qemu_vfree(copied);
+    if (in_place_buf) {
+        qemu_vfree(copied);
+    }
     return ret;
 }
 
@@ -1578,173 +1638,151 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
     }
 }
 
-/*
- * For use in fuse_co_process_request():
- * Returns a pointer to the parameter object for the given operation (inside of
- * queue->request_buf, which is assumed to hold a fuse_in_header first).
- * Verifies that the object is complete (queue->request_buf is large enough to
- * hold it in one piece, and the request length includes the whole object).
- *
- * Note that queue->request_buf may be overwritten after yielding, so the
- * returned pointer must not be used across a function that may yield!
- */
-#define FUSE_IN_OP_STRUCT(op_name, queue) \
+#define FUSE_IN_OP_STRUCT_LEGACY(in_buf) \
     ({ \
-        const struct fuse_in_header *__in_hdr = \
-            (const struct fuse_in_header *)(queue)->request_buf; \
-        const struct fuse_##op_name##_in *__in = \
-            (const struct fuse_##op_name##_in *)(__in_hdr + 1); \
-        const size_t __param_len = sizeof(*__in_hdr) + sizeof(*__in); \
-        uint32_t __req_len; \
-        \
-        QEMU_BUILD_BUG_ON(sizeof((queue)->request_buf) < __param_len); \
-        \
-        __req_len = __in_hdr->len; \
-        if (__req_len < __param_len) { \
-            warn_report("FUSE request truncated (%" PRIu32 " < %zu)", \
-                        __req_len, __param_len); \
-            ret = -EINVAL; \
-            break; \
-        } \
-        __in; \
+        (void *)(((struct fuse_in_header *)in_buf) + 1); \
     })
 
-/*
- * For use in fuse_co_process_request():
- * Returns a pointer to the return object for the given operation (inside of
- * out_buf, which is assumed to hold a fuse_out_header first).
- * Verifies that out_buf is large enough to hold the whole object.
- *
- * (out_buf should be a char[] array.)
- */
-#define FUSE_OUT_OP_STRUCT(op_name, out_buf) \
+#define FUSE_OUT_OP_STRUCT_LEGACY(out_buf) \
     ({ \
-        struct fuse_out_header *__out_hdr = \
-            (struct fuse_out_header *)(out_buf); \
-        struct fuse_##op_name##_out *__out = \
-            (struct fuse_##op_name##_out *)(__out_hdr + 1); \
-        \
-        QEMU_BUILD_BUG_ON(sizeof(*__out_hdr) + sizeof(*__out) > \
-                          sizeof(out_buf)); \
-        \
-        __out; \
+        (void *)(((struct fuse_out_header *)out_buf) + 1); \
     })
 
-/**
- * Process a FUSE request, incl. writing the response.
- *
- * Note that yielding in any request-processing function can overwrite the
- * contents of q->request_buf.  Anything that takes a buffer needs to take
- * care that the content is copied before yielding.
- *
- * @spillover_buf can contain the tail of a write request too large to fit into
- * q->request_buf.  This function takes ownership of it (i.e. will free it),
- * which assumes that its contents will not be overwritten by concurrent
- * requests (as opposed to q->request_buf).
+
+/*
+ * Shared helper for FUSE request processing. Handles both legacy and io_uring
+ * paths.
  */
-static void coroutine_fn
-fuse_co_process_request(FuseQueue *q, void *spillover_buf)
+static void coroutine_fn fuse_co_process_request_common(
+    FuseExport *exp,
+    uint32_t opcode,
+    uint64_t req_id,
+    void *in_buf,
+    void *spillover_buf,
+    void *out_buf,
+    int fd, /* -1 for uring */
+    void (*send_response)(void *opaque, uint32_t req_id, ssize_t ret,
+                         const void *buf, void *out_buf),
+    void *opaque /* FuseQueue* or FuseRingEnt* */)
 {
-    FuseExport *exp = q->exp;
-    uint32_t opcode;
-    uint64_t req_id;
-    /*
-     * Return buffer.  Must be large enough to hold all return headers, but does
-     * not include space for data returned by read requests.
-     * (FUSE_IN_OP_STRUCT() verifies at compile time that out_buf is indeed
-     * large enough.)
-     */
-    char out_buf[sizeof(struct fuse_out_header) +
-                 MAX_CONST(sizeof(struct fuse_init_out),
-                 MAX_CONST(sizeof(struct fuse_open_out),
-                 MAX_CONST(sizeof(struct fuse_attr_out),
-                 MAX_CONST(sizeof(struct fuse_write_out),
-                           sizeof(struct fuse_lseek_out)))))];
-    struct fuse_out_header *out_hdr = (struct fuse_out_header *)out_buf;
-    /* For read requests: Data to be returned */
     void *out_data_buffer = NULL;
-    ssize_t ret;
+    ssize_t ret = 0;
 
-    /* Limit scope to ensure pointer is no longer used after yielding */
-    {
-        const struct fuse_in_header *in_hdr =
-            (const struct fuse_in_header *)q->request_buf;
+    void *op_in_buf = (void *)FUSE_IN_OP_STRUCT_LEGACY(in_buf);
+    void *op_out_buf = (void *)FUSE_OUT_OP_STRUCT_LEGACY(out_buf);
 
-        opcode = in_hdr->opcode;
-        req_id = in_hdr->unique;
+#ifdef CONFIG_LINUX_IO_URING
+    if (opcode != FUSE_INIT && exp->is_uring) {
+        op_in_buf = (void *)in_buf;
+        op_out_buf = (void *)out_buf;
     }
+#endif
 
     switch (opcode) {
     case FUSE_INIT: {
-        const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, q);
-        ret = fuse_co_init(exp, FUSE_OUT_OP_STRUCT(init, out_buf),
-                           in->max_readahead, in->flags);
+        const struct fuse_init_in *in =
+            (const struct fuse_init_in *)FUSE_IN_OP_STRUCT_LEGACY(in_buf);
+
+        struct fuse_init_out *out =
+            (struct fuse_init_out *)FUSE_OUT_OP_STRUCT_LEGACY(out_buf);
+
+        ret = fuse_co_init(exp, out, in->max_readahead, in);
         break;
     }
 
-    case FUSE_OPEN:
-        ret = fuse_co_open(exp, FUSE_OUT_OP_STRUCT(open, out_buf));
+    case FUSE_OPEN: {
+        struct fuse_open_out *out =
+            (struct fuse_open_out *)op_out_buf;
+
+        ret = fuse_co_open(exp, out);
         break;
+    }
 
     case FUSE_RELEASE:
         ret = 0;
         break;
 
     case FUSE_LOOKUP:
-        ret = -ENOENT; /* There is no node but the root node */
+        ret = -ENOENT;
         break;
 
-    case FUSE_GETATTR:
-        ret = fuse_co_getattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf));
+    case FUSE_GETATTR: {
+        struct fuse_attr_out *out =
+            (struct fuse_attr_out *)op_out_buf;
+
+        ret = fuse_co_getattr(exp, out);
         break;
+    }
 
     case FUSE_SETATTR: {
-        const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, q);
-        ret = fuse_co_setattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf),
-                              in->valid, in->size, in->mode, in->uid, in->gid);
+        const struct fuse_setattr_in *in =
+            (const struct fuse_setattr_in *)op_in_buf;
+
+        struct fuse_attr_out *out =
+            (struct fuse_attr_out *)op_out_buf;
+
+        ret = fuse_co_setattr(exp, out, in->valid, in->size, in->mode,
+                              in->uid, in->gid);
         break;
     }
 
     case FUSE_READ: {
-        const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, q);
+        const struct fuse_read_in *in =
+            (const struct fuse_read_in *)op_in_buf;
+
         ret = fuse_co_read(exp, &out_data_buffer, in->offset, in->size);
         break;
     }
 
     case FUSE_WRITE: {
-        const struct fuse_write_in *in = FUSE_IN_OP_STRUCT(write, q);
-        uint32_t req_len;
-
-        req_len = ((const struct fuse_in_header *)q->request_buf)->len;
-        if (unlikely(req_len < sizeof(struct fuse_in_header) + sizeof(*in) +
-                               in->size)) {
-            warn_report("FUSE WRITE truncated; received %zu bytes of %" PRIu32,
-                        req_len - sizeof(struct fuse_in_header) - sizeof(*in),
-                        in->size);
-            ret = -EINVAL;
-            break;
+        const struct fuse_write_in *in =
+            (const struct fuse_write_in *)op_in_buf;
+
+        struct fuse_write_out *out =
+            (struct fuse_write_out *)op_out_buf;
+
+#ifdef CONFIG_LINUX_IO_URING
+        if (!exp->is_uring) {
+#endif
+            uint32_t req_len = ((const struct fuse_in_header *)in_buf)->len;
+
+            if (unlikely(req_len < sizeof(struct fuse_in_header) + sizeof(*in) +
+                        in->size)) {
+                warn_report("FUSE WRITE truncated; received %zu bytes of %"
+                    PRIu32,
+                    req_len - sizeof(struct fuse_in_header) - sizeof(*in),
+                    in->size);
+                ret = -EINVAL;
+                break;
+            }
+#ifdef CONFIG_LINUX_IO_URING
+        } else {
+            assert(in->size <=
+                ((FuseRingEnt *)opaque)->req_header.ring_ent_in_out.payload_sz);
         }
+#endif
 
-        /*
-         * poll_fuse_fd() has checked that in_hdr->len matches the number of
-         * bytes read, which cannot exceed the max_write value we set
-         * (FUSE_MAX_WRITE_BYTES).  So we know that FUSE_MAX_WRITE_BYTES >=
-         * in_hdr->len >= in->size + X, so this assertion must hold.
-         */
         assert(in->size <= FUSE_MAX_WRITE_BYTES);
 
-        /*
-         * Passing a pointer to `in` (i.e. the request buffer) is fine because
-         * fuse_co_write() takes care to copy its contents before potentially
-         * yielding.
-         */
-        ret = fuse_co_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
-                            in->offset, in->size, in + 1, spillover_buf);
+        const void *in_place_buf = in + 1;
+        const void *spill_buf = spillover_buf;
+
+#ifdef CONFIG_LINUX_IO_URING
+        if (exp->is_uring) {
+            in_place_buf = NULL;
+            spill_buf = out_buf;
+        }
+#endif
+
+        ret = fuse_co_write(exp, out, in->offset, in->size,
+                            in_place_buf, spill_buf);
         break;
     }
 
     case FUSE_FALLOCATE: {
-        const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, q);
+        const struct fuse_fallocate_in *in =
+            (const struct fuse_fallocate_in *)op_in_buf;
+
         ret = fuse_co_fallocate(exp, in->offset, in->length, in->mode);
         break;
     }
@@ -1759,9 +1797,13 @@ fuse_co_process_request(FuseQueue *q, void *spillover_buf)
 
 #ifdef CONFIG_FUSE_LSEEK
     case FUSE_LSEEK: {
-        const struct fuse_lseek_in *in = FUSE_IN_OP_STRUCT(lseek, q);
-        ret = fuse_co_lseek(exp, FUSE_OUT_OP_STRUCT(lseek, out_buf),
-                            in->offset, in->whence);
+        const struct fuse_lseek_in *in =
+            (const struct fuse_lseek_in *)op_in_buf;
+
+        struct fuse_lseek_out *out =
+            (struct fuse_lseek_out *)op_out_buf;
+
+        ret = fuse_co_lseek(exp, out, in->offset, in->whence);
         break;
     }
 #endif
@@ -1770,28 +1812,147 @@ fuse_co_process_request(FuseQueue *q, void *spillover_buf)
         ret = -ENOSYS;
     }
 
-    /* Ignore errors from fuse_write*(), nothing we can do anyway */
+    send_response(opaque, req_id, ret, out_data_buffer, out_buf);
+
     if (out_data_buffer) {
-        assert(ret >= 0);
-        fuse_write_buf_response(q->fuse_fd, req_id, out_hdr,
-                                out_data_buffer, ret);
         qemu_vfree(out_data_buffer);
+    }
+
+    if (fd != -1) {
+        qemu_vfree(spillover_buf);
+    }
+
 #ifdef CONFIG_LINUX_IO_URING
+    /* Handle FUSE initialization errors */
+    if (unlikely(opcode == FUSE_INIT && ret == -ENODEV)) {
+        error_report("System doesn't support FUSE-over-io_uring");
+        fuse_export_halt(exp);
+        return;
+    }
+
     /* Handle FUSE-over-io_uring initialization */
-    if (unlikely(opcode == FUSE_INIT && exp->is_uring)) {
+    if (unlikely(opcode == FUSE_INIT && exp->is_uring && fd != -1)) {
         struct fuse_init_out *out =
-            (struct fuse_init_out *)FUSE_OUT_OP_STRUCT(out_buf);
+            (struct fuse_init_out *)FUSE_OUT_OP_STRUCT_LEGACY(out_buf);
         fuse_uring_start(exp, out);
     }
 #endif
+}
+
+/* Helper to send response for legacy */
+static void send_response_legacy(void *opaque, uint32_t req_id, ssize_t ret,
+                            const void *buf, void *out_buf)
+{
+    FuseQueue *q = (FuseQueue *)opaque;
+    struct fuse_out_header *out_hdr = (struct fuse_out_header *)out_buf;
+    if (buf) {
+        assert(ret >= 0);
+        fuse_write_buf_response(q->fuse_fd, req_id, out_hdr, buf, ret);
     } else {
         fuse_write_response(q->fuse_fd, req_id, out_hdr,
                             ret < 0 ? ret : 0,
                             ret < 0 ? 0 : ret);
     }
+}
 
-    qemu_vfree(spillover_buf);
+static void coroutine_fn
+fuse_co_process_request(FuseQueue *q, void *spillover_buf)
+{
+    FuseExport *exp = q->exp;
+    uint32_t opcode;
+    uint64_t req_id;
+
+    /*
+     * Return buffer.  Must be large enough to hold all return headers, but does
+     * not include space for data returned by read requests.
+     */
+    char out_buf[sizeof(struct fuse_out_header) +
+        MAX_CONST(sizeof(struct fuse_init_out),
+        MAX_CONST(sizeof(struct fuse_open_out),
+        MAX_CONST(sizeof(struct fuse_attr_out),
+        MAX_CONST(sizeof(struct fuse_write_out),
+                  sizeof(struct fuse_lseek_out)))))] = {0};
+
+    /* Limit scope to ensure pointer is no longer used after yielding */
+    {
+        const struct fuse_in_header *in_hdr =
+            (const struct fuse_in_header *)q->request_buf;
+
+        opcode = in_hdr->opcode;
+        req_id = in_hdr->unique;
+    }
+
+    fuse_co_process_request_common(exp, opcode, req_id, q->request_buf,
+        spillover_buf, out_buf, q->fuse_fd, send_response_legacy, q);
+}
+
+#ifdef CONFIG_LINUX_IO_URING
+static void fuse_uring_prep_sqe_commit(struct io_uring_sqe *sqe, void *opaque)
+{
+    FuseRingEnt *ent = opaque;
+    struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0];
+
+    fuse_uring_sqe_prepare(sqe, ent->rq->q, FUSE_IO_URING_CMD_COMMIT_AND_FETCH);
+    fuse_uring_sqe_set_req_data(req, ent->rq->rqid,
+                                     ent->req_commit_id);
+}
+
+static void
+fuse_uring_send_response(FuseRingEnt *ent, uint32_t req_id, ssize_t ret,
+                         const void *out_data_buffer)
+{
+    FuseExport *exp = ent->rq->q->exp;
+
+    struct fuse_uring_req_header *rrh = &ent->req_header;
+    struct fuse_out_header *out_header = (struct fuse_out_header *)&rrh->in_out;
+    struct fuse_uring_ent_in_out *ent_in_out =
+        (struct fuse_uring_ent_in_out *)&rrh->ring_ent_in_out;
+
+    /* FUSE_READ */
+    if (out_data_buffer && ret > 0) {
+        memcpy(ent->op_payload, out_data_buffer, ret);
+    }
+
+    out_header->error  = ret < 0 ? ret : 0;
+    out_header->unique = req_id;
+    /* out_header->len = ret > 0 ? ret : 0; */
+    ent_in_out->payload_sz = ret > 0 ? ret : 0;
+    aio_add_sqe(fuse_uring_prep_sqe_commit, ent,
+                    &ent->fuse_cqe_handler);
+}
+
+/* Helper to send response for uring */
+static void send_response_uring(void *opaque, uint32_t req_id, ssize_t ret,
+                        const void *out_data_buffer, void *payload)
+{
+    FuseRingEnt *ent = (FuseRingEnt *)opaque;
+
+    fuse_uring_send_response(ent, req_id, ret, out_data_buffer);
+}
+
+static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent)
+{
+    FuseExport *exp = ent->rq->q->exp;
+    struct fuse_uring_req_header *rrh = &ent->req_header;
+    struct fuse_uring_ent_in_out *ent_in_out =
+        (struct fuse_uring_ent_in_out *)&rrh->ring_ent_in_out;
+    struct fuse_in_header *in_hdr =
+        (struct fuse_in_header *)&rrh->in_out;
+    uint32_t opcode = in_hdr->opcode;
+    uint64_t req_id = in_hdr->unique;
+    ent->req_commit_id = ent_in_out->commit_id;
+
+    if (unlikely(ent->req_commit_id == 0)) {
+        error_report("If this happens kernel will not find the response - "
+            "it will be stuck forever - better to abort immediately.");
+        fuse_export_halt(exp);
+        return;
+    }
+
+    fuse_co_process_request_common(exp, opcode, req_id, &rrh->op_in,
+        NULL, ent->op_payload, -1, send_response_uring, ent);
 }
+#endif
 
 const BlockExportDriver blk_exp_fuse = {
     .type               = BLOCK_EXPORT_TYPE_FUSE,
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/4] export/fuse: Safe termination for FUSE-uring
  2025-08-30  2:50 [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports Brian Song
  2025-08-30  2:50 ` [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring Brian Song
  2025-08-30  2:50 ` [PATCH 2/4] export/fuse: process FUSE-over-io_uring requests Brian Song
@ 2025-08-30  2:50 ` Brian Song
  2025-09-09 19:33   ` Stefan Hajnoczi
  2025-08-30  2:50 ` [PATCH 4/4] iotests: add tests for FUSE-over-io_uring Brian Song
  2025-08-30 12:00 ` [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports Brian Song
  4 siblings, 1 reply; 38+ messages in thread
From: Brian Song @ 2025-08-30  2:50 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, armbru, bernd, fam, hibriansong, hreitz, kwolf,
	stefanha

When the user sends a termination signal, storage-export-daemon stops
the export, exits the main loop (main_loop_wait), and begins cleaning
up associated resources. At this point, some SQEs submitted via FUSE_IO
_URING_CMD_COMMIT_AND_FETCH may still be pending in the kernel, waiting
for incoming FUSE requests, which can trigger CQE handlers in user
space.

Currently, there is no way to manually cancel these pending CQEs in the
kernel. As a result, after export termination, the related data
structures might be deleted before the pending CQEs return, causing the
CQE handler to be invoked after it has been freed, which may lead to a
segfault.

As a workaround, when submitting an SQE to the kernel, we increment the
block reference (blk_exp_ref) to prevent the CQE handler from being
deleted during export termination. Once the CQE is received, we
decrement the reference (blk_exp_unref).

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Brian Song <hibriansong@gmail.com>
---
 block/export/fuse.c | 75 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 65 insertions(+), 10 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 07f74fc8ec..ab2eb895ad 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -39,6 +39,7 @@
 
 #include "standard-headers/linux/fuse.h"
 #include <sys/ioctl.h>
+#include <sys/sysinfo.h>
 
 #if defined(CONFIG_FALLOCATE_ZERO_RANGE)
 #include <linux/falloc.h>
@@ -321,6 +322,8 @@ static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque)
     fuse_inc_in_flight(exp);
 
     /* A ring entry returned */
+    blk_exp_unref(&exp->common);
+
     fuse_uring_co_process_request(ent);
 
     /* Finished processing requests */
@@ -345,6 +348,9 @@ static void fuse_uring_cqe_handler(CqeHandler *cqe_handler)
             err != -ENOTCONN) {
             fuse_export_halt(exp);
         }
+
+        /* A ring entry returned */
+        blk_exp_unref(&exp->common);
     } else {
         co = qemu_coroutine_create(co_fuse_uring_queue_handle_cqes, ent);
         qemu_coroutine_enter(co);
@@ -392,6 +398,8 @@ static void fuse_uring_submit_register(void *opaque)
     FuseRingEnt *ent = opaque;
     FuseExport *exp = ent->rq->q->exp;
 
+    /* Commit and fetch a ring entry */
+    blk_exp_ref(&exp->common);
 
     aio_add_sqe(fuse_uring_prep_sqe_register, ent, &(ent->fuse_cqe_handler));
 }
@@ -886,6 +894,38 @@ static void read_from_fuse_fd(void *opaque)
     qemu_coroutine_enter(co);
 }
 
+#ifdef CONFIG_LINUX_IO_URING
+static void fuse_ring_queue_manager_destroy(FuseRingQueueManager *manager)
+{
+    if (!manager) {
+        return;
+    }
+
+    for (int i = 0; i < manager->num_ring_queues; i++) {
+        FuseRingQueue *rq = &manager->ring_queues[i];
+
+        for (int j = 0; j < FUSE_DEFAULT_RING_QUEUE_DEPTH; j++) {
+            g_free(rq->ent[j].op_payload);
+        }
+        g_free(rq->ent);
+    }
+
+    g_free(manager->ring_queues);
+    g_free(manager);
+}
+
+static void fuse_export_delete_uring(FuseExport *exp)
+{
+    exp->is_uring = false;
+
+    /* Clean up ring queue manager */
+    if (exp->ring_queue_manager) {
+        fuse_ring_queue_manager_destroy(exp->ring_queue_manager);
+        exp->ring_queue_manager = NULL;
+    }
+}
+#endif
+
 static void fuse_export_shutdown(BlockExport *blk_exp)
 {
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
@@ -901,24 +941,15 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
          */
         g_hash_table_remove(exports, exp->mountpoint);
     }
-}
-
-static void fuse_export_delete(BlockExport *blk_exp)
-{
-    FuseExport *exp = container_of(blk_exp, FuseExport, common);
 
-    for (int i = 0; i < exp->num_queues; i++) {
+    for (size_t i = 0; i < exp->num_queues; i++) {
         FuseQueue *q = &exp->queues[i];
 
         /* Queue 0's FD belongs to the FUSE session */
         if (i > 0 && q->fuse_fd >= 0) {
             close(q->fuse_fd);
         }
-        if (q->spillover_buf) {
-            qemu_vfree(q->spillover_buf);
-        }
     }
-    g_free(exp->queues);
 
     if (exp->fuse_session) {
         if (exp->mounted) {
@@ -927,8 +958,29 @@ static void fuse_export_delete(BlockExport *blk_exp)
 
         fuse_session_destroy(exp->fuse_session);
     }
+}
+
+static void fuse_export_delete(BlockExport *blk_exp)
+{
+    FuseExport *exp = container_of(blk_exp, FuseExport, common);
+
+    for (size_t i = 0; i < exp->num_queues; i++) {
+        FuseQueue *q = &exp->queues[i];
+
+        if (q->spillover_buf) {
+            qemu_vfree(q->spillover_buf);
+        }
+    }
 
     g_free(exp->mountpoint);
+
+#ifdef CONFIG_LINUX_IO_URING
+    if (exp->is_uring) {
+        fuse_export_delete_uring(exp);
+    }
+#endif
+
+    g_free(exp->queues);
 }
 
 /**
@@ -1917,6 +1969,9 @@ fuse_uring_send_response(FuseRingEnt *ent, uint32_t req_id, ssize_t ret,
     out_header->unique = req_id;
     /* out_header->len = ret > 0 ? ret : 0; */
     ent_in_out->payload_sz = ret > 0 ? ret : 0;
+
+    /* Commit and fetch a ring entry */
+    blk_exp_ref(&exp->common);
     aio_add_sqe(fuse_uring_prep_sqe_commit, ent,
                     &ent->fuse_cqe_handler);
 }
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/4] iotests: add tests for FUSE-over-io_uring
  2025-08-30  2:50 [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports Brian Song
                   ` (2 preceding siblings ...)
  2025-08-30  2:50 ` [PATCH 3/4] export/fuse: Safe termination for FUSE-uring Brian Song
@ 2025-08-30  2:50 ` Brian Song
  2025-09-09 19:38   ` Stefan Hajnoczi
  2025-08-30 12:00 ` [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports Brian Song
  4 siblings, 1 reply; 38+ messages in thread
From: Brian Song @ 2025-08-30  2:50 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, armbru, bernd, fam, hibriansong, hreitz, kwolf,
	stefanha

To test FUSE-over-io_uring, set the environment variable
FUSE_OVER_IO_URING=1. This applies only when using the
'fuse' protocol.

$ FUSE_OVER_IO_URING=1 ./check -fuse

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Brian Song <hibriansong@gmail.com>
---
 tests/qemu-iotests/check     |  2 ++
 tests/qemu-iotests/common.rc | 45 +++++++++++++++++++++++++++---------
 2 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/tests/qemu-iotests/check b/tests/qemu-iotests/check
index 545f9ec7bd..c6fa0f9e3d 100755
--- a/tests/qemu-iotests/check
+++ b/tests/qemu-iotests/check
@@ -94,6 +94,8 @@ def make_argparser() -> argparse.ArgumentParser:
         mg.add_argument('-' + fmt, dest='imgfmt', action='store_const',
                         const=fmt, help=f'test {fmt}')
 
+    # To test FUSE-over-io_uring, set the environment variable
+    # FUSE_OVER_IO_URING=1. This applies only when using the 'fuse' protocol
     protocol_list = ['file', 'rbd', 'nbd', 'ssh', 'nfs', 'fuse']
     g_prt = p.add_argument_group(
         '  image protocol options',
diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index e977cb4eb6..f8b79c3810 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -539,17 +539,38 @@ _make_test_img()
         touch "$export_mp"
         rm -f "$SOCK_DIR/fuse-output"
 
-        # Usually, users would export formatted nodes.  But we present fuse as a
-        # protocol-level driver here, so we have to leave the format to the
-        # client.
-        # Switch off allow-other, because in general we do not need it for
-        # iotests.  The default allow-other=auto has the downside of printing a
-        # fusermount error on its first attempt if allow_other is not
-        # permissible, which we would need to filter.
-        QSD_NEED_PID=y $QSD \
-              --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
-              --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off \
-              &
+        if [ -n "$FUSE_OVER_IO_URING" ]; then
+            nr_cpu=$(nproc 2>/dev/null || echo 1)
+            nr_iothreads=$((nr_cpu / 2))
+            if [ $nr_iothreads -lt 1 ]; then
+                nr_iothreads=1
+            fi
+
+            iothread_args=""
+            iothread_export_args=""
+            for ((i=0; i<$nr_iothreads; i++)); do
+                iothread_args="$iothread_args --object iothread,id=iothread$i"
+                iothread_export_args="$iothread_export_args,iothread.$i=iothread$i"
+            done
+
+            QSD_NEED_PID=y $QSD \
+                    $iothread_args \
+                    --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
+                    --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off,io-uring=on$iothread_export_args \
+                &
+        else
+            # Usually, users would export formatted nodes.  But we present fuse as a
+            # protocol-level driver here, so we have to leave the format to the
+            # client.
+            # Switch off allow-other, because in general we do not need it for
+            # iotests.  The default allow-other=auto has the downside of printing a
+            # fusermount error on its first attempt if allow_other is not
+            # permissible, which we would need to filter.
+            QSD_NEED_PID=y $QSD \
+                --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
+                --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off \
+                &
+        fi
 
         pidfile="$QEMU_TEST_DIR/qemu-storage-daemon.pid"
 
@@ -592,6 +613,8 @@ _rm_test_img()
 
         kill "${FUSE_PIDS[index]}"
 
+        sleep 1
+
         # Wait until the mount is gone
         timeout=10 # *0.5 s
         while true; do
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports
  2025-08-30  2:50 [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports Brian Song
                   ` (3 preceding siblings ...)
  2025-08-30  2:50 ` [PATCH 4/4] iotests: add tests for FUSE-over-io_uring Brian Song
@ 2025-08-30 12:00 ` Brian Song
  2025-09-03  9:49   ` Stefan Hajnoczi
  2025-09-04 19:32   ` Stefan Hajnoczi
  4 siblings, 2 replies; 38+ messages in thread
From: Brian Song @ 2025-08-30 12:00 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, armbru, bernd, fam, hreitz, kwolf, stefanha

We used fio to test a 1 GB file under both traditional FUSE and
FUSE-over-io_uring modes. The experiments were conducted with the
following iodepth and numjobs configurations: 1-1, 64-1, 1-4, and 64-4,
with 70% read and 30% write, resulting in a total of eight test cases,
measuring both latency and throughput.

Test results:

https://gist.github.com/hibriansong/a4849903387b297516603e83b53bbde4




On 8/29/25 10:50 PM, Brian Song wrote:
> Hi all,
>
> This is a GSoC project. More details are available here:
> https://wiki.qemu.org/Google_Summer_of_Code_2025#FUSE-over-io_uring_exports
>
> This patch series includes:
> - Add a round-robin mechanism to distribute the kernel-required Ring
> Queues to FUSE Queues
> - Support multiple in-flight requests (multiple ring entries)
> - Add tests for FUSE-over-io_uring
>
> More detail in the v2 cover letter:
> https://lists.nongnu.org/archive/html/qemu-block/2025-08/msg00140.html
>
> And in the v1 cover letter:
> https://lists.nongnu.org/archive/html/qemu-block/2025-07/msg00280.html
>
>
> Brian Song (4):
>    export/fuse: add opt to enable FUSE-over-io_uring
>    export/fuse: process FUSE-over-io_uring requests
>    export/fuse: Safe termination for FUSE-uring
>    iotests: add tests for FUSE-over-io_uring
>
>   block/export/fuse.c                  | 838 +++++++++++++++++++++------
>   docs/tools/qemu-storage-daemon.rst   |  11 +-
>   qapi/block-export.json               |   5 +-
>   storage-daemon/qemu-storage-daemon.c |   1 +
>   tests/qemu-iotests/check             |   2 +
>   tests/qemu-iotests/common.rc         |  45 +-
>   util/fdmon-io_uring.c                |   5 +-
>   7 files changed, 717 insertions(+), 190 deletions(-)
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports
  2025-08-30 12:00 ` [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports Brian Song
@ 2025-09-03  9:49   ` Stefan Hajnoczi
  2025-09-03 18:11     ` Brian Song
  2025-09-04 19:32   ` Stefan Hajnoczi
  1 sibling, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2025-09-03  9:49 UTC (permalink / raw)
  To: Brian Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf

[-- Attachment #1: Type: text/plain, Size: 2424 bytes --]

On Sat, Aug 30, 2025 at 08:00:00AM -0400, Brian Song wrote:
> We used fio to test a 1 GB file under both traditional FUSE and
> FUSE-over-io_uring modes. The experiments were conducted with the
> following iodepth and numjobs configurations: 1-1, 64-1, 1-4, and 64-4,
> with 70% read and 30% write, resulting in a total of eight test cases,
> measuring both latency and throughput.
> 
> Test results:
> 
> https://gist.github.com/hibriansong/a4849903387b297516603e83b53bbde4

Hanna: You benchmarked the FUSE export coroutine implementation a little
while ago. What do you think about these results with
FUSE-over-io_uring?

What stands out to me is that iodepth=1 numjobs=4 already saturates the
system, so increasing iodepth to 64 does not improve the results much.

Brian: What is the qemu-storage-daemon command-line for the benchmark
and what are the details of /mnt/tmp/ (e.g. a preallocated 10 GB file
with an XFS file system mounted from the FUSE image)?

Thanks,
Stefan

> 
> 
> 
> 
> On 8/29/25 10:50 PM, Brian Song wrote:
> > Hi all,
> >
> > This is a GSoC project. More details are available here:
> > https://wiki.qemu.org/Google_Summer_of_Code_2025#FUSE-over-io_uring_exports
> >
> > This patch series includes:
> > - Add a round-robin mechanism to distribute the kernel-required Ring
> > Queues to FUSE Queues
> > - Support multiple in-flight requests (multiple ring entries)
> > - Add tests for FUSE-over-io_uring
> >
> > More detail in the v2 cover letter:
> > https://lists.nongnu.org/archive/html/qemu-block/2025-08/msg00140.html
> >
> > And in the v1 cover letter:
> > https://lists.nongnu.org/archive/html/qemu-block/2025-07/msg00280.html
> >
> >
> > Brian Song (4):
> >    export/fuse: add opt to enable FUSE-over-io_uring
> >    export/fuse: process FUSE-over-io_uring requests
> >    export/fuse: Safe termination for FUSE-uring
> >    iotests: add tests for FUSE-over-io_uring
> >
> >   block/export/fuse.c                  | 838 +++++++++++++++++++++------
> >   docs/tools/qemu-storage-daemon.rst   |  11 +-
> >   qapi/block-export.json               |   5 +-
> >   storage-daemon/qemu-storage-daemon.c |   1 +
> >   tests/qemu-iotests/check             |   2 +
> >   tests/qemu-iotests/common.rc         |  45 +-
> >   util/fdmon-io_uring.c                |   5 +-
> >   7 files changed, 717 insertions(+), 190 deletions(-)
> >
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring
  2025-08-30  2:50 ` [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring Brian Song
@ 2025-09-03 10:53   ` Stefan Hajnoczi
  2025-09-03 18:00     ` Brian Song
  2025-09-03 11:26   ` Stefan Hajnoczi
  2025-09-16 19:08   ` Kevin Wolf
  2 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2025-09-03 10:53 UTC (permalink / raw)
  To: Brian Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf

[-- Attachment #1: Type: text/plain, Size: 23280 bytes --]

On Fri, Aug 29, 2025 at 10:50:22PM -0400, Brian Song wrote:
> This patch adds a new export option for storage-export-daemon to enable
> FUSE-over-io_uring via the switch io-uring=on|off (disableby default).
> It also implements the protocol handshake with the Linux kernel
> during the FUSE-over-io_uring initialization phase.
> 
> See: https://docs.kernel.org/filesystems/fuse-io-uring.html
> 
> The kernel documentation describes in detail how FUSE-over-io_uring
> works. This patch implements the Initial SQE stage shown in thediagram:
> it initializes one queue per IOThread, each currently supporting a
> single submission queue entry (SQE). When the FUSE driver sends the
> first FUSE request (FUSE_INIT), storage-export-daemon calls
> fuse_uring_start() to complete initialization, ultimately submitting
> the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm
> successful initialization with the kernel.
> 
> We also added support for multiple IOThreads. The current Linux kernel
> requires registering $(nproc) queues when setting up FUSE-over-io_uring
> To let users customize the number of FUSE Queues (i.e., IOThreads),
> we first create nproc Ring Queues as required by the kernel, then
> distribute them in a round-robin manner to the FUSE Queues for
> registration. In addition, to support multiple in-flight requests,
> we configure each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH
> entries/requests.

The previous paragraph says "each currently supporting a single
submission queue entry (SQE)" whereas this paragraph says "we configure
each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH entries/requests".
Maybe this paragraph was squashed into the commit description in a later
step and the previous paragraph can be updated to reflect that multiple
SQEs are submitted?

> 
> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Brian Song <hibriansong@gmail.com>
> ---
>  block/export/fuse.c                  | 310 +++++++++++++++++++++++++--
>  docs/tools/qemu-storage-daemon.rst   |  11 +-
>  qapi/block-export.json               |   5 +-
>  storage-daemon/qemu-storage-daemon.c |   1 +
>  util/fdmon-io_uring.c                |   5 +-
>  5 files changed, 309 insertions(+), 23 deletions(-)
> 
> diff --git a/block/export/fuse.c b/block/export/fuse.c
> index c0ad4696ce..19bf9e5f74 100644
> --- a/block/export/fuse.c
> +++ b/block/export/fuse.c
> @@ -48,6 +48,9 @@
>  #include <linux/fs.h>
>  #endif
>  
> +/* room needed in buffer to accommodate header */
> +#define FUSE_BUFFER_HEADER_SIZE 0x1000

Is it possible to write this in a way that shows how the constant is
calculated? That way the constant would automatically adjust on systems
where the underlying assumptions have changed (e.g. page size, header
struct size). This approach is also self-documenting so it's possible to
understand where the magic number comes from.

For example:

  #define FUSE_BUFFER_HEADER_SIZE DIV_ROUND_UP(sizeof(struct fuse_uring_req_header), qemu_real_host_page_size())

(I'm guessing what the formula you used is, so this example may be
incorrect...)

> +
>  /* Prevent overly long bounce buffer allocations */
>  #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
>  /*
> @@ -63,12 +66,59 @@
>      (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
>  
>  typedef struct FuseExport FuseExport;
> +typedef struct FuseQueue FuseQueue;
> +
> +#ifdef CONFIG_LINUX_IO_URING
> +#define FUSE_DEFAULT_RING_QUEUE_DEPTH 64
> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
> +
> +typedef struct FuseRingQueue FuseRingQueue;
> +typedef struct FuseRingEnt {
> +    /* back pointer */
> +    FuseRingQueue *rq;
> +
> +    /* commit id of a fuse request */
> +    uint64_t req_commit_id;

This field is not used in this commit. Please introduce it in the commit
that uses it so it's easier to review and understand the purpose of this
field.

> +
> +    /* fuse request header and payload */
> +    struct fuse_uring_req_header req_header;
> +    void *op_payload;
> +    size_t req_payload_sz;

op_payload and req_payload_sz refer to the same buffer, and they are
submitted alongside req_header. It would be nice to name the fields
consistently:

  struct fuse_uring_req_header req_header;
  void *req_payload;
  size_t req_payload_sz;

req_payload and req_payload_sz could be eliminated since they are also
stored in iov[1].iov_base and .iov_len. If you feel that would be harder
to understand, then it's okay to keep the duplicate fields.

> +
> +    /* The vector passed to the kernel */
> +    struct iovec iov[2];
> +
> +    CqeHandler fuse_cqe_handler;
> +} FuseRingEnt;
> +
> +struct FuseRingQueue {

A comment would be nice here to explain that the kernel requires one
FuseRingQueue per host CPU and this concept is independent of /dev/fuse
(FuseQueue).

> +    int rqid;
> +
> +    /* back pointer */
> +    FuseQueue *q;
> +    FuseRingEnt *ent;
> +
> +    /* List entry for ring_queues */
> +    QLIST_ENTRY(FuseRingQueue) next;
> +};
> +
> +/*
> + * Round-robin distribution of ring queues across FUSE queues.
> + * This structure manages the mapping between kernel ring queues and user
> + * FUSE queues.
> + */
> +typedef struct FuseRingQueueManager {
> +    FuseRingQueue *ring_queues;
> +    int num_ring_queues;
> +    int num_fuse_queues;
> +} FuseRingQueueManager;
> +#endif

It's easy to forget which #ifdef we're inside after a few lines, so it
helps to indicate that in a comment:

#endif /* CONFIG_LINUX_IO_URING */

>  
>  /*
>   * One FUSE "queue", representing one FUSE FD from which requests are fetched
>   * and processed.  Each queue is tied to an AioContext.
>   */
> -typedef struct FuseQueue {
> +struct FuseQueue {
>      FuseExport *exp;
>  
>      AioContext *ctx;
> @@ -109,15 +159,11 @@ typedef struct FuseQueue {
>       * Free this buffer with qemu_vfree().
>       */
>      void *spillover_buf;
> -} FuseQueue;
>  
> -/*
> - * Verify that FuseQueue.request_buf plus the spill-over buffer together
> - * are big enough to be accepted by the FUSE kernel driver.
> - */
> -QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
> -                  FUSE_SPILLOVER_BUF_SIZE <
> -                  FUSE_MIN_READ_BUFFER);

Why was this removed, it's probably still necessary in the non-io_uring
case (which is compiled in even when CONFIG_LINUX_IO_URING is defined)?

> +#ifdef CONFIG_LINUX_IO_URING
> +    QLIST_HEAD(, FuseRingQueue) ring_queue_list;
> +#endif
> +};
>  
>  struct FuseExport {
>      BlockExport common;
> @@ -133,7 +179,7 @@ struct FuseExport {
>       */
>      bool halted;
>  
> -    int num_queues;
> +    size_t num_queues;
>      FuseQueue *queues;
>      /*
>       * True if this export should follow the generic export's AioContext.
> @@ -149,6 +195,12 @@ struct FuseExport {
>      /* Whether allow_other was used as a mount option or not */
>      bool allow_other;
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +    bool is_uring;
> +    size_t ring_queue_depth;
> +    FuseRingQueueManager *ring_queue_manager;
> +#endif
> +
>      mode_t st_mode;
>      uid_t st_uid;
>      gid_t st_gid;
> @@ -205,7 +257,7 @@ static void fuse_attach_handlers(FuseExport *exp)
>          return;
>      }
>  
> -    for (int i = 0; i < exp->num_queues; i++) {
> +    for (size_t i = 0; i < exp->num_queues; i++) {
>          aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
>                             read_from_fuse_fd, NULL, NULL, NULL,
>                             &exp->queues[i]);
> @@ -257,6 +309,189 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
>      .drained_poll  = fuse_export_drained_poll,
>  };
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req,
> +                    const unsigned int rqid,
> +                    const unsigned int commit_id)
> +{
> +    req->qid = rqid;
> +    req->commit_id = commit_id;
> +    req->flags = 0;
> +}
> +
> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q,
> +               __u32 cmd_op)
> +{
> +    sqe->opcode = IORING_OP_URING_CMD;
> +
> +    sqe->fd = q->fuse_fd;
> +    sqe->rw_flags = 0;
> +    sqe->ioprio = 0;
> +    sqe->off = 0;
> +
> +    sqe->cmd_op = cmd_op;
> +    sqe->__pad1 = 0;
> +}
> +
> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque)
> +{
> +    FuseRingEnt *ent = opaque;
> +    struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0];
> +
> +    fuse_uring_sqe_prepare(sqe, ent->rq->q, FUSE_IO_URING_CMD_REGISTER);
> +
> +    sqe->addr = (uint64_t)(ent->iov);
> +    sqe->len = 2;
> +
> +    fuse_uring_sqe_set_req_data(req, ent->rq->rqid, 0);
> +}
> +
> +static void fuse_uring_submit_register(void *opaque)
> +{
> +    FuseRingEnt *ent = opaque;
> +    FuseExport *exp = ent->rq->q->exp;

This variable is unused in this commit? Does this commit compile for
you? Usually the compiler warns about unused variables.

> +
> +
> +    aio_add_sqe(fuse_uring_prep_sqe_register, ent, &(ent->fuse_cqe_handler));
> +}
> +
> +/**
> + * Distribute ring queues across FUSE queues using round-robin algorithm.
> + * This ensures even distribution of kernel ring queues across user-specified
> + * FUSE queues.
> + */
> +static
> +FuseRingQueueManager *fuse_ring_queue_manager_create(int num_fuse_queues,
> +                                                    size_t ring_queue_depth,
> +                                                    size_t bufsize)
> +{
> +    int num_ring_queues = get_nprocs();

The kernel code uses num_possible_cpus() in
fs/fuse/dev_uring.c:fuse_uring_create() so I think this should be
get_nprocs_conf() instead of get_nprocs().

> +    FuseRingQueueManager *manager = g_new(FuseRingQueueManager, 1);
> +
> +    if (!manager) {

g_new() never returns NULL, so you can remove this if statement. If
memory cannot be allocated then the process will abort.

> +        return NULL;
> +    }
> +
> +    manager->ring_queues = g_new(FuseRingQueue, num_ring_queues);
> +    manager->num_ring_queues = num_ring_queues;
> +    manager->num_fuse_queues = num_fuse_queues;
> +
> +    if (!manager->ring_queues) {

Same here.

> +        g_free(manager);
> +        return NULL;
> +    }
> +
> +    for (int i = 0; i < num_ring_queues; i++) {
> +        FuseRingQueue *rq = &manager->ring_queues[i];
> +        rq->rqid = i;
> +        rq->ent = g_new(FuseRingEnt, ring_queue_depth);
> +
> +        if (!rq->ent) {

Same here.

> +            for (int j = 0; j < i; j++) {
> +                g_free(manager->ring_queues[j].ent);
> +            }
> +            g_free(manager->ring_queues);
> +            g_free(manager);
> +            return NULL;
> +        }
> +
> +        for (size_t j = 0; j < ring_queue_depth; j++) {
> +            FuseRingEnt *ent = &rq->ent[j];
> +            ent->rq = rq;
> +            ent->req_payload_sz = bufsize - FUSE_BUFFER_HEADER_SIZE;
> +            ent->op_payload = g_malloc0(ent->req_payload_sz);
> +
> +            if (!ent->op_payload) {

Same here.

> +                for (size_t k = 0; k < j; k++) {
> +                    g_free(rq->ent[k].op_payload);
> +                }
> +                g_free(rq->ent);
> +                for (int k = 0; k < i; k++) {
> +                    g_free(manager->ring_queues[k].ent);
> +                }
> +                g_free(manager->ring_queues);
> +                g_free(manager);

Where are these structures freed in the normal lifecycle of a FUSE
export? I only see this error handling code, but nothing is freed when
the export is shut down.

> +                return NULL;
> +            }
> +
> +            ent->iov[0] = (struct iovec) {
> +                &(ent->req_header),
> +                sizeof(struct fuse_uring_req_header)
> +            };
> +            ent->iov[1] = (struct iovec) {
> +                ent->op_payload,
> +                ent->req_payload_sz
> +            };
> +
> +            ent->fuse_cqe_handler.cb = fuse_uring_cqe_handler;
> +        }
> +    }
> +
> +    return manager;
> +}
> +
> +static
> +void fuse_distribute_ring_queues(FuseExport *exp, FuseRingQueueManager *manager)
> +{
> +    int queue_index = 0;
> +
> +    for (int i = 0; i < manager->num_ring_queues; i++) {
> +        FuseRingQueue *rq = &manager->ring_queues[i];
> +
> +        rq->q = &exp->queues[queue_index];
> +        QLIST_INSERT_HEAD(&(rq->q->ring_queue_list), rq, next);
> +
> +        queue_index = (queue_index + 1) % manager->num_fuse_queues;
> +    }
> +}
> +
> +static
> +void fuse_schedule_ring_queue_registrations(FuseExport *exp,
> +                                            FuseRingQueueManager *manager)
> +{
> +    for (int i = 0; i < manager->num_fuse_queues; i++) {
> +        FuseQueue *q = &exp->queues[i];
> +        FuseRingQueue *rq;
> +
> +        QLIST_FOREACH(rq, &q->ring_queue_list, next) {
> +            for (int j = 0; j < exp->ring_queue_depth; j++) {
> +                aio_bh_schedule_oneshot(q->ctx, fuse_uring_submit_register,
> +                                        &(rq->ent[j]));
> +            }
> +        }
> +    }
> +}
> +
> +static void fuse_uring_start(FuseExport *exp, struct fuse_init_out *out)
> +{
> +    /*
> +     * Since we didn't enable the FUSE_MAX_PAGES feature, the value of
> +     * fc->max_pages should be FUSE_DEFAULT_MAX_PAGES_PER_REQ, which is set by
> +     * the kernel by default. Also, max_write should not exceed
> +     * FUSE_DEFAULT_MAX_PAGES_PER_REQ * PAGE_SIZE.
> +     */
> +    size_t bufsize = out->max_write + FUSE_BUFFER_HEADER_SIZE;
> +
> +    if (!(out->flags & FUSE_MAX_PAGES)) {
> +        bufsize = FUSE_DEFAULT_MAX_PAGES_PER_REQ * qemu_real_host_page_size()
> +                         + FUSE_BUFFER_HEADER_SIZE;
> +    }
> +
> +    exp->ring_queue_manager = fuse_ring_queue_manager_create(
> +        exp->num_queues, exp->ring_queue_depth, bufsize);
> +
> +    if (!exp->ring_queue_manager) {
> +        error_report("Failed to create ring queue manager");
> +        return;
> +    }
> +
> +    /* Distribute ring queues across FUSE queues using round-robin */
> +    fuse_distribute_ring_queues(exp, exp->ring_queue_manager);
> +
> +    fuse_schedule_ring_queue_registrations(exp, exp->ring_queue_manager);
> +}
> +#endif
> +
>  static int fuse_export_create(BlockExport *blk_exp,
>                                BlockExportOptions *blk_exp_args,
>                                AioContext *const *multithread,
> @@ -270,6 +505,11 @@ static int fuse_export_create(BlockExport *blk_exp,
>  
>      assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +    exp->is_uring = args->io_uring;
> +    exp->ring_queue_depth = FUSE_DEFAULT_RING_QUEUE_DEPTH;
> +#endif
> +
>      if (multithread) {
>          /* Guaranteed by common export code */
>          assert(mt_count >= 1);
> @@ -283,6 +523,10 @@ static int fuse_export_create(BlockExport *blk_exp,
>                  .exp = exp,
>                  .ctx = multithread[i],
>                  .fuse_fd = -1,
> +#ifdef CONFIG_LINUX_IO_URING
> +                .ring_queue_list =
> +                    QLIST_HEAD_INITIALIZER(exp->queues[i].ring_queue_list),
> +#endif
>              };
>          }
>      } else {
> @@ -296,6 +540,10 @@ static int fuse_export_create(BlockExport *blk_exp,
>              .exp = exp,
>              .ctx = exp->common.ctx,
>              .fuse_fd = -1,
> +#ifdef CONFIG_LINUX_IO_URING
> +            .ring_queue_list =
> +                QLIST_HEAD_INITIALIZER(exp->queues[0].ring_queue_list),
> +#endif
>          };
>      }
>  
> @@ -685,17 +933,39 @@ static bool is_regular_file(const char *path, Error **errp)
>   */
>  static ssize_t coroutine_fn
>  fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
> -             uint32_t max_readahead, uint32_t flags)
> +             uint32_t max_readahead, const struct fuse_init_in *in)
>  {
> -    const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
> +    uint64_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO
> +                                     | FUSE_INIT_EXT;
> +    uint64_t outargflags = 0;
> +    uint64_t inargflags = in->flags;
> +
> +    ssize_t ret = 0;
> +
> +    if (inargflags & FUSE_INIT_EXT) {
> +        inargflags = inargflags | (uint64_t) in->flags2 << 32;
> +    }
> +
> +#ifdef CONFIG_LINUX_IO_URING
> +    if (exp->is_uring) {
> +        if (inargflags & FUSE_OVER_IO_URING) {
> +            supported_flags |= FUSE_OVER_IO_URING;
> +        } else {
> +            exp->is_uring = false;
> +            ret = -ENODEV;
> +        }
> +    }
> +#endif
> +
> +    outargflags = inargflags & supported_flags;
>  
>      *out = (struct fuse_init_out) {
>          .major = FUSE_KERNEL_VERSION,
>          .minor = FUSE_KERNEL_MINOR_VERSION,
>          .max_readahead = max_readahead,
>          .max_write = FUSE_MAX_WRITE_BYTES,
> -        .flags = flags & supported_flags,
> -        .flags2 = 0,
> +        .flags = outargflags,
> +        .flags2 = outargflags >> 32,
>  
>          /* libfuse maximum: 2^16 - 1 */
>          .max_background = UINT16_MAX,
> @@ -717,7 +987,7 @@ fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
>          .map_alignment = 0,
>      };
>  
> -    return sizeof(*out);
> +    return ret < 0 ? ret : sizeof(*out);
>  }
>  
>  /**
> @@ -1506,6 +1776,14 @@ fuse_co_process_request(FuseQueue *q, void *spillover_buf)
>          fuse_write_buf_response(q->fuse_fd, req_id, out_hdr,
>                                  out_data_buffer, ret);
>          qemu_vfree(out_data_buffer);
> +#ifdef CONFIG_LINUX_IO_URING
> +    /* Handle FUSE-over-io_uring initialization */
> +    if (unlikely(opcode == FUSE_INIT && exp->is_uring)) {
> +        struct fuse_init_out *out =
> +            (struct fuse_init_out *)FUSE_OUT_OP_STRUCT(out_buf);
> +        fuse_uring_start(exp, out);

Is there any scenario where FUSE_INIT can be received multiple times?
Maybe if the FUSE file system is umounted and mounted again? I want to
check that this doesn't leak previously allocated ring state.

> +    }
> +#endif
>      } else {
>          fuse_write_response(q->fuse_fd, req_id, out_hdr,
>                              ret < 0 ? ret : 0,
> diff --git a/docs/tools/qemu-storage-daemon.rst b/docs/tools/qemu-storage-daemon.rst
> index 35ab2d7807..c5076101e0 100644
> --- a/docs/tools/qemu-storage-daemon.rst
> +++ b/docs/tools/qemu-storage-daemon.rst
> @@ -78,7 +78,7 @@ Standard options:
>  .. option:: --export [type=]nbd,id=<id>,node-name=<node-name>[,name=<export-name>][,writable=on|off][,bitmap=<name>]
>    --export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=unix,addr.path=<socket-path>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>]
>    --export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=fd,addr.str=<fd>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>]
> -  --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>[,growable=on|off][,writable=on|off][,allow-other=on|off|auto]
> +  --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>[,growable=on|off][,writable=on|off][,allow-other=on|off|auto][,io-uring=on|off]
>    --export [type=]vduse-blk,id=<id>,node-name=<node-name>,name=<vduse-name>[,writable=on|off][,num-queues=<num-queues>][,queue-size=<queue-size>][,logical-block-size=<block-size>][,serial=<serial-number>]
>  
>    is a block export definition. ``node-name`` is the block node that should be
> @@ -111,10 +111,11 @@ Standard options:
>    that enabling this option as a non-root user requires enabling the
>    user_allow_other option in the global fuse.conf configuration file.  Setting
>    ``allow-other`` to auto (the default) will try enabling this option, and on
> -  error fall back to disabling it.
> -
> -  The ``vduse-blk`` export type takes a ``name`` (must be unique across the host)
> -  to create the VDUSE device.
> +  error fall back to disabling it. Once ``io-uring`` is enabled (off by default),
> +  the FUSE-over-io_uring-related settings will be initialized to bypass the
> +  traditional /dev/fuse communication mechanism and instead use io_uring to
> +  handle FUSE operations. The ``vduse-blk`` export type takes a ``name``
> +  (must be unique across the host) to create the VDUSE device.
>    ``num-queues`` sets the number of virtqueues (the default is 1).
>    ``queue-size`` sets the virtqueue descriptor table size (the default is 256).
>  
> diff --git a/qapi/block-export.json b/qapi/block-export.json
> index 9ae703ad01..37f2fc47e2 100644
> --- a/qapi/block-export.json
> +++ b/qapi/block-export.json
> @@ -184,12 +184,15 @@
>  #     mount the export with allow_other, and if that fails, try again
>  #     without.  (since 6.1; default: auto)
>  #
> +# @io-uring: Use FUSE-over-io-uring.  (since 10.2; default: false)
> +#
>  # Since: 6.0
>  ##
>  { 'struct': 'BlockExportOptionsFuse',
>    'data': { 'mountpoint': 'str',
>              '*growable': 'bool',
> -            '*allow-other': 'FuseExportAllowOther' },
> +            '*allow-other': 'FuseExportAllowOther',
> +            '*io-uring': 'bool' },
>    'if': 'CONFIG_FUSE' }
>  
>  ##
> diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-storage-daemon.c
> index eb72561358..0cd4cd2b58 100644
> --- a/storage-daemon/qemu-storage-daemon.c
> +++ b/storage-daemon/qemu-storage-daemon.c
> @@ -107,6 +107,7 @@ static void help(void)
>  #ifdef CONFIG_FUSE
>  "  --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>\n"
>  "           [,growable=on|off][,writable=on|off][,allow-other=on|off|auto]\n"
> +"           [,io-uring=on|off]"
>  "                         export the specified block node over FUSE\n"
>  "\n"
>  #endif /* CONFIG_FUSE */
> diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> index d2433d1d99..68d3fe8e01 100644
> --- a/util/fdmon-io_uring.c
> +++ b/util/fdmon-io_uring.c
> @@ -452,10 +452,13 @@ static const FDMonOps fdmon_io_uring_ops = {
>  void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
>  {
>      int ret;
> +    int flags;
>  
>      ctx->io_uring_fd_tag = NULL;
> +    flags = IORING_SETUP_SQE128;

Please add /* needed by FUSE-over-io_uring */ so it's clear who the user
is.

>  
> -    ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
> +    ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES,
> +                            &ctx->fdmon_io_uring, flags);
>      if (ret != 0) {
>          error_setg_errno(errp, -ret, "Failed to initialize io_uring");
>          return;
> -- 
> 2.45.2
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring
  2025-08-30  2:50 ` [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring Brian Song
  2025-09-03 10:53   ` Stefan Hajnoczi
@ 2025-09-03 11:26   ` Stefan Hajnoczi
  2025-09-16 19:08   ` Kevin Wolf
  2 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2025-09-03 11:26 UTC (permalink / raw)
  To: Brian Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf

[-- Attachment #1: Type: text/plain, Size: 20861 bytes --]

On Fri, Aug 29, 2025 at 10:50:22PM -0400, Brian Song wrote:
> This patch adds a new export option for storage-export-daemon to enable
> FUSE-over-io_uring via the switch io-uring=on|off (disableby default).
> It also implements the protocol handshake with the Linux kernel
> during the FUSE-over-io_uring initialization phase.
> 
> See: https://docs.kernel.org/filesystems/fuse-io-uring.html
> 
> The kernel documentation describes in detail how FUSE-over-io_uring
> works. This patch implements the Initial SQE stage shown in thediagram:
> it initializes one queue per IOThread, each currently supporting a
> single submission queue entry (SQE). When the FUSE driver sends the
> first FUSE request (FUSE_INIT), storage-export-daemon calls
> fuse_uring_start() to complete initialization, ultimately submitting
> the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm
> successful initialization with the kernel.
> 
> We also added support for multiple IOThreads. The current Linux kernel
> requires registering $(nproc) queues when setting up FUSE-over-io_uring
> To let users customize the number of FUSE Queues (i.e., IOThreads),
> we first create nproc Ring Queues as required by the kernel, then
> distribute them in a round-robin manner to the FUSE Queues for
> registration. In addition, to support multiple in-flight requests,
> we configure each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH
> entries/requests.
> 
> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Brian Song <hibriansong@gmail.com>
> ---
>  block/export/fuse.c                  | 310 +++++++++++++++++++++++++--
>  docs/tools/qemu-storage-daemon.rst   |  11 +-
>  qapi/block-export.json               |   5 +-
>  storage-daemon/qemu-storage-daemon.c |   1 +
>  util/fdmon-io_uring.c                |   5 +-
>  5 files changed, 309 insertions(+), 23 deletions(-)
> 
> diff --git a/block/export/fuse.c b/block/export/fuse.c
> index c0ad4696ce..19bf9e5f74 100644
> --- a/block/export/fuse.c
> +++ b/block/export/fuse.c
> @@ -48,6 +48,9 @@
>  #include <linux/fs.h>
>  #endif
>  
> +/* room needed in buffer to accommodate header */
> +#define FUSE_BUFFER_HEADER_SIZE 0x1000
> +
>  /* Prevent overly long bounce buffer allocations */
>  #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
>  /*
> @@ -63,12 +66,59 @@
>      (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
>  
>  typedef struct FuseExport FuseExport;
> +typedef struct FuseQueue FuseQueue;
> +
> +#ifdef CONFIG_LINUX_IO_URING
> +#define FUSE_DEFAULT_RING_QUEUE_DEPTH 64
> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
> +
> +typedef struct FuseRingQueue FuseRingQueue;
> +typedef struct FuseRingEnt {
> +    /* back pointer */
> +    FuseRingQueue *rq;
> +
> +    /* commit id of a fuse request */
> +    uint64_t req_commit_id;
> +
> +    /* fuse request header and payload */
> +    struct fuse_uring_req_header req_header;
> +    void *op_payload;
> +    size_t req_payload_sz;
> +
> +    /* The vector passed to the kernel */
> +    struct iovec iov[2];
> +
> +    CqeHandler fuse_cqe_handler;
> +} FuseRingEnt;
> +
> +struct FuseRingQueue {
> +    int rqid;
> +
> +    /* back pointer */
> +    FuseQueue *q;
> +    FuseRingEnt *ent;
> +
> +    /* List entry for ring_queues */
> +    QLIST_ENTRY(FuseRingQueue) next;
> +};
> +
> +/*
> + * Round-robin distribution of ring queues across FUSE queues.
> + * This structure manages the mapping between kernel ring queues and user
> + * FUSE queues.
> + */
> +typedef struct FuseRingQueueManager {
> +    FuseRingQueue *ring_queues;
> +    int num_ring_queues;
> +    int num_fuse_queues;
> +} FuseRingQueueManager;
> +#endif
>  
>  /*
>   * One FUSE "queue", representing one FUSE FD from which requests are fetched
>   * and processed.  Each queue is tied to an AioContext.
>   */
> -typedef struct FuseQueue {
> +struct FuseQueue {
>      FuseExport *exp;
>  
>      AioContext *ctx;
> @@ -109,15 +159,11 @@ typedef struct FuseQueue {
>       * Free this buffer with qemu_vfree().
>       */
>      void *spillover_buf;
> -} FuseQueue;
>  
> -/*
> - * Verify that FuseQueue.request_buf plus the spill-over buffer together
> - * are big enough to be accepted by the FUSE kernel driver.
> - */
> -QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
> -                  FUSE_SPILLOVER_BUF_SIZE <
> -                  FUSE_MIN_READ_BUFFER);
> +#ifdef CONFIG_LINUX_IO_URING
> +    QLIST_HEAD(, FuseRingQueue) ring_queue_list;
> +#endif
> +};
>  
>  struct FuseExport {
>      BlockExport common;
> @@ -133,7 +179,7 @@ struct FuseExport {
>       */
>      bool halted;
>  
> -    int num_queues;
> +    size_t num_queues;
>      FuseQueue *queues;
>      /*
>       * True if this export should follow the generic export's AioContext.
> @@ -149,6 +195,12 @@ struct FuseExport {
>      /* Whether allow_other was used as a mount option or not */
>      bool allow_other;
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +    bool is_uring;
> +    size_t ring_queue_depth;
> +    FuseRingQueueManager *ring_queue_manager;
> +#endif
> +
>      mode_t st_mode;
>      uid_t st_uid;
>      gid_t st_gid;
> @@ -205,7 +257,7 @@ static void fuse_attach_handlers(FuseExport *exp)
>          return;
>      }
>  
> -    for (int i = 0; i < exp->num_queues; i++) {
> +    for (size_t i = 0; i < exp->num_queues; i++) {
>          aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
>                             read_from_fuse_fd, NULL, NULL, NULL,
>                             &exp->queues[i]);
> @@ -257,6 +309,189 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
>      .drained_poll  = fuse_export_drained_poll,
>  };
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req,
> +                    const unsigned int rqid,
> +                    const unsigned int commit_id)
> +{
> +    req->qid = rqid;
> +    req->commit_id = commit_id;
> +    req->flags = 0;
> +}
> +
> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q,
> +               __u32 cmd_op)
> +{
> +    sqe->opcode = IORING_OP_URING_CMD;
> +
> +    sqe->fd = q->fuse_fd;
> +    sqe->rw_flags = 0;
> +    sqe->ioprio = 0;
> +    sqe->off = 0;
> +
> +    sqe->cmd_op = cmd_op;
> +    sqe->__pad1 = 0;
> +}
> +
> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque)
> +{
> +    FuseRingEnt *ent = opaque;
> +    struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0];
> +
> +    fuse_uring_sqe_prepare(sqe, ent->rq->q, FUSE_IO_URING_CMD_REGISTER);
> +
> +    sqe->addr = (uint64_t)(ent->iov);
> +    sqe->len = 2;
> +
> +    fuse_uring_sqe_set_req_data(req, ent->rq->rqid, 0);
> +}
> +
> +static void fuse_uring_submit_register(void *opaque)
> +{
> +    FuseRingEnt *ent = opaque;
> +    FuseExport *exp = ent->rq->q->exp;
> +
> +
> +    aio_add_sqe(fuse_uring_prep_sqe_register, ent, &(ent->fuse_cqe_handler));
> +}
> +
> +/**
> + * Distribute ring queues across FUSE queues using round-robin algorithm.
> + * This ensures even distribution of kernel ring queues across user-specified
> + * FUSE queues.
> + */
> +static
> +FuseRingQueueManager *fuse_ring_queue_manager_create(int num_fuse_queues,
> +                                                    size_t ring_queue_depth,
> +                                                    size_t bufsize)
> +{
> +    int num_ring_queues = get_nprocs();
> +    FuseRingQueueManager *manager = g_new(FuseRingQueueManager, 1);
> +
> +    if (!manager) {
> +        return NULL;
> +    }
> +
> +    manager->ring_queues = g_new(FuseRingQueue, num_ring_queues);
> +    manager->num_ring_queues = num_ring_queues;
> +    manager->num_fuse_queues = num_fuse_queues;
> +
> +    if (!manager->ring_queues) {
> +        g_free(manager);
> +        return NULL;
> +    }
> +
> +    for (int i = 0; i < num_ring_queues; i++) {
> +        FuseRingQueue *rq = &manager->ring_queues[i];
> +        rq->rqid = i;
> +        rq->ent = g_new(FuseRingEnt, ring_queue_depth);
> +
> +        if (!rq->ent) {
> +            for (int j = 0; j < i; j++) {
> +                g_free(manager->ring_queues[j].ent);
> +            }
> +            g_free(manager->ring_queues);
> +            g_free(manager);
> +            return NULL;
> +        }
> +
> +        for (size_t j = 0; j < ring_queue_depth; j++) {
> +            FuseRingEnt *ent = &rq->ent[j];
> +            ent->rq = rq;
> +            ent->req_payload_sz = bufsize - FUSE_BUFFER_HEADER_SIZE;
> +            ent->op_payload = g_malloc0(ent->req_payload_sz);
> +
> +            if (!ent->op_payload) {
> +                for (size_t k = 0; k < j; k++) {
> +                    g_free(rq->ent[k].op_payload);
> +                }
> +                g_free(rq->ent);
> +                for (int k = 0; k < i; k++) {
> +                    g_free(manager->ring_queues[k].ent);
> +                }
> +                g_free(manager->ring_queues);
> +                g_free(manager);
> +                return NULL;
> +            }
> +
> +            ent->iov[0] = (struct iovec) {
> +                &(ent->req_header),
> +                sizeof(struct fuse_uring_req_header)
> +            };
> +            ent->iov[1] = (struct iovec) {
> +                ent->op_payload,
> +                ent->req_payload_sz
> +            };
> +
> +            ent->fuse_cqe_handler.cb = fuse_uring_cqe_handler;

I just noticed this commit won't compile because
fuse_uring_cqe_handler() is introduced in the next commit. There are
several options for resolving this. I suggest squashing the next commit
into this one.

The reason why every commit must compile is that git-bisect(1) is only
useful when the code compiles and passes tests at every commit. If there
are broken commits then bisection becomes impractical because you have
to troubleshoot intermediate commits that may be broken due to issues
unrelated to your bisection.

> +        }
> +    }
> +
> +    return manager;
> +}
> +
> +static
> +void fuse_distribute_ring_queues(FuseExport *exp, FuseRingQueueManager *manager)
> +{
> +    int queue_index = 0;
> +
> +    for (int i = 0; i < manager->num_ring_queues; i++) {
> +        FuseRingQueue *rq = &manager->ring_queues[i];
> +
> +        rq->q = &exp->queues[queue_index];
> +        QLIST_INSERT_HEAD(&(rq->q->ring_queue_list), rq, next);
> +
> +        queue_index = (queue_index + 1) % manager->num_fuse_queues;
> +    }
> +}
> +
> +static
> +void fuse_schedule_ring_queue_registrations(FuseExport *exp,
> +                                            FuseRingQueueManager *manager)
> +{
> +    for (int i = 0; i < manager->num_fuse_queues; i++) {
> +        FuseQueue *q = &exp->queues[i];
> +        FuseRingQueue *rq;
> +
> +        QLIST_FOREACH(rq, &q->ring_queue_list, next) {
> +            for (int j = 0; j < exp->ring_queue_depth; j++) {
> +                aio_bh_schedule_oneshot(q->ctx, fuse_uring_submit_register,
> +                                        &(rq->ent[j]));
> +            }
> +        }
> +    }
> +}
> +
> +static void fuse_uring_start(FuseExport *exp, struct fuse_init_out *out)
> +{
> +    /*
> +     * Since we didn't enable the FUSE_MAX_PAGES feature, the value of
> +     * fc->max_pages should be FUSE_DEFAULT_MAX_PAGES_PER_REQ, which is set by
> +     * the kernel by default. Also, max_write should not exceed
> +     * FUSE_DEFAULT_MAX_PAGES_PER_REQ * PAGE_SIZE.
> +     */
> +    size_t bufsize = out->max_write + FUSE_BUFFER_HEADER_SIZE;
> +
> +    if (!(out->flags & FUSE_MAX_PAGES)) {
> +        bufsize = FUSE_DEFAULT_MAX_PAGES_PER_REQ * qemu_real_host_page_size()
> +                         + FUSE_BUFFER_HEADER_SIZE;
> +    }
> +
> +    exp->ring_queue_manager = fuse_ring_queue_manager_create(
> +        exp->num_queues, exp->ring_queue_depth, bufsize);
> +
> +    if (!exp->ring_queue_manager) {
> +        error_report("Failed to create ring queue manager");
> +        return;
> +    }
> +
> +    /* Distribute ring queues across FUSE queues using round-robin */
> +    fuse_distribute_ring_queues(exp, exp->ring_queue_manager);
> +
> +    fuse_schedule_ring_queue_registrations(exp, exp->ring_queue_manager);
> +}
> +#endif
> +
>  static int fuse_export_create(BlockExport *blk_exp,
>                                BlockExportOptions *blk_exp_args,
>                                AioContext *const *multithread,
> @@ -270,6 +505,11 @@ static int fuse_export_create(BlockExport *blk_exp,
>  
>      assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +    exp->is_uring = args->io_uring;
> +    exp->ring_queue_depth = FUSE_DEFAULT_RING_QUEUE_DEPTH;
> +#endif
> +
>      if (multithread) {
>          /* Guaranteed by common export code */
>          assert(mt_count >= 1);
> @@ -283,6 +523,10 @@ static int fuse_export_create(BlockExport *blk_exp,
>                  .exp = exp,
>                  .ctx = multithread[i],
>                  .fuse_fd = -1,
> +#ifdef CONFIG_LINUX_IO_URING
> +                .ring_queue_list =
> +                    QLIST_HEAD_INITIALIZER(exp->queues[i].ring_queue_list),
> +#endif
>              };
>          }
>      } else {
> @@ -296,6 +540,10 @@ static int fuse_export_create(BlockExport *blk_exp,
>              .exp = exp,
>              .ctx = exp->common.ctx,
>              .fuse_fd = -1,
> +#ifdef CONFIG_LINUX_IO_URING
> +            .ring_queue_list =
> +                QLIST_HEAD_INITIALIZER(exp->queues[0].ring_queue_list),
> +#endif
>          };
>      }
>  
> @@ -685,17 +933,39 @@ static bool is_regular_file(const char *path, Error **errp)
>   */
>  static ssize_t coroutine_fn
>  fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
> -             uint32_t max_readahead, uint32_t flags)
> +             uint32_t max_readahead, const struct fuse_init_in *in)
>  {
> -    const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
> +    uint64_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO
> +                                     | FUSE_INIT_EXT;
> +    uint64_t outargflags = 0;
> +    uint64_t inargflags = in->flags;
> +
> +    ssize_t ret = 0;
> +
> +    if (inargflags & FUSE_INIT_EXT) {
> +        inargflags = inargflags | (uint64_t) in->flags2 << 32;
> +    }
> +
> +#ifdef CONFIG_LINUX_IO_URING
> +    if (exp->is_uring) {
> +        if (inargflags & FUSE_OVER_IO_URING) {
> +            supported_flags |= FUSE_OVER_IO_URING;
> +        } else {
> +            exp->is_uring = false;
> +            ret = -ENODEV;
> +        }
> +    }
> +#endif
> +
> +    outargflags = inargflags & supported_flags;
>  
>      *out = (struct fuse_init_out) {
>          .major = FUSE_KERNEL_VERSION,
>          .minor = FUSE_KERNEL_MINOR_VERSION,
>          .max_readahead = max_readahead,
>          .max_write = FUSE_MAX_WRITE_BYTES,
> -        .flags = flags & supported_flags,
> -        .flags2 = 0,
> +        .flags = outargflags,
> +        .flags2 = outargflags >> 32,
>  
>          /* libfuse maximum: 2^16 - 1 */
>          .max_background = UINT16_MAX,
> @@ -717,7 +987,7 @@ fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
>          .map_alignment = 0,
>      };
>  
> -    return sizeof(*out);
> +    return ret < 0 ? ret : sizeof(*out);
>  }
>  
>  /**
> @@ -1506,6 +1776,14 @@ fuse_co_process_request(FuseQueue *q, void *spillover_buf)
>          fuse_write_buf_response(q->fuse_fd, req_id, out_hdr,
>                                  out_data_buffer, ret);
>          qemu_vfree(out_data_buffer);
> +#ifdef CONFIG_LINUX_IO_URING
> +    /* Handle FUSE-over-io_uring initialization */
> +    if (unlikely(opcode == FUSE_INIT && exp->is_uring)) {
> +        struct fuse_init_out *out =
> +            (struct fuse_init_out *)FUSE_OUT_OP_STRUCT(out_buf);
> +        fuse_uring_start(exp, out);
> +    }
> +#endif
>      } else {
>          fuse_write_response(q->fuse_fd, req_id, out_hdr,
>                              ret < 0 ? ret : 0,
> diff --git a/docs/tools/qemu-storage-daemon.rst b/docs/tools/qemu-storage-daemon.rst
> index 35ab2d7807..c5076101e0 100644
> --- a/docs/tools/qemu-storage-daemon.rst
> +++ b/docs/tools/qemu-storage-daemon.rst
> @@ -78,7 +78,7 @@ Standard options:
>  .. option:: --export [type=]nbd,id=<id>,node-name=<node-name>[,name=<export-name>][,writable=on|off][,bitmap=<name>]
>    --export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=unix,addr.path=<socket-path>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>]
>    --export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=fd,addr.str=<fd>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>]
> -  --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>[,growable=on|off][,writable=on|off][,allow-other=on|off|auto]
> +  --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>[,growable=on|off][,writable=on|off][,allow-other=on|off|auto][,io-uring=on|off]
>    --export [type=]vduse-blk,id=<id>,node-name=<node-name>,name=<vduse-name>[,writable=on|off][,num-queues=<num-queues>][,queue-size=<queue-size>][,logical-block-size=<block-size>][,serial=<serial-number>]
>  
>    is a block export definition. ``node-name`` is the block node that should be
> @@ -111,10 +111,11 @@ Standard options:
>    that enabling this option as a non-root user requires enabling the
>    user_allow_other option in the global fuse.conf configuration file.  Setting
>    ``allow-other`` to auto (the default) will try enabling this option, and on
> -  error fall back to disabling it.
> -
> -  The ``vduse-blk`` export type takes a ``name`` (must be unique across the host)
> -  to create the VDUSE device.
> +  error fall back to disabling it. Once ``io-uring`` is enabled (off by default),
> +  the FUSE-over-io_uring-related settings will be initialized to bypass the
> +  traditional /dev/fuse communication mechanism and instead use io_uring to
> +  handle FUSE operations. The ``vduse-blk`` export type takes a ``name``
> +  (must be unique across the host) to create the VDUSE device.
>    ``num-queues`` sets the number of virtqueues (the default is 1).
>    ``queue-size`` sets the virtqueue descriptor table size (the default is 256).
>  
> diff --git a/qapi/block-export.json b/qapi/block-export.json
> index 9ae703ad01..37f2fc47e2 100644
> --- a/qapi/block-export.json
> +++ b/qapi/block-export.json
> @@ -184,12 +184,15 @@
>  #     mount the export with allow_other, and if that fails, try again
>  #     without.  (since 6.1; default: auto)
>  #
> +# @io-uring: Use FUSE-over-io-uring.  (since 10.2; default: false)
> +#
>  # Since: 6.0
>  ##
>  { 'struct': 'BlockExportOptionsFuse',
>    'data': { 'mountpoint': 'str',
>              '*growable': 'bool',
> -            '*allow-other': 'FuseExportAllowOther' },
> +            '*allow-other': 'FuseExportAllowOther',
> +            '*io-uring': 'bool' },
>    'if': 'CONFIG_FUSE' }
>  
>  ##
> diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-storage-daemon.c
> index eb72561358..0cd4cd2b58 100644
> --- a/storage-daemon/qemu-storage-daemon.c
> +++ b/storage-daemon/qemu-storage-daemon.c
> @@ -107,6 +107,7 @@ static void help(void)
>  #ifdef CONFIG_FUSE
>  "  --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>\n"
>  "           [,growable=on|off][,writable=on|off][,allow-other=on|off|auto]\n"
> +"           [,io-uring=on|off]"
>  "                         export the specified block node over FUSE\n"
>  "\n"
>  #endif /* CONFIG_FUSE */
> diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> index d2433d1d99..68d3fe8e01 100644
> --- a/util/fdmon-io_uring.c
> +++ b/util/fdmon-io_uring.c
> @@ -452,10 +452,13 @@ static const FDMonOps fdmon_io_uring_ops = {
>  void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
>  {
>      int ret;
> +    int flags;
>  
>      ctx->io_uring_fd_tag = NULL;
> +    flags = IORING_SETUP_SQE128;
>  
> -    ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
> +    ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES,
> +                            &ctx->fdmon_io_uring, flags);
>      if (ret != 0) {
>          error_setg_errno(errp, -ret, "Failed to initialize io_uring");
>          return;
> -- 
> 2.45.2
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/4] export/fuse: process FUSE-over-io_uring requests
  2025-08-30  2:50 ` [PATCH 2/4] export/fuse: process FUSE-over-io_uring requests Brian Song
@ 2025-09-03 11:51   ` Stefan Hajnoczi
  2025-09-08 19:09     ` Brian Song
  2025-09-19 13:54   ` Kevin Wolf
  1 sibling, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2025-09-03 11:51 UTC (permalink / raw)
  To: Brian Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf

[-- Attachment #1: Type: text/plain, Size: 3127 bytes --]

On Fri, Aug 29, 2025 at 10:50:23PM -0400, Brian Song wrote:
> https://docs.kernel.org/filesystems/fuse-io-uring.html
> 
> As described in the kernel documentation, after FUSE-over-io_uring
> initialization and handshake, FUSE interacts with the kernel using
> SQE/CQE to send requests and receive responses. This corresponds to
> the "Sending requests with CQEs" section in the docs.
> 
> This patch implements three key parts: registering the CQE handler
> (fuse_uring_cqe_handler), processing FUSE requests (fuse_uring_co_
> process_request), and sending response results (fuse_uring_send_
> response). It also merges the traditional /dev/fuse request handling
> with the FUSE-over-io_uring handling functions.
> 
> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Brian Song <hibriansong@gmail.com>
> ---
>  block/export/fuse.c | 457 ++++++++++++++++++++++++++++++--------------
>  1 file changed, 309 insertions(+), 148 deletions(-)
> 
> diff --git a/block/export/fuse.c b/block/export/fuse.c
> index 19bf9e5f74..07f74fc8ec 100644
> --- a/block/export/fuse.c
> +++ b/block/export/fuse.c
> @@ -310,6 +310,47 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
>  };
>  
>  #ifdef CONFIG_LINUX_IO_URING
> +static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent);
> +
> +static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque)

This function appears to handle exactly one cqe. A singular function
name would be clearer than a plural: co_fuse_uring_queue_handle_cqe().

> +{
> +    FuseRingEnt *ent = opaque;
> +    FuseExport *exp = ent->rq->q->exp;
> +
> +    /* Going to process requests */
> +    fuse_inc_in_flight(exp);

What is the rationale for taking a reference here? Normally something
already holds a reference (e.g. the request itself) and it will be
dropped somewhere inside a function we're about to call, but we still
need to access exp afterwards, so we temporarily take a reference.
Please document the specifics in a comment.

I think blk_exp_ref()/blk_exp_unref() are appropriate instead of
fuse_inc_in_flight()/fuse_dec_in_flight() since we only need to hold
onto the export and don't care about drain behavior.

> +
> +    /* A ring entry returned */
> +    fuse_uring_co_process_request(ent);
> +
> +    /* Finished processing requests */
> +    fuse_dec_in_flight(exp);
> +}
> +
> +static void fuse_uring_cqe_handler(CqeHandler *cqe_handler)
> +{
> +    FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, fuse_cqe_handler);
> +    Coroutine *co;
> +    FuseExport *exp = ent->rq->q->exp;
> +
> +    if (unlikely(exp->halted)) {
> +        return;
> +    }
> +
> +    int err = cqe_handler->cqe.res;
> +
> +    if (err != 0) {
> +        /* -ENOTCONN is ok on umount  */
> +        if (err != -EINTR && err != -EAGAIN &&
> +            err != -ENOTCONN) {
> +            fuse_export_halt(exp);
> +        }

How are EINTR and EAGAIN handled if they are silently ignored? When did
you encounter these error codes?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring
  2025-09-03 10:53   ` Stefan Hajnoczi
@ 2025-09-03 18:00     ` Brian Song
  2025-09-09 14:48       ` Stefan Hajnoczi
  0 siblings, 1 reply; 38+ messages in thread
From: Brian Song @ 2025-09-03 18:00 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf



On 9/3/25 6:53 AM, Stefan Hajnoczi wrote:
> On Fri, Aug 29, 2025 at 10:50:22PM -0400, Brian Song wrote:
>> This patch adds a new export option for storage-export-daemon to enable
>> FUSE-over-io_uring via the switch io-uring=on|off (disableby default).
>> It also implements the protocol handshake with the Linux kernel
>> during the FUSE-over-io_uring initialization phase.
>>
>> See: https://docs.kernel.org/filesystems/fuse-io-uring.html
>>
>> The kernel documentation describes in detail how FUSE-over-io_uring
>> works. This patch implements the Initial SQE stage shown in thediagram:
>> it initializes one queue per IOThread, each currently supporting a
>> single submission queue entry (SQE). When the FUSE driver sends the
>> first FUSE request (FUSE_INIT), storage-export-daemon calls
>> fuse_uring_start() to complete initialization, ultimately submitting
>> the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm
>> successful initialization with the kernel.
>>
>> We also added support for multiple IOThreads. The current Linux kernel
>> requires registering $(nproc) queues when setting up FUSE-over-io_uring
>> To let users customize the number of FUSE Queues (i.e., IOThreads),
>> we first create nproc Ring Queues as required by the kernel, then
>> distribute them in a round-robin manner to the FUSE Queues for
>> registration. In addition, to support multiple in-flight requests,
>> we configure each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH
>> entries/requests.
> 
> The previous paragraph says "each currently supporting a single
> submission queue entry (SQE)" whereas this paragraph says "we configure
> each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH entries/requests".
> Maybe this paragraph was squashed into the commit description in a later
> step and the previous paragraph can be updated to reflect that multiple
> SQEs are submitted?
> 
>>
>> Suggested-by: Kevin Wolf <kwolf@redhat.com>
>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>> Signed-off-by: Brian Song <hibriansong@gmail.com>
>> ---
>>   block/export/fuse.c                  | 310 +++++++++++++++++++++++++--
>>   docs/tools/qemu-storage-daemon.rst   |  11 +-
>>   qapi/block-export.json               |   5 +-
>>   storage-daemon/qemu-storage-daemon.c |   1 +
>>   util/fdmon-io_uring.c                |   5 +-
>>   5 files changed, 309 insertions(+), 23 deletions(-)
>>
>> diff --git a/block/export/fuse.c b/block/export/fuse.c
>> index c0ad4696ce..19bf9e5f74 100644
>> --- a/block/export/fuse.c
>> +++ b/block/export/fuse.c
>> @@ -48,6 +48,9 @@
>>   #include <linux/fs.h>
>>   #endif
>>   
>> +/* room needed in buffer to accommodate header */
>> +#define FUSE_BUFFER_HEADER_SIZE 0x1000
> 
> Is it possible to write this in a way that shows how the constant is
> calculated? That way the constant would automatically adjust on systems
> where the underlying assumptions have changed (e.g. page size, header
> struct size). This approach is also self-documenting so it's possible to
> understand where the magic number comes from.
> 
> For example:
> 
>    #define FUSE_BUFFER_HEADER_SIZE DIV_ROUND_UP(sizeof(struct fuse_uring_req_header), qemu_real_host_page_size())
> 
> (I'm guessing what the formula you used is, so this example may be
> incorrect...)
> 

In libfuse, the way to calculate the bufsize (for req_payload) is the 
same as in this patch. For different requests, the request header sizes 
are not the same, but they should never exceed a certain value. So is 
that why libfuse has this kind of magic number?

>> +
>>   /* Prevent overly long bounce buffer allocations */
>>   #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
>>   /*
>> @@ -63,12 +66,59 @@
>>       (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
>>   
>>   typedef struct FuseExport FuseExport;
>> +typedef struct FuseQueue FuseQueue;
>> +
>> +#ifdef CONFIG_LINUX_IO_URING
>> +#define FUSE_DEFAULT_RING_QUEUE_DEPTH 64
>> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
>> +
>> +typedef struct FuseRingQueue FuseRingQueue;
>> +typedef struct FuseRingEnt {
>> +    /* back pointer */
>> +    FuseRingQueue *rq;
>> +
>> +    /* commit id of a fuse request */
>> +    uint64_t req_commit_id;
> 
> This field is not used in this commit. Please introduce it in the commit
> that uses it so it's easier to review and understand the purpose of this
> field.
> 
>> +
>> +    /* fuse request header and payload */
>> +    struct fuse_uring_req_header req_header;
>> +    void *op_payload;
>> +    size_t req_payload_sz;
> 
> op_payload and req_payload_sz refer to the same buffer, and they are
> submitted alongside req_header. It would be nice to name the fields
> consistently:
> 
>    struct fuse_uring_req_header req_header;
>    void *req_payload;
>    size_t req_payload_sz;
> 
> req_payload and req_payload_sz could be eliminated since they are also
> stored in iov[1].iov_base and .iov_len. If you feel that would be harder
> to understand, then it's okay to keep the duplicate fields.
> 

Makes sense. I followed the design in libfuse. Probably best to just 
leave them in the struct for readability

>> +
>> +    /* The vector passed to the kernel */
>> +    struct iovec iov[2];
>> +
>> +    CqeHandler fuse_cqe_handler;
>> +} FuseRingEnt;
>> +
>> +struct FuseRingQueue {
> 
> A comment would be nice here to explain that the kernel requires one
> FuseRingQueue per host CPU and this concept is independent of /dev/fuse
> (FuseQueue).
> 
>> +    int rqid;
>> +
>> +    /* back pointer */
>> +    FuseQueue *q;
>> +    FuseRingEnt *ent;
>> +
>> +    /* List entry for ring_queues */
>> +    QLIST_ENTRY(FuseRingQueue) next;
>> +};
>> +
>> +/*
>> + * Round-robin distribution of ring queues across FUSE queues.
>> + * This structure manages the mapping between kernel ring queues and user
>> + * FUSE queues.
>> + */
>> +typedef struct FuseRingQueueManager {
>> +    FuseRingQueue *ring_queues;
>> +    int num_ring_queues;
>> +    int num_fuse_queues;
>> +} FuseRingQueueManager;
>> +#endif
> 
> It's easy to forget which #ifdef we're inside after a few lines, so it
> helps to indicate that in a comment:
> 
> #endif /* CONFIG_LINUX_IO_URING */
> 
>>   
>>   /*
>>    * One FUSE "queue", representing one FUSE FD from which requests are fetched
>>    * and processed.  Each queue is tied to an AioContext.
>>    */
>> -typedef struct FuseQueue {
>> +struct FuseQueue {
>>       FuseExport *exp;
>>   
>>       AioContext *ctx;
>> @@ -109,15 +159,11 @@ typedef struct FuseQueue {
>>        * Free this buffer with qemu_vfree().
>>        */
>>       void *spillover_buf;
>> -} FuseQueue;
>>   
>> -/*
>> - * Verify that FuseQueue.request_buf plus the spill-over buffer together
>> - * are big enough to be accepted by the FUSE kernel driver.
>> - */
>> -QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
>> -                  FUSE_SPILLOVER_BUF_SIZE <
>> -                  FUSE_MIN_READ_BUFFER);
> 
> Why was this removed, it's probably still necessary in the non-io_uring
> case (which is compiled in even when CONFIG_LINUX_IO_URING is defined)?
> 

You can check Hanna’s patch. In fuse_co_process_request, Hanna 
introduced this check when using FUSE_OUT_OP_STRUCT to cast void *buf 
into the corresponding in/out header for the given operation.

But in the v2 patch, we merged the legacy process_request and the uring 
version into one. This caused the legacy path to pass the array into the 
common function as a pointer. Now, when we do the buf header size check, 
what gets checked is just the pointer size.

#define FUSE_OUT_OP_STRUCT(op_name, out_buf) \
     ({ \
         struct fuse_out_header *__out_hdr = \
             (struct fuse_out_header *)(out_buf); \
         struct fuse_##op_name##_out *__out = \
             (struct fuse_##op_name##_out *)(__out_hdr + 1); \
         \
         QEMU_BUILD_BUG_ON(sizeof(*__out_hdr) + sizeof(*__out) > \
                           sizeof(out_buf)); \
         \
         __out; \
     })


>> +#ifdef CONFIG_LINUX_IO_URING
>> +    QLIST_HEAD(, FuseRingQueue) ring_queue_list;
>> +#endif
>> +};
>>   
>>   struct FuseExport {
>>       BlockExport common;
>> @@ -133,7 +179,7 @@ struct FuseExport {
>>        */
>>       bool halted;
>>   
>> -    int num_queues;
>> +    size_t num_queues;
>>       FuseQueue *queues;
>>       /*
>>        * True if this export should follow the generic export's AioContext.
>> @@ -149,6 +195,12 @@ struct FuseExport {
>>       /* Whether allow_other was used as a mount option or not */
>>       bool allow_other;
>>   
>> +#ifdef CONFIG_LINUX_IO_URING
>> +    bool is_uring;
>> +    size_t ring_queue_depth;
>> +    FuseRingQueueManager *ring_queue_manager;
>> +#endif
>> +
>>       mode_t st_mode;
>>       uid_t st_uid;
>>       gid_t st_gid;
>> @@ -205,7 +257,7 @@ static void fuse_attach_handlers(FuseExport *exp)
>>           return;
>>       }
>>   
>> -    for (int i = 0; i < exp->num_queues; i++) {
>> +    for (size_t i = 0; i < exp->num_queues; i++) {
>>           aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
>>                              read_from_fuse_fd, NULL, NULL, NULL,
>>                              &exp->queues[i]);
>> @@ -257,6 +309,189 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
>>       .drained_poll  = fuse_export_drained_poll,
>>   };
>>   
>> +#ifdef CONFIG_LINUX_IO_URING
>> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req,
>> +                    const unsigned int rqid,
>> +                    const unsigned int commit_id)
>> +{
>> +    req->qid = rqid;
>> +    req->commit_id = commit_id;
>> +    req->flags = 0;
>> +}
>> +
>> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q,
>> +               __u32 cmd_op)
>> +{
>> +    sqe->opcode = IORING_OP_URING_CMD;
>> +
>> +    sqe->fd = q->fuse_fd;
>> +    sqe->rw_flags = 0;
>> +    sqe->ioprio = 0;
>> +    sqe->off = 0;
>> +
>> +    sqe->cmd_op = cmd_op;
>> +    sqe->__pad1 = 0;
>> +}
>> +
>> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque)
>> +{
>> +    FuseRingEnt *ent = opaque;
>> +    struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0];
>> +
>> +    fuse_uring_sqe_prepare(sqe, ent->rq->q, FUSE_IO_URING_CMD_REGISTER);
>> +
>> +    sqe->addr = (uint64_t)(ent->iov);
>> +    sqe->len = 2;
>> +
>> +    fuse_uring_sqe_set_req_data(req, ent->rq->rqid, 0);
>> +}
>> +
>> +static void fuse_uring_submit_register(void *opaque)
>> +{
>> +    FuseRingEnt *ent = opaque;
>> +    FuseExport *exp = ent->rq->q->exp;
> 
> This variable is unused in this commit? Does this commit compile for
> you? Usually the compiler warns about unused variables.
> 

The first version was a large single patch. I split it with git, and 
this variable is now used in a different patch

>> +
>> +
>> +    aio_add_sqe(fuse_uring_prep_sqe_register, ent, &(ent->fuse_cqe_handler));
>> +}
>> +
>> +/**
>> + * Distribute ring queues across FUSE queues using round-robin algorithm.
>> + * This ensures even distribution of kernel ring queues across user-specified
>> + * FUSE queues.
>> + */
>> +static
>> +FuseRingQueueManager *fuse_ring_queue_manager_create(int num_fuse_queues,
>> +                                                    size_t ring_queue_depth,
>> +                                                    size_t bufsize)
>> +{
>> +    int num_ring_queues = get_nprocs();
> 
> The kernel code uses num_possible_cpus() in
> fs/fuse/dev_uring.c:fuse_uring_create() so I think this should be
> get_nprocs_conf() instead of get_nprocs().
> 
>> +    FuseRingQueueManager *manager = g_new(FuseRingQueueManager, 1);
>> +
>> +    if (!manager) {
> 
> g_new() never returns NULL, so you can remove this if statement. If
> memory cannot be allocated then the process will abort.
> 
>> +        return NULL;
>> +    }
>> +
>> +    manager->ring_queues = g_new(FuseRingQueue, num_ring_queues);
>> +    manager->num_ring_queues = num_ring_queues;
>> +    manager->num_fuse_queues = num_fuse_queues;
>> +
>> +    if (!manager->ring_queues) {
> 
> Same here.
> 
>> +        g_free(manager);
>> +        return NULL;
>> +    }
>> +
>> +    for (int i = 0; i < num_ring_queues; i++) {
>> +        FuseRingQueue *rq = &manager->ring_queues[i];
>> +        rq->rqid = i;
>> +        rq->ent = g_new(FuseRingEnt, ring_queue_depth);
>> +
>> +        if (!rq->ent) {
> 
> Same here.
> 
>> +            for (int j = 0; j < i; j++) {
>> +                g_free(manager->ring_queues[j].ent);
>> +            }
>> +            g_free(manager->ring_queues);
>> +            g_free(manager);
>> +            return NULL;
>> +        }
>> +
>> +        for (size_t j = 0; j < ring_queue_depth; j++) {
>> +            FuseRingEnt *ent = &rq->ent[j];
>> +            ent->rq = rq;
>> +            ent->req_payload_sz = bufsize - FUSE_BUFFER_HEADER_SIZE;
>> +            ent->op_payload = g_malloc0(ent->req_payload_sz);
>> +
>> +            if (!ent->op_payload) {
> 
> Same here.
> 
>> +                for (size_t k = 0; k < j; k++) {
>> +                    g_free(rq->ent[k].op_payload);
>> +                }
>> +                g_free(rq->ent);
>> +                for (int k = 0; k < i; k++) {
>> +                    g_free(manager->ring_queues[k].ent);
>> +                }
>> +                g_free(manager->ring_queues);
>> +                g_free(manager);
> 
> Where are these structures freed in the normal lifecycle of a FUSE
> export? I only see this error handling code, but nothing is freed when
> the export is shut down.


Same here. The first version was a large single patch. I split it with 
git, and we do cleanup in a different patch

> 
>> +                return NULL;
>> +            }
>> +
>> +            ent->iov[0] = (struct iovec) {
>> +                &(ent->req_header),
>> +                sizeof(struct fuse_uring_req_header)
>> +            };
>> +            ent->iov[1] = (struct iovec) {
>> +                ent->op_payload,
>> +                ent->req_payload_sz
>> +            };
>> +
>> +            ent->fuse_cqe_handler.cb = fuse_uring_cqe_handler;
>> +        }
>> +    }
>> +
>> +    return manager;
>> +}
>> +
>> +static
>> +void fuse_distribute_ring_queues(FuseExport *exp, FuseRingQueueManager *manager)
>> +{
>> +    int queue_index = 0;
>> +
>> +    for (int i = 0; i < manager->num_ring_queues; i++) {
>> +        FuseRingQueue *rq = &manager->ring_queues[i];
>> +
>> +        rq->q = &exp->queues[queue_index];
>> +        QLIST_INSERT_HEAD(&(rq->q->ring_queue_list), rq, next);
>> +
>> +        queue_index = (queue_index + 1) % manager->num_fuse_queues;
>> +    }
>> +}
>> +
>> +static
>> +void fuse_schedule_ring_queue_registrations(FuseExport *exp,
>> +                                            FuseRingQueueManager *manager)
>> +{
>> +    for (int i = 0; i < manager->num_fuse_queues; i++) {
>> +        FuseQueue *q = &exp->queues[i];
>> +        FuseRingQueue *rq;
>> +
>> +        QLIST_FOREACH(rq, &q->ring_queue_list, next) {
>> +            for (int j = 0; j < exp->ring_queue_depth; j++) {
>> +                aio_bh_schedule_oneshot(q->ctx, fuse_uring_submit_register,
>> +                                        &(rq->ent[j]));
>> +            }
>> +        }
>> +    }
>> +}
>> +
>> +static void fuse_uring_start(FuseExport *exp, struct fuse_init_out *out)
>> +{
>> +    /*
>> +     * Since we didn't enable the FUSE_MAX_PAGES feature, the value of
>> +     * fc->max_pages should be FUSE_DEFAULT_MAX_PAGES_PER_REQ, which is set by
>> +     * the kernel by default. Also, max_write should not exceed
>> +     * FUSE_DEFAULT_MAX_PAGES_PER_REQ * PAGE_SIZE.
>> +     */
>> +    size_t bufsize = out->max_write + FUSE_BUFFER_HEADER_SIZE;
>> +
>> +    if (!(out->flags & FUSE_MAX_PAGES)) {
>> +        bufsize = FUSE_DEFAULT_MAX_PAGES_PER_REQ * qemu_real_host_page_size()
>> +                         + FUSE_BUFFER_HEADER_SIZE;
>> +    }
>> +
>> +    exp->ring_queue_manager = fuse_ring_queue_manager_create(
>> +        exp->num_queues, exp->ring_queue_depth, bufsize);
>> +
>> +    if (!exp->ring_queue_manager) {
>> +        error_report("Failed to create ring queue manager");
>> +        return;
>> +    }
>> +
>> +    /* Distribute ring queues across FUSE queues using round-robin */
>> +    fuse_distribute_ring_queues(exp, exp->ring_queue_manager);
>> +
>> +    fuse_schedule_ring_queue_registrations(exp, exp->ring_queue_manager);
>> +}
>> +#endif
>> +
>>   static int fuse_export_create(BlockExport *blk_exp,
>>                                 BlockExportOptions *blk_exp_args,
>>                                 AioContext *const *multithread,
>> @@ -270,6 +505,11 @@ static int fuse_export_create(BlockExport *blk_exp,
>>   
>>       assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
>>   
>> +#ifdef CONFIG_LINUX_IO_URING
>> +    exp->is_uring = args->io_uring;
>> +    exp->ring_queue_depth = FUSE_DEFAULT_RING_QUEUE_DEPTH;
>> +#endif
>> +
>>       if (multithread) {
>>           /* Guaranteed by common export code */
>>           assert(mt_count >= 1);
>> @@ -283,6 +523,10 @@ static int fuse_export_create(BlockExport *blk_exp,
>>                   .exp = exp,
>>                   .ctx = multithread[i],
>>                   .fuse_fd = -1,
>> +#ifdef CONFIG_LINUX_IO_URING
>> +                .ring_queue_list =
>> +                    QLIST_HEAD_INITIALIZER(exp->queues[i].ring_queue_list),
>> +#endif
>>               };
>>           }
>>       } else {
>> @@ -296,6 +540,10 @@ static int fuse_export_create(BlockExport *blk_exp,
>>               .exp = exp,
>>               .ctx = exp->common.ctx,
>>               .fuse_fd = -1,
>> +#ifdef CONFIG_LINUX_IO_URING
>> +            .ring_queue_list =
>> +                QLIST_HEAD_INITIALIZER(exp->queues[0].ring_queue_list),
>> +#endif
>>           };
>>       }
>>   
>> @@ -685,17 +933,39 @@ static bool is_regular_file(const char *path, Error **errp)
>>    */
>>   static ssize_t coroutine_fn
>>   fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
>> -             uint32_t max_readahead, uint32_t flags)
>> +             uint32_t max_readahead, const struct fuse_init_in *in)
>>   {
>> -    const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
>> +    uint64_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO
>> +                                     | FUSE_INIT_EXT;
>> +    uint64_t outargflags = 0;
>> +    uint64_t inargflags = in->flags;
>> +
>> +    ssize_t ret = 0;
>> +
>> +    if (inargflags & FUSE_INIT_EXT) {
>> +        inargflags = inargflags | (uint64_t) in->flags2 << 32;
>> +    }
>> +
>> +#ifdef CONFIG_LINUX_IO_URING
>> +    if (exp->is_uring) {
>> +        if (inargflags & FUSE_OVER_IO_URING) {
>> +            supported_flags |= FUSE_OVER_IO_URING;
>> +        } else {
>> +            exp->is_uring = false;
>> +            ret = -ENODEV;
>> +        }
>> +    }
>> +#endif
>> +
>> +    outargflags = inargflags & supported_flags;
>>   
>>       *out = (struct fuse_init_out) {
>>           .major = FUSE_KERNEL_VERSION,
>>           .minor = FUSE_KERNEL_MINOR_VERSION,
>>           .max_readahead = max_readahead,
>>           .max_write = FUSE_MAX_WRITE_BYTES,
>> -        .flags = flags & supported_flags,
>> -        .flags2 = 0,
>> +        .flags = outargflags,
>> +        .flags2 = outargflags >> 32,
>>   
>>           /* libfuse maximum: 2^16 - 1 */
>>           .max_background = UINT16_MAX,
>> @@ -717,7 +987,7 @@ fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
>>           .map_alignment = 0,
>>       };
>>   
>> -    return sizeof(*out);
>> +    return ret < 0 ? ret : sizeof(*out);
>>   }
>>   
>>   /**
>> @@ -1506,6 +1776,14 @@ fuse_co_process_request(FuseQueue *q, void *spillover_buf)
>>           fuse_write_buf_response(q->fuse_fd, req_id, out_hdr,
>>                                   out_data_buffer, ret);
>>           qemu_vfree(out_data_buffer);
>> +#ifdef CONFIG_LINUX_IO_URING
>> +    /* Handle FUSE-over-io_uring initialization */
>> +    if (unlikely(opcode == FUSE_INIT && exp->is_uring)) {
>> +        struct fuse_init_out *out =
>> +            (struct fuse_init_out *)FUSE_OUT_OP_STRUCT(out_buf);
>> +        fuse_uring_start(exp, out);
> 
> Is there any scenario where FUSE_INIT can be received multiple times?
> Maybe if the FUSE file system is umounted and mounted again? I want to
> check that this doesn't leak previously allocated ring state.
> 

I don't think so, even in a multi-threaded FUSE setup, the kernel only 
sends a single FUSE_INIT to userspace. In the legacy mode, whichever 
thread receives that request can handle it and initialize FUSE-over-io_uring




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports
  2025-09-03  9:49   ` Stefan Hajnoczi
@ 2025-09-03 18:11     ` Brian Song
  2025-09-16 12:18       ` Kevin Wolf
  0 siblings, 1 reply; 38+ messages in thread
From: Brian Song @ 2025-09-03 18:11 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf



On 9/3/25 5:49 AM, Stefan Hajnoczi wrote:
> On Sat, Aug 30, 2025 at 08:00:00AM -0400, Brian Song wrote:
>> We used fio to test a 1 GB file under both traditional FUSE and
>> FUSE-over-io_uring modes. The experiments were conducted with the
>> following iodepth and numjobs configurations: 1-1, 64-1, 1-4, and 64-4,
>> with 70% read and 30% write, resulting in a total of eight test cases,
>> measuring both latency and throughput.
>>
>> Test results:
>>
>> https://gist.github.com/hibriansong/a4849903387b297516603e83b53bbde4
> 
> Hanna: You benchmarked the FUSE export coroutine implementation a little
> while ago. What do you think about these results with
> FUSE-over-io_uring?
> 
> What stands out to me is that iodepth=1 numjobs=4 already saturates the
> system, so increasing iodepth to 64 does not improve the results much.
> 
> Brian: What is the qemu-storage-daemon command-line for the benchmark
> and what are the details of /mnt/tmp/ (e.g. a preallocated 10 GB file
> with an XFS file system mounted from the FUSE image)?

QMP script:
https://gist.github.com/hibriansong/399f9564a385cfb94db58669e63611f8

Or:
### NORMAL
./qemu/build/storage-daemon/qemu-storage-daemon \
   --object iothread,id=iothread1 \
   --object iothread,id=iothread2 \
   --object iothread,id=iothread3 \
   --object iothread,id=iothread4 \
   --blockdev node-name=prot-node,driver=file,filename=ubuntu.qcow2 \
   --blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
   --export 
type=fuse,id=exp0,node-name=fmt-node,mountpoint=mount-point,writable=on,iothread.0=iothread1,iothread.1=iothread2,iothread.2=iothread3,iothread.3=iothread4

### URING
echo Y > /sys/module/fuse/parameters/enable_uring

./qemu/build/storage-daemon/qemu-storage-daemon \
   --object iothread,id=iothread1 \
   --object iothread,id=iothread2 \
   --object iothread,id=iothread3 \
   --object iothread,id=iothread4 \
   --blockdev node-name=prot-node,driver=file,filename=ubuntu.qcow2 \
   --blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
   --export 
type=fuse,id=exp0,node-name=fmt-node,mountpoint=mount-point,writable=on,io-uring=on,iothread.0=iothread1,iothread.1=iothread2,iothread.2=iothread3,iothread.3=iothread4

ubuntu.qcow2 has been prealloacted and enlarge the space to 100GB by

$ qemu-img resize ubuntu.qcow2 100G
$ virt-customize \
    --run-command '/bin/bash /bin/growpart /dev/sda 1' \
    --run-command 'resize2fs /dev/sda1' -a ubuntu.qcow2

The image file, formatted with an Ext4 filesystem, was mounted on 
/mnt/tmp on my PC equipped with a Kingston PCIe 4.0 NVMe SSD

$ sudo kpartx -av mount-point
$ sudo mount /dev/mapper/loop31p1 /mnt/tmp/


Unmount the partition after done using it.

$ sudo umount /mnt/tmp
# sudo kpartx -dv mount-point

> 
> Thanks,
> Stefan
> 
>>
>>
>>
>>
>> On 8/29/25 10:50 PM, Brian Song wrote:
>>> Hi all,
>>>
>>> This is a GSoC project. More details are available here:
>>> https://wiki.qemu.org/Google_Summer_of_Code_2025#FUSE-over-io_uring_exports
>>>
>>> This patch series includes:
>>> - Add a round-robin mechanism to distribute the kernel-required Ring
>>> Queues to FUSE Queues
>>> - Support multiple in-flight requests (multiple ring entries)
>>> - Add tests for FUSE-over-io_uring
>>>
>>> More detail in the v2 cover letter:
>>> https://lists.nongnu.org/archive/html/qemu-block/2025-08/msg00140.html
>>>
>>> And in the v1 cover letter:
>>> https://lists.nongnu.org/archive/html/qemu-block/2025-07/msg00280.html
>>>
>>>
>>> Brian Song (4):
>>>     export/fuse: add opt to enable FUSE-over-io_uring
>>>     export/fuse: process FUSE-over-io_uring requests
>>>     export/fuse: Safe termination for FUSE-uring
>>>     iotests: add tests for FUSE-over-io_uring
>>>
>>>    block/export/fuse.c                  | 838 +++++++++++++++++++++------
>>>    docs/tools/qemu-storage-daemon.rst   |  11 +-
>>>    qapi/block-export.json               |   5 +-
>>>    storage-daemon/qemu-storage-daemon.c |   1 +
>>>    tests/qemu-iotests/check             |   2 +
>>>    tests/qemu-iotests/common.rc         |  45 +-
>>>    util/fdmon-io_uring.c                |   5 +-
>>>    7 files changed, 717 insertions(+), 190 deletions(-)
>>>
>>



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports
  2025-08-30 12:00 ` [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports Brian Song
  2025-09-03  9:49   ` Stefan Hajnoczi
@ 2025-09-04 19:32   ` Stefan Hajnoczi
  1 sibling, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2025-09-04 19:32 UTC (permalink / raw)
  To: Brian Song
  Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf,
	eperezma

[-- Attachment #1: Type: text/plain, Size: 2006 bytes --]

On Sat, Aug 30, 2025 at 08:00:00AM -0400, Brian Song wrote:
> We used fio to test a 1 GB file under both traditional FUSE and
> FUSE-over-io_uring modes. The experiments were conducted with the
> following iodepth and numjobs configurations: 1-1, 64-1, 1-4, and 64-4,
> with 70% read and 30% write, resulting in a total of eight test cases,
> measuring both latency and throughput.
> 
> Test results:
> 
> https://gist.github.com/hibriansong/a4849903387b297516603e83b53bbde4

CCing Eugenio, who is looking at optimizing FUSE server performance
using virtiofs with VDUSE.

> 
> 
> 
> 
> On 8/29/25 10:50 PM, Brian Song wrote:
> > Hi all,
> >
> > This is a GSoC project. More details are available here:
> > https://wiki.qemu.org/Google_Summer_of_Code_2025#FUSE-over-io_uring_exports
> >
> > This patch series includes:
> > - Add a round-robin mechanism to distribute the kernel-required Ring
> > Queues to FUSE Queues
> > - Support multiple in-flight requests (multiple ring entries)
> > - Add tests for FUSE-over-io_uring
> >
> > More detail in the v2 cover letter:
> > https://lists.nongnu.org/archive/html/qemu-block/2025-08/msg00140.html
> >
> > And in the v1 cover letter:
> > https://lists.nongnu.org/archive/html/qemu-block/2025-07/msg00280.html
> >
> >
> > Brian Song (4):
> >    export/fuse: add opt to enable FUSE-over-io_uring
> >    export/fuse: process FUSE-over-io_uring requests
> >    export/fuse: Safe termination for FUSE-uring
> >    iotests: add tests for FUSE-over-io_uring
> >
> >   block/export/fuse.c                  | 838 +++++++++++++++++++++------
> >   docs/tools/qemu-storage-daemon.rst   |  11 +-
> >   qapi/block-export.json               |   5 +-
> >   storage-daemon/qemu-storage-daemon.c |   1 +
> >   tests/qemu-iotests/check             |   2 +
> >   tests/qemu-iotests/common.rc         |  45 +-
> >   util/fdmon-io_uring.c                |   5 +-
> >   7 files changed, 717 insertions(+), 190 deletions(-)
> >
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/4] export/fuse: process FUSE-over-io_uring requests
  2025-09-03 11:51   ` Stefan Hajnoczi
@ 2025-09-08 19:09     ` Brian Song
  2025-09-08 19:45       ` Bernd Schubert
  2025-09-09 15:26       ` Stefan Hajnoczi
  0 siblings, 2 replies; 38+ messages in thread
From: Brian Song @ 2025-09-08 19:09 UTC (permalink / raw)
  To: Stefan Hajnoczi, Bernd Schubert
  Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf



On 9/3/25 7:51 AM, Stefan Hajnoczi wrote:
> On Fri, Aug 29, 2025 at 10:50:23PM -0400, Brian Song wrote:
>> https://docs.kernel.org/filesystems/fuse-io-uring.html
>>
>> As described in the kernel documentation, after FUSE-over-io_uring
>> initialization and handshake, FUSE interacts with the kernel using
>> SQE/CQE to send requests and receive responses. This corresponds to
>> the "Sending requests with CQEs" section in the docs.
>>
>> This patch implements three key parts: registering the CQE handler
>> (fuse_uring_cqe_handler), processing FUSE requests (fuse_uring_co_
>> process_request), and sending response results (fuse_uring_send_
>> response). It also merges the traditional /dev/fuse request handling
>> with the FUSE-over-io_uring handling functions.
>>
>> Suggested-by: Kevin Wolf <kwolf@redhat.com>
>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>> Signed-off-by: Brian Song <hibriansong@gmail.com>
>> ---
>>   block/export/fuse.c | 457 ++++++++++++++++++++++++++++++--------------
>>   1 file changed, 309 insertions(+), 148 deletions(-)
>>
>> diff --git a/block/export/fuse.c b/block/export/fuse.c
>> index 19bf9e5f74..07f74fc8ec 100644
>> --- a/block/export/fuse.c
>> +++ b/block/export/fuse.c
>> @@ -310,6 +310,47 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
>>   };
>>   
>>   #ifdef CONFIG_LINUX_IO_URING
>> +static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent);
>> +
>> +static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque)
> 
> This function appears to handle exactly one cqe. A singular function
> name would be clearer than a plural: co_fuse_uring_queue_handle_cqe().
> 
>> +{
>> +    FuseRingEnt *ent = opaque;
>> +    FuseExport *exp = ent->rq->q->exp;
>> +
>> +    /* Going to process requests */
>> +    fuse_inc_in_flight(exp);
> 
> What is the rationale for taking a reference here? Normally something
> already holds a reference (e.g. the request itself) and it will be
> dropped somewhere inside a function we're about to call, but we still
> need to access exp afterwards, so we temporarily take a reference.
> Please document the specifics in a comment.
> 
> I think blk_exp_ref()/blk_exp_unref() are appropriate instead of
> fuse_inc_in_flight()/fuse_dec_in_flight() since we only need to hold
> onto the export and don't care about drain behavior.
> 

Stefan:

When handling FUSE requests, we don’t want the FuseExport to be 
accidentally deleted. Therefore, we use fuse_inc_in_flight in the CQE 
handler to increment the in_flight counter, and when a request is 
completed, we call fuse_dec_in_flight to decrement it. Once the last 
request has been processed, fuse_dec_in_flight brings the in_flight 
counter down to 0, indicating that the export can safely be deleted. The 
usage of in_flight follows the same logic as in traditional FUSE request 
handling.

Since submitted SQEs for FUSE cannot be canceled, once we register or 
commit them we must wait for the kernel to return a CQE. Otherwise, the 
kernel may deliver a CQE and invoke its handler after the export has 
already been deleted. For this reason, we directly call blk_exp_ref and 
blk_exp_unref when submitting an SQE and when receiving its CQE, to 
explicitly control the export reference and prevent accidental deletion.

The doc/comment for co_fuse_uring_queue_handle_cqe:

Protect FuseExport from premature deletion while handling FUSE requests. 
CQE handlers inc/dec the in_flight counter; when it reaches 0, the 
export can be freed. This follows the same logic as traditional FUSE.

Since FUSE SQEs cannot be canceled, a CQE may arrive after commit even 
if the export is deleted. To prevent this, we ref/unref the export 
explicitly at SQE submission and CQE completion.

>> +
>> +    /* A ring entry returned */
>> +    fuse_uring_co_process_request(ent);
>> +
>> +    /* Finished processing requests */
>> +    fuse_dec_in_flight(exp);
>> +}
>> +
>> +static void fuse_uring_cqe_handler(CqeHandler *cqe_handler)
>> +{
>> +    FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, fuse_cqe_handler);
>> +    Coroutine *co;
>> +    FuseExport *exp = ent->rq->q->exp;
>> +
>> +    if (unlikely(exp->halted)) {
>> +        return;
>> +    }
>> +
>> +    int err = cqe_handler->cqe.res;
>> +
>> +    if (err != 0) {
>> +        /* -ENOTCONN is ok on umount  */
>> +        if (err != -EINTR && err != -EAGAIN &&
>> +            err != -ENOTCONN) {
>> +            fuse_export_halt(exp);
>> +        }
> 
> How are EINTR and EAGAIN handled if they are silently ignored? When did
> you encounter these error codes?

Bernd:

I have the same question about this. As for how the kernel returns 
errors, I haven’t studied each case yet. In libfuse it’s implemented the 
same way, could you briefly explain why we choose to ignore these two 
errors, and under what circumstances we might encounter them?

Thanks,
Brian


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/4] export/fuse: process FUSE-over-io_uring requests
  2025-09-08 19:09     ` Brian Song
@ 2025-09-08 19:45       ` Bernd Schubert
  2025-09-09  1:10         ` Brian Song
  2025-09-09 15:26       ` Stefan Hajnoczi
  1 sibling, 1 reply; 38+ messages in thread
From: Bernd Schubert @ 2025-09-08 19:45 UTC (permalink / raw)
  To: Brian Song, Stefan Hajnoczi
  Cc: qemu-block, qemu-devel, armbru, fam, hreitz, kwolf



On 9/8/25 21:09, Brian Song wrote:
> 
> 
> On 9/3/25 7:51 AM, Stefan Hajnoczi wrote:
>> On Fri, Aug 29, 2025 at 10:50:23PM -0400, Brian Song wrote:
>>> https://docs.kernel.org/filesystems/fuse-io-uring.html
>>>
>>> As described in the kernel documentation, after FUSE-over-io_uring
>>> initialization and handshake, FUSE interacts with the kernel using
>>> SQE/CQE to send requests and receive responses. This corresponds to
>>> the "Sending requests with CQEs" section in the docs.
>>>
>>> This patch implements three key parts: registering the CQE handler
>>> (fuse_uring_cqe_handler), processing FUSE requests (fuse_uring_co_
>>> process_request), and sending response results (fuse_uring_send_
>>> response). It also merges the traditional /dev/fuse request handling
>>> with the FUSE-over-io_uring handling functions.
>>>
>>> Suggested-by: Kevin Wolf <kwolf@redhat.com>
>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> Signed-off-by: Brian Song <hibriansong@gmail.com>
>>> ---
>>>   block/export/fuse.c | 457 ++++++++++++++++++++++++++++++--------------
>>>   1 file changed, 309 insertions(+), 148 deletions(-)
>>>
>>> diff --git a/block/export/fuse.c b/block/export/fuse.c
>>> index 19bf9e5f74..07f74fc8ec 100644
>>> --- a/block/export/fuse.c
>>> +++ b/block/export/fuse.c
>>> @@ -310,6 +310,47 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
>>>   };
>>>   
>>>   #ifdef CONFIG_LINUX_IO_URING
>>> +static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent);
>>> +
>>> +static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque)
>>
>> This function appears to handle exactly one cqe. A singular function
>> name would be clearer than a plural: co_fuse_uring_queue_handle_cqe().
>>
>>> +{
>>> +    FuseRingEnt *ent = opaque;
>>> +    FuseExport *exp = ent->rq->q->exp;
>>> +
>>> +    /* Going to process requests */
>>> +    fuse_inc_in_flight(exp);
>>
>> What is the rationale for taking a reference here? Normally something
>> already holds a reference (e.g. the request itself) and it will be
>> dropped somewhere inside a function we're about to call, but we still
>> need to access exp afterwards, so we temporarily take a reference.
>> Please document the specifics in a comment.
>>
>> I think blk_exp_ref()/blk_exp_unref() are appropriate instead of
>> fuse_inc_in_flight()/fuse_dec_in_flight() since we only need to hold
>> onto the export and don't care about drain behavior.
>>
> 
> Stefan:
> 
> When handling FUSE requests, we don’t want the FuseExport to be 
> accidentally deleted. Therefore, we use fuse_inc_in_flight in the CQE 
> handler to increment the in_flight counter, and when a request is 
> completed, we call fuse_dec_in_flight to decrement it. Once the last 
> request has been processed, fuse_dec_in_flight brings the in_flight 
> counter down to 0, indicating that the export can safely be deleted. The 
> usage of in_flight follows the same logic as in traditional FUSE request 
> handling.
> 
> Since submitted SQEs for FUSE cannot be canceled, once we register or 
> commit them we must wait for the kernel to return a CQE. Otherwise, the 
> kernel may deliver a CQE and invoke its handler after the export has 
> already been deleted. For this reason, we directly call blk_exp_ref and 
> blk_exp_unref when submitting an SQE and when receiving its CQE, to 
> explicitly control the export reference and prevent accidental deletion.
> 
> The doc/comment for co_fuse_uring_queue_handle_cqe:
> 
> Protect FuseExport from premature deletion while handling FUSE requests. 
> CQE handlers inc/dec the in_flight counter; when it reaches 0, the 
> export can be freed. This follows the same logic as traditional FUSE.
> 
> Since FUSE SQEs cannot be canceled, a CQE may arrive after commit even 
> if the export is deleted. To prevent this, we ref/unref the export 
> explicitly at SQE submission and CQE completion.
> 
>>> +
>>> +    /* A ring entry returned */
>>> +    fuse_uring_co_process_request(ent);
>>> +
>>> +    /* Finished processing requests */
>>> +    fuse_dec_in_flight(exp);
>>> +}
>>> +
>>> +static void fuse_uring_cqe_handler(CqeHandler *cqe_handler)
>>> +{
>>> +    FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, fuse_cqe_handler);
>>> +    Coroutine *co;
>>> +    FuseExport *exp = ent->rq->q->exp;
>>> +
>>> +    if (unlikely(exp->halted)) {
>>> +        return;
>>> +    }
>>> +
>>> +    int err = cqe_handler->cqe.res;
>>> +
>>> +    if (err != 0) {
>>> +        /* -ENOTCONN is ok on umount  */
>>> +        if (err != -EINTR && err != -EAGAIN &&
>>> +            err != -ENOTCONN) {
>>> +            fuse_export_halt(exp);
>>> +        }
>>
>> How are EINTR and EAGAIN handled if they are silently ignored? When did
>> you encounter these error codes?
> 
> Bernd:
> 
> I have the same question about this. As for how the kernel returns 
> errors, I haven’t studied each case yet. In libfuse it’s implemented the 
> same way, could you briefly explain why we choose to ignore these two 
> errors, and under what circumstances we might encounter them?


I think I remember why I had added these. Initially the ring threads
didn't inherit the signal handlers libfuse worker threads have. I had
fixed that later and these error conditions are a left over.
In libfuse idea is that the main thread gets all signals and then sets
se->exited - worker thread, include ring threads are not supposed to get
or handle signals at all, but have to monitor se->exited.

Good catch Stefan, I think I can remove these conditions in libfuse.


Thanks,
Bernd



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/4] export/fuse: process FUSE-over-io_uring requests
  2025-09-08 19:45       ` Bernd Schubert
@ 2025-09-09  1:10         ` Brian Song
  0 siblings, 0 replies; 38+ messages in thread
From: Brian Song @ 2025-09-09  1:10 UTC (permalink / raw)
  To: Bernd Schubert, Stefan Hajnoczi
  Cc: qemu-block, qemu-devel, armbru, fam, hreitz, kwolf



On 9/8/25 3:45 PM, Bernd Schubert wrote:
> 
> 
> On 9/8/25 21:09, Brian Song wrote:
>>
>>
>> On 9/3/25 7:51 AM, Stefan Hajnoczi wrote:
>>> On Fri, Aug 29, 2025 at 10:50:23PM -0400, Brian Song wrote:
>>>> https://docs.kernel.org/filesystems/fuse-io-uring.html
>>>>
>>>> As described in the kernel documentation, after FUSE-over-io_uring
>>>> initialization and handshake, FUSE interacts with the kernel using
>>>> SQE/CQE to send requests and receive responses. This corresponds to
>>>> the "Sending requests with CQEs" section in the docs.
>>>>
>>>> This patch implements three key parts: registering the CQE handler
>>>> (fuse_uring_cqe_handler), processing FUSE requests (fuse_uring_co_
>>>> process_request), and sending response results (fuse_uring_send_
>>>> response). It also merges the traditional /dev/fuse request handling
>>>> with the FUSE-over-io_uring handling functions.
>>>>
>>>> Suggested-by: Kevin Wolf <kwolf@redhat.com>
>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>> Signed-off-by: Brian Song <hibriansong@gmail.com>
>>>> ---
>>>>    block/export/fuse.c | 457 ++++++++++++++++++++++++++++++--------------
>>>>    1 file changed, 309 insertions(+), 148 deletions(-)
>>>>
>>>> diff --git a/block/export/fuse.c b/block/export/fuse.c
>>>> index 19bf9e5f74..07f74fc8ec 100644
>>>> --- a/block/export/fuse.c
>>>> +++ b/block/export/fuse.c
>>>> @@ -310,6 +310,47 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
>>>>    };
>>>>    
>>>>    #ifdef CONFIG_LINUX_IO_URING
>>>> +static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent);
>>>> +
>>>> +static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque)
>>>
>>> This function appears to handle exactly one cqe. A singular function
>>> name would be clearer than a plural: co_fuse_uring_queue_handle_cqe().
>>>
>>>> +{
>>>> +    FuseRingEnt *ent = opaque;
>>>> +    FuseExport *exp = ent->rq->q->exp;
>>>> +
>>>> +    /* Going to process requests */
>>>> +    fuse_inc_in_flight(exp);
>>>
>>> What is the rationale for taking a reference here? Normally something
>>> already holds a reference (e.g. the request itself) and it will be
>>> dropped somewhere inside a function we're about to call, but we still
>>> need to access exp afterwards, so we temporarily take a reference.
>>> Please document the specifics in a comment.
>>>
>>> I think blk_exp_ref()/blk_exp_unref() are appropriate instead of
>>> fuse_inc_in_flight()/fuse_dec_in_flight() since we only need to hold
>>> onto the export and don't care about drain behavior.
>>>
>>
>> Stefan:
>>
>> When handling FUSE requests, we don’t want the FuseExport to be
>> accidentally deleted. Therefore, we use fuse_inc_in_flight in the CQE
>> handler to increment the in_flight counter, and when a request is
>> completed, we call fuse_dec_in_flight to decrement it. Once the last
>> request has been processed, fuse_dec_in_flight brings the in_flight
>> counter down to 0, indicating that the export can safely be deleted. The
>> usage of in_flight follows the same logic as in traditional FUSE request
>> handling.
>>
>> Since submitted SQEs for FUSE cannot be canceled, once we register or
>> commit them we must wait for the kernel to return a CQE. Otherwise, the
>> kernel may deliver a CQE and invoke its handler after the export has
>> already been deleted. For this reason, we directly call blk_exp_ref and
>> blk_exp_unref when submitting an SQE and when receiving its CQE, to
>> explicitly control the export reference and prevent accidental deletion.
>>
>> The doc/comment for co_fuse_uring_queue_handle_cqe:
>>
>> Protect FuseExport from premature deletion while handling FUSE requests.
>> CQE handlers inc/dec the in_flight counter; when it reaches 0, the
>> export can be freed. This follows the same logic as traditional FUSE.
>>
>> Since FUSE SQEs cannot be canceled, a CQE may arrive after commit even
>> if the export is deleted. To prevent this, we ref/unref the export
>> explicitly at SQE submission and CQE completion.
>>
>>>> +
>>>> +    /* A ring entry returned */
>>>> +    fuse_uring_co_process_request(ent);
>>>> +
>>>> +    /* Finished processing requests */
>>>> +    fuse_dec_in_flight(exp);
>>>> +}
>>>> +
>>>> +static void fuse_uring_cqe_handler(CqeHandler *cqe_handler)
>>>> +{
>>>> +    FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, fuse_cqe_handler);
>>>> +    Coroutine *co;
>>>> +    FuseExport *exp = ent->rq->q->exp;
>>>> +
>>>> +    if (unlikely(exp->halted)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    int err = cqe_handler->cqe.res;
>>>> +
>>>> +    if (err != 0) {
>>>> +        /* -ENOTCONN is ok on umount  */
>>>> +        if (err != -EINTR && err != -EAGAIN &&
>>>> +            err != -ENOTCONN) {
>>>> +            fuse_export_halt(exp);
>>>> +        }
>>>
>>> How are EINTR and EAGAIN handled if they are silently ignored? When did
>>> you encounter these error codes?
>>
>> Bernd:
>>
>> I have the same question about this. As for how the kernel returns
>> errors, I haven’t studied each case yet. In libfuse it’s implemented the
>> same way, could you briefly explain why we choose to ignore these two
>> errors, and under what circumstances we might encounter them?
> 
> 
> I think I remember why I had added these. Initially the ring threads
> didn't inherit the signal handlers libfuse worker threads have. I had
> fixed that later and these error conditions are a left over.
> In libfuse idea is that the main thread gets all signals and then sets
> se->exited - worker thread, include ring threads are not supposed to get
> or handle signals at all, but have to monitor se->exited.
> 
> Good catch Stefan, I think I can remove these conditions in libfuse.
> 
> 
> Thanks,
> Bernd
> 

In libfuse:

static int fuse_uring_queue_handle_cqes(struct fuse_ring_queue *queue)
{
	struct fuse_ring_pool *ring_pool = queue->ring_pool;
	struct fuse_session *se = ring_pool->se;
	size_t num_completed = 0;
	struct io_uring_cqe *cqe;
	unsigned int head;
	int ret = 0;

	io_uring_for_each_cqe(&queue->ring, head, cqe) {
		int err = 0;

		num_completed++;

		err = cqe->res;
		if (err != 0) {
			if (err > 0 && ((uintptr_t)io_uring_cqe_get_data(cqe) ==
					(unsigned int)queue->eventfd)) {
				/* teardown from eventfd */
				return -ENOTCONN;
			}

			// XXX: Needs rate limited logs, otherwise log spam
			//fuse_log(FUSE_LOG_ERR, "cqe res: %d\n", cqe->res);

			/* -ENOTCONN is ok on umount  */
			if (err != -EINTR &&
			    err != -EAGAIN && err != -ENOTCONN) {
				se->error = cqe->res;

				/* return first error */
				if (ret == 0)
					ret = err;
			}

		} else {
			fuse_uring_handle_cqe(queue, cqe);
		}
	}

	if (num_completed)
		io_uring_cq_advance(&queue->ring, num_completed);

	return ret == 0 ? 0 : num_completed;
}

If err > 0 && ((uintptr_t)io_uring_cqe_get_data(cqe) == (unsigned 
int)queue->eventfd), it will return the negative value -ENOTCONN so that 
the caller sets se->exited = 1. Then, under what circumstances is err > 
0? When is err < 0? The current code also doesn't seem to handle the 
case where err is negative?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring
  2025-09-03 18:00     ` Brian Song
@ 2025-09-09 14:48       ` Stefan Hajnoczi
  2025-09-09 17:46         ` Brian Song
  0 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2025-09-09 14:48 UTC (permalink / raw)
  To: Brian Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf

[-- Attachment #1: Type: text/plain, Size: 23896 bytes --]

On Wed, Sep 03, 2025 at 02:00:55PM -0400, Brian Song wrote:
> 
> 
> On 9/3/25 6:53 AM, Stefan Hajnoczi wrote:
> > On Fri, Aug 29, 2025 at 10:50:22PM -0400, Brian Song wrote:
> > > This patch adds a new export option for storage-export-daemon to enable
> > > FUSE-over-io_uring via the switch io-uring=on|off (disableby default).
> > > It also implements the protocol handshake with the Linux kernel
> > > during the FUSE-over-io_uring initialization phase.
> > > 
> > > See: https://docs.kernel.org/filesystems/fuse-io-uring.html
> > > 
> > > The kernel documentation describes in detail how FUSE-over-io_uring
> > > works. This patch implements the Initial SQE stage shown in thediagram:
> > > it initializes one queue per IOThread, each currently supporting a
> > > single submission queue entry (SQE). When the FUSE driver sends the
> > > first FUSE request (FUSE_INIT), storage-export-daemon calls
> > > fuse_uring_start() to complete initialization, ultimately submitting
> > > the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm
> > > successful initialization with the kernel.
> > > 
> > > We also added support for multiple IOThreads. The current Linux kernel
> > > requires registering $(nproc) queues when setting up FUSE-over-io_uring
> > > To let users customize the number of FUSE Queues (i.e., IOThreads),
> > > we first create nproc Ring Queues as required by the kernel, then
> > > distribute them in a round-robin manner to the FUSE Queues for
> > > registration. In addition, to support multiple in-flight requests,
> > > we configure each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH
> > > entries/requests.
> > 
> > The previous paragraph says "each currently supporting a single
> > submission queue entry (SQE)" whereas this paragraph says "we configure
> > each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH entries/requests".
> > Maybe this paragraph was squashed into the commit description in a later
> > step and the previous paragraph can be updated to reflect that multiple
> > SQEs are submitted?
> > 
> > > 
> > > Suggested-by: Kevin Wolf <kwolf@redhat.com>
> > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > Signed-off-by: Brian Song <hibriansong@gmail.com>
> > > ---
> > >   block/export/fuse.c                  | 310 +++++++++++++++++++++++++--
> > >   docs/tools/qemu-storage-daemon.rst   |  11 +-
> > >   qapi/block-export.json               |   5 +-
> > >   storage-daemon/qemu-storage-daemon.c |   1 +
> > >   util/fdmon-io_uring.c                |   5 +-
> > >   5 files changed, 309 insertions(+), 23 deletions(-)
> > > 
> > > diff --git a/block/export/fuse.c b/block/export/fuse.c
> > > index c0ad4696ce..19bf9e5f74 100644
> > > --- a/block/export/fuse.c
> > > +++ b/block/export/fuse.c
> > > @@ -48,6 +48,9 @@
> > >   #include <linux/fs.h>
> > >   #endif
> > > +/* room needed in buffer to accommodate header */
> > > +#define FUSE_BUFFER_HEADER_SIZE 0x1000
> > 
> > Is it possible to write this in a way that shows how the constant is
> > calculated? That way the constant would automatically adjust on systems
> > where the underlying assumptions have changed (e.g. page size, header
> > struct size). This approach is also self-documenting so it's possible to
> > understand where the magic number comes from.
> > 
> > For example:
> > 
> >    #define FUSE_BUFFER_HEADER_SIZE DIV_ROUND_UP(sizeof(struct fuse_uring_req_header), qemu_real_host_page_size())
> > 
> > (I'm guessing what the formula you used is, so this example may be
> > incorrect...)
> > 
> 
> In libfuse, the way to calculate the bufsize (for req_payload) is the same
> as in this patch. For different requests, the request header sizes are not
> the same, but they should never exceed a certain value. So is that why
> libfuse has this kind of magic number?

From <linux/fuse.h>:

  #define FUSE_URING_IN_OUT_HEADER_SZ 128
  #define FUSE_URING_OP_IN_OUT_SZ 128
  ...
  struct fuse_uring_req_header {
          /* struct fuse_in_header / struct fuse_out_header */
          char in_out[FUSE_URING_IN_OUT_HEADER_SZ];

          /* per op code header */
          char op_in[FUSE_URING_OP_IN_OUT_SZ];

          struct fuse_uring_ent_in_out ring_ent_in_out;
  };

The size of struct fuse_uring_req_header is 128 + 128 + (4 * 8) = 288
bytes. A single 4 KB page easily fits this. I guess that's why 0x1000
was chosen in libfuse.

> 
> > > +
> > >   /* Prevent overly long bounce buffer allocations */
> > >   #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
> > >   /*
> > > @@ -63,12 +66,59 @@
> > >       (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
> > >   typedef struct FuseExport FuseExport;
> > > +typedef struct FuseQueue FuseQueue;
> > > +
> > > +#ifdef CONFIG_LINUX_IO_URING
> > > +#define FUSE_DEFAULT_RING_QUEUE_DEPTH 64
> > > +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
> > > +
> > > +typedef struct FuseRingQueue FuseRingQueue;
> > > +typedef struct FuseRingEnt {
> > > +    /* back pointer */
> > > +    FuseRingQueue *rq;
> > > +
> > > +    /* commit id of a fuse request */
> > > +    uint64_t req_commit_id;
> > 
> > This field is not used in this commit. Please introduce it in the commit
> > that uses it so it's easier to review and understand the purpose of this
> > field.
> > 
> > > +
> > > +    /* fuse request header and payload */
> > > +    struct fuse_uring_req_header req_header;
> > > +    void *op_payload;
> > > +    size_t req_payload_sz;
> > 
> > op_payload and req_payload_sz refer to the same buffer, and they are
> > submitted alongside req_header. It would be nice to name the fields
> > consistently:
> > 
> >    struct fuse_uring_req_header req_header;
> >    void *req_payload;
> >    size_t req_payload_sz;
> > 
> > req_payload and req_payload_sz could be eliminated since they are also
> > stored in iov[1].iov_base and .iov_len. If you feel that would be harder
> > to understand, then it's okay to keep the duplicate fields.
> > 
> 
> Makes sense. I followed the design in libfuse. Probably best to just leave
> them in the struct for readability
> 
> > > +
> > > +    /* The vector passed to the kernel */
> > > +    struct iovec iov[2];
> > > +
> > > +    CqeHandler fuse_cqe_handler;
> > > +} FuseRingEnt;
> > > +
> > > +struct FuseRingQueue {
> > 
> > A comment would be nice here to explain that the kernel requires one
> > FuseRingQueue per host CPU and this concept is independent of /dev/fuse
> > (FuseQueue).
> > 
> > > +    int rqid;
> > > +
> > > +    /* back pointer */
> > > +    FuseQueue *q;
> > > +    FuseRingEnt *ent;
> > > +
> > > +    /* List entry for ring_queues */
> > > +    QLIST_ENTRY(FuseRingQueue) next;
> > > +};
> > > +
> > > +/*
> > > + * Round-robin distribution of ring queues across FUSE queues.
> > > + * This structure manages the mapping between kernel ring queues and user
> > > + * FUSE queues.
> > > + */
> > > +typedef struct FuseRingQueueManager {
> > > +    FuseRingQueue *ring_queues;
> > > +    int num_ring_queues;
> > > +    int num_fuse_queues;
> > > +} FuseRingQueueManager;
> > > +#endif
> > 
> > It's easy to forget which #ifdef we're inside after a few lines, so it
> > helps to indicate that in a comment:
> > 
> > #endif /* CONFIG_LINUX_IO_URING */
> > 
> > >   /*
> > >    * One FUSE "queue", representing one FUSE FD from which requests are fetched
> > >    * and processed.  Each queue is tied to an AioContext.
> > >    */
> > > -typedef struct FuseQueue {
> > > +struct FuseQueue {
> > >       FuseExport *exp;
> > >       AioContext *ctx;
> > > @@ -109,15 +159,11 @@ typedef struct FuseQueue {
> > >        * Free this buffer with qemu_vfree().
> > >        */
> > >       void *spillover_buf;
> > > -} FuseQueue;
> > > -/*
> > > - * Verify that FuseQueue.request_buf plus the spill-over buffer together
> > > - * are big enough to be accepted by the FUSE kernel driver.
> > > - */
> > > -QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
> > > -                  FUSE_SPILLOVER_BUF_SIZE <
> > > -                  FUSE_MIN_READ_BUFFER);
> > 
> > Why was this removed, it's probably still necessary in the non-io_uring
> > case (which is compiled in even when CONFIG_LINUX_IO_URING is defined)?
> > 
> 
> You can check Hanna’s patch. In fuse_co_process_request, Hanna introduced
> this check when using FUSE_OUT_OP_STRUCT to cast void *buf into the
> corresponding in/out header for the given operation.
> 
> But in the v2 patch, we merged the legacy process_request and the uring
> version into one. This caused the legacy path to pass the array into the
> common function as a pointer. Now, when we do the buf header size check,
> what gets checked is just the pointer size.
> 
> #define FUSE_OUT_OP_STRUCT(op_name, out_buf) \
>     ({ \
>         struct fuse_out_header *__out_hdr = \
>             (struct fuse_out_header *)(out_buf); \
>         struct fuse_##op_name##_out *__out = \
>             (struct fuse_##op_name##_out *)(__out_hdr + 1); \
>         \
>         QEMU_BUILD_BUG_ON(sizeof(*__out_hdr) + sizeof(*__out) > \
>                           sizeof(out_buf)); \
>         \
>         __out; \
>     })

Your patch does not change how ->request_buf is used by the non-io_uring
code path. ->request_buf needs to fit at least FUSE_MIN_READ_BUFFER
bytes so I think this QEMU_BUILD_BUG_ON() should not be deleted.

> 
> 
> > > +#ifdef CONFIG_LINUX_IO_URING
> > > +    QLIST_HEAD(, FuseRingQueue) ring_queue_list;
> > > +#endif
> > > +};
> > >   struct FuseExport {
> > >       BlockExport common;
> > > @@ -133,7 +179,7 @@ struct FuseExport {
> > >        */
> > >       bool halted;
> > > -    int num_queues;
> > > +    size_t num_queues;
> > >       FuseQueue *queues;
> > >       /*
> > >        * True if this export should follow the generic export's AioContext.
> > > @@ -149,6 +195,12 @@ struct FuseExport {
> > >       /* Whether allow_other was used as a mount option or not */
> > >       bool allow_other;
> > > +#ifdef CONFIG_LINUX_IO_URING
> > > +    bool is_uring;
> > > +    size_t ring_queue_depth;
> > > +    FuseRingQueueManager *ring_queue_manager;
> > > +#endif
> > > +
> > >       mode_t st_mode;
> > >       uid_t st_uid;
> > >       gid_t st_gid;
> > > @@ -205,7 +257,7 @@ static void fuse_attach_handlers(FuseExport *exp)
> > >           return;
> > >       }
> > > -    for (int i = 0; i < exp->num_queues; i++) {
> > > +    for (size_t i = 0; i < exp->num_queues; i++) {
> > >           aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
> > >                              read_from_fuse_fd, NULL, NULL, NULL,
> > >                              &exp->queues[i]);
> > > @@ -257,6 +309,189 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
> > >       .drained_poll  = fuse_export_drained_poll,
> > >   };
> > > +#ifdef CONFIG_LINUX_IO_URING
> > > +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req,
> > > +                    const unsigned int rqid,
> > > +                    const unsigned int commit_id)
> > > +{
> > > +    req->qid = rqid;
> > > +    req->commit_id = commit_id;
> > > +    req->flags = 0;
> > > +}
> > > +
> > > +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q,
> > > +               __u32 cmd_op)
> > > +{
> > > +    sqe->opcode = IORING_OP_URING_CMD;
> > > +
> > > +    sqe->fd = q->fuse_fd;
> > > +    sqe->rw_flags = 0;
> > > +    sqe->ioprio = 0;
> > > +    sqe->off = 0;
> > > +
> > > +    sqe->cmd_op = cmd_op;
> > > +    sqe->__pad1 = 0;
> > > +}
> > > +
> > > +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque)
> > > +{
> > > +    FuseRingEnt *ent = opaque;
> > > +    struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0];
> > > +
> > > +    fuse_uring_sqe_prepare(sqe, ent->rq->q, FUSE_IO_URING_CMD_REGISTER);
> > > +
> > > +    sqe->addr = (uint64_t)(ent->iov);
> > > +    sqe->len = 2;
> > > +
> > > +    fuse_uring_sqe_set_req_data(req, ent->rq->rqid, 0);
> > > +}
> > > +
> > > +static void fuse_uring_submit_register(void *opaque)
> > > +{
> > > +    FuseRingEnt *ent = opaque;
> > > +    FuseExport *exp = ent->rq->q->exp;
> > 
> > This variable is unused in this commit? Does this commit compile for
> > you? Usually the compiler warns about unused variables.
> > 
> 
> The first version was a large single patch. I split it with git, and this
> variable is now used in a different patch
> 
> > > +
> > > +
> > > +    aio_add_sqe(fuse_uring_prep_sqe_register, ent, &(ent->fuse_cqe_handler));
> > > +}
> > > +
> > > +/**
> > > + * Distribute ring queues across FUSE queues using round-robin algorithm.
> > > + * This ensures even distribution of kernel ring queues across user-specified
> > > + * FUSE queues.
> > > + */
> > > +static
> > > +FuseRingQueueManager *fuse_ring_queue_manager_create(int num_fuse_queues,
> > > +                                                    size_t ring_queue_depth,
> > > +                                                    size_t bufsize)
> > > +{
> > > +    int num_ring_queues = get_nprocs();
> > 
> > The kernel code uses num_possible_cpus() in
> > fs/fuse/dev_uring.c:fuse_uring_create() so I think this should be
> > get_nprocs_conf() instead of get_nprocs().
> > 
> > > +    FuseRingQueueManager *manager = g_new(FuseRingQueueManager, 1);
> > > +
> > > +    if (!manager) {
> > 
> > g_new() never returns NULL, so you can remove this if statement. If
> > memory cannot be allocated then the process will abort.
> > 
> > > +        return NULL;
> > > +    }
> > > +
> > > +    manager->ring_queues = g_new(FuseRingQueue, num_ring_queues);
> > > +    manager->num_ring_queues = num_ring_queues;
> > > +    manager->num_fuse_queues = num_fuse_queues;
> > > +
> > > +    if (!manager->ring_queues) {
> > 
> > Same here.
> > 
> > > +        g_free(manager);
> > > +        return NULL;
> > > +    }
> > > +
> > > +    for (int i = 0; i < num_ring_queues; i++) {
> > > +        FuseRingQueue *rq = &manager->ring_queues[i];
> > > +        rq->rqid = i;
> > > +        rq->ent = g_new(FuseRingEnt, ring_queue_depth);
> > > +
> > > +        if (!rq->ent) {
> > 
> > Same here.
> > 
> > > +            for (int j = 0; j < i; j++) {
> > > +                g_free(manager->ring_queues[j].ent);
> > > +            }
> > > +            g_free(manager->ring_queues);
> > > +            g_free(manager);
> > > +            return NULL;
> > > +        }
> > > +
> > > +        for (size_t j = 0; j < ring_queue_depth; j++) {
> > > +            FuseRingEnt *ent = &rq->ent[j];
> > > +            ent->rq = rq;
> > > +            ent->req_payload_sz = bufsize - FUSE_BUFFER_HEADER_SIZE;
> > > +            ent->op_payload = g_malloc0(ent->req_payload_sz);
> > > +
> > > +            if (!ent->op_payload) {
> > 
> > Same here.
> > 
> > > +                for (size_t k = 0; k < j; k++) {
> > > +                    g_free(rq->ent[k].op_payload);
> > > +                }
> > > +                g_free(rq->ent);
> > > +                for (int k = 0; k < i; k++) {
> > > +                    g_free(manager->ring_queues[k].ent);
> > > +                }
> > > +                g_free(manager->ring_queues);
> > > +                g_free(manager);
> > 
> > Where are these structures freed in the normal lifecycle of a FUSE
> > export? I only see this error handling code, but nothing is freed when
> > the export is shut down.
> 
> 
> Same here. The first version was a large single patch. I split it with git,
> and we do cleanup in a different patch

It's easier for reviewers and safer for backports if each patch is
self-contained with the cleanup code included in the same patch where
the resource is created. If you make changes to the patch organization
in the next revision then it would be nice to included the cleanup in
this patch.

> 
> > 
> > > +                return NULL;
> > > +            }
> > > +
> > > +            ent->iov[0] = (struct iovec) {
> > > +                &(ent->req_header),
> > > +                sizeof(struct fuse_uring_req_header)
> > > +            };
> > > +            ent->iov[1] = (struct iovec) {
> > > +                ent->op_payload,
> > > +                ent->req_payload_sz
> > > +            };
> > > +
> > > +            ent->fuse_cqe_handler.cb = fuse_uring_cqe_handler;
> > > +        }
> > > +    }
> > > +
> > > +    return manager;
> > > +}
> > > +
> > > +static
> > > +void fuse_distribute_ring_queues(FuseExport *exp, FuseRingQueueManager *manager)
> > > +{
> > > +    int queue_index = 0;
> > > +
> > > +    for (int i = 0; i < manager->num_ring_queues; i++) {
> > > +        FuseRingQueue *rq = &manager->ring_queues[i];
> > > +
> > > +        rq->q = &exp->queues[queue_index];
> > > +        QLIST_INSERT_HEAD(&(rq->q->ring_queue_list), rq, next);
> > > +
> > > +        queue_index = (queue_index + 1) % manager->num_fuse_queues;
> > > +    }
> > > +}
> > > +
> > > +static
> > > +void fuse_schedule_ring_queue_registrations(FuseExport *exp,
> > > +                                            FuseRingQueueManager *manager)
> > > +{
> > > +    for (int i = 0; i < manager->num_fuse_queues; i++) {
> > > +        FuseQueue *q = &exp->queues[i];
> > > +        FuseRingQueue *rq;
> > > +
> > > +        QLIST_FOREACH(rq, &q->ring_queue_list, next) {
> > > +            for (int j = 0; j < exp->ring_queue_depth; j++) {
> > > +                aio_bh_schedule_oneshot(q->ctx, fuse_uring_submit_register,
> > > +                                        &(rq->ent[j]));
> > > +            }
> > > +        }
> > > +    }
> > > +}
> > > +
> > > +static void fuse_uring_start(FuseExport *exp, struct fuse_init_out *out)
> > > +{
> > > +    /*
> > > +     * Since we didn't enable the FUSE_MAX_PAGES feature, the value of
> > > +     * fc->max_pages should be FUSE_DEFAULT_MAX_PAGES_PER_REQ, which is set by
> > > +     * the kernel by default. Also, max_write should not exceed
> > > +     * FUSE_DEFAULT_MAX_PAGES_PER_REQ * PAGE_SIZE.
> > > +     */
> > > +    size_t bufsize = out->max_write + FUSE_BUFFER_HEADER_SIZE;
> > > +
> > > +    if (!(out->flags & FUSE_MAX_PAGES)) {
> > > +        bufsize = FUSE_DEFAULT_MAX_PAGES_PER_REQ * qemu_real_host_page_size()
> > > +                         + FUSE_BUFFER_HEADER_SIZE;
> > > +    }
> > > +
> > > +    exp->ring_queue_manager = fuse_ring_queue_manager_create(
> > > +        exp->num_queues, exp->ring_queue_depth, bufsize);
> > > +
> > > +    if (!exp->ring_queue_manager) {
> > > +        error_report("Failed to create ring queue manager");
> > > +        return;
> > > +    }
> > > +
> > > +    /* Distribute ring queues across FUSE queues using round-robin */
> > > +    fuse_distribute_ring_queues(exp, exp->ring_queue_manager);
> > > +
> > > +    fuse_schedule_ring_queue_registrations(exp, exp->ring_queue_manager);
> > > +}
> > > +#endif
> > > +
> > >   static int fuse_export_create(BlockExport *blk_exp,
> > >                                 BlockExportOptions *blk_exp_args,
> > >                                 AioContext *const *multithread,
> > > @@ -270,6 +505,11 @@ static int fuse_export_create(BlockExport *blk_exp,
> > >       assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
> > > +#ifdef CONFIG_LINUX_IO_URING
> > > +    exp->is_uring = args->io_uring;
> > > +    exp->ring_queue_depth = FUSE_DEFAULT_RING_QUEUE_DEPTH;
> > > +#endif
> > > +
> > >       if (multithread) {
> > >           /* Guaranteed by common export code */
> > >           assert(mt_count >= 1);
> > > @@ -283,6 +523,10 @@ static int fuse_export_create(BlockExport *blk_exp,
> > >                   .exp = exp,
> > >                   .ctx = multithread[i],
> > >                   .fuse_fd = -1,
> > > +#ifdef CONFIG_LINUX_IO_URING
> > > +                .ring_queue_list =
> > > +                    QLIST_HEAD_INITIALIZER(exp->queues[i].ring_queue_list),
> > > +#endif
> > >               };
> > >           }
> > >       } else {
> > > @@ -296,6 +540,10 @@ static int fuse_export_create(BlockExport *blk_exp,
> > >               .exp = exp,
> > >               .ctx = exp->common.ctx,
> > >               .fuse_fd = -1,
> > > +#ifdef CONFIG_LINUX_IO_URING
> > > +            .ring_queue_list =
> > > +                QLIST_HEAD_INITIALIZER(exp->queues[0].ring_queue_list),
> > > +#endif
> > >           };
> > >       }
> > > @@ -685,17 +933,39 @@ static bool is_regular_file(const char *path, Error **errp)
> > >    */
> > >   static ssize_t coroutine_fn
> > >   fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
> > > -             uint32_t max_readahead, uint32_t flags)
> > > +             uint32_t max_readahead, const struct fuse_init_in *in)
> > >   {
> > > -    const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
> > > +    uint64_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO
> > > +                                     | FUSE_INIT_EXT;
> > > +    uint64_t outargflags = 0;
> > > +    uint64_t inargflags = in->flags;
> > > +
> > > +    ssize_t ret = 0;
> > > +
> > > +    if (inargflags & FUSE_INIT_EXT) {
> > > +        inargflags = inargflags | (uint64_t) in->flags2 << 32;
> > > +    }
> > > +
> > > +#ifdef CONFIG_LINUX_IO_URING
> > > +    if (exp->is_uring) {
> > > +        if (inargflags & FUSE_OVER_IO_URING) {
> > > +            supported_flags |= FUSE_OVER_IO_URING;
> > > +        } else {
> > > +            exp->is_uring = false;
> > > +            ret = -ENODEV;
> > > +        }
> > > +    }
> > > +#endif
> > > +
> > > +    outargflags = inargflags & supported_flags;
> > >       *out = (struct fuse_init_out) {
> > >           .major = FUSE_KERNEL_VERSION,
> > >           .minor = FUSE_KERNEL_MINOR_VERSION,
> > >           .max_readahead = max_readahead,
> > >           .max_write = FUSE_MAX_WRITE_BYTES,
> > > -        .flags = flags & supported_flags,
> > > -        .flags2 = 0,
> > > +        .flags = outargflags,
> > > +        .flags2 = outargflags >> 32,
> > >           /* libfuse maximum: 2^16 - 1 */
> > >           .max_background = UINT16_MAX,
> > > @@ -717,7 +987,7 @@ fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
> > >           .map_alignment = 0,
> > >       };
> > > -    return sizeof(*out);
> > > +    return ret < 0 ? ret : sizeof(*out);
> > >   }
> > >   /**
> > > @@ -1506,6 +1776,14 @@ fuse_co_process_request(FuseQueue *q, void *spillover_buf)
> > >           fuse_write_buf_response(q->fuse_fd, req_id, out_hdr,
> > >                                   out_data_buffer, ret);
> > >           qemu_vfree(out_data_buffer);
> > > +#ifdef CONFIG_LINUX_IO_URING
> > > +    /* Handle FUSE-over-io_uring initialization */
> > > +    if (unlikely(opcode == FUSE_INIT && exp->is_uring)) {
> > > +        struct fuse_init_out *out =
> > > +            (struct fuse_init_out *)FUSE_OUT_OP_STRUCT(out_buf);
> > > +        fuse_uring_start(exp, out);
> > 
> > Is there any scenario where FUSE_INIT can be received multiple times?
> > Maybe if the FUSE file system is umounted and mounted again? I want to
> > check that this doesn't leak previously allocated ring state.
> > 
> 
> I don't think so, even in a multi-threaded FUSE setup, the kernel only sends
> a single FUSE_INIT to userspace. In the legacy mode, whichever thread
> receives that request can handle it and initialize FUSE-over-io_uring

Okay. Please add an assertion to fuse_uring_start() to catch the case
where it is called twice.

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/4] export/fuse: process FUSE-over-io_uring requests
  2025-09-08 19:09     ` Brian Song
  2025-09-08 19:45       ` Bernd Schubert
@ 2025-09-09 15:26       ` Stefan Hajnoczi
  1 sibling, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2025-09-09 15:26 UTC (permalink / raw)
  To: Brian Song
  Cc: Bernd Schubert, qemu-block, qemu-devel, armbru, fam, hreitz,
	kwolf

[-- Attachment #1: Type: text/plain, Size: 5131 bytes --]

On Mon, Sep 08, 2025 at 03:09:57PM -0400, Brian Song wrote:
> 
> 
> On 9/3/25 7:51 AM, Stefan Hajnoczi wrote:
> > On Fri, Aug 29, 2025 at 10:50:23PM -0400, Brian Song wrote:
> > > https://docs.kernel.org/filesystems/fuse-io-uring.html
> > > 
> > > As described in the kernel documentation, after FUSE-over-io_uring
> > > initialization and handshake, FUSE interacts with the kernel using
> > > SQE/CQE to send requests and receive responses. This corresponds to
> > > the "Sending requests with CQEs" section in the docs.
> > > 
> > > This patch implements three key parts: registering the CQE handler
> > > (fuse_uring_cqe_handler), processing FUSE requests (fuse_uring_co_
> > > process_request), and sending response results (fuse_uring_send_
> > > response). It also merges the traditional /dev/fuse request handling
> > > with the FUSE-over-io_uring handling functions.
> > > 
> > > Suggested-by: Kevin Wolf <kwolf@redhat.com>
> > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > Signed-off-by: Brian Song <hibriansong@gmail.com>
> > > ---
> > >   block/export/fuse.c | 457 ++++++++++++++++++++++++++++++--------------
> > >   1 file changed, 309 insertions(+), 148 deletions(-)
> > > 
> > > diff --git a/block/export/fuse.c b/block/export/fuse.c
> > > index 19bf9e5f74..07f74fc8ec 100644
> > > --- a/block/export/fuse.c
> > > +++ b/block/export/fuse.c
> > > @@ -310,6 +310,47 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
> > >   };
> > >   #ifdef CONFIG_LINUX_IO_URING
> > > +static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent);
> > > +
> > > +static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque)
> > 
> > This function appears to handle exactly one cqe. A singular function
> > name would be clearer than a plural: co_fuse_uring_queue_handle_cqe().
> > 
> > > +{
> > > +    FuseRingEnt *ent = opaque;
> > > +    FuseExport *exp = ent->rq->q->exp;
> > > +
> > > +    /* Going to process requests */
> > > +    fuse_inc_in_flight(exp);
> > 
> > What is the rationale for taking a reference here? Normally something
> > already holds a reference (e.g. the request itself) and it will be
> > dropped somewhere inside a function we're about to call, but we still
> > need to access exp afterwards, so we temporarily take a reference.
> > Please document the specifics in a comment.
> > 
> > I think blk_exp_ref()/blk_exp_unref() are appropriate instead of
> > fuse_inc_in_flight()/fuse_dec_in_flight() since we only need to hold
> > onto the export and don't care about drain behavior.
> > 
> 
> Stefan:
> 
> When handling FUSE requests, we don’t want the FuseExport to be accidentally
> deleted. Therefore, we use fuse_inc_in_flight in the CQE handler to
> increment the in_flight counter, and when a request is completed, we call
> fuse_dec_in_flight to decrement it. Once the last request has been
> processed, fuse_dec_in_flight brings the in_flight counter down to 0,
> indicating that the export can safely be deleted. The usage of in_flight
> follows the same logic as in traditional FUSE request handling.
> 
> Since submitted SQEs for FUSE cannot be canceled, once we register or commit
> them we must wait for the kernel to return a CQE. Otherwise, the kernel may
> deliver a CQE and invoke its handler after the export has already been
> deleted. For this reason, we directly call blk_exp_ref and blk_exp_unref
> when submitting an SQE and when receiving its CQE, to explicitly control the
> export reference and prevent accidental deletion.
> 
> The doc/comment for co_fuse_uring_queue_handle_cqe:
> 
> Protect FuseExport from premature deletion while handling FUSE requests. CQE
> handlers inc/dec the in_flight counter; when it reaches 0, the export can be
> freed. This follows the same logic as traditional FUSE.
> 
> Since FUSE SQEs cannot be canceled, a CQE may arrive after commit even if
> the export is deleted. To prevent this, we ref/unref the export explicitly
> at SQE submission and CQE completion.

I looked at your "final" branch on GitHub and the refcount changes there
match what I was thinking of.

In case it helps for writing comments, I'll try to describe my mental
model of the refcounts:

- fuse_inc_in_flight()/fuse_dec_in_flight() must wrap the lifecycle of
  FUSE requests that the server is processing. This ensures that the
  block layer's drain operation waits for requests to complete and that
  the export cannot be deleted while the requests are still in progress.

- blk_exp_ref()/blk_exp_unref() prevents the export from being deleted
  while something that still depends on it remains outstanding.

How this maps to FUSE-over-io_uring:

- When an SQE is submitted blk_exp_ref() must be called. After the CQE
  has been processed, blk_exp_unref() must be called. This way the
  export cannot be deleted before all CQEs have been handled.

- The coroutine that processes a FUSE request must call
  fuse_inc_in_flight() before processing begins and fuse_dec_in_flight()
  after processing ends.

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring
  2025-09-09 14:48       ` Stefan Hajnoczi
@ 2025-09-09 17:46         ` Brian Song
  2025-09-09 18:05           ` Bernd Schubert
  0 siblings, 1 reply; 38+ messages in thread
From: Brian Song @ 2025-09-09 17:46 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf



On 9/9/25 10:48 AM, Stefan Hajnoczi wrote:
> On Wed, Sep 03, 2025 at 02:00:55PM -0400, Brian Song wrote:
>>
>>
>> On 9/3/25 6:53 AM, Stefan Hajnoczi wrote:
>>> On Fri, Aug 29, 2025 at 10:50:22PM -0400, Brian Song wrote:
>>>> This patch adds a new export option for storage-export-daemon to enable
>>>> FUSE-over-io_uring via the switch io-uring=on|off (disableby default).
>>>> It also implements the protocol handshake with the Linux kernel
>>>> during the FUSE-over-io_uring initialization phase.
>>>>
>>>> See: https://docs.kernel.org/filesystems/fuse-io-uring.html
>>>>
>>>> The kernel documentation describes in detail how FUSE-over-io_uring
>>>> works. This patch implements the Initial SQE stage shown in thediagram:
>>>> it initializes one queue per IOThread, each currently supporting a
>>>> single submission queue entry (SQE). When the FUSE driver sends the
>>>> first FUSE request (FUSE_INIT), storage-export-daemon calls
>>>> fuse_uring_start() to complete initialization, ultimately submitting
>>>> the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm
>>>> successful initialization with the kernel.
>>>>
>>>> We also added support for multiple IOThreads. The current Linux kernel
>>>> requires registering $(nproc) queues when setting up FUSE-over-io_uring
>>>> To let users customize the number of FUSE Queues (i.e., IOThreads),
>>>> we first create nproc Ring Queues as required by the kernel, then
>>>> distribute them in a round-robin manner to the FUSE Queues for
>>>> registration. In addition, to support multiple in-flight requests,
>>>> we configure each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH
>>>> entries/requests.
>>>
>>> The previous paragraph says "each currently supporting a single
>>> submission queue entry (SQE)" whereas this paragraph says "we configure
>>> each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH entries/requests".
>>> Maybe this paragraph was squashed into the commit description in a later
>>> step and the previous paragraph can be updated to reflect that multiple
>>> SQEs are submitted?
>>>
>>>>
>>>> Suggested-by: Kevin Wolf <kwolf@redhat.com>
>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>> Signed-off-by: Brian Song <hibriansong@gmail.com>
>>>> ---
>>>>    block/export/fuse.c                  | 310 +++++++++++++++++++++++++--
>>>>    docs/tools/qemu-storage-daemon.rst   |  11 +-
>>>>    qapi/block-export.json               |   5 +-
>>>>    storage-daemon/qemu-storage-daemon.c |   1 +
>>>>    util/fdmon-io_uring.c                |   5 +-
>>>>    5 files changed, 309 insertions(+), 23 deletions(-)
>>>>
>>>> diff --git a/block/export/fuse.c b/block/export/fuse.c
>>>> index c0ad4696ce..19bf9e5f74 100644
>>>> --- a/block/export/fuse.c
>>>> +++ b/block/export/fuse.c
>>>> @@ -48,6 +48,9 @@
>>>>    #include <linux/fs.h>
>>>>    #endif
>>>> +/* room needed in buffer to accommodate header */
>>>> +#define FUSE_BUFFER_HEADER_SIZE 0x1000
>>>
>>> Is it possible to write this in a way that shows how the constant is
>>> calculated? That way the constant would automatically adjust on systems
>>> where the underlying assumptions have changed (e.g. page size, header
>>> struct size). This approach is also self-documenting so it's possible to
>>> understand where the magic number comes from.
>>>
>>> For example:
>>>
>>>     #define FUSE_BUFFER_HEADER_SIZE DIV_ROUND_UP(sizeof(struct fuse_uring_req_header), qemu_real_host_page_size())
>>>
>>> (I'm guessing what the formula you used is, so this example may be
>>> incorrect...)
>>>
>>
>> In libfuse, the way to calculate the bufsize (for req_payload) is the same
>> as in this patch. For different requests, the request header sizes are not
>> the same, but they should never exceed a certain value. So is that why
>> libfuse has this kind of magic number?
> 
>  From <linux/fuse.h>:
> 
>    #define FUSE_URING_IN_OUT_HEADER_SZ 128
>    #define FUSE_URING_OP_IN_OUT_SZ 128
>    ...
>    struct fuse_uring_req_header {
>            /* struct fuse_in_header / struct fuse_out_header */
>            char in_out[FUSE_URING_IN_OUT_HEADER_SZ];
> 
>            /* per op code header */
>            char op_in[FUSE_URING_OP_IN_OUT_SZ];
> 
>            struct fuse_uring_ent_in_out ring_ent_in_out;
>    };
> 
> The size of struct fuse_uring_req_header is 128 + 128 + (4 * 8) = 288
> bytes. A single 4 KB page easily fits this. I guess that's why 0x1000
> was chosen in libfuse.
> 

Yes, the two iovecs in the ring entry: one refers to the general request 
header (fuse_uring_req_header) and the other refers to the payload. The 
variable bufsize represents the space for these two objects and is used 
to calculate the payload size in case max_write changes.

Alright, let me document the buffer usage. It's been a while since I 
started this, so I don’t fully remember how the buffer works here.

>>
>>>> +
>>>>    /* Prevent overly long bounce buffer allocations */
>>>>    #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
>>>>    /*
>>>> @@ -63,12 +66,59 @@
>>>>        (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
>>>>    typedef struct FuseExport FuseExport;
>>>> +typedef struct FuseQueue FuseQueue;
>>>> +
>>>> +#ifdef CONFIG_LINUX_IO_URING
>>>> +#define FUSE_DEFAULT_RING_QUEUE_DEPTH 64
>>>> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
>>>> +
>>>> +typedef struct FuseRingQueue FuseRingQueue;
>>>> +typedef struct FuseRingEnt {
>>>> +    /* back pointer */
>>>> +    FuseRingQueue *rq;
>>>> +
>>>> +    /* commit id of a fuse request */
>>>> +    uint64_t req_commit_id;
>>>
>>> This field is not used in this commit. Please introduce it in the commit
>>> that uses it so it's easier to review and understand the purpose of this
>>> field.
>>>
>>>> +
>>>> +    /* fuse request header and payload */
>>>> +    struct fuse_uring_req_header req_header;
>>>> +    void *op_payload;
>>>> +    size_t req_payload_sz;
>>>
>>> op_payload and req_payload_sz refer to the same buffer, and they are
>>> submitted alongside req_header. It would be nice to name the fields
>>> consistently:
>>>
>>>     struct fuse_uring_req_header req_header;
>>>     void *req_payload;
>>>     size_t req_payload_sz;
>>>
>>> req_payload and req_payload_sz could be eliminated since they are also
>>> stored in iov[1].iov_base and .iov_len. If you feel that would be harder
>>> to understand, then it's okay to keep the duplicate fields.
>>>
>>
>> Makes sense. I followed the design in libfuse. Probably best to just leave
>> them in the struct for readability
>>
>>>> +
>>>> +    /* The vector passed to the kernel */
>>>> +    struct iovec iov[2];
>>>> +
>>>> +    CqeHandler fuse_cqe_handler;
>>>> +} FuseRingEnt;
>>>> +
>>>> +struct FuseRingQueue {
>>>
>>> A comment would be nice here to explain that the kernel requires one
>>> FuseRingQueue per host CPU and this concept is independent of /dev/fuse
>>> (FuseQueue).
>>>
>>>> +    int rqid;
>>>> +
>>>> +    /* back pointer */
>>>> +    FuseQueue *q;
>>>> +    FuseRingEnt *ent;
>>>> +
>>>> +    /* List entry for ring_queues */
>>>> +    QLIST_ENTRY(FuseRingQueue) next;
>>>> +};
>>>> +
>>>> +/*
>>>> + * Round-robin distribution of ring queues across FUSE queues.
>>>> + * This structure manages the mapping between kernel ring queues and user
>>>> + * FUSE queues.
>>>> + */
>>>> +typedef struct FuseRingQueueManager {
>>>> +    FuseRingQueue *ring_queues;
>>>> +    int num_ring_queues;
>>>> +    int num_fuse_queues;
>>>> +} FuseRingQueueManager;
>>>> +#endif
>>>
>>> It's easy to forget which #ifdef we're inside after a few lines, so it
>>> helps to indicate that in a comment:
>>>
>>> #endif /* CONFIG_LINUX_IO_URING */
>>>
>>>>    /*
>>>>     * One FUSE "queue", representing one FUSE FD from which requests are fetched
>>>>     * and processed.  Each queue is tied to an AioContext.
>>>>     */
>>>> -typedef struct FuseQueue {
>>>> +struct FuseQueue {
>>>>        FuseExport *exp;
>>>>        AioContext *ctx;
>>>> @@ -109,15 +159,11 @@ typedef struct FuseQueue {
>>>>         * Free this buffer with qemu_vfree().
>>>>         */
>>>>        void *spillover_buf;
>>>> -} FuseQueue;
>>>> -/*
>>>> - * Verify that FuseQueue.request_buf plus the spill-over buffer together
>>>> - * are big enough to be accepted by the FUSE kernel driver.
>>>> - */
>>>> -QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
>>>> -                  FUSE_SPILLOVER_BUF_SIZE <
>>>> -                  FUSE_MIN_READ_BUFFER);
>>>
>>> Why was this removed, it's probably still necessary in the non-io_uring
>>> case (which is compiled in even when CONFIG_LINUX_IO_URING is defined)?
>>>
>>
>> You can check Hanna’s patch. In fuse_co_process_request, Hanna introduced
>> this check when using FUSE_OUT_OP_STRUCT to cast void *buf into the
>> corresponding in/out header for the given operation.
>>
>> But in the v2 patch, we merged the legacy process_request and the uring
>> version into one. This caused the legacy path to pass the array into the
>> common function as a pointer. Now, when we do the buf header size check,
>> what gets checked is just the pointer size.
>>
>> #define FUSE_OUT_OP_STRUCT(op_name, out_buf) \
>>      ({ \
>>          struct fuse_out_header *__out_hdr = \
>>              (struct fuse_out_header *)(out_buf); \
>>          struct fuse_##op_name##_out *__out = \
>>              (struct fuse_##op_name##_out *)(__out_hdr + 1); \
>>          \
>>          QEMU_BUILD_BUG_ON(sizeof(*__out_hdr) + sizeof(*__out) > \
>>                            sizeof(out_buf)); \
>>          \
>>          __out; \
>>      })
> 
> Your patch does not change how ->request_buf is used by the non-io_uring
> code path. ->request_buf needs to fit at least FUSE_MIN_READ_BUFFER
> bytes so I think this QEMU_BUILD_BUG_ON() should not be deleted.
> 

Oh, I misread and thought you were mentioning the QEMU_BUILD_BUG_ON 
deleted in FUSE_IN/OUT_OP_STRUCT_LEGACY. Yes, I mistakenly deleted the 
static assertion for the read buffer and will put it back.

>>
>>
>>>> +#ifdef CONFIG_LINUX_IO_URING
>>>> +    QLIST_HEAD(, FuseRingQueue) ring_queue_list;
>>>> +#endif
>>>> +};
>>>>    struct FuseExport {
>>>>        BlockExport common;
>>>> @@ -133,7 +179,7 @@ struct FuseExport {
>>>>         */
>>>>        bool halted;
>>>> -    int num_queues;
>>>> +    size_t num_queues;
>>>>        FuseQueue *queues;
>>>>        /*
>>>>         * True if this export should follow the generic export's AioContext.
>>>> @@ -149,6 +195,12 @@ struct FuseExport {
>>>>        /* Whether allow_other was used as a mount option or not */
>>>>        bool allow_other;
>>>> +#ifdef CONFIG_LINUX_IO_URING
>>>> +    bool is_uring;
>>>> +    size_t ring_queue_depth;
>>>> +    FuseRingQueueManager *ring_queue_manager;
>>>> +#endif
>>>> +
>>>>        mode_t st_mode;
>>>>        uid_t st_uid;
>>>>        gid_t st_gid;
>>>> @@ -205,7 +257,7 @@ static void fuse_attach_handlers(FuseExport *exp)
>>>>            return;
>>>>        }
>>>> -    for (int i = 0; i < exp->num_queues; i++) {
>>>> +    for (size_t i = 0; i < exp->num_queues; i++) {
>>>>            aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
>>>>                               read_from_fuse_fd, NULL, NULL, NULL,
>>>>                               &exp->queues[i]);
>>>> @@ -257,6 +309,189 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
>>>>        .drained_poll  = fuse_export_drained_poll,
>>>>    };
>>>> +#ifdef CONFIG_LINUX_IO_URING
>>>> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req,
>>>> +                    const unsigned int rqid,
>>>> +                    const unsigned int commit_id)
>>>> +{
>>>> +    req->qid = rqid;
>>>> +    req->commit_id = commit_id;
>>>> +    req->flags = 0;
>>>> +}
>>>> +
>>>> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q,
>>>> +               __u32 cmd_op)
>>>> +{
>>>> +    sqe->opcode = IORING_OP_URING_CMD;
>>>> +
>>>> +    sqe->fd = q->fuse_fd;
>>>> +    sqe->rw_flags = 0;
>>>> +    sqe->ioprio = 0;
>>>> +    sqe->off = 0;
>>>> +
>>>> +    sqe->cmd_op = cmd_op;
>>>> +    sqe->__pad1 = 0;
>>>> +}
>>>> +
>>>> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque)
>>>> +{
>>>> +    FuseRingEnt *ent = opaque;
>>>> +    struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0];
>>>> +
>>>> +    fuse_uring_sqe_prepare(sqe, ent->rq->q, FUSE_IO_URING_CMD_REGISTER);
>>>> +
>>>> +    sqe->addr = (uint64_t)(ent->iov);
>>>> +    sqe->len = 2;
>>>> +
>>>> +    fuse_uring_sqe_set_req_data(req, ent->rq->rqid, 0);
>>>> +}
>>>> +
>>>> +static void fuse_uring_submit_register(void *opaque)
>>>> +{
>>>> +    FuseRingEnt *ent = opaque;
>>>> +    FuseExport *exp = ent->rq->q->exp;
>>>
>>> This variable is unused in this commit? Does this commit compile for
>>> you? Usually the compiler warns about unused variables.
>>>
>>
>> The first version was a large single patch. I split it with git, and this
>> variable is now used in a different patch
>>
>>>> +
>>>> +
>>>> +    aio_add_sqe(fuse_uring_prep_sqe_register, ent, &(ent->fuse_cqe_handler));
>>>> +}
>>>> +
>>>> +/**
>>>> + * Distribute ring queues across FUSE queues using round-robin algorithm.
>>>> + * This ensures even distribution of kernel ring queues across user-specified
>>>> + * FUSE queues.
>>>> + */
>>>> +static
>>>> +FuseRingQueueManager *fuse_ring_queue_manager_create(int num_fuse_queues,
>>>> +                                                    size_t ring_queue_depth,
>>>> +                                                    size_t bufsize)
>>>> +{
>>>> +    int num_ring_queues = get_nprocs();
>>>
>>> The kernel code uses num_possible_cpus() in
>>> fs/fuse/dev_uring.c:fuse_uring_create() so I think this should be
>>> get_nprocs_conf() instead of get_nprocs().
>>>
>>>> +    FuseRingQueueManager *manager = g_new(FuseRingQueueManager, 1);
>>>> +
>>>> +    if (!manager) {
>>>
>>> g_new() never returns NULL, so you can remove this if statement. If
>>> memory cannot be allocated then the process will abort.
>>>
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    manager->ring_queues = g_new(FuseRingQueue, num_ring_queues);
>>>> +    manager->num_ring_queues = num_ring_queues;
>>>> +    manager->num_fuse_queues = num_fuse_queues;
>>>> +
>>>> +    if (!manager->ring_queues) {
>>>
>>> Same here.
>>>
>>>> +        g_free(manager);
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    for (int i = 0; i < num_ring_queues; i++) {
>>>> +        FuseRingQueue *rq = &manager->ring_queues[i];
>>>> +        rq->rqid = i;
>>>> +        rq->ent = g_new(FuseRingEnt, ring_queue_depth);
>>>> +
>>>> +        if (!rq->ent) {
>>>
>>> Same here.
>>>
>>>> +            for (int j = 0; j < i; j++) {
>>>> +                g_free(manager->ring_queues[j].ent);
>>>> +            }
>>>> +            g_free(manager->ring_queues);
>>>> +            g_free(manager);
>>>> +            return NULL;
>>>> +        }
>>>> +
>>>> +        for (size_t j = 0; j < ring_queue_depth; j++) {
>>>> +            FuseRingEnt *ent = &rq->ent[j];
>>>> +            ent->rq = rq;
>>>> +            ent->req_payload_sz = bufsize - FUSE_BUFFER_HEADER_SIZE;
>>>> +            ent->op_payload = g_malloc0(ent->req_payload_sz);
>>>> +
>>>> +            if (!ent->op_payload) {
>>>
>>> Same here.
>>>
>>>> +                for (size_t k = 0; k < j; k++) {
>>>> +                    g_free(rq->ent[k].op_payload);
>>>> +                }
>>>> +                g_free(rq->ent);
>>>> +                for (int k = 0; k < i; k++) {
>>>> +                    g_free(manager->ring_queues[k].ent);
>>>> +                }
>>>> +                g_free(manager->ring_queues);
>>>> +                g_free(manager);
>>>
>>> Where are these structures freed in the normal lifecycle of a FUSE
>>> export? I only see this error handling code, but nothing is freed when
>>> the export is shut down.
>>
>>
>> Same here. The first version was a large single patch. I split it with git,
>> and we do cleanup in a different patch
> 
> It's easier for reviewers and safer for backports if each patch is
> self-contained with the cleanup code included in the same patch where
> the resource is created. If you make changes to the patch organization
> in the next revision then it would be nice to included the cleanup in
> this patch.
> 
>>
>>>
>>>> +                return NULL;
>>>> +            }
>>>> +
>>>> +            ent->iov[0] = (struct iovec) {
>>>> +                &(ent->req_header),
>>>> +                sizeof(struct fuse_uring_req_header)
>>>> +            };
>>>> +            ent->iov[1] = (struct iovec) {
>>>> +                ent->op_payload,
>>>> +                ent->req_payload_sz
>>>> +            };
>>>> +
>>>> +            ent->fuse_cqe_handler.cb = fuse_uring_cqe_handler;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return manager;
>>>> +}
>>>> +
>>>> +static
>>>> +void fuse_distribute_ring_queues(FuseExport *exp, FuseRingQueueManager *manager)
>>>> +{
>>>> +    int queue_index = 0;
>>>> +
>>>> +    for (int i = 0; i < manager->num_ring_queues; i++) {
>>>> +        FuseRingQueue *rq = &manager->ring_queues[i];
>>>> +
>>>> +        rq->q = &exp->queues[queue_index];
>>>> +        QLIST_INSERT_HEAD(&(rq->q->ring_queue_list), rq, next);
>>>> +
>>>> +        queue_index = (queue_index + 1) % manager->num_fuse_queues;
>>>> +    }
>>>> +}
>>>> +
>>>> +static
>>>> +void fuse_schedule_ring_queue_registrations(FuseExport *exp,
>>>> +                                            FuseRingQueueManager *manager)
>>>> +{
>>>> +    for (int i = 0; i < manager->num_fuse_queues; i++) {
>>>> +        FuseQueue *q = &exp->queues[i];
>>>> +        FuseRingQueue *rq;
>>>> +
>>>> +        QLIST_FOREACH(rq, &q->ring_queue_list, next) {
>>>> +            for (int j = 0; j < exp->ring_queue_depth; j++) {
>>>> +                aio_bh_schedule_oneshot(q->ctx, fuse_uring_submit_register,
>>>> +                                        &(rq->ent[j]));
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +}
>>>> +
>>>> +static void fuse_uring_start(FuseExport *exp, struct fuse_init_out *out)
>>>> +{
>>>> +    /*
>>>> +     * Since we didn't enable the FUSE_MAX_PAGES feature, the value of
>>>> +     * fc->max_pages should be FUSE_DEFAULT_MAX_PAGES_PER_REQ, which is set by
>>>> +     * the kernel by default. Also, max_write should not exceed
>>>> +     * FUSE_DEFAULT_MAX_PAGES_PER_REQ * PAGE_SIZE.
>>>> +     */
>>>> +    size_t bufsize = out->max_write + FUSE_BUFFER_HEADER_SIZE;
>>>> +
>>>> +    if (!(out->flags & FUSE_MAX_PAGES)) {
>>>> +        bufsize = FUSE_DEFAULT_MAX_PAGES_PER_REQ * qemu_real_host_page_size()
>>>> +                         + FUSE_BUFFER_HEADER_SIZE;
>>>> +    }
>>>> +
>>>> +    exp->ring_queue_manager = fuse_ring_queue_manager_create(
>>>> +        exp->num_queues, exp->ring_queue_depth, bufsize);
>>>> +
>>>> +    if (!exp->ring_queue_manager) {
>>>> +        error_report("Failed to create ring queue manager");
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /* Distribute ring queues across FUSE queues using round-robin */
>>>> +    fuse_distribute_ring_queues(exp, exp->ring_queue_manager);
>>>> +
>>>> +    fuse_schedule_ring_queue_registrations(exp, exp->ring_queue_manager);
>>>> +}
>>>> +#endif
>>>> +
>>>>    static int fuse_export_create(BlockExport *blk_exp,
>>>>                                  BlockExportOptions *blk_exp_args,
>>>>                                  AioContext *const *multithread,
>>>> @@ -270,6 +505,11 @@ static int fuse_export_create(BlockExport *blk_exp,
>>>>        assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
>>>> +#ifdef CONFIG_LINUX_IO_URING
>>>> +    exp->is_uring = args->io_uring;
>>>> +    exp->ring_queue_depth = FUSE_DEFAULT_RING_QUEUE_DEPTH;
>>>> +#endif
>>>> +
>>>>        if (multithread) {
>>>>            /* Guaranteed by common export code */
>>>>            assert(mt_count >= 1);
>>>> @@ -283,6 +523,10 @@ static int fuse_export_create(BlockExport *blk_exp,
>>>>                    .exp = exp,
>>>>                    .ctx = multithread[i],
>>>>                    .fuse_fd = -1,
>>>> +#ifdef CONFIG_LINUX_IO_URING
>>>> +                .ring_queue_list =
>>>> +                    QLIST_HEAD_INITIALIZER(exp->queues[i].ring_queue_list),
>>>> +#endif
>>>>                };
>>>>            }
>>>>        } else {
>>>> @@ -296,6 +540,10 @@ static int fuse_export_create(BlockExport *blk_exp,
>>>>                .exp = exp,
>>>>                .ctx = exp->common.ctx,
>>>>                .fuse_fd = -1,
>>>> +#ifdef CONFIG_LINUX_IO_URING
>>>> +            .ring_queue_list =
>>>> +                QLIST_HEAD_INITIALIZER(exp->queues[0].ring_queue_list),
>>>> +#endif
>>>>            };
>>>>        }
>>>> @@ -685,17 +933,39 @@ static bool is_regular_file(const char *path, Error **errp)
>>>>     */
>>>>    static ssize_t coroutine_fn
>>>>    fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
>>>> -             uint32_t max_readahead, uint32_t flags)
>>>> +             uint32_t max_readahead, const struct fuse_init_in *in)
>>>>    {
>>>> -    const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
>>>> +    uint64_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO
>>>> +                                     | FUSE_INIT_EXT;
>>>> +    uint64_t outargflags = 0;
>>>> +    uint64_t inargflags = in->flags;
>>>> +
>>>> +    ssize_t ret = 0;
>>>> +
>>>> +    if (inargflags & FUSE_INIT_EXT) {
>>>> +        inargflags = inargflags | (uint64_t) in->flags2 << 32;
>>>> +    }
>>>> +
>>>> +#ifdef CONFIG_LINUX_IO_URING
>>>> +    if (exp->is_uring) {
>>>> +        if (inargflags & FUSE_OVER_IO_URING) {
>>>> +            supported_flags |= FUSE_OVER_IO_URING;
>>>> +        } else {
>>>> +            exp->is_uring = false;
>>>> +            ret = -ENODEV;
>>>> +        }
>>>> +    }
>>>> +#endif
>>>> +
>>>> +    outargflags = inargflags & supported_flags;
>>>>        *out = (struct fuse_init_out) {
>>>>            .major = FUSE_KERNEL_VERSION,
>>>>            .minor = FUSE_KERNEL_MINOR_VERSION,
>>>>            .max_readahead = max_readahead,
>>>>            .max_write = FUSE_MAX_WRITE_BYTES,
>>>> -        .flags = flags & supported_flags,
>>>> -        .flags2 = 0,
>>>> +        .flags = outargflags,
>>>> +        .flags2 = outargflags >> 32,
>>>>            /* libfuse maximum: 2^16 - 1 */
>>>>            .max_background = UINT16_MAX,
>>>> @@ -717,7 +987,7 @@ fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
>>>>            .map_alignment = 0,
>>>>        };
>>>> -    return sizeof(*out);
>>>> +    return ret < 0 ? ret : sizeof(*out);
>>>>    }
>>>>    /**
>>>> @@ -1506,6 +1776,14 @@ fuse_co_process_request(FuseQueue *q, void *spillover_buf)
>>>>            fuse_write_buf_response(q->fuse_fd, req_id, out_hdr,
>>>>                                    out_data_buffer, ret);
>>>>            qemu_vfree(out_data_buffer);
>>>> +#ifdef CONFIG_LINUX_IO_URING
>>>> +    /* Handle FUSE-over-io_uring initialization */
>>>> +    if (unlikely(opcode == FUSE_INIT && exp->is_uring)) {
>>>> +        struct fuse_init_out *out =
>>>> +            (struct fuse_init_out *)FUSE_OUT_OP_STRUCT(out_buf);
>>>> +        fuse_uring_start(exp, out);
>>>
>>> Is there any scenario where FUSE_INIT can be received multiple times?
>>> Maybe if the FUSE file system is umounted and mounted again? I want to
>>> check that this doesn't leak previously allocated ring state.
>>>
>>
>> I don't think so, even in a multi-threaded FUSE setup, the kernel only sends
>> a single FUSE_INIT to userspace. In the legacy mode, whichever thread
>> receives that request can handle it and initialize FUSE-over-io_uring
> 
> Okay. Please add an assertion to fuse_uring_start() to catch the case
> where it is called twice.
> 
> Thanks,
> Stefan



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring
  2025-09-09 17:46         ` Brian Song
@ 2025-09-09 18:05           ` Bernd Schubert
  0 siblings, 0 replies; 38+ messages in thread
From: Bernd Schubert @ 2025-09-09 18:05 UTC (permalink / raw)
  To: Brian Song, Stefan Hajnoczi
  Cc: qemu-block, qemu-devel, armbru, fam, hreitz, kwolf



On 9/9/25 19:46, Brian Song wrote:
> 
> 
> On 9/9/25 10:48 AM, Stefan Hajnoczi wrote:
>> On Wed, Sep 03, 2025 at 02:00:55PM -0400, Brian Song wrote:
>>>
>>>
>>> On 9/3/25 6:53 AM, Stefan Hajnoczi wrote:
>>>> On Fri, Aug 29, 2025 at 10:50:22PM -0400, Brian Song wrote:
>>>>> This patch adds a new export option for storage-export-daemon to enable
>>>>> FUSE-over-io_uring via the switch io-uring=on|off (disableby default).
>>>>> It also implements the protocol handshake with the Linux kernel
>>>>> during the FUSE-over-io_uring initialization phase.
>>>>>
>>>>> See: https://docs.kernel.org/filesystems/fuse-io-uring.html
>>>>>
>>>>> The kernel documentation describes in detail how FUSE-over-io_uring
>>>>> works. This patch implements the Initial SQE stage shown in thediagram:
>>>>> it initializes one queue per IOThread, each currently supporting a
>>>>> single submission queue entry (SQE). When the FUSE driver sends the
>>>>> first FUSE request (FUSE_INIT), storage-export-daemon calls
>>>>> fuse_uring_start() to complete initialization, ultimately submitting
>>>>> the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm
>>>>> successful initialization with the kernel.
>>>>>
>>>>> We also added support for multiple IOThreads. The current Linux kernel
>>>>> requires registering $(nproc) queues when setting up FUSE-over-io_uring
>>>>> To let users customize the number of FUSE Queues (i.e., IOThreads),
>>>>> we first create nproc Ring Queues as required by the kernel, then
>>>>> distribute them in a round-robin manner to the FUSE Queues for
>>>>> registration. In addition, to support multiple in-flight requests,
>>>>> we configure each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH
>>>>> entries/requests.
>>>>
>>>> The previous paragraph says "each currently supporting a single
>>>> submission queue entry (SQE)" whereas this paragraph says "we configure
>>>> each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH entries/requests".
>>>> Maybe this paragraph was squashed into the commit description in a later
>>>> step and the previous paragraph can be updated to reflect that multiple
>>>> SQEs are submitted?
>>>>
>>>>>
>>>>> Suggested-by: Kevin Wolf <kwolf@redhat.com>
>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>> Signed-off-by: Brian Song <hibriansong@gmail.com>
>>>>> ---
>>>>>    block/export/fuse.c                  | 310 +++++++++++++++++++++++++--
>>>>>    docs/tools/qemu-storage-daemon.rst   |  11 +-
>>>>>    qapi/block-export.json               |   5 +-
>>>>>    storage-daemon/qemu-storage-daemon.c |   1 +
>>>>>    util/fdmon-io_uring.c                |   5 +-
>>>>>    5 files changed, 309 insertions(+), 23 deletions(-)
>>>>>
>>>>> diff --git a/block/export/fuse.c b/block/export/fuse.c
>>>>> index c0ad4696ce..19bf9e5f74 100644
>>>>> --- a/block/export/fuse.c
>>>>> +++ b/block/export/fuse.c
>>>>> @@ -48,6 +48,9 @@
>>>>>    #include <linux/fs.h>
>>>>>    #endif
>>>>> +/* room needed in buffer to accommodate header */
>>>>> +#define FUSE_BUFFER_HEADER_SIZE 0x1000
>>>>
>>>> Is it possible to write this in a way that shows how the constant is
>>>> calculated? That way the constant would automatically adjust on systems
>>>> where the underlying assumptions have changed (e.g. page size, header
>>>> struct size). This approach is also self-documenting so it's possible to
>>>> understand where the magic number comes from.
>>>>
>>>> For example:
>>>>
>>>>     #define FUSE_BUFFER_HEADER_SIZE DIV_ROUND_UP(sizeof(struct fuse_uring_req_header), qemu_real_host_page_size())
>>>>
>>>> (I'm guessing what the formula you used is, so this example may be
>>>> incorrect...)
>>>>
>>>
>>> In libfuse, the way to calculate the bufsize (for req_payload) is the same
>>> as in this patch. For different requests, the request header sizes are not
>>> the same, but they should never exceed a certain value. So is that why
>>> libfuse has this kind of magic number?
>>
>>  From <linux/fuse.h>:
>>
>>    #define FUSE_URING_IN_OUT_HEADER_SZ 128
>>    #define FUSE_URING_OP_IN_OUT_SZ 128
>>    ...
>>    struct fuse_uring_req_header {
>>            /* struct fuse_in_header / struct fuse_out_header */
>>            char in_out[FUSE_URING_IN_OUT_HEADER_SZ];
>>
>>            /* per op code header */
>>            char op_in[FUSE_URING_OP_IN_OUT_SZ];
>>
>>            struct fuse_uring_ent_in_out ring_ent_in_out;
>>    };
>>
>> The size of struct fuse_uring_req_header is 128 + 128 + (4 * 8) = 288
>> bytes. A single 4 KB page easily fits this. I guess that's why 0x1000
>> was chosen in libfuse.
>>
> 
> Yes, the two iovecs in the ring entry: one refers to the general request 
> header (fuse_uring_req_header) and the other refers to the payload. The 
> variable bufsize represents the space for these two objects and is used 
> to calculate the payload size in case max_write changes.
> 
> Alright, let me document the buffer usage. It's been a while since I 
> started this, so I don’t fully remember how the buffer works here.

For current kernel code we could make this 288 allocations for the header. 
This just does not work with page pinning, which we are using at DDN
(kernel patches not upstreamed yet).

Maybe I should make the header allocation way dependent if page pinning,
there is a bit overhead with 4K headers, although 4K doesn't sound too bad,
even with many queues.


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] export/fuse: Safe termination for FUSE-uring
  2025-08-30  2:50 ` [PATCH 3/4] export/fuse: Safe termination for FUSE-uring Brian Song
@ 2025-09-09 19:33   ` Stefan Hajnoczi
  2025-09-09 20:51     ` Brian Song
  2025-09-15  5:43     ` Brian Song
  0 siblings, 2 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2025-09-09 19:33 UTC (permalink / raw)
  To: Brian Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf

[-- Attachment #1: Type: text/plain, Size: 1486 bytes --]

On Fri, Aug 29, 2025 at 10:50:24PM -0400, Brian Song wrote:
> @@ -901,24 +941,15 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
>           */
>          g_hash_table_remove(exports, exp->mountpoint);
>      }
> -}
> -
> -static void fuse_export_delete(BlockExport *blk_exp)
> -{
> -    FuseExport *exp = container_of(blk_exp, FuseExport, common);
>  
> -    for (int i = 0; i < exp->num_queues; i++) {
> +    for (size_t i = 0; i < exp->num_queues; i++) {
>          FuseQueue *q = &exp->queues[i];
>  
>          /* Queue 0's FD belongs to the FUSE session */
>          if (i > 0 && q->fuse_fd >= 0) {
>              close(q->fuse_fd);

This changes the behavior of the non-io_uring code. Now all fuse fds and
fuse_session are closed while requests are potentially still being
processed.

There is a race condition: if an IOThread is processing a request here
then it may invoke a system call on q->fuse_fd just after it has been
closed but not set to -1. If another thread has also opened a new file
then the fd could be reused, resulting in an accidental write(2) to the
new file. I'm not sure whether there is a way to trigger this in
practice, but it looks like a problem waiting to happen.

Simply setting q->fuse_fd to -1 here doesn't fix the race. It would be
necessary to stop processing fuse_fd in the thread before closing it
here or to schedule a BH in each thread so that fuse_fd can be closed
in the thread that uses the fd.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] iotests: add tests for FUSE-over-io_uring
  2025-08-30  2:50 ` [PATCH 4/4] iotests: add tests for FUSE-over-io_uring Brian Song
@ 2025-09-09 19:38   ` Stefan Hajnoczi
  2025-09-09 20:51     ` Brian Song
  0 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2025-09-09 19:38 UTC (permalink / raw)
  To: Brian Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf

[-- Attachment #1: Type: text/plain, Size: 4930 bytes --]

On Fri, Aug 29, 2025 at 10:50:25PM -0400, Brian Song wrote:
> To test FUSE-over-io_uring, set the environment variable
> FUSE_OVER_IO_URING=1. This applies only when using the
> 'fuse' protocol.
> 
> $ FUSE_OVER_IO_URING=1 ./check -fuse
> 
> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Brian Song <hibriansong@gmail.com>
> ---
>  tests/qemu-iotests/check     |  2 ++
>  tests/qemu-iotests/common.rc | 45 +++++++++++++++++++++++++++---------
>  2 files changed, 36 insertions(+), 11 deletions(-)
> 
> diff --git a/tests/qemu-iotests/check b/tests/qemu-iotests/check
> index 545f9ec7bd..c6fa0f9e3d 100755
> --- a/tests/qemu-iotests/check
> +++ b/tests/qemu-iotests/check
> @@ -94,6 +94,8 @@ def make_argparser() -> argparse.ArgumentParser:
>          mg.add_argument('-' + fmt, dest='imgfmt', action='store_const',
>                          const=fmt, help=f'test {fmt}')
>  
> +    # To test FUSE-over-io_uring, set the environment variable
> +    # FUSE_OVER_IO_URING=1. This applies only when using the 'fuse' protocol
>      protocol_list = ['file', 'rbd', 'nbd', 'ssh', 'nfs', 'fuse']
>      g_prt = p.add_argument_group(
>          '  image protocol options',
> diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
> index e977cb4eb6..f8b79c3810 100644
> --- a/tests/qemu-iotests/common.rc
> +++ b/tests/qemu-iotests/common.rc
> @@ -539,17 +539,38 @@ _make_test_img()
>          touch "$export_mp"
>          rm -f "$SOCK_DIR/fuse-output"
>  
> -        # Usually, users would export formatted nodes.  But we present fuse as a
> -        # protocol-level driver here, so we have to leave the format to the
> -        # client.
> -        # Switch off allow-other, because in general we do not need it for
> -        # iotests.  The default allow-other=auto has the downside of printing a
> -        # fusermount error on its first attempt if allow_other is not
> -        # permissible, which we would need to filter.

This comment applies to both branches of the if statement. I think
keeping it here is slightly better.

> -        QSD_NEED_PID=y $QSD \
> -              --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
> -              --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off \
> -              &
> +        if [ -n "$FUSE_OVER_IO_URING" ]; then
> +            nr_cpu=$(nproc 2>/dev/null || echo 1)
> +            nr_iothreads=$((nr_cpu / 2))
> +            if [ $nr_iothreads -lt 1 ]; then
> +                nr_iothreads=1
> +            fi

Please add a comment explaining that the purpose of this configuration
based on the number of CPUs is to test multiple IOThreads when the host
allows it, since that is a more interesting case then just 1 IOThread.
Many other configurations are possible as well, but not all of them can
be tested because the test matrix would be large.

> +
> +            iothread_args=""
> +            iothread_export_args=""
> +            for ((i=0; i<$nr_iothreads; i++)); do
> +                iothread_args="$iothread_args --object iothread,id=iothread$i"
> +                iothread_export_args="$iothread_export_args,iothread.$i=iothread$i"
> +            done
> +
> +            QSD_NEED_PID=y $QSD \
> +                    $iothread_args \
> +                    --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
> +                    --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off,io-uring=on$iothread_export_args \
> +                &
> +        else
> +            # Usually, users would export formatted nodes.  But we present fuse as a
> +            # protocol-level driver here, so we have to leave the format to the
> +            # client.
> +            # Switch off allow-other, because in general we do not need it for
> +            # iotests.  The default allow-other=auto has the downside of printing a
> +            # fusermount error on its first attempt if allow_other is not
> +            # permissible, which we would need to filter.
> +            QSD_NEED_PID=y $QSD \
> +                --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
> +                --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off \
> +                &
> +        fi
>  
>          pidfile="$QEMU_TEST_DIR/qemu-storage-daemon.pid"
>  
> @@ -592,6 +613,8 @@ _rm_test_img()
>  
>          kill "${FUSE_PIDS[index]}"
>  
> +        sleep 1
> +

What is the purpose of this sleep command?

>          # Wait until the mount is gone
>          timeout=10 # *0.5 s
>          while true; do
> -- 
> 2.45.2
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] iotests: add tests for FUSE-over-io_uring
  2025-09-09 19:38   ` Stefan Hajnoczi
@ 2025-09-09 20:51     ` Brian Song
  2025-09-10 13:14       ` Stefan Hajnoczi
  0 siblings, 1 reply; 38+ messages in thread
From: Brian Song @ 2025-09-09 20:51 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf



On 9/9/25 3:38 PM, Stefan Hajnoczi wrote:
> On Fri, Aug 29, 2025 at 10:50:25PM -0400, Brian Song wrote:
>> To test FUSE-over-io_uring, set the environment variable
>> FUSE_OVER_IO_URING=1. This applies only when using the
>> 'fuse' protocol.
>>
>> $ FUSE_OVER_IO_URING=1 ./check -fuse
>>
>> Suggested-by: Kevin Wolf <kwolf@redhat.com>
>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>> Signed-off-by: Brian Song <hibriansong@gmail.com>
>> ---
>>   tests/qemu-iotests/check     |  2 ++
>>   tests/qemu-iotests/common.rc | 45 +++++++++++++++++++++++++++---------
>>   2 files changed, 36 insertions(+), 11 deletions(-)
>>
>> diff --git a/tests/qemu-iotests/check b/tests/qemu-iotests/check
>> index 545f9ec7bd..c6fa0f9e3d 100755
>> --- a/tests/qemu-iotests/check
>> +++ b/tests/qemu-iotests/check
>> @@ -94,6 +94,8 @@ def make_argparser() -> argparse.ArgumentParser:
>>           mg.add_argument('-' + fmt, dest='imgfmt', action='store_const',
>>                           const=fmt, help=f'test {fmt}')
>>   
>> +    # To test FUSE-over-io_uring, set the environment variable
>> +    # FUSE_OVER_IO_URING=1. This applies only when using the 'fuse' protocol
>>       protocol_list = ['file', 'rbd', 'nbd', 'ssh', 'nfs', 'fuse']
>>       g_prt = p.add_argument_group(
>>           '  image protocol options',
>> diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
>> index e977cb4eb6..f8b79c3810 100644
>> --- a/tests/qemu-iotests/common.rc
>> +++ b/tests/qemu-iotests/common.rc
>> @@ -539,17 +539,38 @@ _make_test_img()
>>           touch "$export_mp"
>>           rm -f "$SOCK_DIR/fuse-output"
>>   
>> -        # Usually, users would export formatted nodes.  But we present fuse as a
>> -        # protocol-level driver here, so we have to leave the format to the
>> -        # client.
>> -        # Switch off allow-other, because in general we do not need it for
>> -        # iotests.  The default allow-other=auto has the downside of printing a
>> -        # fusermount error on its first attempt if allow_other is not
>> -        # permissible, which we would need to filter.
> 
> This comment applies to both branches of the if statement. I think
> keeping it here is slightly better.
> 
>> -        QSD_NEED_PID=y $QSD \
>> -              --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
>> -              --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off \
>> -              &
>> +        if [ -n "$FUSE_OVER_IO_URING" ]; then
>> +            nr_cpu=$(nproc 2>/dev/null || echo 1)
>> +            nr_iothreads=$((nr_cpu / 2))
>> +            if [ $nr_iothreads -lt 1 ]; then
>> +                nr_iothreads=1
>> +            fi
> 
> Please add a comment explaining that the purpose of this configuration
> based on the number of CPUs is to test multiple IOThreads when the host
> allows it, since that is a more interesting case then just 1 IOThread.
> Many other configurations are possible as well, but not all of them can
> be tested because the test matrix would be large.
> 
>> +
>> +            iothread_args=""
>> +            iothread_export_args=""
>> +            for ((i=0; i<$nr_iothreads; i++)); do
>> +                iothread_args="$iothread_args --object iothread,id=iothread$i"
>> +                iothread_export_args="$iothread_export_args,iothread.$i=iothread$i"
>> +            done
>> +
>> +            QSD_NEED_PID=y $QSD \
>> +                    $iothread_args \
>> +                    --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
>> +                    --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off,io-uring=on$iothread_export_args \
>> +                &
>> +        else
>> +            # Usually, users would export formatted nodes.  But we present fuse as a
>> +            # protocol-level driver here, so we have to leave the format to the
>> +            # client.
>> +            # Switch off allow-other, because in general we do not need it for
>> +            # iotests.  The default allow-other=auto has the downside of printing a
>> +            # fusermount error on its first attempt if allow_other is not
>> +            # permissible, which we would need to filter.
>> +            QSD_NEED_PID=y $QSD \
>> +                --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
>> +                --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off \
>> +                &
>> +        fi
>>   
>>           pidfile="$QEMU_TEST_DIR/qemu-storage-daemon.pid"
>>   
>> @@ -592,6 +613,8 @@ _rm_test_img()
>>   
>>           kill "${FUSE_PIDS[index]}"
>>   
>> +        sleep 1
>> +
> 
> What is the purpose of this sleep command?
> 

I don’t exactly remember why. It might get stuck if there’s no sleep 
here. I remember we discussed this problem in earlier emails.

>>           # Wait until the mount is gone
>>           timeout=10 # *0.5 s
>>           while true; do
>> -- 
>> 2.45.2
>>



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] export/fuse: Safe termination for FUSE-uring
  2025-09-09 19:33   ` Stefan Hajnoczi
@ 2025-09-09 20:51     ` Brian Song
  2025-09-10 13:17       ` Stefan Hajnoczi
  2025-09-15  5:43     ` Brian Song
  1 sibling, 1 reply; 38+ messages in thread
From: Brian Song @ 2025-09-09 20:51 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf



On 9/9/25 3:33 PM, Stefan Hajnoczi wrote:
> On Fri, Aug 29, 2025 at 10:50:24PM -0400, Brian Song wrote:
>> @@ -901,24 +941,15 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
>>            */
>>           g_hash_table_remove(exports, exp->mountpoint);
>>       }
>> -}
>> -
>> -static void fuse_export_delete(BlockExport *blk_exp)
>> -{
>> -    FuseExport *exp = container_of(blk_exp, FuseExport, common);
>>   
>> -    for (int i = 0; i < exp->num_queues; i++) {
>> +    for (size_t i = 0; i < exp->num_queues; i++) {
>>           FuseQueue *q = &exp->queues[i];
>>   
>>           /* Queue 0's FD belongs to the FUSE session */
>>           if (i > 0 && q->fuse_fd >= 0) {
>>               close(q->fuse_fd);
> 
> This changes the behavior of the non-io_uring code. Now all fuse fds and
> fuse_session are closed while requests are potentially still being
> processed.
> 
> There is a race condition: if an IOThread is processing a request here
> then it may invoke a system call on q->fuse_fd just after it has been
> closed but not set to -1. If another thread has also opened a new file
> then the fd could be reused, resulting in an accidental write(2) to the
> new file. I'm not sure whether there is a way to trigger this in
> practice, but it looks like a problem waiting to happen.
> 
> Simply setting q->fuse_fd to -1 here doesn't fix the race. It would be
> necessary to stop processing fuse_fd in the thread before closing it
> here or to schedule a BH in each thread so that fuse_fd can be closed
> in the thread that uses the fd.

I get what you mean. This newly introduced cleanup code was originally 
in the deletion section, after the reconf counter decreased to 0, and it 
was meant to cancel the pending SQEs. But now we've moved it to the 
shutdown section, which may introduce a potential problem. How do you 
think we should fix it? This is the last week of GSoC, I'm already busy 
on weekdays since the new term has started.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] iotests: add tests for FUSE-over-io_uring
  2025-09-09 20:51     ` Brian Song
@ 2025-09-10 13:14       ` Stefan Hajnoczi
  2025-09-12  2:22         ` Brian Song
  0 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2025-09-10 13:14 UTC (permalink / raw)
  To: Brian Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf

[-- Attachment #1: Type: text/plain, Size: 5652 bytes --]

On Tue, Sep 09, 2025 at 04:51:12PM -0400, Brian Song wrote:
> 
> 
> On 9/9/25 3:38 PM, Stefan Hajnoczi wrote:
> > On Fri, Aug 29, 2025 at 10:50:25PM -0400, Brian Song wrote:
> > > To test FUSE-over-io_uring, set the environment variable
> > > FUSE_OVER_IO_URING=1. This applies only when using the
> > > 'fuse' protocol.
> > > 
> > > $ FUSE_OVER_IO_URING=1 ./check -fuse
> > > 
> > > Suggested-by: Kevin Wolf <kwolf@redhat.com>
> > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > Signed-off-by: Brian Song <hibriansong@gmail.com>
> > > ---
> > >   tests/qemu-iotests/check     |  2 ++
> > >   tests/qemu-iotests/common.rc | 45 +++++++++++++++++++++++++++---------
> > >   2 files changed, 36 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/tests/qemu-iotests/check b/tests/qemu-iotests/check
> > > index 545f9ec7bd..c6fa0f9e3d 100755
> > > --- a/tests/qemu-iotests/check
> > > +++ b/tests/qemu-iotests/check
> > > @@ -94,6 +94,8 @@ def make_argparser() -> argparse.ArgumentParser:
> > >           mg.add_argument('-' + fmt, dest='imgfmt', action='store_const',
> > >                           const=fmt, help=f'test {fmt}')
> > > +    # To test FUSE-over-io_uring, set the environment variable
> > > +    # FUSE_OVER_IO_URING=1. This applies only when using the 'fuse' protocol
> > >       protocol_list = ['file', 'rbd', 'nbd', 'ssh', 'nfs', 'fuse']
> > >       g_prt = p.add_argument_group(
> > >           '  image protocol options',
> > > diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
> > > index e977cb4eb6..f8b79c3810 100644
> > > --- a/tests/qemu-iotests/common.rc
> > > +++ b/tests/qemu-iotests/common.rc
> > > @@ -539,17 +539,38 @@ _make_test_img()
> > >           touch "$export_mp"
> > >           rm -f "$SOCK_DIR/fuse-output"
> > > -        # Usually, users would export formatted nodes.  But we present fuse as a
> > > -        # protocol-level driver here, so we have to leave the format to the
> > > -        # client.
> > > -        # Switch off allow-other, because in general we do not need it for
> > > -        # iotests.  The default allow-other=auto has the downside of printing a
> > > -        # fusermount error on its first attempt if allow_other is not
> > > -        # permissible, which we would need to filter.
> > 
> > This comment applies to both branches of the if statement. I think
> > keeping it here is slightly better.
> > 
> > > -        QSD_NEED_PID=y $QSD \
> > > -              --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
> > > -              --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off \
> > > -              &
> > > +        if [ -n "$FUSE_OVER_IO_URING" ]; then
> > > +            nr_cpu=$(nproc 2>/dev/null || echo 1)
> > > +            nr_iothreads=$((nr_cpu / 2))
> > > +            if [ $nr_iothreads -lt 1 ]; then
> > > +                nr_iothreads=1
> > > +            fi
> > 
> > Please add a comment explaining that the purpose of this configuration
> > based on the number of CPUs is to test multiple IOThreads when the host
> > allows it, since that is a more interesting case then just 1 IOThread.
> > Many other configurations are possible as well, but not all of them can
> > be tested because the test matrix would be large.
> > 
> > > +
> > > +            iothread_args=""
> > > +            iothread_export_args=""
> > > +            for ((i=0; i<$nr_iothreads; i++)); do
> > > +                iothread_args="$iothread_args --object iothread,id=iothread$i"
> > > +                iothread_export_args="$iothread_export_args,iothread.$i=iothread$i"
> > > +            done
> > > +
> > > +            QSD_NEED_PID=y $QSD \
> > > +                    $iothread_args \
> > > +                    --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
> > > +                    --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off,io-uring=on$iothread_export_args \
> > > +                &
> > > +        else
> > > +            # Usually, users would export formatted nodes.  But we present fuse as a
> > > +            # protocol-level driver here, so we have to leave the format to the
> > > +            # client.
> > > +            # Switch off allow-other, because in general we do not need it for
> > > +            # iotests.  The default allow-other=auto has the downside of printing a
> > > +            # fusermount error on its first attempt if allow_other is not
> > > +            # permissible, which we would need to filter.
> > > +            QSD_NEED_PID=y $QSD \
> > > +                --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
> > > +                --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off \
> > > +                &
> > > +        fi
> > >           pidfile="$QEMU_TEST_DIR/qemu-storage-daemon.pid"
> > > @@ -592,6 +613,8 @@ _rm_test_img()
> > >           kill "${FUSE_PIDS[index]}"
> > > +        sleep 1
> > > +
> > 
> > What is the purpose of this sleep command?
> > 
> 
> I don’t exactly remember why. It might get stuck if there’s no sleep here. I
> remember we discussed this problem in earlier emails.

The purpose needs to be understood. Otherwise there is a good chance
that the test will fail randomly in a continuous integration environment
where things sometimes take a long time due to CPU contention.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] export/fuse: Safe termination for FUSE-uring
  2025-09-09 20:51     ` Brian Song
@ 2025-09-10 13:17       ` Stefan Hajnoczi
  0 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2025-09-10 13:17 UTC (permalink / raw)
  To: Brian Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf

[-- Attachment #1: Type: text/plain, Size: 2457 bytes --]

On Tue, Sep 09, 2025 at 04:51:32PM -0400, Brian Song wrote:
> 
> 
> On 9/9/25 3:33 PM, Stefan Hajnoczi wrote:
> > On Fri, Aug 29, 2025 at 10:50:24PM -0400, Brian Song wrote:
> > > @@ -901,24 +941,15 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
> > >            */
> > >           g_hash_table_remove(exports, exp->mountpoint);
> > >       }
> > > -}
> > > -
> > > -static void fuse_export_delete(BlockExport *blk_exp)
> > > -{
> > > -    FuseExport *exp = container_of(blk_exp, FuseExport, common);
> > > -    for (int i = 0; i < exp->num_queues; i++) {
> > > +    for (size_t i = 0; i < exp->num_queues; i++) {
> > >           FuseQueue *q = &exp->queues[i];
> > >           /* Queue 0's FD belongs to the FUSE session */
> > >           if (i > 0 && q->fuse_fd >= 0) {
> > >               close(q->fuse_fd);
> > 
> > This changes the behavior of the non-io_uring code. Now all fuse fds and
> > fuse_session are closed while requests are potentially still being
> > processed.
> > 
> > There is a race condition: if an IOThread is processing a request here
> > then it may invoke a system call on q->fuse_fd just after it has been
> > closed but not set to -1. If another thread has also opened a new file
> > then the fd could be reused, resulting in an accidental write(2) to the
> > new file. I'm not sure whether there is a way to trigger this in
> > practice, but it looks like a problem waiting to happen.
> > 
> > Simply setting q->fuse_fd to -1 here doesn't fix the race. It would be
> > necessary to stop processing fuse_fd in the thread before closing it
> > here or to schedule a BH in each thread so that fuse_fd can be closed
> > in the thread that uses the fd.
> 
> I get what you mean. This newly introduced cleanup code was originally in
> the deletion section, after the reconf counter decreased to 0, and it was
> meant to cancel the pending SQEs. But now we've moved it to the shutdown
> section, which may introduce a potential problem. How do you think we should
> fix it? This is the last week of GSoC, I'm already busy on weekdays since
> the new term has started.

Hi Brian,
Two issues:
1. Change of behavior for non-io_uring code. It would be safer to keep
   the old behavior for non-io_uring code.
2. The race condition. Schedule a BH in each queue's IOThread and call
   close(fuse_fd) from the BH function. That way there is no race
   between threads.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] iotests: add tests for FUSE-over-io_uring
  2025-09-10 13:14       ` Stefan Hajnoczi
@ 2025-09-12  2:22         ` Brian Song
  2025-09-15 17:41           ` Stefan Hajnoczi
  0 siblings, 1 reply; 38+ messages in thread
From: Brian Song @ 2025-09-12  2:22 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf



On 9/10/25 9:14 AM, Stefan Hajnoczi wrote:
> On Tue, Sep 09, 2025 at 04:51:12PM -0400, Brian Song wrote:
>>
>>
>> On 9/9/25 3:38 PM, Stefan Hajnoczi wrote:
>>> On Fri, Aug 29, 2025 at 10:50:25PM -0400, Brian Song wrote:
>>>> To test FUSE-over-io_uring, set the environment variable
>>>> FUSE_OVER_IO_URING=1. This applies only when using the
>>>> 'fuse' protocol.
>>>>
>>>> $ FUSE_OVER_IO_URING=1 ./check -fuse
>>>>
>>>> Suggested-by: Kevin Wolf <kwolf@redhat.com>
>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>> Signed-off-by: Brian Song <hibriansong@gmail.com>
>>>> ---
>>>>    tests/qemu-iotests/check     |  2 ++
>>>>    tests/qemu-iotests/common.rc | 45 +++++++++++++++++++++++++++---------
>>>>    2 files changed, 36 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/tests/qemu-iotests/check b/tests/qemu-iotests/check
>>>> index 545f9ec7bd..c6fa0f9e3d 100755
>>>> --- a/tests/qemu-iotests/check
>>>> +++ b/tests/qemu-iotests/check
>>>> @@ -94,6 +94,8 @@ def make_argparser() -> argparse.ArgumentParser:
>>>>            mg.add_argument('-' + fmt, dest='imgfmt', action='store_const',
>>>>                            const=fmt, help=f'test {fmt}')
>>>> +    # To test FUSE-over-io_uring, set the environment variable
>>>> +    # FUSE_OVER_IO_URING=1. This applies only when using the 'fuse' protocol
>>>>        protocol_list = ['file', 'rbd', 'nbd', 'ssh', 'nfs', 'fuse']
>>>>        g_prt = p.add_argument_group(
>>>>            '  image protocol options',
>>>> diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
>>>> index e977cb4eb6..f8b79c3810 100644
>>>> --- a/tests/qemu-iotests/common.rc
>>>> +++ b/tests/qemu-iotests/common.rc
>>>> @@ -539,17 +539,38 @@ _make_test_img()
>>>>            touch "$export_mp"
>>>>            rm -f "$SOCK_DIR/fuse-output"
>>>> -        # Usually, users would export formatted nodes.  But we present fuse as a
>>>> -        # protocol-level driver here, so we have to leave the format to the
>>>> -        # client.
>>>> -        # Switch off allow-other, because in general we do not need it for
>>>> -        # iotests.  The default allow-other=auto has the downside of printing a
>>>> -        # fusermount error on its first attempt if allow_other is not
>>>> -        # permissible, which we would need to filter.
>>>
>>> This comment applies to both branches of the if statement. I think
>>> keeping it here is slightly better.
>>>
>>>> -        QSD_NEED_PID=y $QSD \
>>>> -              --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
>>>> -              --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off \
>>>> -              &
>>>> +        if [ -n "$FUSE_OVER_IO_URING" ]; then
>>>> +            nr_cpu=$(nproc 2>/dev/null || echo 1)
>>>> +            nr_iothreads=$((nr_cpu / 2))
>>>> +            if [ $nr_iothreads -lt 1 ]; then
>>>> +                nr_iothreads=1
>>>> +            fi
>>>
>>> Please add a comment explaining that the purpose of this configuration
>>> based on the number of CPUs is to test multiple IOThreads when the host
>>> allows it, since that is a more interesting case then just 1 IOThread.
>>> Many other configurations are possible as well, but not all of them can
>>> be tested because the test matrix would be large.
>>>
>>>> +
>>>> +            iothread_args=""
>>>> +            iothread_export_args=""
>>>> +            for ((i=0; i<$nr_iothreads; i++)); do
>>>> +                iothread_args="$iothread_args --object iothread,id=iothread$i"
>>>> +                iothread_export_args="$iothread_export_args,iothread.$i=iothread$i"
>>>> +            done
>>>> +
>>>> +            QSD_NEED_PID=y $QSD \
>>>> +                    $iothread_args \
>>>> +                    --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
>>>> +                    --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off,io-uring=on$iothread_export_args \
>>>> +                &
>>>> +        else
>>>> +            # Usually, users would export formatted nodes.  But we present fuse as a
>>>> +            # protocol-level driver here, so we have to leave the format to the
>>>> +            # client.
>>>> +            # Switch off allow-other, because in general we do not need it for
>>>> +            # iotests.  The default allow-other=auto has the downside of printing a
>>>> +            # fusermount error on its first attempt if allow_other is not
>>>> +            # permissible, which we would need to filter.
>>>> +            QSD_NEED_PID=y $QSD \
>>>> +                --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
>>>> +                --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off \
>>>> +                &
>>>> +        fi
>>>>            pidfile="$QEMU_TEST_DIR/qemu-storage-daemon.pid"
>>>> @@ -592,6 +613,8 @@ _rm_test_img()
>>>>            kill "${FUSE_PIDS[index]}"
>>>> +        sleep 1
>>>> +
>>>
>>> What is the purpose of this sleep command?
>>>
>>
>> I don’t exactly remember why. It might get stuck if there’s no sleep here. I
>> remember we discussed this problem in earlier emails.
> 
> The purpose needs to be understood. Otherwise there is a good chance
> that the test will fail randomly in a continuous integration environment
> where things sometimes take a long time due to CPU contention.
> 
> Stefan

I think the issue lies in our current approach of using df to check 
whether the FUSE mount has been unmounted.

When we traced df with strace, we found that its logic for checking the 
mount point is:
=> Call mount to read the system's mount information
=> Use statfs() to get the filesystem statistics

But our current test code exits with the following sequence:
=> Kill the FUSE process
=> The kernel starts cleaning up the FUSE mount point
=> df calls statfs(), which requires communication with the FUSE process 
But the FUSE process might still be cleaning up, causing the 
communication to fail
=> df then returns an error or stale information
=> Our detection logic misinterprets this and immediately deletes the 
mounted image

Since we only need to check the system's mount information, we can just 
call mount and grep "$img" to verify whether the image has been 
successfully unmounted.

Does it make sense?

Brian


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] export/fuse: Safe termination for FUSE-uring
  2025-09-09 19:33   ` Stefan Hajnoczi
  2025-09-09 20:51     ` Brian Song
@ 2025-09-15  5:43     ` Brian Song
  2025-09-17 13:01       ` Hanna Czenczek
  1 sibling, 1 reply; 38+ messages in thread
From: Brian Song @ 2025-09-15  5:43 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf

Hi Hanna,

Stefan raised the above issue and proposed a preliminary solution: keep
closing the file descriptor in the delete section, but perform
umount separately for FUSE uring and traditional FUSE in the shutdown
and delete sections respectively. This approach avoids the race
condition on the file descriptor.

In the case of FUSE uring, umount must be performed in the shutdown
section. The reason is that the kernel currently lacks an interface to
explicitly cancel submitted SQEs. Performing umount forces the kernel to
flush all pending SQEs and return their CQEs. Without this step, CQEs
may arrive after the export has already been deleted, and invoking the
CQE handler at that point would dereference freed memory and trigger a
segmentation fault.

I’m curious about traditional FUSE: is it strictly necessary to perform
umount in the delete section, or could it also be done in shutdown?
Additionally, what is the correct ordering between close(fd) and
umount, does one need to precede the other?

Thanks,
Brian

On 9/9/25 3:33 PM, Stefan Hajnoczi wrote:
 > On Fri, Aug 29, 2025 at 10:50:24PM -0400, Brian Song wrote:
 >> @@ -901,24 +941,15 @@ static void fuse_export_shutdown(BlockExport
*blk_exp)
 >>            */
 >>           g_hash_table_remove(exports, exp->mountpoint);
 >>       }
 >> -}
 >> -
 >> -static void fuse_export_delete(BlockExport *blk_exp)
 >> -{
 >> -    FuseExport *exp = container_of(blk_exp, FuseExport, common);
 >>
 >> -    for (int i = 0; i < exp->num_queues; i++) {
 >> +    for (size_t i = 0; i < exp->num_queues; i++) {
 >>           FuseQueue *q = &exp->queues[i];
 >>
 >>           /* Queue 0's FD belongs to the FUSE session */
 >>           if (i > 0 && q->fuse_fd >= 0) {
 >>               close(q->fuse_fd);
 >
 > This changes the behavior of the non-io_uring code. Now all fuse fds and
 > fuse_session are closed while requests are potentially still being
 > processed.
 >
 > There is a race condition: if an IOThread is processing a request here
 > then it may invoke a system call on q->fuse_fd just after it has been
 > closed but not set to -1. If another thread has also opened a new file
 > then the fd could be reused, resulting in an accidental write(2) to the
 > new file. I'm not sure whether there is a way to trigger this in
 > practice, but it looks like a problem waiting to happen.
 >
 > Simply setting q->fuse_fd to -1 here doesn't fix the race. It would be
 > necessary to stop processing fuse_fd in the thread before closing it
 > here or to schedule a BH in each thread so that fuse_fd can be closed
 > in the thread that uses the fd.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] iotests: add tests for FUSE-over-io_uring
  2025-09-12  2:22         ` Brian Song
@ 2025-09-15 17:41           ` Stefan Hajnoczi
  0 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2025-09-15 17:41 UTC (permalink / raw)
  To: Brian Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf

[-- Attachment #1: Type: text/plain, Size: 7706 bytes --]

On Thu, Sep 11, 2025 at 10:22:24PM -0400, Brian Song wrote:
> 
> 
> On 9/10/25 9:14 AM, Stefan Hajnoczi wrote:
> > On Tue, Sep 09, 2025 at 04:51:12PM -0400, Brian Song wrote:
> > > 
> > > 
> > > On 9/9/25 3:38 PM, Stefan Hajnoczi wrote:
> > > > On Fri, Aug 29, 2025 at 10:50:25PM -0400, Brian Song wrote:
> > > > > To test FUSE-over-io_uring, set the environment variable
> > > > > FUSE_OVER_IO_URING=1. This applies only when using the
> > > > > 'fuse' protocol.
> > > > > 
> > > > > $ FUSE_OVER_IO_URING=1 ./check -fuse
> > > > > 
> > > > > Suggested-by: Kevin Wolf <kwolf@redhat.com>
> > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > Signed-off-by: Brian Song <hibriansong@gmail.com>
> > > > > ---
> > > > >    tests/qemu-iotests/check     |  2 ++
> > > > >    tests/qemu-iotests/common.rc | 45 +++++++++++++++++++++++++++---------
> > > > >    2 files changed, 36 insertions(+), 11 deletions(-)
> > > > > 
> > > > > diff --git a/tests/qemu-iotests/check b/tests/qemu-iotests/check
> > > > > index 545f9ec7bd..c6fa0f9e3d 100755
> > > > > --- a/tests/qemu-iotests/check
> > > > > +++ b/tests/qemu-iotests/check
> > > > > @@ -94,6 +94,8 @@ def make_argparser() -> argparse.ArgumentParser:
> > > > >            mg.add_argument('-' + fmt, dest='imgfmt', action='store_const',
> > > > >                            const=fmt, help=f'test {fmt}')
> > > > > +    # To test FUSE-over-io_uring, set the environment variable
> > > > > +    # FUSE_OVER_IO_URING=1. This applies only when using the 'fuse' protocol
> > > > >        protocol_list = ['file', 'rbd', 'nbd', 'ssh', 'nfs', 'fuse']
> > > > >        g_prt = p.add_argument_group(
> > > > >            '  image protocol options',
> > > > > diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
> > > > > index e977cb4eb6..f8b79c3810 100644
> > > > > --- a/tests/qemu-iotests/common.rc
> > > > > +++ b/tests/qemu-iotests/common.rc
> > > > > @@ -539,17 +539,38 @@ _make_test_img()
> > > > >            touch "$export_mp"
> > > > >            rm -f "$SOCK_DIR/fuse-output"
> > > > > -        # Usually, users would export formatted nodes.  But we present fuse as a
> > > > > -        # protocol-level driver here, so we have to leave the format to the
> > > > > -        # client.
> > > > > -        # Switch off allow-other, because in general we do not need it for
> > > > > -        # iotests.  The default allow-other=auto has the downside of printing a
> > > > > -        # fusermount error on its first attempt if allow_other is not
> > > > > -        # permissible, which we would need to filter.
> > > > 
> > > > This comment applies to both branches of the if statement. I think
> > > > keeping it here is slightly better.
> > > > 
> > > > > -        QSD_NEED_PID=y $QSD \
> > > > > -              --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
> > > > > -              --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off \
> > > > > -              &
> > > > > +        if [ -n "$FUSE_OVER_IO_URING" ]; then
> > > > > +            nr_cpu=$(nproc 2>/dev/null || echo 1)
> > > > > +            nr_iothreads=$((nr_cpu / 2))
> > > > > +            if [ $nr_iothreads -lt 1 ]; then
> > > > > +                nr_iothreads=1
> > > > > +            fi
> > > > 
> > > > Please add a comment explaining that the purpose of this configuration
> > > > based on the number of CPUs is to test multiple IOThreads when the host
> > > > allows it, since that is a more interesting case then just 1 IOThread.
> > > > Many other configurations are possible as well, but not all of them can
> > > > be tested because the test matrix would be large.
> > > > 
> > > > > +
> > > > > +            iothread_args=""
> > > > > +            iothread_export_args=""
> > > > > +            for ((i=0; i<$nr_iothreads; i++)); do
> > > > > +                iothread_args="$iothread_args --object iothread,id=iothread$i"
> > > > > +                iothread_export_args="$iothread_export_args,iothread.$i=iothread$i"
> > > > > +            done
> > > > > +
> > > > > +            QSD_NEED_PID=y $QSD \
> > > > > +                    $iothread_args \
> > > > > +                    --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
> > > > > +                    --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off,io-uring=on$iothread_export_args \
> > > > > +                &
> > > > > +        else
> > > > > +            # Usually, users would export formatted nodes.  But we present fuse as a
> > > > > +            # protocol-level driver here, so we have to leave the format to the
> > > > > +            # client.
> > > > > +            # Switch off allow-other, because in general we do not need it for
> > > > > +            # iotests.  The default allow-other=auto has the downside of printing a
> > > > > +            # fusermount error on its first attempt if allow_other is not
> > > > > +            # permissible, which we would need to filter.
> > > > > +            QSD_NEED_PID=y $QSD \
> > > > > +                --blockdev file,node-name=export-node,filename=$img_name,discard=unmap \
> > > > > +                --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=on,allow-other=off \
> > > > > +                &
> > > > > +        fi
> > > > >            pidfile="$QEMU_TEST_DIR/qemu-storage-daemon.pid"
> > > > > @@ -592,6 +613,8 @@ _rm_test_img()
> > > > >            kill "${FUSE_PIDS[index]}"
> > > > > +        sleep 1
> > > > > +
> > > > 
> > > > What is the purpose of this sleep command?
> > > > 
> > > 
> > > I don’t exactly remember why. It might get stuck if there’s no sleep here. I
> > > remember we discussed this problem in earlier emails.
> > 
> > The purpose needs to be understood. Otherwise there is a good chance
> > that the test will fail randomly in a continuous integration environment
> > where things sometimes take a long time due to CPU contention.
> > 
> > Stefan
> 
> I think the issue lies in our current approach of using df to check whether
> the FUSE mount has been unmounted.
> 
> When we traced df with strace, we found that its logic for checking the
> mount point is:
> => Call mount to read the system's mount information
> => Use statfs() to get the filesystem statistics
> 
> But our current test code exits with the following sequence:
> => Kill the FUSE process
> => The kernel starts cleaning up the FUSE mount point
> => df calls statfs(), which requires communication with the FUSE process But
> the FUSE process might still be cleaning up, causing the communication to
> fail
> => df then returns an error or stale information
> => Our detection logic misinterprets this and immediately deletes the
> mounted image
> 
> Since we only need to check the system's mount information, we can just call
> mount and grep "$img" to verify whether the image has been successfully
> unmounted.
> 
> Does it make sense?

It's worth trying. Hanna wrote the existing code that uses df(1), so
maybe she has thoughts on this too.

I looked at waiting for FUSE_PIDS[] or using the QMP monitor to shut
down cleanly. Those approaches have their own issues. Sending a
`block-export-del` QMP command and waiting for it to return, followed by
a `quit` QMP command should work well. But it's more complex than
adjusting the existing loop and still needs a timeout. So I think the
mount(8) approach is worth a shot.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports
  2025-09-03 18:11     ` Brian Song
@ 2025-09-16 12:18       ` Kevin Wolf
  0 siblings, 0 replies; 38+ messages in thread
From: Kevin Wolf @ 2025-09-16 12:18 UTC (permalink / raw)
  To: Brian Song
  Cc: Stefan Hajnoczi, qemu-block, qemu-devel, armbru, bernd, fam,
	hreitz

Am 03.09.2025 um 20:11 hat Brian Song geschrieben:
> 
> 
> On 9/3/25 5:49 AM, Stefan Hajnoczi wrote:
> > On Sat, Aug 30, 2025 at 08:00:00AM -0400, Brian Song wrote:
> > > We used fio to test a 1 GB file under both traditional FUSE and
> > > FUSE-over-io_uring modes. The experiments were conducted with the
> > > following iodepth and numjobs configurations: 1-1, 64-1, 1-4, and 64-4,
> > > with 70% read and 30% write, resulting in a total of eight test cases,
> > > measuring both latency and throughput.
> > > 
> > > Test results:
> > > 
> > > https://gist.github.com/hibriansong/a4849903387b297516603e83b53bbde4
> > 
> > Hanna: You benchmarked the FUSE export coroutine implementation a little
> > while ago. What do you think about these results with
> > FUSE-over-io_uring?
> > 
> > What stands out to me is that iodepth=1 numjobs=4 already saturates the
> > system, so increasing iodepth to 64 does not improve the results much.
> > 
> > Brian: What is the qemu-storage-daemon command-line for the benchmark
> > and what are the details of /mnt/tmp/ (e.g. a preallocated 10 GB file
> > with an XFS file system mounted from the FUSE image)?
> 
> QMP script:
> https://gist.github.com/hibriansong/399f9564a385cfb94db58669e63611f8
> 
> Or:
> ### NORMAL
> ./qemu/build/storage-daemon/qemu-storage-daemon \
>   --object iothread,id=iothread1 \
>   --object iothread,id=iothread2 \
>   --object iothread,id=iothread3 \
>   --object iothread,id=iothread4 \
>   --blockdev node-name=prot-node,driver=file,filename=ubuntu.qcow2 \

This uses the default AIO and most importantly cache mode, which means
that the host kernel page cache is used. This makes it hard to tell how
much it accessed just RAM on the host and how much really went to the
disk, so the results are difficult to interpret correctly.

For benchmarks, it's generally best to use cache.direct=on and I think
I'd also prefer aio=native (or aio=io_uring).

>   --blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
>   --export type=fuse,id=exp0,node-name=fmt-node,mountpoint=mount-point,writable=on,iothread.0=iothread1,iothread.1=iothread2,iothread.2=iothread3,iothread.3=iothread4
> 
> ### URING
> echo Y > /sys/module/fuse/parameters/enable_uring
> 
> ./qemu/build/storage-daemon/qemu-storage-daemon \
>   --object iothread,id=iothread1 \
>   --object iothread,id=iothread2 \
>   --object iothread,id=iothread3 \
>   --object iothread,id=iothread4 \
>   --blockdev node-name=prot-node,driver=file,filename=ubuntu.qcow2 \
>   --blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
>   --export type=fuse,id=exp0,node-name=fmt-node,mountpoint=mount-point,writable=on,io-uring=on,iothread.0=iothread1,iothread.1=iothread2,iothread.2=iothread3,iothread.3=iothread4
> 
> ubuntu.qcow2 has been prealloacted and enlarge the space to 100GB by
> 
> $ qemu-img resize ubuntu.qcow2 100G

I think this doesn't preallocate the newly added space, you should add
--preallocation=falloc at least.

> $ virt-customize \
>    --run-command '/bin/bash /bin/growpart /dev/sda 1' \
>    --run-command 'resize2fs /dev/sda1' -a ubuntu.qcow2
> 
> The image file, formatted with an Ext4 filesystem, was mounted on /mnt/tmp
> on my PC equipped with a Kingston PCIe 4.0 NVMe SSD
> 
> $ sudo kpartx -av mount-point
> $ sudo mount /dev/mapper/loop31p1 /mnt/tmp/
> 
> 
> Unmount the partition after done using it.
> 
> $ sudo umount /mnt/tmp
> # sudo kpartx -dv mount-point

What I would personally use to benchmark performance is just a clean
preallocated raw image without a guest on it. I wouldn't even partition
it or necessarily put a filesystem on it, but just run the benchmark
directly on the FUSE export's mountpoint.

The other thing I'd consider for benchmarking is the null-co block
driver so that the FUSE overhead really dominates and isn't dwarved by a
slow disk. (A null block device is where you can't have a filesystem
even if you wanted.)

Kevin

> > > On 8/29/25 10:50 PM, Brian Song wrote:
> > > > Hi all,
> > > > 
> > > > This is a GSoC project. More details are available here:
> > > > https://wiki.qemu.org/Google_Summer_of_Code_2025#FUSE-over-io_uring_exports
> > > > 
> > > > This patch series includes:
> > > > - Add a round-robin mechanism to distribute the kernel-required Ring
> > > > Queues to FUSE Queues
> > > > - Support multiple in-flight requests (multiple ring entries)
> > > > - Add tests for FUSE-over-io_uring
> > > > 
> > > > More detail in the v2 cover letter:
> > > > https://lists.nongnu.org/archive/html/qemu-block/2025-08/msg00140.html
> > > > 
> > > > And in the v1 cover letter:
> > > > https://lists.nongnu.org/archive/html/qemu-block/2025-07/msg00280.html
> > > > 
> > > > 
> > > > Brian Song (4):
> > > >     export/fuse: add opt to enable FUSE-over-io_uring
> > > >     export/fuse: process FUSE-over-io_uring requests
> > > >     export/fuse: Safe termination for FUSE-uring
> > > >     iotests: add tests for FUSE-over-io_uring
> > > > 
> > > >    block/export/fuse.c                  | 838 +++++++++++++++++++++------
> > > >    docs/tools/qemu-storage-daemon.rst   |  11 +-
> > > >    qapi/block-export.json               |   5 +-
> > > >    storage-daemon/qemu-storage-daemon.c |   1 +
> > > >    tests/qemu-iotests/check             |   2 +
> > > >    tests/qemu-iotests/common.rc         |  45 +-
> > > >    util/fdmon-io_uring.c                |   5 +-
> > > >    7 files changed, 717 insertions(+), 190 deletions(-)
> > > > 
> > > 
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring
  2025-08-30  2:50 ` [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring Brian Song
  2025-09-03 10:53   ` Stefan Hajnoczi
  2025-09-03 11:26   ` Stefan Hajnoczi
@ 2025-09-16 19:08   ` Kevin Wolf
  2025-09-17 19:47     ` Brian Song
  2 siblings, 1 reply; 38+ messages in thread
From: Kevin Wolf @ 2025-09-16 19:08 UTC (permalink / raw)
  To: Brian Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, stefanha

Am 30.08.2025 um 04:50 hat Brian Song geschrieben:
> This patch adds a new export option for storage-export-daemon to enable
> FUSE-over-io_uring via the switch io-uring=on|off (disableby default).
> It also implements the protocol handshake with the Linux kernel
> during the FUSE-over-io_uring initialization phase.
> 
> See: https://docs.kernel.org/filesystems/fuse-io-uring.html
> 
> The kernel documentation describes in detail how FUSE-over-io_uring
> works. This patch implements the Initial SQE stage shown in thediagram:
> it initializes one queue per IOThread, each currently supporting a
> single submission queue entry (SQE). When the FUSE driver sends the
> first FUSE request (FUSE_INIT), storage-export-daemon calls
> fuse_uring_start() to complete initialization, ultimately submitting
> the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm
> successful initialization with the kernel.
> 
> We also added support for multiple IOThreads. The current Linux kernel
> requires registering $(nproc) queues when setting up FUSE-over-io_uring
> To let users customize the number of FUSE Queues (i.e., IOThreads),
> we first create nproc Ring Queues as required by the kernel, then
> distribute them in a round-robin manner to the FUSE Queues for
> registration. In addition, to support multiple in-flight requests,
> we configure each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH
> entries/requests.
> 
> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Brian Song <hibriansong@gmail.com>
> ---
>  block/export/fuse.c                  | 310 +++++++++++++++++++++++++--
>  docs/tools/qemu-storage-daemon.rst   |  11 +-
>  qapi/block-export.json               |   5 +-
>  storage-daemon/qemu-storage-daemon.c |   1 +
>  util/fdmon-io_uring.c                |   5 +-
>  5 files changed, 309 insertions(+), 23 deletions(-)
> 
> diff --git a/block/export/fuse.c b/block/export/fuse.c
> index c0ad4696ce..19bf9e5f74 100644
> --- a/block/export/fuse.c
> +++ b/block/export/fuse.c
> @@ -48,6 +48,9 @@
>  #include <linux/fs.h>
>  #endif
>  
> +/* room needed in buffer to accommodate header */
> +#define FUSE_BUFFER_HEADER_SIZE 0x1000
> +
>  /* Prevent overly long bounce buffer allocations */
>  #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
>  /*
> @@ -63,12 +66,59 @@
>      (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
>  
>  typedef struct FuseExport FuseExport;
> +typedef struct FuseQueue FuseQueue;
> +
> +#ifdef CONFIG_LINUX_IO_URING
> +#define FUSE_DEFAULT_RING_QUEUE_DEPTH 64
> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32

Maybe it would be a little clearer if the next few types has URing in
their name instead of just Ring.

> +typedef struct FuseRingQueue FuseRingQueue;
> +typedef struct FuseRingEnt {
> +    /* back pointer */
> +    FuseRingQueue *rq;
> +
> +    /* commit id of a fuse request */
> +    uint64_t req_commit_id;
> +
> +    /* fuse request header and payload */
> +    struct fuse_uring_req_header req_header;
> +    void *op_payload;
> +    size_t req_payload_sz;
> +
> +    /* The vector passed to the kernel */
> +    struct iovec iov[2];
> +
> +    CqeHandler fuse_cqe_handler;
> +} FuseRingEnt;
> +
> +struct FuseRingQueue {

It would be good to have a comment here that explains the difference
between FuseQueue and FuseRingQueue.

Is this a distinction that should remain in the long run or would we
always have a 1:1 mapping between FuseQueue and FuseRingQueue once the
pending kernel changes are merged that allow a number of uring queues
different from the number of CPUs?

> +    int rqid;
> +
> +    /* back pointer */
> +    FuseQueue *q;
> +    FuseRingEnt *ent;
> +
> +    /* List entry for ring_queues */
> +    QLIST_ENTRY(FuseRingQueue) next;
> +};
> +
> +/*
> + * Round-robin distribution of ring queues across FUSE queues.
> + * This structure manages the mapping between kernel ring queues and user
> + * FUSE queues.
> + */
> +typedef struct FuseRingQueueManager {
> +    FuseRingQueue *ring_queues;
> +    int num_ring_queues;
> +    int num_fuse_queues;
> +} FuseRingQueueManager;

This isn't a manager, it's just the set of queues the export uses.

num_fuse_queues duplicates exp->num_queues, there is no reason for it to
exist. All users also have access to the FuseExport itself.

The other two fields can just be merged directly into FuseExport,
preferably renamed to uring_queues and num_uring_queues.

> +#endif
>  
>  /*
>   * One FUSE "queue", representing one FUSE FD from which requests are fetched
>   * and processed.  Each queue is tied to an AioContext.
>   */
> -typedef struct FuseQueue {
> +struct FuseQueue {
>      FuseExport *exp;
>  
>      AioContext *ctx;
> @@ -109,15 +159,11 @@ typedef struct FuseQueue {
>       * Free this buffer with qemu_vfree().
>       */
>      void *spillover_buf;
> -} FuseQueue;
>  
> -/*
> - * Verify that FuseQueue.request_buf plus the spill-over buffer together
> - * are big enough to be accepted by the FUSE kernel driver.
> - */
> -QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
> -                  FUSE_SPILLOVER_BUF_SIZE <
> -                  FUSE_MIN_READ_BUFFER);
> +#ifdef CONFIG_LINUX_IO_URING
> +    QLIST_HEAD(, FuseRingQueue) ring_queue_list;
> +#endif
> +};
>  
>  struct FuseExport {
>      BlockExport common;
> @@ -133,7 +179,7 @@ struct FuseExport {
>       */
>      bool halted;
>  
> -    int num_queues;
> +    size_t num_queues;

I'm not sure why this change is needed. If it is, can it be a separate
patch before this one, with a commit message describing the reason?

>      FuseQueue *queues;
>      /*
>       * True if this export should follow the generic export's AioContext.
> @@ -149,6 +195,12 @@ struct FuseExport {
>      /* Whether allow_other was used as a mount option or not */
>      bool allow_other;
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +    bool is_uring;
> +    size_t ring_queue_depth;
> +    FuseRingQueueManager *ring_queue_manager;
> +#endif
> +
>      mode_t st_mode;
>      uid_t st_uid;
>      gid_t st_gid;
> @@ -205,7 +257,7 @@ static void fuse_attach_handlers(FuseExport *exp)
>          return;
>      }
>  
> -    for (int i = 0; i < exp->num_queues; i++) {
> +    for (size_t i = 0; i < exp->num_queues; i++) {
>          aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
>                             read_from_fuse_fd, NULL, NULL, NULL,
>                             &exp->queues[i]);
> @@ -257,6 +309,189 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
>      .drained_poll  = fuse_export_drained_poll,
>  };
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req,
> +                    const unsigned int rqid,
> +                    const unsigned int commit_id)

Indentation is off here. There are two accepted styles for indentation
after breaking a long line in QEMU (see docs/devel/style.rst):

1. Indent the next line by exactly four spaces:

    do_something(x, y,
        z);

2. Align the next line with the first character after the opening
   parenthesis:

    do_something(x, y,
                 z);

The second one is the preferred one. The first one is generally only
used when the parenthesis is already too far right and we can't do much
about it.

> +{
> +    req->qid = rqid;
> +    req->commit_id = commit_id;
> +    req->flags = 0;
> +}
> +
> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q,
> +               __u32 cmd_op)

Indentation.

Another option here is to keep everything before the function name on a
separate line, like this:

static void
fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q, __u32 cmd_op)

This would allow the second line to stay under 80 characters.

> +{
> +    sqe->opcode = IORING_OP_URING_CMD;
> +
> +    sqe->fd = q->fuse_fd;
> +    sqe->rw_flags = 0;
> +    sqe->ioprio = 0;
> +    sqe->off = 0;
> +
> +    sqe->cmd_op = cmd_op;
> +    sqe->__pad1 = 0;
> +}
> +
> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque)
> +{
> +    FuseRingEnt *ent = opaque;
> +    struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0];
> +
> +    fuse_uring_sqe_prepare(sqe, ent->rq->q, FUSE_IO_URING_CMD_REGISTER);
> +
> +    sqe->addr = (uint64_t)(ent->iov);
> +    sqe->len = 2;
> +
> +    fuse_uring_sqe_set_req_data(req, ent->rq->rqid, 0);
> +}
> +
> +static void fuse_uring_submit_register(void *opaque)
> +{
> +    FuseRingEnt *ent = opaque;
> +    FuseExport *exp = ent->rq->q->exp;
> +
> +

Extra empty line.

> +    aio_add_sqe(fuse_uring_prep_sqe_register, ent, &(ent->fuse_cqe_handler));

The parentheses around ent->fuse_cqe_handler are unnecessary.

> +}
> +
> +/**
> + * Distribute ring queues across FUSE queues using round-robin algorithm.

Hm, if this function distributes (u)ring queues, then what is
fuse_distribute_ring_queues() doing? Is the term overloaded with two
meanings?

> + * This ensures even distribution of kernel ring queues across user-specified
> + * FUSE queues.
> + */
> +static
> +FuseRingQueueManager *fuse_ring_queue_manager_create(int num_fuse_queues,
> +                                                    size_t ring_queue_depth,
> +                                                    size_t bufsize)

The right style here would be something like:

static FuseRingQueueManager *
fuse_ring_queue_manager_create(int num_fuse_queues,
                               size_t ring_queue_depth,
                               size_t bufsize)

Given that I said that there is no reason to call the set of all queues
a manager, or to even have it separate from FuseExport, this probably
becomes fuse_uring_setup_queues() or something.

> +{
> +    int num_ring_queues = get_nprocs();

This could use a comment saying that this is a kernel requirement at the
moment.

> +    FuseRingQueueManager *manager = g_new(FuseRingQueueManager, 1);
> +
> +    if (!manager) {
> +        return NULL;
> +    }

g_new() never returns NULL, it aborts on error instead, so no reason to
have a NULL check here.

> +
> +    manager->ring_queues = g_new(FuseRingQueue, num_ring_queues);
> +    manager->num_ring_queues = num_ring_queues;
> +    manager->num_fuse_queues = num_fuse_queues;
> +
> +    if (!manager->ring_queues) {
> +        g_free(manager);
> +        return NULL;
> +    }

This check is unnecessary for the same reason.

> +
> +    for (int i = 0; i < num_ring_queues; i++) {
> +        FuseRingQueue *rq = &manager->ring_queues[i];
> +        rq->rqid = i;
> +        rq->ent = g_new(FuseRingEnt, ring_queue_depth);
> +
> +        if (!rq->ent) {
> +            for (int j = 0; j < i; j++) {
> +                g_free(manager->ring_queues[j].ent);
> +            }
> +            g_free(manager->ring_queues);
> +            g_free(manager);
> +            return NULL;
> +        }

This one, too.

> +
> +        for (size_t j = 0; j < ring_queue_depth; j++) {
> +            FuseRingEnt *ent = &rq->ent[j];
> +            ent->rq = rq;
> +            ent->req_payload_sz = bufsize - FUSE_BUFFER_HEADER_SIZE;
> +            ent->op_payload = g_malloc0(ent->req_payload_sz);
> +
> +            if (!ent->op_payload) {
> +                for (size_t k = 0; k < j; k++) {
> +                    g_free(rq->ent[k].op_payload);
> +                }
> +                g_free(rq->ent);
> +                for (int k = 0; k < i; k++) {
> +                    g_free(manager->ring_queues[k].ent);
> +                }
> +                g_free(manager->ring_queues);
> +                g_free(manager);
> +                return NULL;
> +            }

And this one.

Removing all of them will make the function a lot more readable.

> +
> +            ent->iov[0] = (struct iovec) {
> +                &(ent->req_header),

Unnecessary parentheses.

> +                sizeof(struct fuse_uring_req_header)
> +            };
> +            ent->iov[1] = (struct iovec) {
> +                ent->op_payload,
> +                ent->req_payload_sz
> +            };
> +
> +            ent->fuse_cqe_handler.cb = fuse_uring_cqe_handler;
> +        }
> +    }
> +
> +    return manager;
> +}
> +
> +static
> +void fuse_distribute_ring_queues(FuseExport *exp, FuseRingQueueManager *manager)
> +{
> +    int queue_index = 0;
> +
> +    for (int i = 0; i < manager->num_ring_queues; i++) {
> +        FuseRingQueue *rq = &manager->ring_queues[i];
> +
> +        rq->q = &exp->queues[queue_index];
> +        QLIST_INSERT_HEAD(&(rq->q->ring_queue_list), rq, next);
> +
> +        queue_index = (queue_index + 1) % manager->num_fuse_queues;
> +    }
> +}

Ok, no overloaded meaning of distributing queues, but this function
should probably be merged with the one above. It's part of setting up
the queues.

You don't need a separate queue_index counter, you can just directly use
exp->queues[i % manager->num_fuse_queues].

> +static
> +void fuse_schedule_ring_queue_registrations(FuseExport *exp,
> +                                            FuseRingQueueManager *manager)

Again the formatting. If you split the line before the function name, it
should be "static void" on the first line.

> +{
> +    for (int i = 0; i < manager->num_fuse_queues; i++) {
> +        FuseQueue *q = &exp->queues[i];
> +        FuseRingQueue *rq;
> +
> +        QLIST_FOREACH(rq, &q->ring_queue_list, next) {
> +            for (int j = 0; j < exp->ring_queue_depth; j++) {
> +                aio_bh_schedule_oneshot(q->ctx, fuse_uring_submit_register,
> +                                        &(rq->ent[j]));
> +            }
> +        }
> +    }
> +}

Why one BH per queue entry? This adds up quickly. All entries of the
same queue need to be processed in the same AioContext, so wouldn't it
make more sense to have a BH per (FUSE) queue and handle all of its
uring queues and their entries in a single BH?

> +static void fuse_uring_start(FuseExport *exp, struct fuse_init_out *out)
> +{
> +    /*
> +     * Since we didn't enable the FUSE_MAX_PAGES feature, the value of
> +     * fc->max_pages should be FUSE_DEFAULT_MAX_PAGES_PER_REQ, which is set by
> +     * the kernel by default. Also, max_write should not exceed
> +     * FUSE_DEFAULT_MAX_PAGES_PER_REQ * PAGE_SIZE.
> +     */
> +    size_t bufsize = out->max_write + FUSE_BUFFER_HEADER_SIZE;
> +
> +    if (!(out->flags & FUSE_MAX_PAGES)) {
> +        bufsize = FUSE_DEFAULT_MAX_PAGES_PER_REQ * qemu_real_host_page_size()
> +                         + FUSE_BUFFER_HEADER_SIZE;
> +    }
> +
> +    exp->ring_queue_manager = fuse_ring_queue_manager_create(
> +        exp->num_queues, exp->ring_queue_depth, bufsize);
> +
> +    if (!exp->ring_queue_manager) {
> +        error_report("Failed to create ring queue manager");
> +        return;
> +    }
> +
> +    /* Distribute ring queues across FUSE queues using round-robin */
> +    fuse_distribute_ring_queues(exp, exp->ring_queue_manager);
> +
> +    fuse_schedule_ring_queue_registrations(exp, exp->ring_queue_manager);
> +}
> +#endif
> +
>  static int fuse_export_create(BlockExport *blk_exp,
>                                BlockExportOptions *blk_exp_args,
>                                AioContext *const *multithread,
> @@ -270,6 +505,11 @@ static int fuse_export_create(BlockExport *blk_exp,
>  
>      assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +    exp->is_uring = args->io_uring;
> +    exp->ring_queue_depth = FUSE_DEFAULT_RING_QUEUE_DEPTH;
> +#endif
> +
>      if (multithread) {
>          /* Guaranteed by common export code */
>          assert(mt_count >= 1);
> @@ -283,6 +523,10 @@ static int fuse_export_create(BlockExport *blk_exp,
>                  .exp = exp,
>                  .ctx = multithread[i],
>                  .fuse_fd = -1,
> +#ifdef CONFIG_LINUX_IO_URING
> +                .ring_queue_list =
> +                    QLIST_HEAD_INITIALIZER(exp->queues[i].ring_queue_list),
> +#endif
>              };
>          }
>      } else {
> @@ -296,6 +540,10 @@ static int fuse_export_create(BlockExport *blk_exp,
>              .exp = exp,
>              .ctx = exp->common.ctx,
>              .fuse_fd = -1,
> +#ifdef CONFIG_LINUX_IO_URING
> +            .ring_queue_list =
> +                QLIST_HEAD_INITIALIZER(exp->queues[0].ring_queue_list),
> +#endif
>          };
>      }
>  
> @@ -685,17 +933,39 @@ static bool is_regular_file(const char *path, Error **errp)
>   */
>  static ssize_t coroutine_fn
>  fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
> -             uint32_t max_readahead, uint32_t flags)
> +             uint32_t max_readahead, const struct fuse_init_in *in)
>  {
> -    const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
> +    uint64_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO
> +                                     | FUSE_INIT_EXT;
> +    uint64_t outargflags = 0;
> +    uint64_t inargflags = in->flags;
> +
> +    ssize_t ret = 0;
> +
> +    if (inargflags & FUSE_INIT_EXT) {
> +        inargflags = inargflags | (uint64_t) in->flags2 << 32;
> +    }
> +
> +#ifdef CONFIG_LINUX_IO_URING
> +    if (exp->is_uring) {
> +        if (inargflags & FUSE_OVER_IO_URING) {
> +            supported_flags |= FUSE_OVER_IO_URING;
> +        } else {
> +            exp->is_uring = false;
> +            ret = -ENODEV;

Add a 'goto out' here...

> +        }
> +    }
> +#endif
> +
> +    outargflags = inargflags & supported_flags;
>  
>      *out = (struct fuse_init_out) {
>          .major = FUSE_KERNEL_VERSION,
>          .minor = FUSE_KERNEL_MINOR_VERSION,
>          .max_readahead = max_readahead,
>          .max_write = FUSE_MAX_WRITE_BYTES,
> -        .flags = flags & supported_flags,
> -        .flags2 = 0,
> +        .flags = outargflags,
> +        .flags2 = outargflags >> 32,
>  
>          /* libfuse maximum: 2^16 - 1 */
>          .max_background = UINT16_MAX,
> @@ -717,7 +987,7 @@ fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
>          .map_alignment = 0,
>      };
> -    return sizeof(*out);
> +    return ret < 0 ? ret : sizeof(*out);

...and make this:

    ret = sizeof(*out);
out:
    return ret;

>  }
>  
>  /**
> @@ -1506,6 +1776,14 @@ fuse_co_process_request(FuseQueue *q, void *spillover_buf)
>          fuse_write_buf_response(q->fuse_fd, req_id, out_hdr,
>                                  out_data_buffer, ret);
>          qemu_vfree(out_data_buffer);
> +#ifdef CONFIG_LINUX_IO_URING
> +    /* Handle FUSE-over-io_uring initialization */
> +    if (unlikely(opcode == FUSE_INIT && exp->is_uring)) {
> +        struct fuse_init_out *out =
> +            (struct fuse_init_out *)FUSE_OUT_OP_STRUCT(out_buf);
> +        fuse_uring_start(exp, out);
> +    }
> +#endif

A level of indentation was lost here.

>      } else {
>          fuse_write_response(q->fuse_fd, req_id, out_hdr,
>                              ret < 0 ? ret : 0,
> diff --git a/docs/tools/qemu-storage-daemon.rst b/docs/tools/qemu-storage-daemon.rst
> index 35ab2d7807..c5076101e0 100644
> --- a/docs/tools/qemu-storage-daemon.rst
> +++ b/docs/tools/qemu-storage-daemon.rst
> @@ -78,7 +78,7 @@ Standard options:
>  .. option:: --export [type=]nbd,id=<id>,node-name=<node-name>[,name=<export-name>][,writable=on|off][,bitmap=<name>]
>    --export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=unix,addr.path=<socket-path>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>]
>    --export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=fd,addr.str=<fd>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>]
> -  --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>[,growable=on|off][,writable=on|off][,allow-other=on|off|auto]
> +  --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>[,growable=on|off][,writable=on|off][,allow-other=on|off|auto][,io-uring=on|off]
>    --export [type=]vduse-blk,id=<id>,node-name=<node-name>,name=<vduse-name>[,writable=on|off][,num-queues=<num-queues>][,queue-size=<queue-size>][,logical-block-size=<block-size>][,serial=<serial-number>]
>  
>    is a block export definition. ``node-name`` is the block node that should be
> @@ -111,10 +111,11 @@ Standard options:
>    that enabling this option as a non-root user requires enabling the
>    user_allow_other option in the global fuse.conf configuration file.  Setting
>    ``allow-other`` to auto (the default) will try enabling this option, and on
> -  error fall back to disabling it.
> -
> -  The ``vduse-blk`` export type takes a ``name`` (must be unique across the host)
> -  to create the VDUSE device.
> +  error fall back to disabling it. Once ``io-uring`` is enabled (off by default),
> +  the FUSE-over-io_uring-related settings will be initialized to bypass the
> +  traditional /dev/fuse communication mechanism and instead use io_uring to
> +  handle FUSE operations. The ``vduse-blk`` export type takes a ``name``
> +  (must be unique across the host) to create the VDUSE device.
>    ``num-queues`` sets the number of virtqueues (the default is 1).
>    ``queue-size`` sets the virtqueue descriptor table size (the default is 256).
>  
> diff --git a/qapi/block-export.json b/qapi/block-export.json
> index 9ae703ad01..37f2fc47e2 100644
> --- a/qapi/block-export.json
> +++ b/qapi/block-export.json
> @@ -184,12 +184,15 @@
>  #     mount the export with allow_other, and if that fails, try again
>  #     without.  (since 6.1; default: auto)
>  #
> +# @io-uring: Use FUSE-over-io-uring.  (since 10.2; default: false)
> +#
>  # Since: 6.0
>  ##
>  { 'struct': 'BlockExportOptionsFuse',
>    'data': { 'mountpoint': 'str',
>              '*growable': 'bool',
> -            '*allow-other': 'FuseExportAllowOther' },
> +            '*allow-other': 'FuseExportAllowOther',
> +            '*io-uring': 'bool' },
>    'if': 'CONFIG_FUSE' }
>  
>  ##
> diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-storage-daemon.c
> index eb72561358..0cd4cd2b58 100644
> --- a/storage-daemon/qemu-storage-daemon.c
> +++ b/storage-daemon/qemu-storage-daemon.c
> @@ -107,6 +107,7 @@ static void help(void)
>  #ifdef CONFIG_FUSE
>  "  --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>\n"
>  "           [,growable=on|off][,writable=on|off][,allow-other=on|off|auto]\n"
> +"           [,io-uring=on|off]"
>  "                         export the specified block node over FUSE\n"
>  "\n"
>  #endif /* CONFIG_FUSE */
> diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> index d2433d1d99..68d3fe8e01 100644
> --- a/util/fdmon-io_uring.c
> +++ b/util/fdmon-io_uring.c
> @@ -452,10 +452,13 @@ static const FDMonOps fdmon_io_uring_ops = {
>  void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
>  {
>      int ret;
> +    int flags;
>  
>      ctx->io_uring_fd_tag = NULL;
> +    flags = IORING_SETUP_SQE128;
>  
> -    ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
> +    ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES,
> +                            &ctx->fdmon_io_uring, flags);

The indentation is off here.

>      if (ret != 0) {
>          error_setg_errno(errp, -ret, "Failed to initialize io_uring");
>          return;

The change to fdmon-io_uring.c should be a separate patch. It's a
prerequisite for, but not directly part of io_uring support in FUSE.

Kevin



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] export/fuse: Safe termination for FUSE-uring
  2025-09-15  5:43     ` Brian Song
@ 2025-09-17 13:01       ` Hanna Czenczek
  2025-09-17 22:06         ` Brian Song
  0 siblings, 1 reply; 38+ messages in thread
From: Hanna Czenczek @ 2025-09-17 13:01 UTC (permalink / raw)
  To: Brian Song, Stefan Hajnoczi
  Cc: qemu-block, qemu-devel, armbru, bernd, fam, kwolf

On 15.09.25 07:43, Brian Song wrote:
> Hi Hanna,

Hi Brian!

(Thanks for your heads-up!)

> Stefan raised the above issue and proposed a preliminary solution: keep
> closing the file descriptor in the delete section, but perform
> umount separately for FUSE uring and traditional FUSE in the shutdown
> and delete sections respectively. This approach avoids the race
> condition on the file descriptor.
>
> In the case of FUSE uring, umount must be performed in the shutdown
> section. The reason is that the kernel currently lacks an interface to
> explicitly cancel submitted SQEs. Performing umount forces the kernel to
> flush all pending SQEs and return their CQEs. Without this step, CQEs
> may arrive after the export has already been deleted, and invoking the
> CQE handler at that point would dereference freed memory and trigger a
> segmentation fault.

The commit message says that incrementing the BB reference would be 
enough to solve the problem (i.e. deleting is delayed until all requests 
are done).  Why isn’t it?

> I’m curious about traditional FUSE: is it strictly necessary to perform
> umount in the delete section, or could it also be done in shutdown?

Looking into libfuse, fuse_session_unmount() (in fuse_kern_unmount()) 
closes the FUSE FD.  I can imagine that might result in the potential 
problems Stefan described.

> Additionally, what is the correct ordering between close(fd) and
> umount, does one need to precede the other?

fuse_kern_unmount() closes the (queue 0) FD first before actually 
unmounting, with a comment: “Need to close file descriptor, otherwise 
synchronous umount would recurse into filesystem, and deadlock.”

Given that, I assume the FDs should all be closed before unmounting.

(Though to be fair, before looking into it now, I don’t think I’ve ever 
given it much thought…)

Hanna

> Thanks,
> Brian
>
> On 9/9/25 3:33 PM, Stefan Hajnoczi wrote:
>   > On Fri, Aug 29, 2025 at 10:50:24PM -0400, Brian Song wrote:
>   >> @@ -901,24 +941,15 @@ static void fuse_export_shutdown(BlockExport
> *blk_exp)
>   >>            */
>   >>           g_hash_table_remove(exports, exp->mountpoint);
>   >>       }
>   >> -}
>   >> -
>   >> -static void fuse_export_delete(BlockExport *blk_exp)
>   >> -{
>   >> -    FuseExport *exp = container_of(blk_exp, FuseExport, common);
>   >>
>   >> -    for (int i = 0; i < exp->num_queues; i++) {
>   >> +    for (size_t i = 0; i < exp->num_queues; i++) {
>   >>           FuseQueue *q = &exp->queues[i];
>   >>
>   >>           /* Queue 0's FD belongs to the FUSE session */
>   >>           if (i > 0 && q->fuse_fd >= 0) {
>   >>               close(q->fuse_fd);
>   >
>   > This changes the behavior of the non-io_uring code. Now all fuse fds and
>   > fuse_session are closed while requests are potentially still being
>   > processed.
>   >
>   > There is a race condition: if an IOThread is processing a request here
>   > then it may invoke a system call on q->fuse_fd just after it has been
>   > closed but not set to -1. If another thread has also opened a new file
>   > then the fd could be reused, resulting in an accidental write(2) to the
>   > new file. I'm not sure whether there is a way to trigger this in
>   > practice, but it looks like a problem waiting to happen.
>   >
>   > Simply setting q->fuse_fd to -1 here doesn't fix the race. It would be
>   > necessary to stop processing fuse_fd in the thread before closing it
>   > here or to schedule a BH in each thread so that fuse_fd can be closed
>   > in the thread that uses the fd.
>



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring
  2025-09-16 19:08   ` Kevin Wolf
@ 2025-09-17 19:47     ` Brian Song
  2025-09-19 14:13       ` Kevin Wolf
  0 siblings, 1 reply; 38+ messages in thread
From: Brian Song @ 2025-09-17 19:47 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, stefanha



On 9/16/25 3:08 PM, Kevin Wolf wrote:
> Am 30.08.2025 um 04:50 hat Brian Song geschrieben:
>> This patch adds a new export option for storage-export-daemon to enable
>> FUSE-over-io_uring via the switch io-uring=on|off (disableby default).
>> It also implements the protocol handshake with the Linux kernel
>> during the FUSE-over-io_uring initialization phase.
>>
>> See: https://docs.kernel.org/filesystems/fuse-io-uring.html
>>
>> The kernel documentation describes in detail how FUSE-over-io_uring
>> works. This patch implements the Initial SQE stage shown in thediagram:
>> it initializes one queue per IOThread, each currently supporting a
>> single submission queue entry (SQE). When the FUSE driver sends the
>> first FUSE request (FUSE_INIT), storage-export-daemon calls
>> fuse_uring_start() to complete initialization, ultimately submitting
>> the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm
>> successful initialization with the kernel.
>>
>> We also added support for multiple IOThreads. The current Linux kernel
>> requires registering $(nproc) queues when setting up FUSE-over-io_uring
>> To let users customize the number of FUSE Queues (i.e., IOThreads),
>> we first create nproc Ring Queues as required by the kernel, then
>> distribute them in a round-robin manner to the FUSE Queues for
>> registration. In addition, to support multiple in-flight requests,
>> we configure each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH
>> entries/requests.
>>
>> Suggested-by: Kevin Wolf <kwolf@redhat.com>
>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>> Signed-off-by: Brian Song <hibriansong@gmail.com>
>> ---
>>   block/export/fuse.c                  | 310 +++++++++++++++++++++++++--
>>   docs/tools/qemu-storage-daemon.rst   |  11 +-
>>   qapi/block-export.json               |   5 +-
>>   storage-daemon/qemu-storage-daemon.c |   1 +
>>   util/fdmon-io_uring.c                |   5 +-
>>   5 files changed, 309 insertions(+), 23 deletions(-)
>>
>> diff --git a/block/export/fuse.c b/block/export/fuse.c
>> index c0ad4696ce..19bf9e5f74 100644
>> --- a/block/export/fuse.c
>> +++ b/block/export/fuse.c
>> @@ -48,6 +48,9 @@
>>   #include <linux/fs.h>
>>   #endif
>>   
>> +/* room needed in buffer to accommodate header */
>> +#define FUSE_BUFFER_HEADER_SIZE 0x1000
>> +
>>   /* Prevent overly long bounce buffer allocations */
>>   #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
>>   /*
>> @@ -63,12 +66,59 @@
>>       (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
>>   
>>   typedef struct FuseExport FuseExport;
>> +typedef struct FuseQueue FuseQueue;
>> +
>> +#ifdef CONFIG_LINUX_IO_URING
>> +#define FUSE_DEFAULT_RING_QUEUE_DEPTH 64
>> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
> 
> Maybe it would be a little clearer if the next few types has URing in
> their name instead of just Ring.
> 
>> +typedef struct FuseRingQueue FuseRingQueue;
>> +typedef struct FuseRingEnt {
>> +    /* back pointer */
>> +    FuseRingQueue *rq;
>> +
>> +    /* commit id of a fuse request */
>> +    uint64_t req_commit_id;
>> +
>> +    /* fuse request header and payload */
>> +    struct fuse_uring_req_header req_header;
>> +    void *op_payload;
>> +    size_t req_payload_sz;
>> +
>> +    /* The vector passed to the kernel */
>> +    struct iovec iov[2];
>> +
>> +    CqeHandler fuse_cqe_handler;
>> +} FuseRingEnt;
>> +
>> +struct FuseRingQueue {
> 
> It would be good to have a comment here that explains the difference
> between FuseQueue and FuseRingQueue.
> 
> Is this a distinction that should remain in the long run or would we
> always have a 1:1 mapping between FuseQueue and FuseRingQueue once the
> pending kernel changes are merged that allow a number of uring queues
> different from the number of CPUs?
> 

Stefan mentioned the issue, and I added some comments here. One thing to 
note is that FuseRingQueueManager and the distribution between FuseQueue 
and FuseRingQueue are just temporary measures until the kernel allows 
user-defined queues. Therefore, I don't think it's a good idea to remove 
FuseRingQueueManager at this stage.

If you look back at the v2 patch, we put the ring entries inside the 
FuseQueue. The result was that we had to define nproc IOThreads 
(FuseQueue) in order to make it work. That's why here I separated the 
numbers of the two types of queues and RingQueue into independent 
abstractions: allocate nproc RingQueues and initialize the entries, then 
distribute them to FuseQueues in a round-robin manner. Once the kernel 
supports a user-defined number of queues, we can remove 
FuseRingQueueManager and the RR distribution.

Also, to keep the variable names consistent with those in the kernel and 
libfuse, I use Ring here instead of URing.

>> +    int rqid;
>> +
>> +    /* back pointer */
>> +    FuseQueue *q;
>> +    FuseRingEnt *ent;
>> +
>> +    /* List entry for ring_queues */
>> +    QLIST_ENTRY(FuseRingQueue) next;
>> +};
>> +
>> +/*
>> + * Round-robin distribution of ring queues across FUSE queues.
>> + * This structure manages the mapping between kernel ring queues and user
>> + * FUSE queues.
>> + */
>> +typedef struct FuseRingQueueManager {
>> +    FuseRingQueue *ring_queues;
>> +    int num_ring_queues;
>> +    int num_fuse_queues;
>> +} FuseRingQueueManager;
> 
> This isn't a manager, it's just the set of queues the export uses.
> 
> num_fuse_queues duplicates exp->num_queues, there is no reason for it to
> exist. All users also have access to the FuseExport itself.
> 
> The other two fields can just be merged directly into FuseExport,
> preferably renamed to uring_queues and num_uring_queues.
> >> +#endif
>>   
>>   /*
>>    * One FUSE "queue", representing one FUSE FD from which requests are fetched
>>    * and processed.  Each queue is tied to an AioContext.
>>    */
>> -typedef struct FuseQueue {
>> +struct FuseQueue {
>>       FuseExport *exp;
>>   
>>       AioContext *ctx;
>> @@ -109,15 +159,11 @@ typedef struct FuseQueue {
>>        * Free this buffer with qemu_vfree().
>>        */
>>       void *spillover_buf;
>> -} FuseQueue;
>>   
>> -/*
>> - * Verify that FuseQueue.request_buf plus the spill-over buffer together
>> - * are big enough to be accepted by the FUSE kernel driver.
>> - */
>> -QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
>> -                  FUSE_SPILLOVER_BUF_SIZE <
>> -                  FUSE_MIN_READ_BUFFER);
>> +#ifdef CONFIG_LINUX_IO_URING
>> +    QLIST_HEAD(, FuseRingQueue) ring_queue_list;
>> +#endif
>> +};
>>   
>>   struct FuseExport {
>>       BlockExport common;
>> @@ -133,7 +179,7 @@ struct FuseExport {
>>        */
>>       bool halted;
>>   
>> -    int num_queues;
>> +    size_t num_queues;
> 
> I'm not sure why this change is needed. If it is, can it be a separate
> patch before this one, with a commit message describing the reason?
> 

I feel there's no reason to use a signed int here, since the number of 
queues cannot be negative.

>>       FuseQueue *queues;
>>       /*
>>        * True if this export should follow the generic export's AioContext.
>> @@ -149,6 +195,12 @@ struct FuseExport {
>>       /* Whether allow_other was used as a mount option or not */
>>       bool allow_other;
>>   
>> +#ifdef CONFIG_LINUX_IO_URING
>> +    bool is_uring;
>> +    size_t ring_queue_depth;
>> +    FuseRingQueueManager *ring_queue_manager;
>> +#endif
>> +
>>       mode_t st_mode;
>>       uid_t st_uid;
>>       gid_t st_gid;
>> @@ -205,7 +257,7 @@ static void fuse_attach_handlers(FuseExport *exp)
>>           return;
>>       }
>>   
>> -    for (int i = 0; i < exp->num_queues; i++) {
>> +    for (size_t i = 0; i < exp->num_queues; i++) {
>>           aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
>>                              read_from_fuse_fd, NULL, NULL, NULL,
>>                              &exp->queues[i]);
>> @@ -257,6 +309,189 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
>>       .drained_poll  = fuse_export_drained_poll,
>>   };
>>   
>> +#ifdef CONFIG_LINUX_IO_URING
>> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req,
>> +                    const unsigned int rqid,
>> +                    const unsigned int commit_id)
> 
> Indentation is off here. There are two accepted styles for indentation
> after breaking a long line in QEMU (see docs/devel/style.rst):
> 
> 1. Indent the next line by exactly four spaces:
> 
>      do_something(x, y,
>          z);
> 
> 2. Align the next line with the first character after the opening
>     parenthesis:
> 
>      do_something(x, y,
>                   z);
> 
> The second one is the preferred one. The first one is generally only
> used when the parenthesis is already too far right and we can't do much
> about it.
> 
>> +{
>> +    req->qid = rqid;
>> +    req->commit_id = commit_id;
>> +    req->flags = 0;
>> +}
>> +
>> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q,
>> +               __u32 cmd_op)
> 
> Indentation.
> 
> Another option here is to keep everything before the function name on a
> separate line, like this:
> 
> static void
> fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q, __u32 cmd_op)
> 
> This would allow the second line to stay under 80 characters.
> 
>> +{
>> +    sqe->opcode = IORING_OP_URING_CMD;
>> +
>> +    sqe->fd = q->fuse_fd;
>> +    sqe->rw_flags = 0;
>> +    sqe->ioprio = 0;
>> +    sqe->off = 0;
>> +
>> +    sqe->cmd_op = cmd_op;
>> +    sqe->__pad1 = 0;
>> +}
>> +
>> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque)
>> +{
>> +    FuseRingEnt *ent = opaque;
>> +    struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0];
>> +
>> +    fuse_uring_sqe_prepare(sqe, ent->rq->q, FUSE_IO_URING_CMD_REGISTER);
>> +
>> +    sqe->addr = (uint64_t)(ent->iov);
>> +    sqe->len = 2;
>> +
>> +    fuse_uring_sqe_set_req_data(req, ent->rq->rqid, 0);
>> +}
>> +
>> +static void fuse_uring_submit_register(void *opaque)
>> +{
>> +    FuseRingEnt *ent = opaque;
>> +    FuseExport *exp = ent->rq->q->exp;
>> +
>> +
> 
> Extra empty line.
> 
>> +    aio_add_sqe(fuse_uring_prep_sqe_register, ent, &(ent->fuse_cqe_handler));
> 
> The parentheses around ent->fuse_cqe_handler are unnecessary.
> 
>> +}
>> +
>> +/**
>> + * Distribute ring queues across FUSE queues using round-robin algorithm.
> 
> Hm, if this function distributes (u)ring queues, then what is
> fuse_distribute_ring_queues() doing? Is the term overloaded with two
> meanings?
> 
>> + * This ensures even distribution of kernel ring queues across user-specified
>> + * FUSE queues.
>> + */
>> +static
>> +FuseRingQueueManager *fuse_ring_queue_manager_create(int num_fuse_queues,
>> +                                                    size_t ring_queue_depth,
>> +                                                    size_t bufsize)
> 
> The right style here would be something like:
> 
> static FuseRingQueueManager *
> fuse_ring_queue_manager_create(int num_fuse_queues,
>                                 size_t ring_queue_depth,
>                                 size_t bufsize)
> 
> Given that I said that there is no reason to call the set of all queues
> a manager, or to even have it separate from FuseExport, this probably
> becomes fuse_uring_setup_queues() or something.
> 
>> +{
>> +    int num_ring_queues = get_nprocs();
> 
> This could use a comment saying that this is a kernel requirement at the
> moment.
> 
>> +    FuseRingQueueManager *manager = g_new(FuseRingQueueManager, 1);
>> +
>> +    if (!manager) {
>> +        return NULL;
>> +    }
> 
> g_new() never returns NULL, it aborts on error instead, so no reason to
> have a NULL check here.
> 
>> +
>> +    manager->ring_queues = g_new(FuseRingQueue, num_ring_queues);
>> +    manager->num_ring_queues = num_ring_queues;
>> +    manager->num_fuse_queues = num_fuse_queues;
>> +
>> +    if (!manager->ring_queues) {
>> +        g_free(manager);
>> +        return NULL;
>> +    }
> 
> This check is unnecessary for the same reason.
> 
>> +
>> +    for (int i = 0; i < num_ring_queues; i++) {
>> +        FuseRingQueue *rq = &manager->ring_queues[i];
>> +        rq->rqid = i;
>> +        rq->ent = g_new(FuseRingEnt, ring_queue_depth);
>> +
>> +        if (!rq->ent) {
>> +            for (int j = 0; j < i; j++) {
>> +                g_free(manager->ring_queues[j].ent);
>> +            }
>> +            g_free(manager->ring_queues);
>> +            g_free(manager);
>> +            return NULL;
>> +        }
> 
> This one, too.
> 
>> +
>> +        for (size_t j = 0; j < ring_queue_depth; j++) {
>> +            FuseRingEnt *ent = &rq->ent[j];
>> +            ent->rq = rq;
>> +            ent->req_payload_sz = bufsize - FUSE_BUFFER_HEADER_SIZE;
>> +            ent->op_payload = g_malloc0(ent->req_payload_sz);
>> +
>> +            if (!ent->op_payload) {
>> +                for (size_t k = 0; k < j; k++) {
>> +                    g_free(rq->ent[k].op_payload);
>> +                }
>> +                g_free(rq->ent);
>> +                for (int k = 0; k < i; k++) {
>> +                    g_free(manager->ring_queues[k].ent);
>> +                }
>> +                g_free(manager->ring_queues);
>> +                g_free(manager);
>> +                return NULL;
>> +            }
> 
> And this one.
> 
> Removing all of them will make the function a lot more readable.
> 
>> +
>> +            ent->iov[0] = (struct iovec) {
>> +                &(ent->req_header),
> 
> Unnecessary parentheses.
> 
>> +                sizeof(struct fuse_uring_req_header)
>> +            };
>> +            ent->iov[1] = (struct iovec) {
>> +                ent->op_payload,
>> +                ent->req_payload_sz
>> +            };
>> +
>> +            ent->fuse_cqe_handler.cb = fuse_uring_cqe_handler;
>> +        }
>> +    }
>> +
>> +    return manager;
>> +}
>> +
>> +static
>> +void fuse_distribute_ring_queues(FuseExport *exp, FuseRingQueueManager *manager)
>> +{
>> +    int queue_index = 0;
>> +
>> +    for (int i = 0; i < manager->num_ring_queues; i++) {
>> +        FuseRingQueue *rq = &manager->ring_queues[i];
>> +
>> +        rq->q = &exp->queues[queue_index];
>> +        QLIST_INSERT_HEAD(&(rq->q->ring_queue_list), rq, next);
>> +
>> +        queue_index = (queue_index + 1) % manager->num_fuse_queues;
>> +    }
>> +}
> 
> Ok, no overloaded meaning of distributing queues, but this function
> should probably be merged with the one above. It's part of setting up
> the queues.
> 
> You don't need a separate queue_index counter, you can just directly use
> exp->queues[i % manager->num_fuse_queues].
> 

There are two steps:

1. Create uring queues and allocate buffers for each entry's payload.

2. Distribute these uring queues to FUSE queues using a round-robin 
algorithm.

Given that this is only a temporary measure to allow users to define 
their own IOThreads/FUSE queues, we might later replace the second part 
of the logic. I believe it's better to separate these two pieces of 
logic rather than combining them.


>> +static
>> +void fuse_schedule_ring_queue_registrations(FuseExport *exp,
>> +                                            FuseRingQueueManager *manager)
> 
> Again the formatting. If you split the line before the function name, it
> should be "static void" on the first line.
> 
>> +{
>> +    for (int i = 0; i < manager->num_fuse_queues; i++) {
>> +        FuseQueue *q = &exp->queues[i];
>> +        FuseRingQueue *rq;
>> +
>> +        QLIST_FOREACH(rq, &q->ring_queue_list, next) {
>> +            for (int j = 0; j < exp->ring_queue_depth; j++) {
>> +                aio_bh_schedule_oneshot(q->ctx, fuse_uring_submit_register,
>> +                                        &(rq->ent[j]));
>> +            }
>> +        }
>> +    }
>> +}
> 
> Why one BH per queue entry? This adds up quickly. All entries of the
> same queue need to be processed in the same AioContext, so wouldn't it
> make more sense to have a BH per (FUSE) queue and handle all of its
> uring queues and their entries in a single BH?
> 
>> +static void fuse_uring_start(FuseExport *exp, struct fuse_init_out *out)
>> +{
>> +    /*
>> +     * Since we didn't enable the FUSE_MAX_PAGES feature, the value of
>> +     * fc->max_pages should be FUSE_DEFAULT_MAX_PAGES_PER_REQ, which is set by
>> +     * the kernel by default. Also, max_write should not exceed
>> +     * FUSE_DEFAULT_MAX_PAGES_PER_REQ * PAGE_SIZE.
>> +     */
>> +    size_t bufsize = out->max_write + FUSE_BUFFER_HEADER_SIZE;
>> +
>> +    if (!(out->flags & FUSE_MAX_PAGES)) {
>> +        bufsize = FUSE_DEFAULT_MAX_PAGES_PER_REQ * qemu_real_host_page_size()
>> +                         + FUSE_BUFFER_HEADER_SIZE;
>> +    }
>> +
>> +    exp->ring_queue_manager = fuse_ring_queue_manager_create(
>> +        exp->num_queues, exp->ring_queue_depth, bufsize);
>> +
>> +    if (!exp->ring_queue_manager) {
>> +        error_report("Failed to create ring queue manager");
>> +        return;
>> +    }
>> +
>> +    /* Distribute ring queues across FUSE queues using round-robin */
>> +    fuse_distribute_ring_queues(exp, exp->ring_queue_manager);
>> +
>> +    fuse_schedule_ring_queue_registrations(exp, exp->ring_queue_manager);
>> +}
>> +#endif
>> +
>>   static int fuse_export_create(BlockExport *blk_exp,
>>                                 BlockExportOptions *blk_exp_args,
>>                                 AioContext *const *multithread,
>> @@ -270,6 +505,11 @@ static int fuse_export_create(BlockExport *blk_exp,
>>   
>>       assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
>>   
>> +#ifdef CONFIG_LINUX_IO_URING
>> +    exp->is_uring = args->io_uring;
>> +    exp->ring_queue_depth = FUSE_DEFAULT_RING_QUEUE_DEPTH;
>> +#endif
>> +
>>       if (multithread) {
>>           /* Guaranteed by common export code */
>>           assert(mt_count >= 1);
>> @@ -283,6 +523,10 @@ static int fuse_export_create(BlockExport *blk_exp,
>>                   .exp = exp,
>>                   .ctx = multithread[i],
>>                   .fuse_fd = -1,
>> +#ifdef CONFIG_LINUX_IO_URING
>> +                .ring_queue_list =
>> +                    QLIST_HEAD_INITIALIZER(exp->queues[i].ring_queue_list),
>> +#endif
>>               };
>>           }
>>       } else {
>> @@ -296,6 +540,10 @@ static int fuse_export_create(BlockExport *blk_exp,
>>               .exp = exp,
>>               .ctx = exp->common.ctx,
>>               .fuse_fd = -1,
>> +#ifdef CONFIG_LINUX_IO_URING
>> +            .ring_queue_list =
>> +                QLIST_HEAD_INITIALIZER(exp->queues[0].ring_queue_list),
>> +#endif
>>           };
>>       }
>>   
>> @@ -685,17 +933,39 @@ static bool is_regular_file(const char *path, Error **errp)
>>    */
>>   static ssize_t coroutine_fn
>>   fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
>> -             uint32_t max_readahead, uint32_t flags)
>> +             uint32_t max_readahead, const struct fuse_init_in *in)
>>   {
>> -    const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
>> +    uint64_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO
>> +                                     | FUSE_INIT_EXT;
>> +    uint64_t outargflags = 0;
>> +    uint64_t inargflags = in->flags;
>> +
>> +    ssize_t ret = 0;
>> +
>> +    if (inargflags & FUSE_INIT_EXT) {
>> +        inargflags = inargflags | (uint64_t) in->flags2 << 32;
>> +    }
>> +
>> +#ifdef CONFIG_LINUX_IO_URING
>> +    if (exp->is_uring) {
>> +        if (inargflags & FUSE_OVER_IO_URING) {
>> +            supported_flags |= FUSE_OVER_IO_URING;
>> +        } else {
>> +            exp->is_uring = false;
>> +            ret = -ENODEV;
> 
> Add a 'goto out' here...
> 
>> +        }
>> +    }
>> +#endif
>> +
>> +    outargflags = inargflags & supported_flags;
>>   
>>       *out = (struct fuse_init_out) {
>>           .major = FUSE_KERNEL_VERSION,
>>           .minor = FUSE_KERNEL_MINOR_VERSION,
>>           .max_readahead = max_readahead,
>>           .max_write = FUSE_MAX_WRITE_BYTES,
>> -        .flags = flags & supported_flags,
>> -        .flags2 = 0,
>> +        .flags = outargflags,
>> +        .flags2 = outargflags >> 32,
>>   
>>           /* libfuse maximum: 2^16 - 1 */
>>           .max_background = UINT16_MAX,
>> @@ -717,7 +987,7 @@ fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
>>           .map_alignment = 0,
>>       };
>> -    return sizeof(*out);
>> +    return ret < 0 ? ret : sizeof(*out);
> 
> ...and make this:
> 
>      ret = sizeof(*out);
> out:
>      return ret;
> 
>>   }
>>   
>>   /**
>> @@ -1506,6 +1776,14 @@ fuse_co_process_request(FuseQueue *q, void *spillover_buf)
>>           fuse_write_buf_response(q->fuse_fd, req_id, out_hdr,
>>                                   out_data_buffer, ret);
>>           qemu_vfree(out_data_buffer);
>> +#ifdef CONFIG_LINUX_IO_URING
>> +    /* Handle FUSE-over-io_uring initialization */
>> +    if (unlikely(opcode == FUSE_INIT && exp->is_uring)) {
>> +        struct fuse_init_out *out =
>> +            (struct fuse_init_out *)FUSE_OUT_OP_STRUCT(out_buf);
>> +        fuse_uring_start(exp, out);
>> +    }
>> +#endif
> 
> A level of indentation was lost here.
> 
>>       } else {
>>           fuse_write_response(q->fuse_fd, req_id, out_hdr,
>>                               ret < 0 ? ret : 0,
>> diff --git a/docs/tools/qemu-storage-daemon.rst b/docs/tools/qemu-storage-daemon.rst
>> index 35ab2d7807..c5076101e0 100644
>> --- a/docs/tools/qemu-storage-daemon.rst
>> +++ b/docs/tools/qemu-storage-daemon.rst
>> @@ -78,7 +78,7 @@ Standard options:
>>   .. option:: --export [type=]nbd,id=<id>,node-name=<node-name>[,name=<export-name>][,writable=on|off][,bitmap=<name>]
>>     --export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=unix,addr.path=<socket-path>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>]
>>     --export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=fd,addr.str=<fd>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>]
>> -  --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>[,growable=on|off][,writable=on|off][,allow-other=on|off|auto]
>> +  --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>[,growable=on|off][,writable=on|off][,allow-other=on|off|auto][,io-uring=on|off]
>>     --export [type=]vduse-blk,id=<id>,node-name=<node-name>,name=<vduse-name>[,writable=on|off][,num-queues=<num-queues>][,queue-size=<queue-size>][,logical-block-size=<block-size>][,serial=<serial-number>]
>>   
>>     is a block export definition. ``node-name`` is the block node that should be
>> @@ -111,10 +111,11 @@ Standard options:
>>     that enabling this option as a non-root user requires enabling the
>>     user_allow_other option in the global fuse.conf configuration file.  Setting
>>     ``allow-other`` to auto (the default) will try enabling this option, and on
>> -  error fall back to disabling it.
>> -
>> -  The ``vduse-blk`` export type takes a ``name`` (must be unique across the host)
>> -  to create the VDUSE device.
>> +  error fall back to disabling it. Once ``io-uring`` is enabled (off by default),
>> +  the FUSE-over-io_uring-related settings will be initialized to bypass the
>> +  traditional /dev/fuse communication mechanism and instead use io_uring to
>> +  handle FUSE operations. The ``vduse-blk`` export type takes a ``name``
>> +  (must be unique across the host) to create the VDUSE device.
>>     ``num-queues`` sets the number of virtqueues (the default is 1).
>>     ``queue-size`` sets the virtqueue descriptor table size (the default is 256).
>>   
>> diff --git a/qapi/block-export.json b/qapi/block-export.json
>> index 9ae703ad01..37f2fc47e2 100644
>> --- a/qapi/block-export.json
>> +++ b/qapi/block-export.json
>> @@ -184,12 +184,15 @@
>>   #     mount the export with allow_other, and if that fails, try again
>>   #     without.  (since 6.1; default: auto)
>>   #
>> +# @io-uring: Use FUSE-over-io-uring.  (since 10.2; default: false)
>> +#
>>   # Since: 6.0
>>   ##
>>   { 'struct': 'BlockExportOptionsFuse',
>>     'data': { 'mountpoint': 'str',
>>               '*growable': 'bool',
>> -            '*allow-other': 'FuseExportAllowOther' },
>> +            '*allow-other': 'FuseExportAllowOther',
>> +            '*io-uring': 'bool' },
>>     'if': 'CONFIG_FUSE' }
>>   
>>   ##
>> diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-storage-daemon.c
>> index eb72561358..0cd4cd2b58 100644
>> --- a/storage-daemon/qemu-storage-daemon.c
>> +++ b/storage-daemon/qemu-storage-daemon.c
>> @@ -107,6 +107,7 @@ static void help(void)
>>   #ifdef CONFIG_FUSE
>>   "  --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>\n"
>>   "           [,growable=on|off][,writable=on|off][,allow-other=on|off|auto]\n"
>> +"           [,io-uring=on|off]"
>>   "                         export the specified block node over FUSE\n"
>>   "\n"
>>   #endif /* CONFIG_FUSE */
>> diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
>> index d2433d1d99..68d3fe8e01 100644
>> --- a/util/fdmon-io_uring.c
>> +++ b/util/fdmon-io_uring.c
>> @@ -452,10 +452,13 @@ static const FDMonOps fdmon_io_uring_ops = {
>>   void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
>>   {
>>       int ret;
>> +    int flags;
>>   
>>       ctx->io_uring_fd_tag = NULL;
>> +    flags = IORING_SETUP_SQE128;
>>   
>> -    ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
>> +    ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES,
>> +                            &ctx->fdmon_io_uring, flags);
> 
> The indentation is off here.
> 
>>       if (ret != 0) {
>>           error_setg_errno(errp, -ret, "Failed to initialize io_uring");
>>           return;
> 
> The change to fdmon-io_uring.c should be a separate patch. It's a
> prerequisite for, but not directly part of io_uring support in FUSE.
> 
> Kevin
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] export/fuse: Safe termination for FUSE-uring
  2025-09-17 13:01       ` Hanna Czenczek
@ 2025-09-17 22:06         ` Brian Song
  2025-09-22 17:41           ` Stefan Hajnoczi
  2025-09-22 17:51           ` Stefan Hajnoczi
  0 siblings, 2 replies; 38+ messages in thread
From: Brian Song @ 2025-09-17 22:06 UTC (permalink / raw)
  To: Hanna Czenczek, Stefan Hajnoczi
  Cc: qemu-block, qemu-devel, armbru, bernd, fam, kwolf



On 9/17/25 9:01 AM, Hanna Czenczek wrote:
> On 15.09.25 07:43, Brian Song wrote:
>> Hi Hanna,
> 
> Hi Brian!
> 
> (Thanks for your heads-up!)
> 
>> Stefan raised the above issue and proposed a preliminary solution: keep
>> closing the file descriptor in the delete section, but perform
>> umount separately for FUSE uring and traditional FUSE in the shutdown
>> and delete sections respectively. This approach avoids the race
>> condition on the file descriptor.
>>
>> In the case of FUSE uring, umount must be performed in the shutdown
>> section. The reason is that the kernel currently lacks an interface to
>> explicitly cancel submitted SQEs. Performing umount forces the kernel to
>> flush all pending SQEs and return their CQEs. Without this step, CQEs
>> may arrive after the export has already been deleted, and invoking the
>> CQE handler at that point would dereference freed memory and trigger a
>> segmentation fault.
> 
> The commit message says that incrementing the BB reference would be 
> enough to solve the problem (i.e. deleting is delayed until all requests 
> are done).  Why isn’t it?

Hanna:

If we place umount in the delete section instead of the shutdown 
section, the kernel FUSE driver will continue waiting for user FUSE 
requests and therefore won't return CQEs to userspace. As a result, the 
BB reference remains held (since the reference is acquired during 
registration and submission and only released once the CQE returns), 
preventing the delete operation from being invoked (invoked once the 
reference is decreased to 0). This is why umount must be placed in the 
shutdown section.

> 
>> I’m curious about traditional FUSE: is it strictly necessary to perform
>> umount in the delete section, or could it also be done in shutdown?
> 
> Looking into libfuse, fuse_session_unmount() (in fuse_kern_unmount()) 
> closes the FUSE FD.  I can imagine that might result in the potential 
> problems Stefan described.
> 
>> Additionally, what is the correct ordering between close(fd) and
>> umount, does one need to precede the other?
> 
> fuse_kern_unmount() closes the (queue 0) FD first before actually 
> unmounting, with a comment: “Need to close file descriptor, otherwise 
> synchronous umount would recurse into filesystem, and deadlock.”
> 
> Given that, I assume the FDs should all be closed before unmounting.
> 
> (Though to be fair, before looking into it now, I don’t think I’ve ever 
> given it much thought…)
> 
> Hanna
>
Stefan:

I roughly went through the umount and close system calls:

umount:
fuse_kill_sb_anon -> fuse_sb_destroy -> fuse_abort_conn

close:
__fput -> file->f_op->release(inode, file) -> fuse_dev_release -> 
fuse_abort_conn
(this only runs after all /dev/fuse FDs have been closed).

And as Hanna mentioned, libfuse points out: “Need to close file 
descriptor, otherwise synchronous umount would recurse into filesystem, 
and deadlock.”

So ideally, we should close each queue FD first, then call umount at the 
end — even though calling umount directly also works. The root issue is 
that the kernel doesn't provide an interface to cancel already submitted 
SQEs.

You mentioned that in fuse over io_uring mode we perform close in the 
shutdown path, but at that point the server may still be processing 
requests. While handling requests, it may still write to the FD, but 
that FD might not be /dev/fuse. I’m not sure how this gets triggered, 
since in fuse uring mode all FUSE requests are handled by io_uring, and 
our FUSE requests should be completed via io_uring. After shutdown 
closes the FD, it may call fuse_abort_conn, which terminates all request 
processing in the kernel. There’s also locking in place to protect the 
termination of requests and the subsequent uring cleanup.

That’s why I think the best approach for now is:

in shutdown, handle close and umount for fuse over io_uring;

in delete, handle close and umount for traditional FUSE.

>> Thanks,
>> Brian
>>
>> On 9/9/25 3:33 PM, Stefan Hajnoczi wrote:
>>   > On Fri, Aug 29, 2025 at 10:50:24PM -0400, Brian Song wrote:
>>   >> @@ -901,24 +941,15 @@ static void fuse_export_shutdown(BlockExport
>> *blk_exp)
>>   >>            */
>>   >>           g_hash_table_remove(exports, exp->mountpoint);
>>   >>       }
>>   >> -}
>>   >> -
>>   >> -static void fuse_export_delete(BlockExport *blk_exp)
>>   >> -{
>>   >> -    FuseExport *exp = container_of(blk_exp, FuseExport, common);
>>   >>
>>   >> -    for (int i = 0; i < exp->num_queues; i++) {
>>   >> +    for (size_t i = 0; i < exp->num_queues; i++) {
>>   >>           FuseQueue *q = &exp->queues[i];
>>   >>
>>   >>           /* Queue 0's FD belongs to the FUSE session */
>>   >>           if (i > 0 && q->fuse_fd >= 0) {
>>   >>               close(q->fuse_fd);
>>   >
>>   > This changes the behavior of the non-io_uring code. Now all fuse 
>> fds and
>>   > fuse_session are closed while requests are potentially still being
>>   > processed.
>>   >
>>   > There is a race condition: if an IOThread is processing a request 
>> here
>>   > then it may invoke a system call on q->fuse_fd just after it has been
>>   > closed but not set to -1. If another thread has also opened a new 
>> file
>>   > then the fd could be reused, resulting in an accidental write(2) 
>> to the
>>   > new file. I'm not sure whether there is a way to trigger this in
>>   > practice, but it looks like a problem waiting to happen.
>>   >
>>   > Simply setting q->fuse_fd to -1 here doesn't fix the race. It 
>> would be
>>   > necessary to stop processing fuse_fd in the thread before closing it
>>   > here or to schedule a BH in each thread so that fuse_fd can be closed
>>   > in the thread that uses the fd.
>>
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/4] export/fuse: process FUSE-over-io_uring requests
  2025-08-30  2:50 ` [PATCH 2/4] export/fuse: process FUSE-over-io_uring requests Brian Song
  2025-09-03 11:51   ` Stefan Hajnoczi
@ 2025-09-19 13:54   ` Kevin Wolf
  1 sibling, 0 replies; 38+ messages in thread
From: Kevin Wolf @ 2025-09-19 13:54 UTC (permalink / raw)
  To: Brian Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, stefanha

Am 30.08.2025 um 04:50 hat Brian Song geschrieben:
> https://docs.kernel.org/filesystems/fuse-io-uring.html
> 
> As described in the kernel documentation, after FUSE-over-io_uring
> initialization and handshake, FUSE interacts with the kernel using
> SQE/CQE to send requests and receive responses. This corresponds to
> the "Sending requests with CQEs" section in the docs.
> 
> This patch implements three key parts: registering the CQE handler
> (fuse_uring_cqe_handler), processing FUSE requests (fuse_uring_co_
> process_request), and sending response results (fuse_uring_send_
> response). It also merges the traditional /dev/fuse request handling
> with the FUSE-over-io_uring handling functions.
> 
> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Brian Song <hibriansong@gmail.com>

A general remark first: I think this would be easier to review if it
were split into multiple patches. For example, at the first sight it
looks to me like I'd split at least:

- Factor out fuse_co_process_request_common() from
  fuse_co_process_request(). This would be a pure code movement patch
  with no intention to change the behaviour (i.e. it doesn't add any
  io_uring code yet). It is very common to have such refactoring commits
  in preparation for the addition of a new feature later.

- Change fuse_co_write() to allow a NULL in_place_buf

- Add io_uring request processing

All three are logically independent changes and can be reviewed on their
own. Maybe further splitting is possible that would only become obvious
when looking at the smaller patches.

>  block/export/fuse.c | 457 ++++++++++++++++++++++++++++++--------------
>  1 file changed, 309 insertions(+), 148 deletions(-)
> 
> diff --git a/block/export/fuse.c b/block/export/fuse.c
> index 19bf9e5f74..07f74fc8ec 100644
> --- a/block/export/fuse.c
> +++ b/block/export/fuse.c
> @@ -310,6 +310,47 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
>  };
>  
>  #ifdef CONFIG_LINUX_IO_URING
> +static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent);
> +
> +static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque)
> +{
> +    FuseRingEnt *ent = opaque;
> +    FuseExport *exp = ent->rq->q->exp;
> +
> +    /* Going to process requests */
> +    fuse_inc_in_flight(exp);

I think this can be too late. The in_flight counter must be increased
before we start processing something that must be waited for in a drain.
Can't it happen here that a drain in the main thread already returns
while the CQE is still pending in an iothread, but nothing stops it from
being processed and starting new requests even though we're supossedly
in a drained section now?

> +    /* A ring entry returned */
> +    fuse_uring_co_process_request(ent);
> +
> +    /* Finished processing requests */
> +    fuse_dec_in_flight(exp);
> +}
> +
> +static void fuse_uring_cqe_handler(CqeHandler *cqe_handler)
> +{
> +    FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, fuse_cqe_handler);
> +    Coroutine *co;
> +    FuseExport *exp = ent->rq->q->exp;
> +
> +    if (unlikely(exp->halted)) {
> +        return;
> +    }
> +
> +    int err = cqe_handler->cqe.res;
> +
> +    if (err != 0) {
> +        /* -ENOTCONN is ok on umount  */
> +        if (err != -EINTR && err != -EAGAIN &&
> +            err != -ENOTCONN) {

This fits on a single line (but I think the result was that you'll
remove some error codes anway).

> +            fuse_export_halt(exp);
> +        }
> +    } else {
> +        co = qemu_coroutine_create(co_fuse_uring_queue_handle_cqes, ent);
> +        qemu_coroutine_enter(co);
> +    }
> +}
> +
>  static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req,
>                      const unsigned int rqid,
>                      const unsigned int commit_id)
> @@ -1213,6 +1254,9 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
>   * Data in @in_place_buf is assumed to be overwritten after yielding, so will
>   * be copied to a bounce buffer beforehand.  @spillover_buf in contrast is
>   * assumed to be exclusively owned and will be used as-is.
> + * In FUSE-over-io_uring mode, the actual op_payload content is stored in
> + * @spillover_buf. To ensure this buffer is used for writing, @in_place_buf
> + * is explicitly set to NULL.
>   * Return the number of bytes written to *out on success, and -errno on error.
>   */
>  static ssize_t coroutine_fn
> @@ -1220,8 +1264,8 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
>                uint64_t offset, uint32_t size,
>                const void *in_place_buf, const void *spillover_buf)
>  {
> -    size_t in_place_size;
> -    void *copied;
> +    size_t in_place_size = 0;
> +    void *copied = NULL;
>      int64_t blk_len;
>      int ret;
>      struct iovec iov[2];
> @@ -1236,10 +1280,12 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
>          return -EACCES;
>      }
>  
> -    /* Must copy to bounce buffer before potentially yielding */
> -    in_place_size = MIN(size, FUSE_IN_PLACE_WRITE_BYTES);
> -    copied = blk_blockalign(exp->common.blk, in_place_size);
> -    memcpy(copied, in_place_buf, in_place_size);
> +    if (in_place_buf) {
> +        /* Must copy to bounce buffer before potentially yielding */
> +        in_place_size = MIN(size, FUSE_IN_PLACE_WRITE_BYTES);
> +        copied = blk_blockalign(exp->common.blk, in_place_size);
> +        memcpy(copied, in_place_buf, in_place_size);
> +    }
>  
>      /**
>       * Clients will expect short writes at EOF, so we have to limit
> @@ -1263,26 +1309,38 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
>          }
>      }
>  
> -    iov[0] = (struct iovec) {
> -        .iov_base = copied,
> -        .iov_len = in_place_size,
> -    };
> -    if (size > FUSE_IN_PLACE_WRITE_BYTES) {
> -        assert(size - FUSE_IN_PLACE_WRITE_BYTES <= FUSE_SPILLOVER_BUF_SIZE);
> -        iov[1] = (struct iovec) {
> -            .iov_base = (void *)spillover_buf,
> -            .iov_len = size - FUSE_IN_PLACE_WRITE_BYTES,
> +    if (in_place_buf) {
> +        iov[0] = (struct iovec) {
> +            .iov_base = copied,
> +            .iov_len = in_place_size,
>          };
> -        qemu_iovec_init_external(&qiov, iov, 2);
> +        if (size > FUSE_IN_PLACE_WRITE_BYTES) {
> +            assert(size - FUSE_IN_PLACE_WRITE_BYTES <= FUSE_SPILLOVER_BUF_SIZE);
> +            iov[1] = (struct iovec) {
> +                .iov_base = (void *)spillover_buf,
> +                .iov_len = size - FUSE_IN_PLACE_WRITE_BYTES,
> +            };
> +            qemu_iovec_init_external(&qiov, iov, 2);
> +        } else {
> +            qemu_iovec_init_external(&qiov, iov, 1);
> +        }
>      } else {
> +        /* fuse over io_uring */
> +        iov[0] = (struct iovec) {
> +            .iov_base = (void *)spillover_buf,
> +            .iov_len = size,
> +        };
>          qemu_iovec_init_external(&qiov, iov, 1);
>      }
> +
>      ret = blk_co_pwritev(exp->common.blk, offset, size, &qiov, 0);
>      if (ret < 0) {
>          goto fail_free_buffer;
>      }
>  
> -    qemu_vfree(copied);
> +    if (in_place_buf) {
> +        qemu_vfree(copied);
> +    }
>  
>      *out = (struct fuse_write_out) {
>          .size = size,
> @@ -1290,7 +1348,9 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
>      return sizeof(*out);
>  
>  fail_free_buffer:
> -    qemu_vfree(copied);
> +    if (in_place_buf) {
> +        qemu_vfree(copied);
> +    }
>      return ret;
>  }
>  
> @@ -1578,173 +1638,151 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
>      }
>  }
>  
> -/*
> - * For use in fuse_co_process_request():
> - * Returns a pointer to the parameter object for the given operation (inside of
> - * queue->request_buf, which is assumed to hold a fuse_in_header first).
> - * Verifies that the object is complete (queue->request_buf is large enough to
> - * hold it in one piece, and the request length includes the whole object).
> - *
> - * Note that queue->request_buf may be overwritten after yielding, so the
> - * returned pointer must not be used across a function that may yield!
> - */
> -#define FUSE_IN_OP_STRUCT(op_name, queue) \
> +#define FUSE_IN_OP_STRUCT_LEGACY(in_buf) \
>      ({ \
> -        const struct fuse_in_header *__in_hdr = \
> -            (const struct fuse_in_header *)(queue)->request_buf; \
> -        const struct fuse_##op_name##_in *__in = \
> -            (const struct fuse_##op_name##_in *)(__in_hdr + 1); \
> -        const size_t __param_len = sizeof(*__in_hdr) + sizeof(*__in); \
> -        uint32_t __req_len; \
> -        \
> -        QEMU_BUILD_BUG_ON(sizeof((queue)->request_buf) < __param_len); \
> -        \
> -        __req_len = __in_hdr->len; \
> -        if (__req_len < __param_len) { \
> -            warn_report("FUSE request truncated (%" PRIu32 " < %zu)", \
> -                        __req_len, __param_len); \
> -            ret = -EINVAL; \
> -            break; \
> -        } \

This check made sure that we don't access in_buf out of bounds. What is
the replacement for it?

> -        __in; \
> +        (void *)(((struct fuse_in_header *)in_buf) + 1); \
>      })
>  
> -/*
> - * For use in fuse_co_process_request():
> - * Returns a pointer to the return object for the given operation (inside of
> - * out_buf, which is assumed to hold a fuse_out_header first).
> - * Verifies that out_buf is large enough to hold the whole object.
> - *
> - * (out_buf should be a char[] array.)
> - */
> -#define FUSE_OUT_OP_STRUCT(op_name, out_buf) \
> +#define FUSE_OUT_OP_STRUCT_LEGACY(out_buf) \
>      ({ \
> -        struct fuse_out_header *__out_hdr = \
> -            (struct fuse_out_header *)(out_buf); \
> -        struct fuse_##op_name##_out *__out = \
> -            (struct fuse_##op_name##_out *)(__out_hdr + 1); \
> -        \
> -        QEMU_BUILD_BUG_ON(sizeof(*__out_hdr) + sizeof(*__out) > \
> -                          sizeof(out_buf)); \
> -        \
> -        __out; \
> +        (void *)(((struct fuse_out_header *)out_buf) + 1); \
>      })
>  
> -/**
> - * Process a FUSE request, incl. writing the response.
> - *
> - * Note that yielding in any request-processing function can overwrite the
> - * contents of q->request_buf.  Anything that takes a buffer needs to take
> - * care that the content is copied before yielding.
> - *
> - * @spillover_buf can contain the tail of a write request too large to fit into
> - * q->request_buf.  This function takes ownership of it (i.e. will free it),
> - * which assumes that its contents will not be overwritten by concurrent
> - * requests (as opposed to q->request_buf).
> +
> +/*
> + * Shared helper for FUSE request processing. Handles both legacy and io_uring
> + * paths.
>   */
> -static void coroutine_fn
> -fuse_co_process_request(FuseQueue *q, void *spillover_buf)
> +static void coroutine_fn fuse_co_process_request_common(
> +    FuseExport *exp,
> +    uint32_t opcode,
> +    uint64_t req_id,
> +    void *in_buf,
> +    void *spillover_buf,
> +    void *out_buf,
> +    int fd, /* -1 for uring */
> +    void (*send_response)(void *opaque, uint32_t req_id, ssize_t ret,
> +                         const void *buf, void *out_buf),
> +    void *opaque /* FuseQueue* or FuseRingEnt* */)
>  {
> -    FuseExport *exp = q->exp;
> -    uint32_t opcode;
> -    uint64_t req_id;
> -    /*
> -     * Return buffer.  Must be large enough to hold all return headers, but does
> -     * not include space for data returned by read requests.
> -     * (FUSE_IN_OP_STRUCT() verifies at compile time that out_buf is indeed
> -     * large enough.)
> -     */
> -    char out_buf[sizeof(struct fuse_out_header) +
> -                 MAX_CONST(sizeof(struct fuse_init_out),
> -                 MAX_CONST(sizeof(struct fuse_open_out),
> -                 MAX_CONST(sizeof(struct fuse_attr_out),
> -                 MAX_CONST(sizeof(struct fuse_write_out),
> -                           sizeof(struct fuse_lseek_out)))))];
> -    struct fuse_out_header *out_hdr = (struct fuse_out_header *)out_buf;
> -    /* For read requests: Data to be returned */
>      void *out_data_buffer = NULL;
> -    ssize_t ret;
> +    ssize_t ret = 0;
>  
> -    /* Limit scope to ensure pointer is no longer used after yielding */
> -    {
> -        const struct fuse_in_header *in_hdr =
> -            (const struct fuse_in_header *)q->request_buf;
> +    void *op_in_buf = (void *)FUSE_IN_OP_STRUCT_LEGACY(in_buf);
> +    void *op_out_buf = (void *)FUSE_OUT_OP_STRUCT_LEGACY(out_buf);
>  
> -        opcode = in_hdr->opcode;
> -        req_id = in_hdr->unique;
> +#ifdef CONFIG_LINUX_IO_URING
> +    if (opcode != FUSE_INIT && exp->is_uring) {

Maybe add a comment explaining that FUSE_INIT is always delivered
through /dev/fuse, even if we want to enable io_uring?

> +        op_in_buf = (void *)in_buf;
> +        op_out_buf = (void *)out_buf;
>      }
> +#endif
>  
>      switch (opcode) {
>      case FUSE_INIT: {
> -        const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, q);
> -        ret = fuse_co_init(exp, FUSE_OUT_OP_STRUCT(init, out_buf),
> -                           in->max_readahead, in->flags);
> +        const struct fuse_init_in *in =
> +            (const struct fuse_init_in *)FUSE_IN_OP_STRUCT_LEGACY(in_buf);
> +
> +        struct fuse_init_out *out =
> +            (struct fuse_init_out *)FUSE_OUT_OP_STRUCT_LEGACY(out_buf);

FUSE_IN_OP_STRUCT_LEGACY() returns a void *, so the explicit casts are
unnecessary. This applies to all of the commands below, too.

> +
> +        ret = fuse_co_init(exp, out, in->max_readahead, in);
>          break;
>      }
>  
> -    case FUSE_OPEN:
> -        ret = fuse_co_open(exp, FUSE_OUT_OP_STRUCT(open, out_buf));
> +    case FUSE_OPEN: {
> +        struct fuse_open_out *out =
> +            (struct fuse_open_out *)op_out_buf;
> +
> +        ret = fuse_co_open(exp, out);
>          break;
> +    }
>  
>      case FUSE_RELEASE:
>          ret = 0;
>          break;
>  
>      case FUSE_LOOKUP:
> -        ret = -ENOENT; /* There is no node but the root node */
> +        ret = -ENOENT;
>          break;

Why are you removing the comment?

>  
> -    case FUSE_GETATTR:
> -        ret = fuse_co_getattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf));
> +    case FUSE_GETATTR: {
> +        struct fuse_attr_out *out =
> +            (struct fuse_attr_out *)op_out_buf;
> +
> +        ret = fuse_co_getattr(exp, out);
>          break;
> +    }
>  
>      case FUSE_SETATTR: {
> -        const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, q);
> -        ret = fuse_co_setattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf),
> -                              in->valid, in->size, in->mode, in->uid, in->gid);
> +        const struct fuse_setattr_in *in =
> +            (const struct fuse_setattr_in *)op_in_buf;
> +
> +        struct fuse_attr_out *out =
> +            (struct fuse_attr_out *)op_out_buf;
> +
> +        ret = fuse_co_setattr(exp, out, in->valid, in->size, in->mode,
> +                              in->uid, in->gid);
>          break;
>      }
>  
>      case FUSE_READ: {
> -        const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, q);
> +        const struct fuse_read_in *in =
> +            (const struct fuse_read_in *)op_in_buf;
> +
>          ret = fuse_co_read(exp, &out_data_buffer, in->offset, in->size);
>          break;
>      }
>  
>      case FUSE_WRITE: {
> -        const struct fuse_write_in *in = FUSE_IN_OP_STRUCT(write, q);
> -        uint32_t req_len;
> -
> -        req_len = ((const struct fuse_in_header *)q->request_buf)->len;
> -        if (unlikely(req_len < sizeof(struct fuse_in_header) + sizeof(*in) +
> -                               in->size)) {
> -            warn_report("FUSE WRITE truncated; received %zu bytes of %" PRIu32,
> -                        req_len - sizeof(struct fuse_in_header) - sizeof(*in),
> -                        in->size);
> -            ret = -EINVAL;
> -            break;
> +        const struct fuse_write_in *in =
> +            (const struct fuse_write_in *)op_in_buf;
> +
> +        struct fuse_write_out *out =
> +            (struct fuse_write_out *)op_out_buf;
> +
> +#ifdef CONFIG_LINUX_IO_URING
> +        if (!exp->is_uring) {
> +#endif

I wonder if it wouldn't be better to have exp->is_uring available even
without CONFIG_LINUX_IO_URING, it would just always be false. It would
be nice to avoid #ifdefs in the middle of the function if they aren't
strictly necessary.

> +            uint32_t req_len = ((const struct fuse_in_header *)in_buf)->len;
> +
> +            if (unlikely(req_len < sizeof(struct fuse_in_header) + sizeof(*in) +
> +                        in->size)) {
> +                warn_report("FUSE WRITE truncated; received %zu bytes of %"
> +                    PRIu32,
> +                    req_len - sizeof(struct fuse_in_header) - sizeof(*in),
> +                    in->size);
> +                ret = -EINVAL;
> +                break;
> +            }
> +#ifdef CONFIG_LINUX_IO_URING
> +        } else {
> +            assert(in->size <=
> +                ((FuseRingEnt *)opaque)->req_header.ring_ent_in_out.payload_sz);
>          }
> +#endif
>  
> -        /*
> -         * poll_fuse_fd() has checked that in_hdr->len matches the number of
> -         * bytes read, which cannot exceed the max_write value we set
> -         * (FUSE_MAX_WRITE_BYTES).  So we know that FUSE_MAX_WRITE_BYTES >=
> -         * in_hdr->len >= in->size + X, so this assertion must hold.
> -         */
>          assert(in->size <= FUSE_MAX_WRITE_BYTES);

Instead of deleting the comment explaining why this is true, can you
just add a second paragraph explaining why it's true for io_uring, too?

> -        /*
> -         * Passing a pointer to `in` (i.e. the request buffer) is fine because
> -         * fuse_co_write() takes care to copy its contents before potentially
> -         * yielding.
> -         */

Why did you delete this comment? It's still true.

> -        ret = fuse_co_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
> -                            in->offset, in->size, in + 1, spillover_buf);
> +        const void *in_place_buf = in + 1;
> +        const void *spill_buf = spillover_buf;
> +
> +#ifdef CONFIG_LINUX_IO_URING
> +        if (exp->is_uring) {
> +            in_place_buf = NULL;
> +            spill_buf = out_buf;
> +        }
> +#endif
> +
> +        ret = fuse_co_write(exp, out, in->offset, in->size,
> +                            in_place_buf, spill_buf);
>          break;
>      }
>  
>      case FUSE_FALLOCATE: {
> -        const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, q);
> +        const struct fuse_fallocate_in *in =
> +            (const struct fuse_fallocate_in *)op_in_buf;
> +
>          ret = fuse_co_fallocate(exp, in->offset, in->length, in->mode);
>          break;
>      }

Kevin



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring
  2025-09-17 19:47     ` Brian Song
@ 2025-09-19 14:13       ` Kevin Wolf
  0 siblings, 0 replies; 38+ messages in thread
From: Kevin Wolf @ 2025-09-19 14:13 UTC (permalink / raw)
  To: Brian Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, stefanha

Am 17.09.2025 um 21:47 hat Brian Song geschrieben:
> 
> 
> On 9/16/25 3:08 PM, Kevin Wolf wrote:
> > Am 30.08.2025 um 04:50 hat Brian Song geschrieben:
> > > This patch adds a new export option for storage-export-daemon to enable
> > > FUSE-over-io_uring via the switch io-uring=on|off (disableby default).
> > > It also implements the protocol handshake with the Linux kernel
> > > during the FUSE-over-io_uring initialization phase.
> > > 
> > > See: https://docs.kernel.org/filesystems/fuse-io-uring.html
> > > 
> > > The kernel documentation describes in detail how FUSE-over-io_uring
> > > works. This patch implements the Initial SQE stage shown in thediagram:
> > > it initializes one queue per IOThread, each currently supporting a
> > > single submission queue entry (SQE). When the FUSE driver sends the
> > > first FUSE request (FUSE_INIT), storage-export-daemon calls
> > > fuse_uring_start() to complete initialization, ultimately submitting
> > > the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm
> > > successful initialization with the kernel.
> > > 
> > > We also added support for multiple IOThreads. The current Linux kernel
> > > requires registering $(nproc) queues when setting up FUSE-over-io_uring
> > > To let users customize the number of FUSE Queues (i.e., IOThreads),
> > > we first create nproc Ring Queues as required by the kernel, then
> > > distribute them in a round-robin manner to the FUSE Queues for
> > > registration. In addition, to support multiple in-flight requests,
> > > we configure each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH
> > > entries/requests.
> > > 
> > > Suggested-by: Kevin Wolf <kwolf@redhat.com>
> > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > Signed-off-by: Brian Song <hibriansong@gmail.com>
> > > ---
> > >   block/export/fuse.c                  | 310 +++++++++++++++++++++++++--
> > >   docs/tools/qemu-storage-daemon.rst   |  11 +-
> > >   qapi/block-export.json               |   5 +-
> > >   storage-daemon/qemu-storage-daemon.c |   1 +
> > >   util/fdmon-io_uring.c                |   5 +-
> > >   5 files changed, 309 insertions(+), 23 deletions(-)
> > > 
> > > diff --git a/block/export/fuse.c b/block/export/fuse.c
> > > index c0ad4696ce..19bf9e5f74 100644
> > > --- a/block/export/fuse.c
> > > +++ b/block/export/fuse.c
> > > @@ -48,6 +48,9 @@
> > >   #include <linux/fs.h>
> > >   #endif
> > > +/* room needed in buffer to accommodate header */
> > > +#define FUSE_BUFFER_HEADER_SIZE 0x1000
> > > +
> > >   /* Prevent overly long bounce buffer allocations */
> > >   #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
> > >   /*
> > > @@ -63,12 +66,59 @@
> > >       (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
> > >   typedef struct FuseExport FuseExport;
> > > +typedef struct FuseQueue FuseQueue;
> > > +
> > > +#ifdef CONFIG_LINUX_IO_URING
> > > +#define FUSE_DEFAULT_RING_QUEUE_DEPTH 64
> > > +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
> > 
> > Maybe it would be a little clearer if the next few types has URing in
> > their name instead of just Ring.
> > 
> > > +typedef struct FuseRingQueue FuseRingQueue;
> > > +typedef struct FuseRingEnt {
> > > +    /* back pointer */
> > > +    FuseRingQueue *rq;
> > > +
> > > +    /* commit id of a fuse request */
> > > +    uint64_t req_commit_id;
> > > +
> > > +    /* fuse request header and payload */
> > > +    struct fuse_uring_req_header req_header;
> > > +    void *op_payload;
> > > +    size_t req_payload_sz;
> > > +
> > > +    /* The vector passed to the kernel */
> > > +    struct iovec iov[2];
> > > +
> > > +    CqeHandler fuse_cqe_handler;
> > > +} FuseRingEnt;
> > > +
> > > +struct FuseRingQueue {
> > 
> > It would be good to have a comment here that explains the difference
> > between FuseQueue and FuseRingQueue.
> > 
> > Is this a distinction that should remain in the long run or would we
> > always have a 1:1 mapping between FuseQueue and FuseRingQueue once the
> > pending kernel changes are merged that allow a number of uring queues
> > different from the number of CPUs?
> 
> Stefan mentioned the issue, and I added some comments here. One thing to
> note is that FuseRingQueueManager and the distribution between FuseQueue and
> FuseRingQueue are just temporary measures until the kernel allows
> user-defined queues. Therefore, I don't think it's a good idea to remove
> FuseRingQueueManager at this stage.

I don't think that simplifying the code now will make it harder to make
these changes in the future, so I'd really prefer to keep e.g. all of
the queue setup in a single place even if we expect part of it to go
away in the long run.

> If you look back at the v2 patch, we put the ring entries inside the
> FuseQueue. The result was that we had to define nproc IOThreads (FuseQueue)
> in order to make it work. That's why here I separated the numbers of the two
> types of queues and RingQueue into independent abstractions: allocate nproc
> RingQueues and initialize the entries, then distribute them to FuseQueues in
> a round-robin manner. Once the kernel supports a user-defined number of
> queues, we can remove FuseRingQueueManager and the RR distribution.

Right, I'm not requesting that you change the mechanism per se. I'd just
like to see it more integrated with the rest. Additional functions and
structs can be helpful if they allow you to separate out self-contained
logic, but that's not the case here. Here it's just one additional
moving part that you have to understand when reading the code, which
makes it a little more complex and harder to read than necessary.

> Also, to keep the variable names consistent with those in the kernel and
> libfuse, I use Ring here instead of URing.

Yes, I can see that. The difference is that there, the types are
contained in a separate source file that handles only io_uring, so the
context is clear.

In QEMU's FUSE export, we're still mixing /dev/fuse code and io_uring
code in a single file, so it's a bit more confusing which name refers to
which.

But alternatively, we can also split the source file in QEMU. At almost
2000 lines of code, that might be a good idea anyway.

> > > +    int rqid;
> > > +
> > > +    /* back pointer */
> > > +    FuseQueue *q;
> > > +    FuseRingEnt *ent;
> > > +
> > > +    /* List entry for ring_queues */
> > > +    QLIST_ENTRY(FuseRingQueue) next;
> > > +};
> > > +
> > > +/*
> > > + * Round-robin distribution of ring queues across FUSE queues.
> > > + * This structure manages the mapping between kernel ring queues and user
> > > + * FUSE queues.
> > > + */
> > > +typedef struct FuseRingQueueManager {
> > > +    FuseRingQueue *ring_queues;
> > > +    int num_ring_queues;
> > > +    int num_fuse_queues;
> > > +} FuseRingQueueManager;
> > 
> > This isn't a manager, it's just the set of queues the export uses.
> > 
> > num_fuse_queues duplicates exp->num_queues, there is no reason for it to
> > exist. All users also have access to the FuseExport itself.
> > 
> > The other two fields can just be merged directly into FuseExport,
> > preferably renamed to uring_queues and num_uring_queues.
> > >> +#endif
> > >   /*
> > >    * One FUSE "queue", representing one FUSE FD from which requests are fetched
> > >    * and processed.  Each queue is tied to an AioContext.
> > >    */
> > > -typedef struct FuseQueue {
> > > +struct FuseQueue {
> > >       FuseExport *exp;
> > >       AioContext *ctx;
> > > @@ -109,15 +159,11 @@ typedef struct FuseQueue {
> > >        * Free this buffer with qemu_vfree().
> > >        */
> > >       void *spillover_buf;
> > > -} FuseQueue;
> > > -/*
> > > - * Verify that FuseQueue.request_buf plus the spill-over buffer together
> > > - * are big enough to be accepted by the FUSE kernel driver.
> > > - */
> > > -QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
> > > -                  FUSE_SPILLOVER_BUF_SIZE <
> > > -                  FUSE_MIN_READ_BUFFER);
> > > +#ifdef CONFIG_LINUX_IO_URING
> > > +    QLIST_HEAD(, FuseRingQueue) ring_queue_list;
> > > +#endif
> > > +};
> > >   struct FuseExport {
> > >       BlockExport common;
> > > @@ -133,7 +179,7 @@ struct FuseExport {
> > >        */
> > >       bool halted;
> > > -    int num_queues;
> > > +    size_t num_queues;
> > 
> > I'm not sure why this change is needed. If it is, can it be a separate
> > patch before this one, with a commit message describing the reason?
> 
> I feel there's no reason to use a signed int here, since the number of
> queues cannot be negative.

So it's unrelated to what the commit message promises, right? ("add opt
to enable FUSE-over-io_uring"). You can make it a separate cleanup patch
then.

> > >       FuseQueue *queues;
> > >       /*
> > >        * True if this export should follow the generic export's AioContext.
> > > @@ -149,6 +195,12 @@ struct FuseExport {
> > >       /* Whether allow_other was used as a mount option or not */
> > >       bool allow_other;
> > > +#ifdef CONFIG_LINUX_IO_URING
> > > +    bool is_uring;
> > > +    size_t ring_queue_depth;
> > > +    FuseRingQueueManager *ring_queue_manager;
> > > +#endif
> > > +
> > >       mode_t st_mode;
> > >       uid_t st_uid;
> > >       gid_t st_gid;
> > > @@ -205,7 +257,7 @@ static void fuse_attach_handlers(FuseExport *exp)
> > >           return;
> > >       }
> > > -    for (int i = 0; i < exp->num_queues; i++) {
> > > +    for (size_t i = 0; i < exp->num_queues; i++) {
> > >           aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
> > >                              read_from_fuse_fd, NULL, NULL, NULL,
> > >                              &exp->queues[i]);
> > > @@ -257,6 +309,189 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
> > >       .drained_poll  = fuse_export_drained_poll,
> > >   };
> > > +#ifdef CONFIG_LINUX_IO_URING
> > > +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req,
> > > +                    const unsigned int rqid,
> > > +                    const unsigned int commit_id)
> > 
> > Indentation is off here. There are two accepted styles for indentation
> > after breaking a long line in QEMU (see docs/devel/style.rst):
> > 
> > 1. Indent the next line by exactly four spaces:
> > 
> >      do_something(x, y,
> >          z);
> > 
> > 2. Align the next line with the first character after the opening
> >     parenthesis:
> > 
> >      do_something(x, y,
> >                   z);
> > 
> > The second one is the preferred one. The first one is generally only
> > used when the parenthesis is already too far right and we can't do much
> > about it.
> > 
> > > +{
> > > +    req->qid = rqid;
> > > +    req->commit_id = commit_id;
> > > +    req->flags = 0;
> > > +}
> > > +
> > > +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q,
> > > +               __u32 cmd_op)
> > 
> > Indentation.
> > 
> > Another option here is to keep everything before the function name on a
> > separate line, like this:
> > 
> > static void
> > fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q, __u32 cmd_op)
> > 
> > This would allow the second line to stay under 80 characters.
> > 
> > > +{
> > > +    sqe->opcode = IORING_OP_URING_CMD;
> > > +
> > > +    sqe->fd = q->fuse_fd;
> > > +    sqe->rw_flags = 0;
> > > +    sqe->ioprio = 0;
> > > +    sqe->off = 0;
> > > +
> > > +    sqe->cmd_op = cmd_op;
> > > +    sqe->__pad1 = 0;
> > > +}
> > > +
> > > +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque)
> > > +{
> > > +    FuseRingEnt *ent = opaque;
> > > +    struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0];
> > > +
> > > +    fuse_uring_sqe_prepare(sqe, ent->rq->q, FUSE_IO_URING_CMD_REGISTER);
> > > +
> > > +    sqe->addr = (uint64_t)(ent->iov);
> > > +    sqe->len = 2;
> > > +
> > > +    fuse_uring_sqe_set_req_data(req, ent->rq->rqid, 0);
> > > +}
> > > +
> > > +static void fuse_uring_submit_register(void *opaque)
> > > +{
> > > +    FuseRingEnt *ent = opaque;
> > > +    FuseExport *exp = ent->rq->q->exp;
> > > +
> > > +
> > 
> > Extra empty line.
> > 
> > > +    aio_add_sqe(fuse_uring_prep_sqe_register, ent, &(ent->fuse_cqe_handler));
> > 
> > The parentheses around ent->fuse_cqe_handler are unnecessary.
> > 
> > > +}
> > > +
> > > +/**
> > > + * Distribute ring queues across FUSE queues using round-robin algorithm.
> > 
> > Hm, if this function distributes (u)ring queues, then what is
> > fuse_distribute_ring_queues() doing? Is the term overloaded with two
> > meanings?
> > 
> > > + * This ensures even distribution of kernel ring queues across user-specified
> > > + * FUSE queues.
> > > + */
> > > +static
> > > +FuseRingQueueManager *fuse_ring_queue_manager_create(int num_fuse_queues,
> > > +                                                    size_t ring_queue_depth,
> > > +                                                    size_t bufsize)
> > 
> > The right style here would be something like:
> > 
> > static FuseRingQueueManager *
> > fuse_ring_queue_manager_create(int num_fuse_queues,
> >                                 size_t ring_queue_depth,
> >                                 size_t bufsize)
> > 
> > Given that I said that there is no reason to call the set of all queues
> > a manager, or to even have it separate from FuseExport, this probably
> > becomes fuse_uring_setup_queues() or something.
> > 
> > > +{
> > > +    int num_ring_queues = get_nprocs();
> > 
> > This could use a comment saying that this is a kernel requirement at the
> > moment.
> > 
> > > +    FuseRingQueueManager *manager = g_new(FuseRingQueueManager, 1);
> > > +
> > > +    if (!manager) {
> > > +        return NULL;
> > > +    }
> > 
> > g_new() never returns NULL, it aborts on error instead, so no reason to
> > have a NULL check here.
> > 
> > > +
> > > +    manager->ring_queues = g_new(FuseRingQueue, num_ring_queues);
> > > +    manager->num_ring_queues = num_ring_queues;
> > > +    manager->num_fuse_queues = num_fuse_queues;
> > > +
> > > +    if (!manager->ring_queues) {
> > > +        g_free(manager);
> > > +        return NULL;
> > > +    }
> > 
> > This check is unnecessary for the same reason.
> > 
> > > +
> > > +    for (int i = 0; i < num_ring_queues; i++) {
> > > +        FuseRingQueue *rq = &manager->ring_queues[i];
> > > +        rq->rqid = i;
> > > +        rq->ent = g_new(FuseRingEnt, ring_queue_depth);
> > > +
> > > +        if (!rq->ent) {
> > > +            for (int j = 0; j < i; j++) {
> > > +                g_free(manager->ring_queues[j].ent);
> > > +            }
> > > +            g_free(manager->ring_queues);
> > > +            g_free(manager);
> > > +            return NULL;
> > > +        }
> > 
> > This one, too.
> > 
> > > +
> > > +        for (size_t j = 0; j < ring_queue_depth; j++) {
> > > +            FuseRingEnt *ent = &rq->ent[j];
> > > +            ent->rq = rq;
> > > +            ent->req_payload_sz = bufsize - FUSE_BUFFER_HEADER_SIZE;
> > > +            ent->op_payload = g_malloc0(ent->req_payload_sz);
> > > +
> > > +            if (!ent->op_payload) {
> > > +                for (size_t k = 0; k < j; k++) {
> > > +                    g_free(rq->ent[k].op_payload);
> > > +                }
> > > +                g_free(rq->ent);
> > > +                for (int k = 0; k < i; k++) {
> > > +                    g_free(manager->ring_queues[k].ent);
> > > +                }
> > > +                g_free(manager->ring_queues);
> > > +                g_free(manager);
> > > +                return NULL;
> > > +            }
> > 
> > And this one.
> > 
> > Removing all of them will make the function a lot more readable.
> > 
> > > +
> > > +            ent->iov[0] = (struct iovec) {
> > > +                &(ent->req_header),
> > 
> > Unnecessary parentheses.
> > 
> > > +                sizeof(struct fuse_uring_req_header)
> > > +            };
> > > +            ent->iov[1] = (struct iovec) {
> > > +                ent->op_payload,
> > > +                ent->req_payload_sz
> > > +            };
> > > +
> > > +            ent->fuse_cqe_handler.cb = fuse_uring_cqe_handler;
> > > +        }
> > > +    }
> > > +
> > > +    return manager;
> > > +}
> > > +
> > > +static
> > > +void fuse_distribute_ring_queues(FuseExport *exp, FuseRingQueueManager *manager)
> > > +{
> > > +    int queue_index = 0;
> > > +
> > > +    for (int i = 0; i < manager->num_ring_queues; i++) {
> > > +        FuseRingQueue *rq = &manager->ring_queues[i];
> > > +
> > > +        rq->q = &exp->queues[queue_index];
> > > +        QLIST_INSERT_HEAD(&(rq->q->ring_queue_list), rq, next);
> > > +
> > > +        queue_index = (queue_index + 1) % manager->num_fuse_queues;
> > > +    }
> > > +}
> > 
> > Ok, no overloaded meaning of distributing queues, but this function
> > should probably be merged with the one above. It's part of setting up
> > the queues.
> > 
> > You don't need a separate queue_index counter, you can just directly use
> > exp->queues[i % manager->num_fuse_queues].
> > 
> 
> There are two steps:
> 
> 1. Create uring queues and allocate buffers for each entry's payload.
> 
> 2. Distribute these uring queues to FUSE queues using a round-robin
> algorithm.
> 
> Given that this is only a temporary measure to allow users to define their
> own IOThreads/FUSE queues, we might later replace the second part of the
> logic. I believe it's better to separate these two pieces of logic rather
> than combining them.

But one of them doesn't make sense without the other currently. Looping
twice over all queues and doing half of their setup is harder to
understand than having a single loop and doing all of the setup.

You're right that we hope that the second half goes away eventually, but
we don't know if or when someone will actually do this. We shouldn't
structure our code so that it may make sense some time in the future if
someone extends it in the way we envision now, but so that it makes
sense and is easy to understand now.

Kevin



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] export/fuse: Safe termination for FUSE-uring
  2025-09-17 22:06         ` Brian Song
@ 2025-09-22 17:41           ` Stefan Hajnoczi
  2025-09-22 17:51           ` Stefan Hajnoczi
  1 sibling, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2025-09-22 17:41 UTC (permalink / raw)
  To: bernd
  Cc: Hanna Czenczek, qemu-block, qemu-devel, armbru, fam, kwolf,
	hibriansong

[-- Attachment #1: Type: text/plain, Size: 6713 bytes --]

On Wed, Sep 17, 2025 at 06:06:55PM -0400, Brian Song wrote:
> 
> 
> On 9/17/25 9:01 AM, Hanna Czenczek wrote:
> > On 15.09.25 07:43, Brian Song wrote:
> > > Hi Hanna,
> > 
> > Hi Brian!
> > 
> > (Thanks for your heads-up!)
> > 
> > > Stefan raised the above issue and proposed a preliminary solution: keep
> > > closing the file descriptor in the delete section, but perform
> > > umount separately for FUSE uring and traditional FUSE in the shutdown
> > > and delete sections respectively. This approach avoids the race
> > > condition on the file descriptor.
> > > 
> > > In the case of FUSE uring, umount must be performed in the shutdown
> > > section. The reason is that the kernel currently lacks an interface to
> > > explicitly cancel submitted SQEs. Performing umount forces the kernel to
> > > flush all pending SQEs and return their CQEs. Without this step, CQEs
> > > may arrive after the export has already been deleted, and invoking the
> > > CQE handler at that point would dereference freed memory and trigger a
> > > segmentation fault.
> > 
> > The commit message says that incrementing the BB reference would be
> > enough to solve the problem (i.e. deleting is delayed until all requests
> > are done).  Why isn’t it?
> 
> Hanna:
> 
> If we place umount in the delete section instead of the shutdown section,
> the kernel FUSE driver will continue waiting for user FUSE requests and
> therefore won't return CQEs to userspace. As a result, the BB reference
> remains held (since the reference is acquired during registration and
> submission and only released once the CQE returns), preventing the delete
> operation from being invoked (invoked once the reference is decreased to 0).
> This is why umount must be placed in the shutdown section.
> 
> > 
> > > I’m curious about traditional FUSE: is it strictly necessary to perform
> > > umount in the delete section, or could it also be done in shutdown?
> > 
> > Looking into libfuse, fuse_session_unmount() (in fuse_kern_unmount())
> > closes the FUSE FD.  I can imagine that might result in the potential
> > problems Stefan described.
> > 
> > > Additionally, what is the correct ordering between close(fd) and
> > > umount, does one need to precede the other?
> > 
> > fuse_kern_unmount() closes the (queue 0) FD first before actually
> > unmounting, with a comment: “Need to close file descriptor, otherwise
> > synchronous umount would recurse into filesystem, and deadlock.”
> > 
> > Given that, I assume the FDs should all be closed before unmounting.
> > 
> > (Though to be fair, before looking into it now, I don’t think I’ve ever
> > given it much thought…)
> > 
> > Hanna
> > 
> Stefan:
> 
> I roughly went through the umount and close system calls:
> 
> umount:
> fuse_kill_sb_anon -> fuse_sb_destroy -> fuse_abort_conn
> 
> close:
> __fput -> file->f_op->release(inode, file) -> fuse_dev_release ->
> fuse_abort_conn
> (this only runs after all /dev/fuse FDs have been closed).
> 
> And as Hanna mentioned, libfuse points out: “Need to close file descriptor,
> otherwise synchronous umount would recurse into filesystem, and deadlock.”
> 
> So ideally, we should close each queue FD first, then call umount at the end
> — even though calling umount directly also works. The root issue is that the
> kernel doesn't provide an interface to cancel already submitted SQEs.

Hi Bernd,
I wanted to check with you to see if you have thought more about
ASYNC_CANCEL support for FUSE-over-io_uring SQEs?

If you don't have time to implement it, maybe you could share your
thoughts on how one would go about doing this? That would be a nice
starting point if someone else wants to try it out.

Thanks,
Stefan

> 
> You mentioned that in fuse over io_uring mode we perform close in the
> shutdown path, but at that point the server may still be processing
> requests. While handling requests, it may still write to the FD, but that FD
> might not be /dev/fuse. I’m not sure how this gets triggered, since in fuse
> uring mode all FUSE requests are handled by io_uring, and our FUSE requests
> should be completed via io_uring. After shutdown closes the FD, it may call
> fuse_abort_conn, which terminates all request processing in the kernel.
> There’s also locking in place to protect the termination of requests and the
> subsequent uring cleanup.
> 
> That’s why I think the best approach for now is:
> 
> in shutdown, handle close and umount for fuse over io_uring;
> 
> in delete, handle close and umount for traditional FUSE.
> 
> > > Thanks,
> > > Brian
> > > 
> > > On 9/9/25 3:33 PM, Stefan Hajnoczi wrote:
> > >   > On Fri, Aug 29, 2025 at 10:50:24PM -0400, Brian Song wrote:
> > >   >> @@ -901,24 +941,15 @@ static void fuse_export_shutdown(BlockExport
> > > *blk_exp)
> > >   >>            */
> > >   >>           g_hash_table_remove(exports, exp->mountpoint);
> > >   >>       }
> > >   >> -}
> > >   >> -
> > >   >> -static void fuse_export_delete(BlockExport *blk_exp)
> > >   >> -{
> > >   >> -    FuseExport *exp = container_of(blk_exp, FuseExport, common);
> > >   >>
> > >   >> -    for (int i = 0; i < exp->num_queues; i++) {
> > >   >> +    for (size_t i = 0; i < exp->num_queues; i++) {
> > >   >>           FuseQueue *q = &exp->queues[i];
> > >   >>
> > >   >>           /* Queue 0's FD belongs to the FUSE session */
> > >   >>           if (i > 0 && q->fuse_fd >= 0) {
> > >   >>               close(q->fuse_fd);
> > >   >
> > >   > This changes the behavior of the non-io_uring code. Now all fuse
> > > fds and
> > >   > fuse_session are closed while requests are potentially still being
> > >   > processed.
> > >   >
> > >   > There is a race condition: if an IOThread is processing a
> > > request here
> > >   > then it may invoke a system call on q->fuse_fd just after it has been
> > >   > closed but not set to -1. If another thread has also opened a
> > > new file
> > >   > then the fd could be reused, resulting in an accidental write(2)
> > > to the
> > >   > new file. I'm not sure whether there is a way to trigger this in
> > >   > practice, but it looks like a problem waiting to happen.
> > >   >
> > >   > Simply setting q->fuse_fd to -1 here doesn't fix the race. It
> > > would be
> > >   > necessary to stop processing fuse_fd in the thread before closing it
> > >   > here or to schedule a BH in each thread so that fuse_fd can be closed
> > >   > in the thread that uses the fd.
> > > 
> > 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] export/fuse: Safe termination for FUSE-uring
  2025-09-17 22:06         ` Brian Song
  2025-09-22 17:41           ` Stefan Hajnoczi
@ 2025-09-22 17:51           ` Stefan Hajnoczi
  1 sibling, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2025-09-22 17:51 UTC (permalink / raw)
  To: Brian Song
  Cc: Hanna Czenczek, qemu-block, qemu-devel, armbru, bernd, fam, kwolf

[-- Attachment #1: Type: text/plain, Size: 7620 bytes --]

On Wed, Sep 17, 2025 at 06:06:55PM -0400, Brian Song wrote:
> 
> 
> On 9/17/25 9:01 AM, Hanna Czenczek wrote:
> > On 15.09.25 07:43, Brian Song wrote:
> > > Hi Hanna,
> > 
> > Hi Brian!
> > 
> > (Thanks for your heads-up!)
> > 
> > > Stefan raised the above issue and proposed a preliminary solution: keep
> > > closing the file descriptor in the delete section, but perform
> > > umount separately for FUSE uring and traditional FUSE in the shutdown
> > > and delete sections respectively. This approach avoids the race
> > > condition on the file descriptor.
> > > 
> > > In the case of FUSE uring, umount must be performed in the shutdown
> > > section. The reason is that the kernel currently lacks an interface to
> > > explicitly cancel submitted SQEs. Performing umount forces the kernel to
> > > flush all pending SQEs and return their CQEs. Without this step, CQEs
> > > may arrive after the export has already been deleted, and invoking the
> > > CQE handler at that point would dereference freed memory and trigger a
> > > segmentation fault.
> > 
> > The commit message says that incrementing the BB reference would be
> > enough to solve the problem (i.e. deleting is delayed until all requests
> > are done).  Why isn’t it?
> 
> Hanna:
> 
> If we place umount in the delete section instead of the shutdown section,
> the kernel FUSE driver will continue waiting for user FUSE requests and
> therefore won't return CQEs to userspace. As a result, the BB reference
> remains held (since the reference is acquired during registration and
> submission and only released once the CQE returns), preventing the delete
> operation from being invoked (invoked once the reference is decreased to 0).
> This is why umount must be placed in the shutdown section.
> 
> > 
> > > I’m curious about traditional FUSE: is it strictly necessary to perform
> > > umount in the delete section, or could it also be done in shutdown?
> > 
> > Looking into libfuse, fuse_session_unmount() (in fuse_kern_unmount())
> > closes the FUSE FD.  I can imagine that might result in the potential
> > problems Stefan described.
> > 
> > > Additionally, what is the correct ordering between close(fd) and
> > > umount, does one need to precede the other?
> > 
> > fuse_kern_unmount() closes the (queue 0) FD first before actually
> > unmounting, with a comment: “Need to close file descriptor, otherwise
> > synchronous umount would recurse into filesystem, and deadlock.”
> > 
> > Given that, I assume the FDs should all be closed before unmounting.
> > 
> > (Though to be fair, before looking into it now, I don’t think I’ve ever
> > given it much thought…)
> > 
> > Hanna
> > 
> Stefan:
> 
> I roughly went through the umount and close system calls:
> 
> umount:
> fuse_kill_sb_anon -> fuse_sb_destroy -> fuse_abort_conn
> 
> close:
> __fput -> file->f_op->release(inode, file) -> fuse_dev_release ->
> fuse_abort_conn
> (this only runs after all /dev/fuse FDs have been closed).
> 
> And as Hanna mentioned, libfuse points out: “Need to close file descriptor,
> otherwise synchronous umount would recurse into filesystem, and deadlock.”
> 
> So ideally, we should close each queue FD first, then call umount at the end
> — even though calling umount directly also works. The root issue is that the
> kernel doesn't provide an interface to cancel already submitted SQEs.
> 
> You mentioned that in fuse over io_uring mode we perform close in the
> shutdown path, but at that point the server may still be processing
> requests. While handling requests, it may still write to the FD, but that FD
> might not be /dev/fuse. I’m not sure how this gets triggered, since in fuse
> uring mode all FUSE requests are handled by io_uring, and our FUSE requests
> should be completed via io_uring. After shutdown closes the FD, it may call
> fuse_abort_conn, which terminates all request processing in the kernel.

If another thread opens a new file descriptor, the kernel will hand out
the lowest numbered available file descriptor. That fd could be the
FUSE-over-io_uring fd that was just closed by the main loop thread while
the IOThread is still waiting for CQEs or in the middle of processing a
FUSE-over-io_uring request. An IOThread must not use the stale fd (e.g.
as part of an io_uring SQE) thinking it is a FUSE fd.

> There’s also locking in place to protect the termination of requests and the
> subsequent uring cleanup.
> 
> That’s why I think the best approach for now is:
> 
> in shutdown, handle close and umount for fuse over io_uring;
> 
> in delete, handle close and umount for traditional FUSE.

Yes. I would refine the FUSE-over-io_uring part like this:

I remember we discussed scheduling a BH in the IOThreads so they can
call close(2). That way there's no race between the IOThreads, which are
still using the fds, and the main loop thread, which is in shutdown().

It sounds like the main loop thread should only umount once all
IOThreads have closed their fds. The IOThreads will need to notify the
main loop thread when they are done. An async callback in the main loop
thread will invoke umount and drop the reference to the export. Then
delete() will finally be called.

If someone can think of a way to achieve the same thing with less
synchronization, that would be simpler. But if not, then I think we need
this for correctness (to avoid the race with IOThreads still using the
fd).

Stefan

> 
> > > Thanks,
> > > Brian
> > > 
> > > On 9/9/25 3:33 PM, Stefan Hajnoczi wrote:
> > >   > On Fri, Aug 29, 2025 at 10:50:24PM -0400, Brian Song wrote:
> > >   >> @@ -901,24 +941,15 @@ static void fuse_export_shutdown(BlockExport
> > > *blk_exp)
> > >   >>            */
> > >   >>           g_hash_table_remove(exports, exp->mountpoint);
> > >   >>       }
> > >   >> -}
> > >   >> -
> > >   >> -static void fuse_export_delete(BlockExport *blk_exp)
> > >   >> -{
> > >   >> -    FuseExport *exp = container_of(blk_exp, FuseExport, common);
> > >   >>
> > >   >> -    for (int i = 0; i < exp->num_queues; i++) {
> > >   >> +    for (size_t i = 0; i < exp->num_queues; i++) {
> > >   >>           FuseQueue *q = &exp->queues[i];
> > >   >>
> > >   >>           /* Queue 0's FD belongs to the FUSE session */
> > >   >>           if (i > 0 && q->fuse_fd >= 0) {
> > >   >>               close(q->fuse_fd);
> > >   >
> > >   > This changes the behavior of the non-io_uring code. Now all fuse
> > > fds and
> > >   > fuse_session are closed while requests are potentially still being
> > >   > processed.
> > >   >
> > >   > There is a race condition: if an IOThread is processing a
> > > request here
> > >   > then it may invoke a system call on q->fuse_fd just after it has been
> > >   > closed but not set to -1. If another thread has also opened a
> > > new file
> > >   > then the fd could be reused, resulting in an accidental write(2)
> > > to the
> > >   > new file. I'm not sure whether there is a way to trigger this in
> > >   > practice, but it looks like a problem waiting to happen.
> > >   >
> > >   > Simply setting q->fuse_fd to -1 here doesn't fix the race. It
> > > would be
> > >   > necessary to stop processing fuse_fd in the thread before closing it
> > >   > here or to schedule a BH in each thread so that fuse_fd can be closed
> > >   > in the thread that uses the fd.
> > > 
> > 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-09-22 17:51 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-30  2:50 [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports Brian Song
2025-08-30  2:50 ` [PATCH 1/4] export/fuse: add opt to enable FUSE-over-io_uring Brian Song
2025-09-03 10:53   ` Stefan Hajnoczi
2025-09-03 18:00     ` Brian Song
2025-09-09 14:48       ` Stefan Hajnoczi
2025-09-09 17:46         ` Brian Song
2025-09-09 18:05           ` Bernd Schubert
2025-09-03 11:26   ` Stefan Hajnoczi
2025-09-16 19:08   ` Kevin Wolf
2025-09-17 19:47     ` Brian Song
2025-09-19 14:13       ` Kevin Wolf
2025-08-30  2:50 ` [PATCH 2/4] export/fuse: process FUSE-over-io_uring requests Brian Song
2025-09-03 11:51   ` Stefan Hajnoczi
2025-09-08 19:09     ` Brian Song
2025-09-08 19:45       ` Bernd Schubert
2025-09-09  1:10         ` Brian Song
2025-09-09 15:26       ` Stefan Hajnoczi
2025-09-19 13:54   ` Kevin Wolf
2025-08-30  2:50 ` [PATCH 3/4] export/fuse: Safe termination for FUSE-uring Brian Song
2025-09-09 19:33   ` Stefan Hajnoczi
2025-09-09 20:51     ` Brian Song
2025-09-10 13:17       ` Stefan Hajnoczi
2025-09-15  5:43     ` Brian Song
2025-09-17 13:01       ` Hanna Czenczek
2025-09-17 22:06         ` Brian Song
2025-09-22 17:41           ` Stefan Hajnoczi
2025-09-22 17:51           ` Stefan Hajnoczi
2025-08-30  2:50 ` [PATCH 4/4] iotests: add tests for FUSE-over-io_uring Brian Song
2025-09-09 19:38   ` Stefan Hajnoczi
2025-09-09 20:51     ` Brian Song
2025-09-10 13:14       ` Stefan Hajnoczi
2025-09-12  2:22         ` Brian Song
2025-09-15 17:41           ` Stefan Hajnoczi
2025-08-30 12:00 ` [PATCH 0/4] export/fuse: Add FUSE-over-io_uring for Storage Exports Brian Song
2025-09-03  9:49   ` Stefan Hajnoczi
2025-09-03 18:11     ` Brian Song
2025-09-16 12:18       ` Kevin Wolf
2025-09-04 19:32   ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).