* [RFC PATCH 0/3] block/export: Add FUSE-over-io_uring for Storage Exports
@ 2025-08-15 3:46 Zhi Song
2025-08-15 3:46 ` [PATCH 1/3] fuse: add FUSE-over-io_uring enable opt and init Zhi Song
` (3 more replies)
0 siblings, 4 replies; 15+ messages in thread
From: Zhi Song @ 2025-08-15 3:46 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, armbru, bernd, fam, hibriansong, hreitz, kwolf,
stefanha
From: Brian Song <hibriansong@gmail.com>
Hello,
This is a GSoC project. You could check out more here:
https://wiki.qemu.org/Google_Summer_of_Code_2025#FUSE-over-io_uring_exports
This series:
- Merges the request processing functions for both traditional FUSE
and FUSE-over-io_uring modes
- Implements multi-threading (Multi-IOThread)
- Improves FUSE-over-io_uring termination handling
Due to kernel limitations, when the FUSE-over-io_uring option is
enabled,
you must create and assign nr_cpu IOThreads. For example:
qemu-storage-daemon \
--object iothread,id=iothread1 \
--object iothread,id=iothread2 \
--blockdev node-name=prot-node,driver=file,filename=img.qcow2 \
--blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
--export type=fuse,id=exp0,node-name=fmt-node,mountpoint=mount-point, \
writable=on,iothread.0=iothread1,iothread.1=iothread2
More detail on the v1 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2025-07/msg00280.html
Brian Song (3):
fuse: add FUSE-over-io_uring enable opt and init
fuse: Handle FUSE-uring requests
fuse: Safe termination for FUSE-uring
block/export/fuse.c | 632 ++++++++++++++++++++-------
docs/tools/qemu-storage-daemon.rst | 11 +-
qapi/block-export.json | 5 +-
storage-daemon/qemu-storage-daemon.c | 1 +
util/fdmon-io_uring.c | 5 +-
5 files changed, 486 insertions(+), 168 deletions(-)
--
2.45.2
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 1/3] fuse: add FUSE-over-io_uring enable opt and init 2025-08-15 3:46 [RFC PATCH 0/3] block/export: Add FUSE-over-io_uring for Storage Exports Zhi Song @ 2025-08-15 3:46 ` Zhi Song 2025-08-16 23:13 ` Brian Song 2025-08-15 3:46 ` [PATCH 2/3] fuse: Handle FUSE-uring requests Zhi Song ` (2 subsequent siblings) 3 siblings, 1 reply; 15+ messages in thread From: Zhi Song @ 2025-08-15 3:46 UTC (permalink / raw) To: qemu-block Cc: qemu-devel, armbru, bernd, fam, hibriansong, hreitz, kwolf, stefanha From: Brian Song <hibriansong@gmail.com> This patch adds a new export option for storage-export-daemon to enable or disable FUSE-over-io_uring via the switch io-uring=on|off (disable by default). It also implements the protocol handshake with the Linux kernel during the FUSE-over-io_uring initialization phase. See: https://docs.kernel.org/filesystems/fuse-io-uring.html The kernel documentation describes in detail how FUSE-over-io_uring works. This patch implements the Initial SQE stage shown in thediagram: it initializes one queue per IOThread, each currently supporting a single submission queue entry (SQE). When the FUSE driver sends the first FUSE request (FUSE_INIT), storage-export-daemon calls fuse_uring_start() to complete initialization, ultimately submitting the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm successful initialization with the kernel. Suggested-by: Kevin Wolf <kwolf@redhat.com> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Brian Song <hibriansong@gmail.com> --- block/export/fuse.c | 161 ++++++++++++++++++++++++--- docs/tools/qemu-storage-daemon.rst | 11 +- qapi/block-export.json | 5 +- storage-daemon/qemu-storage-daemon.c | 1 + util/fdmon-io_uring.c | 5 +- 5 files changed, 159 insertions(+), 24 deletions(-) diff --git a/block/export/fuse.c b/block/export/fuse.c index c0ad4696ce..59fa79f486 100644 --- a/block/export/fuse.c +++ b/block/export/fuse.c @@ -48,6 +48,11 @@ #include <linux/fs.h> #endif +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32 + +/* room needed in buffer to accommodate header */ +#define FUSE_BUFFER_HEADER_SIZE 0x1000 + /* Prevent overly long bounce buffer allocations */ #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024)) /* @@ -63,12 +68,31 @@ (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES) typedef struct FuseExport FuseExport; +typedef struct FuseQueue FuseQueue; + +typedef struct FuseRingEnt { + /* back pointer */ + FuseQueue *q; + + /* commit id of a fuse request */ + uint64_t req_commit_id; + + /* fuse request header and payload */ + struct fuse_uring_req_header req_header; + void *op_payload; + size_t req_payload_sz; + + /* The vector passed to the kernel */ + struct iovec iov[2]; + + CqeHandler fuse_cqe_handler; +} FuseRingEnt; /* * One FUSE "queue", representing one FUSE FD from which requests are fetched * and processed. Each queue is tied to an AioContext. */ -typedef struct FuseQueue { +struct FuseQueue { FuseExport *exp; AioContext *ctx; @@ -109,7 +133,12 @@ typedef struct FuseQueue { * Free this buffer with qemu_vfree(). */ void *spillover_buf; -} FuseQueue; + +#ifdef CONFIG_LINUX_IO_URING + int qid; + FuseRingEnt ent; +#endif +}; /* * Verify that FuseQueue.request_buf plus the spill-over buffer together @@ -148,6 +177,7 @@ struct FuseExport { bool growable; /* Whether allow_other was used as a mount option or not */ bool allow_other; + bool is_uring; mode_t st_mode; uid_t st_uid; @@ -257,6 +287,93 @@ static const BlockDevOps fuse_export_blk_dev_ops = { .drained_poll = fuse_export_drained_poll, }; +#ifdef CONFIG_LINUX_IO_URING + +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req, + const unsigned int qid, + const unsigned int commit_id) +{ + req->qid = qid; + req->commit_id = commit_id; + req->flags = 0; +} + +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q, + __u32 cmd_op) +{ + sqe->opcode = IORING_OP_URING_CMD; + + sqe->fd = q->fuse_fd; + sqe->rw_flags = 0; + sqe->ioprio = 0; + sqe->off = 0; + + sqe->cmd_op = cmd_op; + sqe->__pad1 = 0; +} + +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque) +{ + FuseQueue *q = opaque; + struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0]; + + fuse_uring_sqe_prepare(sqe, q, FUSE_IO_URING_CMD_REGISTER); + + sqe->addr = (uint64_t)(q->ent.iov); + sqe->len = 2; + + fuse_uring_sqe_set_req_data(req, q->qid, 0); +} + +static void fuse_uring_submit_register(void *opaque) +{ + FuseQueue *q = opaque; + FuseExport *exp = q->exp; + + + aio_add_sqe(fuse_uring_prep_sqe_register, q, &(q->ent.fuse_cqe_handler)); +} + +static void fuse_uring_start(FuseExport *exp, struct fuse_init_out *out) +{ + /* + * Since we didn't enable the FUSE_MAX_PAGES feature, the value of + * fc->max_pages should be FUSE_DEFAULT_MAX_PAGES_PER_REQ, which is set by + * the kernel by default. Also, max_write should not exceed + * FUSE_DEFAULT_MAX_PAGES_PER_REQ * PAGE_SIZE. + */ + size_t bufsize = out->max_write + FUSE_BUFFER_HEADER_SIZE; + + if (!(out->flags & FUSE_MAX_PAGES)) { + bufsize = FUSE_DEFAULT_MAX_PAGES_PER_REQ * qemu_real_host_page_size() + + FUSE_BUFFER_HEADER_SIZE; + } + + for (int i = 0; i < exp->num_queues; i++) { + FuseQueue *q = &exp->queues[i]; + FuseRingEnt *ent = &q->ent; + + ent->q = q; + + ent->req_payload_sz = bufsize - FUSE_BUFFER_HEADER_SIZE; + ent->op_payload = g_malloc0(ent->req_payload_sz); + + ent->iov[0] = (struct iovec) { + &(ent->req_header), + sizeof(struct fuse_uring_req_header) + }; + ent->iov[1] = (struct iovec) { + ent->op_payload, + ent->req_payload_sz + }; + + ent->fuse_cqe_handler.cb = fuse_uring_cqe_handler; + + aio_bh_schedule_oneshot(q->ctx, fuse_uring_submit_register, q); + } +} +#endif + static int fuse_export_create(BlockExport *blk_exp, BlockExportOptions *blk_exp_args, AioContext *const *multithread, @@ -280,6 +397,9 @@ static int fuse_export_create(BlockExport *blk_exp, for (size_t i = 0; i < mt_count; i++) { exp->queues[i] = (FuseQueue) { +#ifdef CONFIG_LINUX_IO_URING + .qid = i, +#endif .exp = exp, .ctx = multithread[i], .fuse_fd = -1, @@ -293,6 +413,9 @@ static int fuse_export_create(BlockExport *blk_exp, exp->num_queues = 1; exp->queues = g_new(FuseQueue, 1); exp->queues[0] = (FuseQueue) { +#ifdef CONFIG_LINUX_IO_URING + .qid = 0, +#endif .exp = exp, .ctx = exp->common.ctx, .fuse_fd = -1, @@ -312,6 +435,8 @@ static int fuse_export_create(BlockExport *blk_exp, } } + exp->is_uring = args->io_uring ? true : false; + blk_set_dev_ops(exp->common.blk, &fuse_export_blk_dev_ops, exp); /* @@ -687,15 +812,22 @@ static ssize_t coroutine_fn fuse_co_init(FuseExport *exp, struct fuse_init_out *out, uint32_t max_readahead, uint32_t flags) { - const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO; + const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO + | FUSE_INIT_EXT; + uint64_t outargflags = flags; + +#ifdef CONFIG_LINUX_IO_URING + if (exp->is_uring) + outargflags |= FUSE_OVER_IO_URING; +#endif *out = (struct fuse_init_out) { .major = FUSE_KERNEL_VERSION, .minor = FUSE_KERNEL_MINOR_VERSION, .max_readahead = max_readahead, .max_write = FUSE_MAX_WRITE_BYTES, - .flags = flags & supported_flags, - .flags2 = 0, + .flags = outargflags & supported_flags, + .flags2 = outargflags >> 32, /* libfuse maximum: 2^16 - 1 */ .max_background = UINT16_MAX, @@ -1393,22 +1525,17 @@ fuse_co_process_request(FuseQueue *q, void *spillover_buf) struct fuse_out_header *out_hdr = (struct fuse_out_header *)out_buf; /* For read requests: Data to be returned */ void *out_data_buffer = NULL; - ssize_t ret; - /* Limit scope to ensure pointer is no longer used after yielding */ - { - const struct fuse_in_header *in_hdr = - (const struct fuse_in_header *)q->request_buf; - - opcode = in_hdr->opcode; - req_id = in_hdr->unique; - } + bool is_uring = exp->is_uring; switch (opcode) { case FUSE_INIT: { - const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, q); - ret = fuse_co_init(exp, FUSE_OUT_OP_STRUCT(init, out_buf), - in->max_readahead, in->flags); +#ifdef CONFIG_LINUX_IO_URING + /* FUSE-over-io_uring enabled && start from the tradition path */ + if (is_uring) { + fuse_uring_start(exp, out); + } +#endif break; } diff --git a/docs/tools/qemu-storage-daemon.rst b/docs/tools/qemu-storage-daemon.rst index 35ab2d7807..c5076101e0 100644 --- a/docs/tools/qemu-storage-daemon.rst +++ b/docs/tools/qemu-storage-daemon.rst @@ -78,7 +78,7 @@ Standard options: .. option:: --export [type=]nbd,id=<id>,node-name=<node-name>[,name=<export-name>][,writable=on|off][,bitmap=<name>] --export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=unix,addr.path=<socket-path>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>] --export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=fd,addr.str=<fd>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>] - --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>[,growable=on|off][,writable=on|off][,allow-other=on|off|auto] + --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>[,growable=on|off][,writable=on|off][,allow-other=on|off|auto][,io-uring=on|off] --export [type=]vduse-blk,id=<id>,node-name=<node-name>,name=<vduse-name>[,writable=on|off][,num-queues=<num-queues>][,queue-size=<queue-size>][,logical-block-size=<block-size>][,serial=<serial-number>] is a block export definition. ``node-name`` is the block node that should be @@ -111,10 +111,11 @@ Standard options: that enabling this option as a non-root user requires enabling the user_allow_other option in the global fuse.conf configuration file. Setting ``allow-other`` to auto (the default) will try enabling this option, and on - error fall back to disabling it. - - The ``vduse-blk`` export type takes a ``name`` (must be unique across the host) - to create the VDUSE device. + error fall back to disabling it. Once ``io-uring`` is enabled (off by default), + the FUSE-over-io_uring-related settings will be initialized to bypass the + traditional /dev/fuse communication mechanism and instead use io_uring to + handle FUSE operations. The ``vduse-blk`` export type takes a ``name`` + (must be unique across the host) to create the VDUSE device. ``num-queues`` sets the number of virtqueues (the default is 1). ``queue-size`` sets the virtqueue descriptor table size (the default is 256). diff --git a/qapi/block-export.json b/qapi/block-export.json index 9ae703ad01..37f2fc47e2 100644 --- a/qapi/block-export.json +++ b/qapi/block-export.json @@ -184,12 +184,15 @@ # mount the export with allow_other, and if that fails, try again # without. (since 6.1; default: auto) # +# @io-uring: Use FUSE-over-io-uring. (since 10.2; default: false) +# # Since: 6.0 ## { 'struct': 'BlockExportOptionsFuse', 'data': { 'mountpoint': 'str', '*growable': 'bool', - '*allow-other': 'FuseExportAllowOther' }, + '*allow-other': 'FuseExportAllowOther', + '*io-uring': 'bool' }, 'if': 'CONFIG_FUSE' } ## diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-storage-daemon.c index eb72561358..0cd4cd2b58 100644 --- a/storage-daemon/qemu-storage-daemon.c +++ b/storage-daemon/qemu-storage-daemon.c @@ -107,6 +107,7 @@ static void help(void) #ifdef CONFIG_FUSE " --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>\n" " [,growable=on|off][,writable=on|off][,allow-other=on|off|auto]\n" +" [,io-uring=on|off]" " export the specified block node over FUSE\n" "\n" #endif /* CONFIG_FUSE */ diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c index d2433d1d99..68d3fe8e01 100644 --- a/util/fdmon-io_uring.c +++ b/util/fdmon-io_uring.c @@ -452,10 +452,13 @@ static const FDMonOps fdmon_io_uring_ops = { void fdmon_io_uring_setup(AioContext *ctx, Error **errp) { int ret; + int flags; ctx->io_uring_fd_tag = NULL; + flags = IORING_SETUP_SQE128; - ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0); + ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, + &ctx->fdmon_io_uring, flags); if (ret != 0) { error_setg_errno(errp, -ret, "Failed to initialize io_uring"); return; -- 2.45.2 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] fuse: add FUSE-over-io_uring enable opt and init 2025-08-15 3:46 ` [PATCH 1/3] fuse: add FUSE-over-io_uring enable opt and init Zhi Song @ 2025-08-16 23:13 ` Brian Song 2025-08-17 13:42 ` Stefan Hajnoczi 2025-08-18 23:04 ` Bernd Schubert 0 siblings, 2 replies; 15+ messages in thread From: Brian Song @ 2025-08-16 23:13 UTC (permalink / raw) To: qemu-block; +Cc: qemu-devel, armbru, bernd, fam, hreitz, kwolf, stefanha On 8/14/25 11:46 PM, Brian Song wrote: > From: Brian Song <hibriansong@gmail.com> > > This patch adds a new export option for storage-export-daemon to enable > or disable FUSE-over-io_uring via the switch io-uring=on|off (disable > by default). It also implements the protocol handshake with the Linux > kernel during the FUSE-over-io_uring initialization phase. > > See: https://docs.kernel.org/filesystems/fuse-io-uring.html > > The kernel documentation describes in detail how FUSE-over-io_uring > works. This patch implements the Initial SQE stage shown in thediagram: > it initializes one queue per IOThread, each currently supporting a > single submission queue entry (SQE). When the FUSE driver sends the > first FUSE request (FUSE_INIT), storage-export-daemon calls > fuse_uring_start() to complete initialization, ultimately submitting > the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm > successful initialization with the kernel. > > Suggested-by: Kevin Wolf <kwolf@redhat.com> > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Brian Song <hibriansong@gmail.com> > --- > block/export/fuse.c | 161 ++++++++++++++++++++++++--- > docs/tools/qemu-storage-daemon.rst | 11 +- > qapi/block-export.json | 5 +- > storage-daemon/qemu-storage-daemon.c | 1 + > util/fdmon-io_uring.c | 5 +- > 5 files changed, 159 insertions(+), 24 deletions(-) > > diff --git a/block/export/fuse.c b/block/export/fuse.c > index c0ad4696ce..59fa79f486 100644 > --- a/block/export/fuse.c > +++ b/block/export/fuse.c > @@ -48,6 +48,11 @@ > #include <linux/fs.h> > #endif > > +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32 > + > +/* room needed in buffer to accommodate header */ > +#define FUSE_BUFFER_HEADER_SIZE 0x1000 > + > /* Prevent overly long bounce buffer allocations */ > #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024)) > /* > @@ -63,12 +68,31 @@ > (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES) > > typedef struct FuseExport FuseExport; > +typedef struct FuseQueue FuseQueue; > + > +typedef struct FuseRingEnt { > + /* back pointer */ > + FuseQueue *q; > + > + /* commit id of a fuse request */ > + uint64_t req_commit_id; > + > + /* fuse request header and payload */ > + struct fuse_uring_req_header req_header; > + void *op_payload; > + size_t req_payload_sz; > + > + /* The vector passed to the kernel */ > + struct iovec iov[2]; > + > + CqeHandler fuse_cqe_handler; > +} FuseRingEnt; > > /* > * One FUSE "queue", representing one FUSE FD from which requests are fetched > * and processed. Each queue is tied to an AioContext. > */ > -typedef struct FuseQueue { > +struct FuseQueue { > FuseExport *exp; > > AioContext *ctx; > @@ -109,7 +133,12 @@ typedef struct FuseQueue { > * Free this buffer with qemu_vfree(). > */ > void *spillover_buf; > -} FuseQueue; > + > +#ifdef CONFIG_LINUX_IO_URING > + int qid; > + FuseRingEnt ent; > +#endif > +}; > > /* > * Verify that FuseQueue.request_buf plus the spill-over buffer together > @@ -148,6 +177,7 @@ struct FuseExport { > bool growable; > /* Whether allow_other was used as a mount option or not */ > bool allow_other; > + bool is_uring; > > mode_t st_mode; > uid_t st_uid; > @@ -257,6 +287,93 @@ static const BlockDevOps fuse_export_blk_dev_ops = { > .drained_poll = fuse_export_drained_poll, > }; > > +#ifdef CONFIG_LINUX_IO_URING > + > +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req, > + const unsigned int qid, > + const unsigned int commit_id) > +{ > + req->qid = qid; > + req->commit_id = commit_id; > + req->flags = 0; > +} > + > +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q, > + __u32 cmd_op) > +{ > + sqe->opcode = IORING_OP_URING_CMD; > + > + sqe->fd = q->fuse_fd; > + sqe->rw_flags = 0; > + sqe->ioprio = 0; > + sqe->off = 0; > + > + sqe->cmd_op = cmd_op; > + sqe->__pad1 = 0; > +} > + > +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque) > +{ > + FuseQueue *q = opaque; > + struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0]; > + > + fuse_uring_sqe_prepare(sqe, q, FUSE_IO_URING_CMD_REGISTER); > + > + sqe->addr = (uint64_t)(q->ent.iov); > + sqe->len = 2; > + > + fuse_uring_sqe_set_req_data(req, q->qid, 0); > +} > + > +static void fuse_uring_submit_register(void *opaque) > +{ > + FuseQueue *q = opaque; > + FuseExport *exp = q->exp; > + > + > + aio_add_sqe(fuse_uring_prep_sqe_register, q, &(q->ent.fuse_cqe_handler)); I think there might be a tricky issue with the io_uring integration in QEMU. Currently, when the number of IOThreads goes above ~6 or 7, there’s a pretty high chance of a hang. I added some debug logging in the kernel’s fuse_uring_cmd() registration part, and noticed that the number of register calls is less than the total number of entries in the queue. In theory, we should be registering each entry for each queue. On the userspace side, everything seems normal, the number of aio_add_sqe() calls matches the number of IOThreads. But here’s the weird part: if I add a printf inside the while loop in fdmon-io_uring.c::fdmon_io_uring_wait(), suddenly everything works fine, and the kernel receives registration requests for all entries as expected. do { ret = io_uring_submit_and_wait(&ctx->fdmon_io_uring, wait_nr); fprintf(stderr, "io_uring_submit_and_wait ret: %d\n", ret); } while (ret == -EINTR); My guess is that printf is just slowing down the loop, or maybe there’s some implicit memory barrier happening. Obviously, the right fix isn’t to sprinkle fprintfs around. I suspect there might be a subtle synchronization/race issue here. Brian ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] fuse: add FUSE-over-io_uring enable opt and init 2025-08-16 23:13 ` Brian Song @ 2025-08-17 13:42 ` Stefan Hajnoczi 2025-08-18 23:04 ` Bernd Schubert 1 sibling, 0 replies; 15+ messages in thread From: Stefan Hajnoczi @ 2025-08-17 13:42 UTC (permalink / raw) To: Brian Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf [-- Attachment #1: Type: text/plain, Size: 7269 bytes --] On Sat, Aug 16, 2025 at 07:13:53PM -0400, Brian Song wrote: > > > On 8/14/25 11:46 PM, Brian Song wrote: > > From: Brian Song <hibriansong@gmail.com> > > > > This patch adds a new export option for storage-export-daemon to enable > > or disable FUSE-over-io_uring via the switch io-uring=on|off (disable > > by default). It also implements the protocol handshake with the Linux > > kernel during the FUSE-over-io_uring initialization phase. > > > > See: https://docs.kernel.org/filesystems/fuse-io-uring.html > > > > The kernel documentation describes in detail how FUSE-over-io_uring > > works. This patch implements the Initial SQE stage shown in thediagram: > > it initializes one queue per IOThread, each currently supporting a > > single submission queue entry (SQE). When the FUSE driver sends the > > first FUSE request (FUSE_INIT), storage-export-daemon calls > > fuse_uring_start() to complete initialization, ultimately submitting > > the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm > > successful initialization with the kernel. > > > > Suggested-by: Kevin Wolf <kwolf@redhat.com> > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > Signed-off-by: Brian Song <hibriansong@gmail.com> > > --- > > block/export/fuse.c | 161 ++++++++++++++++++++++++--- > > docs/tools/qemu-storage-daemon.rst | 11 +- > > qapi/block-export.json | 5 +- > > storage-daemon/qemu-storage-daemon.c | 1 + > > util/fdmon-io_uring.c | 5 +- > > 5 files changed, 159 insertions(+), 24 deletions(-) > > > > diff --git a/block/export/fuse.c b/block/export/fuse.c > > index c0ad4696ce..59fa79f486 100644 > > --- a/block/export/fuse.c > > +++ b/block/export/fuse.c > > @@ -48,6 +48,11 @@ > > #include <linux/fs.h> > > #endif > > > > +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32 > > + > > +/* room needed in buffer to accommodate header */ > > +#define FUSE_BUFFER_HEADER_SIZE 0x1000 > > + > > /* Prevent overly long bounce buffer allocations */ > > #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024)) > > /* > > @@ -63,12 +68,31 @@ > > (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES) > > > > typedef struct FuseExport FuseExport; > > +typedef struct FuseQueue FuseQueue; > > + > > +typedef struct FuseRingEnt { > > + /* back pointer */ > > + FuseQueue *q; > > + > > + /* commit id of a fuse request */ > > + uint64_t req_commit_id; > > + > > + /* fuse request header and payload */ > > + struct fuse_uring_req_header req_header; > > + void *op_payload; > > + size_t req_payload_sz; > > + > > + /* The vector passed to the kernel */ > > + struct iovec iov[2]; > > + > > + CqeHandler fuse_cqe_handler; > > +} FuseRingEnt; > > > > /* > > * One FUSE "queue", representing one FUSE FD from which requests are fetched > > * and processed. Each queue is tied to an AioContext. > > */ > > -typedef struct FuseQueue { > > +struct FuseQueue { > > FuseExport *exp; > > > > AioContext *ctx; > > @@ -109,7 +133,12 @@ typedef struct FuseQueue { > > * Free this buffer with qemu_vfree(). > > */ > > void *spillover_buf; > > -} FuseQueue; > > + > > +#ifdef CONFIG_LINUX_IO_URING > > + int qid; > > + FuseRingEnt ent; > > +#endif > > +}; > > > > /* > > * Verify that FuseQueue.request_buf plus the spill-over buffer together > > @@ -148,6 +177,7 @@ struct FuseExport { > > bool growable; > > /* Whether allow_other was used as a mount option or not */ > > bool allow_other; > > + bool is_uring; > > > > mode_t st_mode; > > uid_t st_uid; > > @@ -257,6 +287,93 @@ static const BlockDevOps fuse_export_blk_dev_ops = { > > .drained_poll = fuse_export_drained_poll, > > }; > > > > +#ifdef CONFIG_LINUX_IO_URING > > + > > +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req, > > + const unsigned int qid, > > + const unsigned int commit_id) > > +{ > > + req->qid = qid; > > + req->commit_id = commit_id; > > + req->flags = 0; > > +} > > + > > +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q, > > + __u32 cmd_op) > > +{ > > + sqe->opcode = IORING_OP_URING_CMD; > > + > > + sqe->fd = q->fuse_fd; > > + sqe->rw_flags = 0; > > + sqe->ioprio = 0; > > + sqe->off = 0; > > + > > + sqe->cmd_op = cmd_op; > > + sqe->__pad1 = 0; > > +} > > + > > +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque) > > +{ > > + FuseQueue *q = opaque; > > + struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0]; > > + > > + fuse_uring_sqe_prepare(sqe, q, FUSE_IO_URING_CMD_REGISTER); > > + > > + sqe->addr = (uint64_t)(q->ent.iov); > > + sqe->len = 2; > > + > > + fuse_uring_sqe_set_req_data(req, q->qid, 0); > > +} > > + > > +static void fuse_uring_submit_register(void *opaque) > > +{ > > + FuseQueue *q = opaque; > > + FuseExport *exp = q->exp; > > + > > + > > + aio_add_sqe(fuse_uring_prep_sqe_register, q, &(q->ent.fuse_cqe_handler)); > > I think there might be a tricky issue with the io_uring integration in QEMU. > Currently, when the number of IOThreads goes above ~6 or 7, there’s a pretty > high chance of a hang. I added some debug logging in the kernel’s > fuse_uring_cmd() registration part, and noticed that the number of register > calls is less than the total number of entries in the queue. In theory, we > should be registering each entry for each queue. > > On the userspace side, everything seems normal, the number of aio_add_sqe() > calls matches the number of IOThreads. But here’s the weird part: if I add a > printf inside the while loop in fdmon-io_uring.c::fdmon_io_uring_wait(), > suddenly everything works fine, and the kernel receives registration > requests for all entries as expected. > > do { > ret = io_uring_submit_and_wait(&ctx->fdmon_io_uring, wait_nr); > fprintf(stderr, "io_uring_submit_and_wait ret: %d\n", ret); > } while (ret == -EINTR); > > My guess is that printf is just slowing down the loop, or maybe there’s some > implicit memory barrier happening. Obviously, the right fix isn’t to > sprinkle fprintfs around. I suspect there might be a subtle > synchronization/race issue here. Strange, your fprintf(3) is after io_uring_submit_and_wait(3). I'm not sure how that would influence timing because there should be num_cpus IOThreads independently submitting 1 REGISTER uring_cmd. Debugging ideas: - When QEMU hangs, cat /proc/<pid>/fdinfo/<fd> for each IOThread's io_uring file descriptor. That shows you what the kernel sees, including the state of the SQ/CQ rings. If userspace has filled in the SQE then the output should reflect that. - Replace the REGISTER uring_cmd SQE with a IORING_OP_NOP SQE. This way you eliminate FUSE and can focus purely on testing io_uring. If the CQE is still missing then there is probably a bug in QEMU's aio_add_sqe() API. Stefan > > Brian > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] fuse: add FUSE-over-io_uring enable opt and init 2025-08-16 23:13 ` Brian Song 2025-08-17 13:42 ` Stefan Hajnoczi @ 2025-08-18 23:04 ` Bernd Schubert 2025-08-19 1:15 ` Brian Song 1 sibling, 1 reply; 15+ messages in thread From: Bernd Schubert @ 2025-08-18 23:04 UTC (permalink / raw) To: Brian Song, qemu-block; +Cc: qemu-devel, armbru, fam, hreitz, kwolf, stefanha On 8/17/25 01:13, Brian Song wrote: > > > On 8/14/25 11:46 PM, Brian Song wrote: >> From: Brian Song <hibriansong@gmail.com> >> >> This patch adds a new export option for storage-export-daemon to enable >> or disable FUSE-over-io_uring via the switch io-uring=on|off (disable >> by default). It also implements the protocol handshake with the Linux >> kernel during the FUSE-over-io_uring initialization phase. >> >> See: https://docs.kernel.org/filesystems/fuse-io-uring.html >> >> The kernel documentation describes in detail how FUSE-over-io_uring >> works. This patch implements the Initial SQE stage shown in thediagram: >> it initializes one queue per IOThread, each currently supporting a >> single submission queue entry (SQE). When the FUSE driver sends the >> first FUSE request (FUSE_INIT), storage-export-daemon calls >> fuse_uring_start() to complete initialization, ultimately submitting >> the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm >> successful initialization with the kernel. >> >> Suggested-by: Kevin Wolf <kwolf@redhat.com> >> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >> Signed-off-by: Brian Song <hibriansong@gmail.com> >> --- >> block/export/fuse.c | 161 ++++++++++++++++++++++++--- >> docs/tools/qemu-storage-daemon.rst | 11 +- >> qapi/block-export.json | 5 +- >> storage-daemon/qemu-storage-daemon.c | 1 + >> util/fdmon-io_uring.c | 5 +- >> 5 files changed, 159 insertions(+), 24 deletions(-) >> >> diff --git a/block/export/fuse.c b/block/export/fuse.c >> index c0ad4696ce..59fa79f486 100644 >> --- a/block/export/fuse.c >> +++ b/block/export/fuse.c >> @@ -48,6 +48,11 @@ >> #include <linux/fs.h> >> #endif >> >> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32 >> + >> +/* room needed in buffer to accommodate header */ >> +#define FUSE_BUFFER_HEADER_SIZE 0x1000 >> + >> /* Prevent overly long bounce buffer allocations */ >> #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024)) >> /* >> @@ -63,12 +68,31 @@ >> (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES) >> >> typedef struct FuseExport FuseExport; >> +typedef struct FuseQueue FuseQueue; >> + >> +typedef struct FuseRingEnt { >> + /* back pointer */ >> + FuseQueue *q; >> + >> + /* commit id of a fuse request */ >> + uint64_t req_commit_id; >> + >> + /* fuse request header and payload */ >> + struct fuse_uring_req_header req_header; >> + void *op_payload; >> + size_t req_payload_sz; >> + >> + /* The vector passed to the kernel */ >> + struct iovec iov[2]; >> + >> + CqeHandler fuse_cqe_handler; >> +} FuseRingEnt; >> >> /* >> * One FUSE "queue", representing one FUSE FD from which requests are fetched >> * and processed. Each queue is tied to an AioContext. >> */ >> -typedef struct FuseQueue { >> +struct FuseQueue { >> FuseExport *exp; >> >> AioContext *ctx; >> @@ -109,7 +133,12 @@ typedef struct FuseQueue { >> * Free this buffer with qemu_vfree(). >> */ >> void *spillover_buf; >> -} FuseQueue; >> + >> +#ifdef CONFIG_LINUX_IO_URING >> + int qid; >> + FuseRingEnt ent; >> +#endif >> +}; >> >> /* >> * Verify that FuseQueue.request_buf plus the spill-over buffer together >> @@ -148,6 +177,7 @@ struct FuseExport { >> bool growable; >> /* Whether allow_other was used as a mount option or not */ >> bool allow_other; >> + bool is_uring; >> >> mode_t st_mode; >> uid_t st_uid; >> @@ -257,6 +287,93 @@ static const BlockDevOps fuse_export_blk_dev_ops = { >> .drained_poll = fuse_export_drained_poll, >> }; >> >> +#ifdef CONFIG_LINUX_IO_URING >> + >> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req, >> + const unsigned int qid, >> + const unsigned int commit_id) >> +{ >> + req->qid = qid; >> + req->commit_id = commit_id; >> + req->flags = 0; >> +} >> + >> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q, >> + __u32 cmd_op) >> +{ >> + sqe->opcode = IORING_OP_URING_CMD; >> + >> + sqe->fd = q->fuse_fd; >> + sqe->rw_flags = 0; >> + sqe->ioprio = 0; >> + sqe->off = 0; >> + >> + sqe->cmd_op = cmd_op; >> + sqe->__pad1 = 0; >> +} >> + >> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque) >> +{ >> + FuseQueue *q = opaque; >> + struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0]; >> + >> + fuse_uring_sqe_prepare(sqe, q, FUSE_IO_URING_CMD_REGISTER); >> + >> + sqe->addr = (uint64_t)(q->ent.iov); >> + sqe->len = 2; >> + >> + fuse_uring_sqe_set_req_data(req, q->qid, 0); >> +} >> + >> +static void fuse_uring_submit_register(void *opaque) >> +{ >> + FuseQueue *q = opaque; >> + FuseExport *exp = q->exp; >> + >> + >> + aio_add_sqe(fuse_uring_prep_sqe_register, q, &(q->ent.fuse_cqe_handler)); > > I think there might be a tricky issue with the io_uring integration in > QEMU. Currently, when the number of IOThreads goes above ~6 or 7, > there’s a pretty high chance of a hang. I added some debug logging in > the kernel’s fuse_uring_cmd() registration part, and noticed that the > number of register calls is less than the total number of entries in the > queue. In theory, we should be registering each entry for each queue. Did you also try to add logging at the top of fuse_uring_cmd()? I wonder if there is a start up race and if initial commands are just getting refused. I had run into issues you are describing in some versions of the -rfc patches, but thought that everything was fixed for that. I.e. not excluded that there is still a kernel issue left. Thanks, Bernd ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] fuse: add FUSE-over-io_uring enable opt and init 2025-08-18 23:04 ` Bernd Schubert @ 2025-08-19 1:15 ` Brian Song 2025-08-19 22:26 ` Bernd Schubert 0 siblings, 1 reply; 15+ messages in thread From: Brian Song @ 2025-08-19 1:15 UTC (permalink / raw) To: Bernd Schubert, qemu-block Cc: qemu-devel, armbru, fam, hreitz, kwolf, stefanha On 8/18/25 7:04 PM, Bernd Schubert wrote: > > > On 8/17/25 01:13, Brian Song wrote: >> >> >> On 8/14/25 11:46 PM, Brian Song wrote: >>> From: Brian Song <hibriansong@gmail.com> >>> >>> This patch adds a new export option for storage-export-daemon to enable >>> or disable FUSE-over-io_uring via the switch io-uring=on|off (disable >>> by default). It also implements the protocol handshake with the Linux >>> kernel during the FUSE-over-io_uring initialization phase. >>> >>> See: https://docs.kernel.org/filesystems/fuse-io-uring.html >>> >>> The kernel documentation describes in detail how FUSE-over-io_uring >>> works. This patch implements the Initial SQE stage shown in thediagram: >>> it initializes one queue per IOThread, each currently supporting a >>> single submission queue entry (SQE). When the FUSE driver sends the >>> first FUSE request (FUSE_INIT), storage-export-daemon calls >>> fuse_uring_start() to complete initialization, ultimately submitting >>> the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm >>> successful initialization with the kernel. >>> >>> Suggested-by: Kevin Wolf <kwolf@redhat.com> >>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>> Signed-off-by: Brian Song <hibriansong@gmail.com> >>> --- >>> block/export/fuse.c | 161 ++++++++++++++++++++++++--- >>> docs/tools/qemu-storage-daemon.rst | 11 +- >>> qapi/block-export.json | 5 +- >>> storage-daemon/qemu-storage-daemon.c | 1 + >>> util/fdmon-io_uring.c | 5 +- >>> 5 files changed, 159 insertions(+), 24 deletions(-) >>> >>> diff --git a/block/export/fuse.c b/block/export/fuse.c >>> index c0ad4696ce..59fa79f486 100644 >>> --- a/block/export/fuse.c >>> +++ b/block/export/fuse.c >>> @@ -48,6 +48,11 @@ >>> #include <linux/fs.h> >>> #endif >>> >>> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32 >>> + >>> +/* room needed in buffer to accommodate header */ >>> +#define FUSE_BUFFER_HEADER_SIZE 0x1000 >>> + >>> /* Prevent overly long bounce buffer allocations */ >>> #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024)) >>> /* >>> @@ -63,12 +68,31 @@ >>> (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES) >>> >>> typedef struct FuseExport FuseExport; >>> +typedef struct FuseQueue FuseQueue; >>> + >>> +typedef struct FuseRingEnt { >>> + /* back pointer */ >>> + FuseQueue *q; >>> + >>> + /* commit id of a fuse request */ >>> + uint64_t req_commit_id; >>> + >>> + /* fuse request header and payload */ >>> + struct fuse_uring_req_header req_header; >>> + void *op_payload; >>> + size_t req_payload_sz; >>> + >>> + /* The vector passed to the kernel */ >>> + struct iovec iov[2]; >>> + >>> + CqeHandler fuse_cqe_handler; >>> +} FuseRingEnt; >>> >>> /* >>> * One FUSE "queue", representing one FUSE FD from which requests are fetched >>> * and processed. Each queue is tied to an AioContext. >>> */ >>> -typedef struct FuseQueue { >>> +struct FuseQueue { >>> FuseExport *exp; >>> >>> AioContext *ctx; >>> @@ -109,7 +133,12 @@ typedef struct FuseQueue { >>> * Free this buffer with qemu_vfree(). >>> */ >>> void *spillover_buf; >>> -} FuseQueue; >>> + >>> +#ifdef CONFIG_LINUX_IO_URING >>> + int qid; >>> + FuseRingEnt ent; >>> +#endif >>> +}; >>> >>> /* >>> * Verify that FuseQueue.request_buf plus the spill-over buffer together >>> @@ -148,6 +177,7 @@ struct FuseExport { >>> bool growable; >>> /* Whether allow_other was used as a mount option or not */ >>> bool allow_other; >>> + bool is_uring; >>> >>> mode_t st_mode; >>> uid_t st_uid; >>> @@ -257,6 +287,93 @@ static const BlockDevOps fuse_export_blk_dev_ops = { >>> .drained_poll = fuse_export_drained_poll, >>> }; >>> >>> +#ifdef CONFIG_LINUX_IO_URING >>> + >>> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req, >>> + const unsigned int qid, >>> + const unsigned int commit_id) >>> +{ >>> + req->qid = qid; >>> + req->commit_id = commit_id; >>> + req->flags = 0; >>> +} >>> + >>> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q, >>> + __u32 cmd_op) >>> +{ >>> + sqe->opcode = IORING_OP_URING_CMD; >>> + >>> + sqe->fd = q->fuse_fd; >>> + sqe->rw_flags = 0; >>> + sqe->ioprio = 0; >>> + sqe->off = 0; >>> + >>> + sqe->cmd_op = cmd_op; >>> + sqe->__pad1 = 0; >>> +} >>> + >>> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque) >>> +{ >>> + FuseQueue *q = opaque; >>> + struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0]; >>> + >>> + fuse_uring_sqe_prepare(sqe, q, FUSE_IO_URING_CMD_REGISTER); >>> + >>> + sqe->addr = (uint64_t)(q->ent.iov); >>> + sqe->len = 2; >>> + >>> + fuse_uring_sqe_set_req_data(req, q->qid, 0); >>> +} >>> + >>> +static void fuse_uring_submit_register(void *opaque) >>> +{ >>> + FuseQueue *q = opaque; >>> + FuseExport *exp = q->exp; >>> + >>> + >>> + aio_add_sqe(fuse_uring_prep_sqe_register, q, &(q->ent.fuse_cqe_handler)); >> >> I think there might be a tricky issue with the io_uring integration in >> QEMU. Currently, when the number of IOThreads goes above ~6 or 7, >> there’s a pretty high chance of a hang. I added some debug logging in >> the kernel’s fuse_uring_cmd() registration part, and noticed that the >> number of register calls is less than the total number of entries in the >> queue. In theory, we should be registering each entry for each queue. > > Did you also try to add logging at the top of fuse_uring_cmd()? I wonder > if there is a start up race and if initial commands are just getting > refused. I had run into issues you are describing in some versions of > the -rfc patches, but thought that everything was fixed for that. > I.e. not excluded that there is still a kernel issue left. > > Thanks, > Bernd > > Yes. I added a printk at the beginning of fuse_uring_cmd(), another at the beginning of fuse_uring_register(), and one more at the end of fuse_uring_do_register(). Then I created and registered 20 queues, each with a single ring entry. It printed 37 times(diff every time) with opcode FUSE_IO_URING_CMD_REGISTER (would expect 20), and only 6 queues were registered successfully. The rest of fuse_uring_cmd (x31) exited inside the if (!fc->initialized) branch in fuse_uring_cmd() dmesg: https://gist.github.com/hibriansong/4eda6e7e92601df497282dcd56fd5470 Thanks, Brian ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] fuse: add FUSE-over-io_uring enable opt and init 2025-08-19 1:15 ` Brian Song @ 2025-08-19 22:26 ` Bernd Schubert 2025-08-19 23:23 ` Brian Song 0 siblings, 1 reply; 15+ messages in thread From: Bernd Schubert @ 2025-08-19 22:26 UTC (permalink / raw) To: Brian Song, qemu-block; +Cc: qemu-devel, armbru, fam, hreitz, kwolf, stefanha On 8/19/25 03:15, Brian Song wrote: > > > On 8/18/25 7:04 PM, Bernd Schubert wrote: >> >> >> On 8/17/25 01:13, Brian Song wrote: >>> >>> >>> On 8/14/25 11:46 PM, Brian Song wrote: >>>> From: Brian Song <hibriansong@gmail.com> >>>> >>>> This patch adds a new export option for storage-export-daemon to enable >>>> or disable FUSE-over-io_uring via the switch io-uring=on|off (disable >>>> by default). It also implements the protocol handshake with the Linux >>>> kernel during the FUSE-over-io_uring initialization phase. >>>> >>>> See: https://docs.kernel.org/filesystems/fuse-io-uring.html >>>> >>>> The kernel documentation describes in detail how FUSE-over-io_uring >>>> works. This patch implements the Initial SQE stage shown in thediagram: >>>> it initializes one queue per IOThread, each currently supporting a >>>> single submission queue entry (SQE). When the FUSE driver sends the >>>> first FUSE request (FUSE_INIT), storage-export-daemon calls >>>> fuse_uring_start() to complete initialization, ultimately submitting >>>> the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm >>>> successful initialization with the kernel. >>>> >>>> Suggested-by: Kevin Wolf <kwolf@redhat.com> >>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>>> Signed-off-by: Brian Song <hibriansong@gmail.com> >>>> --- >>>> block/export/fuse.c | 161 ++++++++++++++++++++++++--- >>>> docs/tools/qemu-storage-daemon.rst | 11 +- >>>> qapi/block-export.json | 5 +- >>>> storage-daemon/qemu-storage-daemon.c | 1 + >>>> util/fdmon-io_uring.c | 5 +- >>>> 5 files changed, 159 insertions(+), 24 deletions(-) >>>> >>>> diff --git a/block/export/fuse.c b/block/export/fuse.c >>>> index c0ad4696ce..59fa79f486 100644 >>>> --- a/block/export/fuse.c >>>> +++ b/block/export/fuse.c >>>> @@ -48,6 +48,11 @@ >>>> #include <linux/fs.h> >>>> #endif >>>> >>>> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32 >>>> + >>>> +/* room needed in buffer to accommodate header */ >>>> +#define FUSE_BUFFER_HEADER_SIZE 0x1000 >>>> + >>>> /* Prevent overly long bounce buffer allocations */ >>>> #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024)) >>>> /* >>>> @@ -63,12 +68,31 @@ >>>> (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES) >>>> >>>> typedef struct FuseExport FuseExport; >>>> +typedef struct FuseQueue FuseQueue; >>>> + >>>> +typedef struct FuseRingEnt { >>>> + /* back pointer */ >>>> + FuseQueue *q; >>>> + >>>> + /* commit id of a fuse request */ >>>> + uint64_t req_commit_id; >>>> + >>>> + /* fuse request header and payload */ >>>> + struct fuse_uring_req_header req_header; >>>> + void *op_payload; >>>> + size_t req_payload_sz; >>>> + >>>> + /* The vector passed to the kernel */ >>>> + struct iovec iov[2]; >>>> + >>>> + CqeHandler fuse_cqe_handler; >>>> +} FuseRingEnt; >>>> >>>> /* >>>> * One FUSE "queue", representing one FUSE FD from which requests are fetched >>>> * and processed. Each queue is tied to an AioContext. >>>> */ >>>> -typedef struct FuseQueue { >>>> +struct FuseQueue { >>>> FuseExport *exp; >>>> >>>> AioContext *ctx; >>>> @@ -109,7 +133,12 @@ typedef struct FuseQueue { >>>> * Free this buffer with qemu_vfree(). >>>> */ >>>> void *spillover_buf; >>>> -} FuseQueue; >>>> + >>>> +#ifdef CONFIG_LINUX_IO_URING >>>> + int qid; >>>> + FuseRingEnt ent; >>>> +#endif >>>> +}; >>>> >>>> /* >>>> * Verify that FuseQueue.request_buf plus the spill-over buffer together >>>> @@ -148,6 +177,7 @@ struct FuseExport { >>>> bool growable; >>>> /* Whether allow_other was used as a mount option or not */ >>>> bool allow_other; >>>> + bool is_uring; >>>> >>>> mode_t st_mode; >>>> uid_t st_uid; >>>> @@ -257,6 +287,93 @@ static const BlockDevOps fuse_export_blk_dev_ops = { >>>> .drained_poll = fuse_export_drained_poll, >>>> }; >>>> >>>> +#ifdef CONFIG_LINUX_IO_URING >>>> + >>>> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req, >>>> + const unsigned int qid, >>>> + const unsigned int commit_id) >>>> +{ >>>> + req->qid = qid; >>>> + req->commit_id = commit_id; >>>> + req->flags = 0; >>>> +} >>>> + >>>> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q, >>>> + __u32 cmd_op) >>>> +{ >>>> + sqe->opcode = IORING_OP_URING_CMD; >>>> + >>>> + sqe->fd = q->fuse_fd; >>>> + sqe->rw_flags = 0; >>>> + sqe->ioprio = 0; >>>> + sqe->off = 0; >>>> + >>>> + sqe->cmd_op = cmd_op; >>>> + sqe->__pad1 = 0; >>>> +} >>>> + >>>> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque) >>>> +{ >>>> + FuseQueue *q = opaque; >>>> + struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0]; >>>> + >>>> + fuse_uring_sqe_prepare(sqe, q, FUSE_IO_URING_CMD_REGISTER); >>>> + >>>> + sqe->addr = (uint64_t)(q->ent.iov); >>>> + sqe->len = 2; >>>> + >>>> + fuse_uring_sqe_set_req_data(req, q->qid, 0); >>>> +} >>>> + >>>> +static void fuse_uring_submit_register(void *opaque) >>>> +{ >>>> + FuseQueue *q = opaque; >>>> + FuseExport *exp = q->exp; >>>> + >>>> + >>>> + aio_add_sqe(fuse_uring_prep_sqe_register, q, &(q->ent.fuse_cqe_handler)); >>> >>> I think there might be a tricky issue with the io_uring integration in >>> QEMU. Currently, when the number of IOThreads goes above ~6 or 7, >>> there’s a pretty high chance of a hang. I added some debug logging in >>> the kernel’s fuse_uring_cmd() registration part, and noticed that the >>> number of register calls is less than the total number of entries in the >>> queue. In theory, we should be registering each entry for each queue. >> >> Did you also try to add logging at the top of fuse_uring_cmd()? I wonder >> if there is a start up race and if initial commands are just getting >> refused. I had run into issues you are describing in some versions of >> the -rfc patches, but thought that everything was fixed for that. >> I.e. not excluded that there is still a kernel issue left. >> >> Thanks, >> Bernd >> >> > > Yes. I added a printk at the beginning of fuse_uring_cmd(), another at > the beginning of fuse_uring_register(), and one more at the end of > fuse_uring_do_register(). Then I created and registered 20 queues, each > with a single ring entry. It printed 37 times(diff every time) with > opcode FUSE_IO_URING_CMD_REGISTER (would expect 20), and only 6 queues > were registered successfully. The rest of fuse_uring_cmd (x31) exited > inside the if (!fc->initialized) branch in fuse_uring_cmd() > > dmesg: https://gist.github.com/hibriansong/4eda6e7e92601df497282dcd56fd5470 Thank you for the logs, could you try this? diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c index 2aa20707f40b..cea57ad5d3ab 100644 --- a/fs/fuse/dev_uring.c +++ b/fs/fuse/dev_uring.c @@ -1324,6 +1324,9 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) if (!fc->connected) return -ENOTCONN; + /* Matches smp_wmb() in fuse_set_initialized() */ + smp_rmb(); + /* * fuse_uring_register() needs the ring to be initialized, * we need to know the max payload size Thanks, Bernd ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] fuse: add FUSE-over-io_uring enable opt and init 2025-08-19 22:26 ` Bernd Schubert @ 2025-08-19 23:23 ` Brian Song 2025-08-20 3:31 ` Brian Song 0 siblings, 1 reply; 15+ messages in thread From: Brian Song @ 2025-08-19 23:23 UTC (permalink / raw) To: Bernd Schubert, qemu-block Cc: qemu-devel, armbru, fam, hreitz, kwolf, stefanha On 8/19/25 6:26 PM, Bernd Schubert wrote: > > > On 8/19/25 03:15, Brian Song wrote: >> >> >> On 8/18/25 7:04 PM, Bernd Schubert wrote: >>> >>> >>> On 8/17/25 01:13, Brian Song wrote: >>>> >>>> >>>> On 8/14/25 11:46 PM, Brian Song wrote: >>>>> From: Brian Song <hibriansong@gmail.com> >>>>> >>>>> This patch adds a new export option for storage-export-daemon to enable >>>>> or disable FUSE-over-io_uring via the switch io-uring=on|off (disable >>>>> by default). It also implements the protocol handshake with the Linux >>>>> kernel during the FUSE-over-io_uring initialization phase. >>>>> >>>>> See: https://docs.kernel.org/filesystems/fuse-io-uring.html >>>>> >>>>> The kernel documentation describes in detail how FUSE-over-io_uring >>>>> works. This patch implements the Initial SQE stage shown in thediagram: >>>>> it initializes one queue per IOThread, each currently supporting a >>>>> single submission queue entry (SQE). When the FUSE driver sends the >>>>> first FUSE request (FUSE_INIT), storage-export-daemon calls >>>>> fuse_uring_start() to complete initialization, ultimately submitting >>>>> the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm >>>>> successful initialization with the kernel. >>>>> >>>>> Suggested-by: Kevin Wolf <kwolf@redhat.com> >>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>> Signed-off-by: Brian Song <hibriansong@gmail.com> >>>>> --- >>>>> block/export/fuse.c | 161 ++++++++++++++++++++++++--- >>>>> docs/tools/qemu-storage-daemon.rst | 11 +- >>>>> qapi/block-export.json | 5 +- >>>>> storage-daemon/qemu-storage-daemon.c | 1 + >>>>> util/fdmon-io_uring.c | 5 +- >>>>> 5 files changed, 159 insertions(+), 24 deletions(-) >>>>> >>>>> diff --git a/block/export/fuse.c b/block/export/fuse.c >>>>> index c0ad4696ce..59fa79f486 100644 >>>>> --- a/block/export/fuse.c >>>>> +++ b/block/export/fuse.c >>>>> @@ -48,6 +48,11 @@ >>>>> #include <linux/fs.h> >>>>> #endif >>>>> >>>>> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32 >>>>> + >>>>> +/* room needed in buffer to accommodate header */ >>>>> +#define FUSE_BUFFER_HEADER_SIZE 0x1000 >>>>> + >>>>> /* Prevent overly long bounce buffer allocations */ >>>>> #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024)) >>>>> /* >>>>> @@ -63,12 +68,31 @@ >>>>> (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES) >>>>> >>>>> typedef struct FuseExport FuseExport; >>>>> +typedef struct FuseQueue FuseQueue; >>>>> + >>>>> +typedef struct FuseRingEnt { >>>>> + /* back pointer */ >>>>> + FuseQueue *q; >>>>> + >>>>> + /* commit id of a fuse request */ >>>>> + uint64_t req_commit_id; >>>>> + >>>>> + /* fuse request header and payload */ >>>>> + struct fuse_uring_req_header req_header; >>>>> + void *op_payload; >>>>> + size_t req_payload_sz; >>>>> + >>>>> + /* The vector passed to the kernel */ >>>>> + struct iovec iov[2]; >>>>> + >>>>> + CqeHandler fuse_cqe_handler; >>>>> +} FuseRingEnt; >>>>> >>>>> /* >>>>> * One FUSE "queue", representing one FUSE FD from which requests are fetched >>>>> * and processed. Each queue is tied to an AioContext. >>>>> */ >>>>> -typedef struct FuseQueue { >>>>> +struct FuseQueue { >>>>> FuseExport *exp; >>>>> >>>>> AioContext *ctx; >>>>> @@ -109,7 +133,12 @@ typedef struct FuseQueue { >>>>> * Free this buffer with qemu_vfree(). >>>>> */ >>>>> void *spillover_buf; >>>>> -} FuseQueue; >>>>> + >>>>> +#ifdef CONFIG_LINUX_IO_URING >>>>> + int qid; >>>>> + FuseRingEnt ent; >>>>> +#endif >>>>> +}; >>>>> >>>>> /* >>>>> * Verify that FuseQueue.request_buf plus the spill-over buffer together >>>>> @@ -148,6 +177,7 @@ struct FuseExport { >>>>> bool growable; >>>>> /* Whether allow_other was used as a mount option or not */ >>>>> bool allow_other; >>>>> + bool is_uring; >>>>> >>>>> mode_t st_mode; >>>>> uid_t st_uid; >>>>> @@ -257,6 +287,93 @@ static const BlockDevOps fuse_export_blk_dev_ops = { >>>>> .drained_poll = fuse_export_drained_poll, >>>>> }; >>>>> >>>>> +#ifdef CONFIG_LINUX_IO_URING >>>>> + >>>>> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req, >>>>> + const unsigned int qid, >>>>> + const unsigned int commit_id) >>>>> +{ >>>>> + req->qid = qid; >>>>> + req->commit_id = commit_id; >>>>> + req->flags = 0; >>>>> +} >>>>> + >>>>> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseQueue *q, >>>>> + __u32 cmd_op) >>>>> +{ >>>>> + sqe->opcode = IORING_OP_URING_CMD; >>>>> + >>>>> + sqe->fd = q->fuse_fd; >>>>> + sqe->rw_flags = 0; >>>>> + sqe->ioprio = 0; >>>>> + sqe->off = 0; >>>>> + >>>>> + sqe->cmd_op = cmd_op; >>>>> + sqe->__pad1 = 0; >>>>> +} >>>>> + >>>>> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque) >>>>> +{ >>>>> + FuseQueue *q = opaque; >>>>> + struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0]; >>>>> + >>>>> + fuse_uring_sqe_prepare(sqe, q, FUSE_IO_URING_CMD_REGISTER); >>>>> + >>>>> + sqe->addr = (uint64_t)(q->ent.iov); >>>>> + sqe->len = 2; >>>>> + >>>>> + fuse_uring_sqe_set_req_data(req, q->qid, 0); >>>>> +} >>>>> + >>>>> +static void fuse_uring_submit_register(void *opaque) >>>>> +{ >>>>> + FuseQueue *q = opaque; >>>>> + FuseExport *exp = q->exp; >>>>> + >>>>> + >>>>> + aio_add_sqe(fuse_uring_prep_sqe_register, q, &(q->ent.fuse_cqe_handler)); >>>> >>>> I think there might be a tricky issue with the io_uring integration in >>>> QEMU. Currently, when the number of IOThreads goes above ~6 or 7, >>>> there’s a pretty high chance of a hang. I added some debug logging in >>>> the kernel’s fuse_uring_cmd() registration part, and noticed that the >>>> number of register calls is less than the total number of entries in the >>>> queue. In theory, we should be registering each entry for each queue. >>> >>> Did you also try to add logging at the top of fuse_uring_cmd()? I wonder >>> if there is a start up race and if initial commands are just getting >>> refused. I had run into issues you are describing in some versions of >>> the -rfc patches, but thought that everything was fixed for that. >>> I.e. not excluded that there is still a kernel issue left. >>> >>> Thanks, >>> Bernd >>> >>> >> >> Yes. I added a printk at the beginning of fuse_uring_cmd(), another at >> the beginning of fuse_uring_register(), and one more at the end of >> fuse_uring_do_register(). Then I created and registered 20 queues, each >> with a single ring entry. It printed 37 times(diff every time) with >> opcode FUSE_IO_URING_CMD_REGISTER (would expect 20), and only 6 queues >> were registered successfully. The rest of fuse_uring_cmd (x31) exited >> inside the if (!fc->initialized) branch in fuse_uring_cmd() >> >> dmesg: https://gist.github.com/hibriansong/4eda6e7e92601df497282dcd56fd5470 > > Thank you for the logs, could you try this? > > diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c > index 2aa20707f40b..cea57ad5d3ab 100644 > --- a/fs/fuse/dev_uring.c > +++ b/fs/fuse/dev_uring.c > @@ -1324,6 +1324,9 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) > if (!fc->connected) > return -ENOTCONN; > > + /* Matches smp_wmb() in fuse_set_initialized() */ > + smp_rmb(); > + > /* > * fuse_uring_register() needs the ring to be initialized, > * we need to know the max payload size > > > > Thanks, > Bernd I realized the issue actually comes from QEMU handling the FUSE_INIT request. After I processed outargs, I didn't send the response back to the kernel before starting the fuse-over-io_uring initialization. So it's possible that the 20 registration requests submitted via io_uring_cmd() reach the kernel before process_init_reply() has run and set fc->initialized = 1, which causes fuse_uring_cmd to bail out repeatedly. I also noticed that in libfuse, they first send the init request response, then allocate queues and submit the register SQEs. But even there, during the fuse-over-io_uring init after sending the response, if the kernel hasn't finished process_init_reply() and set fc->initialized = 1, wouldn't they run into a similar issue fuse_uring_cmd repeatedly bailing on register requests because fc->initialized isn't set yet? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] fuse: add FUSE-over-io_uring enable opt and init 2025-08-19 23:23 ` Brian Song @ 2025-08-20 3:31 ` Brian Song 0 siblings, 0 replies; 15+ messages in thread From: Brian Song @ 2025-08-20 3:31 UTC (permalink / raw) To: Bernd Schubert, qemu-block Cc: qemu-devel, armbru, fam, hreitz, kwolf, stefanha On 8/19/25 7:23 PM, Brian Song wrote: > > > On 8/19/25 6:26 PM, Bernd Schubert wrote: >> >> >> On 8/19/25 03:15, Brian Song wrote: >>> >>> >>> On 8/18/25 7:04 PM, Bernd Schubert wrote: >>>> >>>> >>>> On 8/17/25 01:13, Brian Song wrote: >>>>> >>>>> >>>>> On 8/14/25 11:46 PM, Brian Song wrote: >>>>>> From: Brian Song <hibriansong@gmail.com> >>>>>> >>>>>> This patch adds a new export option for storage-export-daemon to >>>>>> enable >>>>>> or disable FUSE-over-io_uring via the switch io-uring=on|off (disable >>>>>> by default). It also implements the protocol handshake with the Linux >>>>>> kernel during the FUSE-over-io_uring initialization phase. >>>>>> >>>>>> See: https://docs.kernel.org/filesystems/fuse-io-uring.html >>>>>> >>>>>> The kernel documentation describes in detail how FUSE-over-io_uring >>>>>> works. This patch implements the Initial SQE stage shown in >>>>>> thediagram: >>>>>> it initializes one queue per IOThread, each currently supporting a >>>>>> single submission queue entry (SQE). When the FUSE driver sends the >>>>>> first FUSE request (FUSE_INIT), storage-export-daemon calls >>>>>> fuse_uring_start() to complete initialization, ultimately submitting >>>>>> the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm >>>>>> successful initialization with the kernel. >>>>>> >>>>>> Suggested-by: Kevin Wolf <kwolf@redhat.com> >>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>>> Signed-off-by: Brian Song <hibriansong@gmail.com> >>>>>> --- >>>>>> block/export/fuse.c | 161 +++++++++++++++++++ >>>>>> +++++--- >>>>>> docs/tools/qemu-storage-daemon.rst | 11 +- >>>>>> qapi/block-export.json | 5 +- >>>>>> storage-daemon/qemu-storage-daemon.c | 1 + >>>>>> util/fdmon-io_uring.c | 5 +- >>>>>> 5 files changed, 159 insertions(+), 24 deletions(-) >>>>>> >>>>>> diff --git a/block/export/fuse.c b/block/export/fuse.c >>>>>> index c0ad4696ce..59fa79f486 100644 >>>>>> --- a/block/export/fuse.c >>>>>> +++ b/block/export/fuse.c >>>>>> @@ -48,6 +48,11 @@ >>>>>> #include <linux/fs.h> >>>>>> #endif >>>>>> >>>>>> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32 >>>>>> + >>>>>> +/* room needed in buffer to accommodate header */ >>>>>> +#define FUSE_BUFFER_HEADER_SIZE 0x1000 >>>>>> + >>>>>> /* Prevent overly long bounce buffer allocations */ >>>>>> #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * >>>>>> 1024 * 1024)) >>>>>> /* >>>>>> @@ -63,12 +68,31 @@ >>>>>> (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES) >>>>>> >>>>>> typedef struct FuseExport FuseExport; >>>>>> +typedef struct FuseQueue FuseQueue; >>>>>> + >>>>>> +typedef struct FuseRingEnt { >>>>>> + /* back pointer */ >>>>>> + FuseQueue *q; >>>>>> + >>>>>> + /* commit id of a fuse request */ >>>>>> + uint64_t req_commit_id; >>>>>> + >>>>>> + /* fuse request header and payload */ >>>>>> + struct fuse_uring_req_header req_header; >>>>>> + void *op_payload; >>>>>> + size_t req_payload_sz; >>>>>> + >>>>>> + /* The vector passed to the kernel */ >>>>>> + struct iovec iov[2]; >>>>>> + >>>>>> + CqeHandler fuse_cqe_handler; >>>>>> +} FuseRingEnt; >>>>>> >>>>>> /* >>>>>> * One FUSE "queue", representing one FUSE FD from which >>>>>> requests are fetched >>>>>> * and processed. Each queue is tied to an AioContext. >>>>>> */ >>>>>> -typedef struct FuseQueue { >>>>>> +struct FuseQueue { >>>>>> FuseExport *exp; >>>>>> >>>>>> AioContext *ctx; >>>>>> @@ -109,7 +133,12 @@ typedef struct FuseQueue { >>>>>> * Free this buffer with qemu_vfree(). >>>>>> */ >>>>>> void *spillover_buf; >>>>>> -} FuseQueue; >>>>>> + >>>>>> +#ifdef CONFIG_LINUX_IO_URING >>>>>> + int qid; >>>>>> + FuseRingEnt ent; >>>>>> +#endif >>>>>> +}; >>>>>> >>>>>> /* >>>>>> * Verify that FuseQueue.request_buf plus the spill-over >>>>>> buffer together >>>>>> @@ -148,6 +177,7 @@ struct FuseExport { >>>>>> bool growable; >>>>>> /* Whether allow_other was used as a mount option or not */ >>>>>> bool allow_other; >>>>>> + bool is_uring; >>>>>> >>>>>> mode_t st_mode; >>>>>> uid_t st_uid; >>>>>> @@ -257,6 +287,93 @@ static const BlockDevOps >>>>>> fuse_export_blk_dev_ops = { >>>>>> .drained_poll = fuse_export_drained_poll, >>>>>> }; >>>>>> >>>>>> +#ifdef CONFIG_LINUX_IO_URING >>>>>> + >>>>>> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req >>>>>> *req, >>>>>> + const unsigned int qid, >>>>>> + const unsigned int commit_id) >>>>>> +{ >>>>>> + req->qid = qid; >>>>>> + req->commit_id = commit_id; >>>>>> + req->flags = 0; >>>>>> +} >>>>>> + >>>>>> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, >>>>>> FuseQueue *q, >>>>>> + __u32 cmd_op) >>>>>> +{ >>>>>> + sqe->opcode = IORING_OP_URING_CMD; >>>>>> + >>>>>> + sqe->fd = q->fuse_fd; >>>>>> + sqe->rw_flags = 0; >>>>>> + sqe->ioprio = 0; >>>>>> + sqe->off = 0; >>>>>> + >>>>>> + sqe->cmd_op = cmd_op; >>>>>> + sqe->__pad1 = 0; >>>>>> +} >>>>>> + >>>>>> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe >>>>>> *sqe, void *opaque) >>>>>> +{ >>>>>> + FuseQueue *q = opaque; >>>>>> + struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0]; >>>>>> + >>>>>> + fuse_uring_sqe_prepare(sqe, q, FUSE_IO_URING_CMD_REGISTER); >>>>>> + >>>>>> + sqe->addr = (uint64_t)(q->ent.iov); >>>>>> + sqe->len = 2; >>>>>> + >>>>>> + fuse_uring_sqe_set_req_data(req, q->qid, 0); >>>>>> +} >>>>>> + >>>>>> +static void fuse_uring_submit_register(void *opaque) >>>>>> +{ >>>>>> + FuseQueue *q = opaque; >>>>>> + FuseExport *exp = q->exp; >>>>>> + >>>>>> + >>>>>> + aio_add_sqe(fuse_uring_prep_sqe_register, q, &(q- >>>>>> >ent.fuse_cqe_handler)); >>>>> >>>>> I think there might be a tricky issue with the io_uring integration in >>>>> QEMU. Currently, when the number of IOThreads goes above ~6 or 7, >>>>> there’s a pretty high chance of a hang. I added some debug logging in >>>>> the kernel’s fuse_uring_cmd() registration part, and noticed that the >>>>> number of register calls is less than the total number of entries >>>>> in the >>>>> queue. In theory, we should be registering each entry for each queue. >>>> >>>> Did you also try to add logging at the top of fuse_uring_cmd()? I >>>> wonder >>>> if there is a start up race and if initial commands are just getting >>>> refused. I had run into issues you are describing in some versions of >>>> the -rfc patches, but thought that everything was fixed for that. >>>> I.e. not excluded that there is still a kernel issue left. >>>> >>>> Thanks, >>>> Bernd >>>> >>>> >>> >>> Yes. I added a printk at the beginning of fuse_uring_cmd(), another at >>> the beginning of fuse_uring_register(), and one more at the end of >>> fuse_uring_do_register(). Then I created and registered 20 queues, each >>> with a single ring entry. It printed 37 times(diff every time) with >>> opcode FUSE_IO_URING_CMD_REGISTER (would expect 20), and only 6 queues >>> were registered successfully. The rest of fuse_uring_cmd (x31) exited >>> inside the if (!fc->initialized) branch in fuse_uring_cmd() >>> >>> dmesg: https://gist.github.com/ >>> hibriansong/4eda6e7e92601df497282dcd56fd5470 >> >> Thank you for the logs, could you try this? >> >> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c >> index 2aa20707f40b..cea57ad5d3ab 100644 >> --- a/fs/fuse/dev_uring.c >> +++ b/fs/fuse/dev_uring.c >> @@ -1324,6 +1324,9 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, >> unsigned int issue_flags) >> if (!fc->connected) >> return -ENOTCONN; >> + /* Matches smp_wmb() in fuse_set_initialized() */ >> + smp_rmb(); >> + >> /* >> * fuse_uring_register() needs the ring to be initialized, >> * we need to know the max payload size >> >> >> >> Thanks, >> Bernd > > I realized the issue actually comes from QEMU handling the FUSE_INIT > request. After I processed outargs, I didn't send the response back to > the kernel before starting the fuse-over-io_uring initialization. So > it's possible that the 20 registration requests submitted via > io_uring_cmd() reach the kernel before process_init_reply() has run and > set fc->initialized = 1, which causes fuse_uring_cmd to bail out > repeatedly. > > I also noticed that in libfuse, they first send the init request > response, then allocate queues and submit the register SQEs. But even > there, during the fuse-over-io_uring init after sending the response, if > the kernel hasn't finished process_init_reply() and set fc->initialized > = 1, wouldn't they run into a similar issue fuse_uring_cmd repeatedly > bailing on register requests because fc->initialized isn't set yet? Hi Bernd, Nvm, I think writing to /dev/fuse fd is blocking. Thanks so much for your feedback! Best, Brian ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/3] fuse: Handle FUSE-uring requests 2025-08-15 3:46 [RFC PATCH 0/3] block/export: Add FUSE-over-io_uring for Storage Exports Zhi Song 2025-08-15 3:46 ` [PATCH 1/3] fuse: add FUSE-over-io_uring enable opt and init Zhi Song @ 2025-08-15 3:46 ` Zhi Song 2025-08-15 3:46 ` [PATCH 3/3] fuse: Safe termination for FUSE-uring Zhi Song 2025-08-17 13:45 ` [RFC PATCH 0/3] block/export: Add FUSE-over-io_uring for Storage Exports Stefan Hajnoczi 3 siblings, 0 replies; 15+ messages in thread From: Zhi Song @ 2025-08-15 3:46 UTC (permalink / raw) To: qemu-block Cc: qemu-devel, armbru, bernd, fam, hibriansong, hreitz, kwolf, stefanha From: Brian Song <hibriansong@gmail.com> https://docs.kernel.org/filesystems/fuse-io-uring.html As described in the kernel documentation, after FUSE-over-io_uring initialization and handshake, FUSE interacts with the kernel using SQE/CQE to send requests and receive responses. This corresponds to the "Sending requests with CQEs" section in the docs. This patch implements three key parts: registering the CQE handler (fuse_uring_cqe_handler), processing FUSE requests (fuse_uring_co_ process_request), and sending response results (fuse_uring_send_ response). It also merges the traditional /dev/fuse request handling with the FUSE-over-io_uring handling functions. Suggested-by: Kevin Wolf <kwolf@redhat.com> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Brian Song <hibriansong@gmail.com> --- block/export/fuse.c | 425 ++++++++++++++++++++++++++++++-------------- 1 file changed, 289 insertions(+), 136 deletions(-) diff --git a/block/export/fuse.c b/block/export/fuse.c index 59fa79f486..7540f8f5a3 100644 --- a/block/export/fuse.c +++ b/block/export/fuse.c @@ -288,6 +288,46 @@ static const BlockDevOps fuse_export_blk_dev_ops = { }; #ifdef CONFIG_LINUX_IO_URING +static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent); + +static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque) +{ + FuseRingEnt *ent = opaque; + FuseExport *exp = ent->q->exp; + + /* Going to process requests */ + fuse_inc_in_flight(exp); + + + fuse_uring_co_process_request(ent); + + /* Finished processing requests */ + fuse_dec_in_flight(exp); +} + +static void fuse_uring_cqe_handler(CqeHandler *cqe_handler) +{ + FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, fuse_cqe_handler); + Coroutine *co; + FuseExport *exp = ent->q->exp; + + if (unlikely(exp->halted)) { + return; + } + + int err = cqe_handler->cqe.res; + + if (err != 0) { + /* -ENOTCONN is ok on umount */ + if (err != -EINTR && err != -EAGAIN && + err != -ENOTCONN) { + fuse_export_halt(exp); + } + } else { + co = qemu_coroutine_create(co_fuse_uring_queue_handle_cqes, ent); + qemu_coroutine_enter(co); + } +} static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req, const unsigned int qid, @@ -1075,6 +1115,9 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size) * Data in @in_place_buf is assumed to be overwritten after yielding, so will * be copied to a bounce buffer beforehand. @spillover_buf in contrast is * assumed to be exclusively owned and will be used as-is. + * In FUSE-over-io_uring mode, the actual op_payload content is stored in + * @spillover_buf. To ensure this buffer is used for writing, @in_place_buf + * is explicitly set to NULL. * Return the number of bytes written to *out on success, and -errno on error. */ static ssize_t coroutine_fn @@ -1082,8 +1125,8 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out, uint64_t offset, uint32_t size, const void *in_place_buf, const void *spillover_buf) { - size_t in_place_size; - void *copied; + size_t in_place_size = 0; + void *copied = NULL; int64_t blk_len; int ret; struct iovec iov[2]; @@ -1098,10 +1141,12 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out, return -EACCES; } - /* Must copy to bounce buffer before potentially yielding */ - in_place_size = MIN(size, FUSE_IN_PLACE_WRITE_BYTES); - copied = blk_blockalign(exp->common.blk, in_place_size); - memcpy(copied, in_place_buf, in_place_size); + if (in_place_buf) { + /* Must copy to bounce buffer before potentially yielding */ + in_place_size = MIN(size, FUSE_IN_PLACE_WRITE_BYTES); + copied = blk_blockalign(exp->common.blk, in_place_size); + memcpy(copied, in_place_buf, in_place_size); + } /** * Clients will expect short writes at EOF, so we have to limit @@ -1125,26 +1170,37 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out, } } - iov[0] = (struct iovec) { - .iov_base = copied, - .iov_len = in_place_size, - }; - if (size > FUSE_IN_PLACE_WRITE_BYTES) { - assert(size - FUSE_IN_PLACE_WRITE_BYTES <= FUSE_SPILLOVER_BUF_SIZE); - iov[1] = (struct iovec) { - .iov_base = (void *)spillover_buf, - .iov_len = size - FUSE_IN_PLACE_WRITE_BYTES, + if (in_place_buf) { + iov[0] = (struct iovec) { + .iov_base = copied, + .iov_len = in_place_size, }; - qemu_iovec_init_external(&qiov, iov, 2); + if (size > FUSE_IN_PLACE_WRITE_BYTES) { + assert(size - FUSE_IN_PLACE_WRITE_BYTES <= FUSE_SPILLOVER_BUF_SIZE); + iov[1] = (struct iovec) { + .iov_base = (void *)spillover_buf, + .iov_len = size - FUSE_IN_PLACE_WRITE_BYTES, + }; + qemu_iovec_init_external(&qiov, iov, 2); + } else { + qemu_iovec_init_external(&qiov, iov, 1); + } } else { + /* fuse over io_uring */ + iov[0] = (struct iovec) { + .iov_base = (void *)spillover_buf, + .iov_len = size, + }; qemu_iovec_init_external(&qiov, iov, 1); } + ret = blk_co_pwritev(exp->common.blk, offset, size, &qiov, 0); if (ret < 0) { goto fail_free_buffer; } - qemu_vfree(copied); + if (in_place_buf) + qemu_vfree(copied); *out = (struct fuse_write_out) { .size = size, @@ -1152,7 +1208,9 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out, return sizeof(*out); fail_free_buffer: - qemu_vfree(copied); + if (in_place_buf) { + qemu_vfree(copied); + } return ret; } @@ -1440,168 +1498,144 @@ static int fuse_write_buf_response(int fd, uint32_t req_id, } } -/* - * For use in fuse_co_process_request(): - * Returns a pointer to the parameter object for the given operation (inside of - * queue->request_buf, which is assumed to hold a fuse_in_header first). - * Verifies that the object is complete (queue->request_buf is large enough to - * hold it in one piece, and the request length includes the whole object). - * - * Note that queue->request_buf may be overwritten after yielding, so the - * returned pointer must not be used across a function that may yield! - */ -#define FUSE_IN_OP_STRUCT(op_name, queue) \ +#define FUSE_IN_OP_STRUCT_LEGACY(in_buf) \ ({ \ - const struct fuse_in_header *__in_hdr = \ - (const struct fuse_in_header *)(queue)->request_buf; \ - const struct fuse_##op_name##_in *__in = \ - (const struct fuse_##op_name##_in *)(__in_hdr + 1); \ - const size_t __param_len = sizeof(*__in_hdr) + sizeof(*__in); \ - uint32_t __req_len; \ - \ - QEMU_BUILD_BUG_ON(sizeof((queue)->request_buf) < __param_len); \ - \ - __req_len = __in_hdr->len; \ - if (__req_len < __param_len) { \ - warn_report("FUSE request truncated (%" PRIu32 " < %zu)", \ - __req_len, __param_len); \ - ret = -EINVAL; \ - break; \ - } \ - __in; \ + (void *)(((struct fuse_in_header *)in_buf) + 1); \ }) -/* - * For use in fuse_co_process_request(): - * Returns a pointer to the return object for the given operation (inside of - * out_buf, which is assumed to hold a fuse_out_header first). - * Verifies that out_buf is large enough to hold the whole object. - * - * (out_buf should be a char[] array.) - */ -#define FUSE_OUT_OP_STRUCT(op_name, out_buf) \ +#define FUSE_OUT_OP_STRUCT_LEGACY(out_buf) \ ({ \ - struct fuse_out_header *__out_hdr = \ - (struct fuse_out_header *)(out_buf); \ - struct fuse_##op_name##_out *__out = \ - (struct fuse_##op_name##_out *)(__out_hdr + 1); \ - \ - QEMU_BUILD_BUG_ON(sizeof(*__out_hdr) + sizeof(*__out) > \ - sizeof(out_buf)); \ - \ - __out; \ + (void *)(((struct fuse_out_header *)out_buf) + 1); \ }) -/** - * Process a FUSE request, incl. writing the response. - * - * Note that yielding in any request-processing function can overwrite the - * contents of q->request_buf. Anything that takes a buffer needs to take - * care that the content is copied before yielding. - * - * @spillover_buf can contain the tail of a write request too large to fit into - * q->request_buf. This function takes ownership of it (i.e. will free it), - * which assumes that its contents will not be overwritten by concurrent - * requests (as opposed to q->request_buf). + +/* + * Shared helper for FUSE request processing. Handles both legacy and io_uring + * paths. */ -static void coroutine_fn -fuse_co_process_request(FuseQueue *q, void *spillover_buf) +static void coroutine_fn fuse_co_process_request_common( + FuseExport *exp, + uint32_t opcode, + uint64_t req_id, + void *in_buf, + void *spillover_buf, + void *out_buf, + int fd, /* -1 for uring */ + void (*send_response)(void *opaque, uint32_t req_id, ssize_t ret, + const void *buf, void *out_buf), + void *opaque /* FuseQueue* or FuseRingEnt* */) { - FuseExport *exp = q->exp; - uint32_t opcode; - uint64_t req_id; - /* - * Return buffer. Must be large enough to hold all return headers, but does - * not include space for data returned by read requests. - * (FUSE_IN_OP_STRUCT() verifies at compile time that out_buf is indeed - * large enough.) - */ - char out_buf[sizeof(struct fuse_out_header) + - MAX_CONST(sizeof(struct fuse_init_out), - MAX_CONST(sizeof(struct fuse_open_out), - MAX_CONST(sizeof(struct fuse_attr_out), - MAX_CONST(sizeof(struct fuse_write_out), - sizeof(struct fuse_lseek_out)))))]; - struct fuse_out_header *out_hdr = (struct fuse_out_header *)out_buf; - /* For read requests: Data to be returned */ void *out_data_buffer = NULL; + ssize_t ret = 0; bool is_uring = exp->is_uring; + void *op_in_buf = (is_uring && opcode != FUSE_INIT) ? + (void *)in_buf : (void *)FUSE_IN_OP_STRUCT_LEGACY(in_buf); + + void *op_out_buf = (is_uring && opcode != FUSE_INIT) ? + (void *)out_buf : (void *)FUSE_OUT_OP_STRUCT_LEGACY(out_buf); + switch (opcode) { case FUSE_INIT: { + const struct fuse_init_in *in = + (const struct fuse_init_in *)FUSE_IN_OP_STRUCT_LEGACY(in_buf); + + struct fuse_init_out *out = + (struct fuse_init_out *)FUSE_OUT_OP_STRUCT_LEGACY(out_buf); + + ret = fuse_co_init(exp, out, in->max_readahead, in->flags); #ifdef CONFIG_LINUX_IO_URING /* FUSE-over-io_uring enabled && start from the tradition path */ - if (is_uring) { + if (is_uring && fd != -1) { fuse_uring_start(exp, out); } #endif break; } - case FUSE_OPEN: - ret = fuse_co_open(exp, FUSE_OUT_OP_STRUCT(open, out_buf)); + case FUSE_OPEN: { + struct fuse_open_out *out = + (struct fuse_open_out *)op_out_buf; + + ret = fuse_co_open(exp, out); break; + } case FUSE_RELEASE: ret = 0; break; case FUSE_LOOKUP: - ret = -ENOENT; /* There is no node but the root node */ + ret = -ENOENT; break; - case FUSE_GETATTR: - ret = fuse_co_getattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf)); + case FUSE_GETATTR: { + struct fuse_attr_out *out = + (struct fuse_attr_out *)op_out_buf; + + ret = fuse_co_getattr(exp, out); break; + } case FUSE_SETATTR: { - const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, q); - ret = fuse_co_setattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf), - in->valid, in->size, in->mode, in->uid, in->gid); + const struct fuse_setattr_in *in = + (const struct fuse_setattr_in *)op_in_buf; + + struct fuse_attr_out *out = + (struct fuse_attr_out *)op_out_buf; + + ret = fuse_co_setattr(exp, out, in->valid, in->size, in->mode, + in->uid, in->gid); break; } case FUSE_READ: { - const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, q); + const struct fuse_read_in *in = + (const struct fuse_read_in *)op_in_buf; + ret = fuse_co_read(exp, &out_data_buffer, in->offset, in->size); break; } case FUSE_WRITE: { - const struct fuse_write_in *in = FUSE_IN_OP_STRUCT(write, q); - uint32_t req_len; - - req_len = ((const struct fuse_in_header *)q->request_buf)->len; - if (unlikely(req_len < sizeof(struct fuse_in_header) + sizeof(*in) + - in->size)) { - warn_report("FUSE WRITE truncated; received %zu bytes of %" PRIu32, - req_len - sizeof(struct fuse_in_header) - sizeof(*in), - in->size); - ret = -EINVAL; - break; + const struct fuse_write_in *in = + (const struct fuse_write_in *)op_in_buf; + + struct fuse_write_out *out = + (struct fuse_write_out *)op_out_buf; + + if (!is_uring) { + uint32_t req_len = ((const struct fuse_in_header *)in_buf)->len; + + if (unlikely(req_len < sizeof(struct fuse_in_header) + sizeof(*in) + + in->size)) { + warn_report("FUSE WRITE truncated; received %zu bytes of %" + PRIu32, + req_len - sizeof(struct fuse_in_header) - sizeof(*in), + in->size); + ret = -EINVAL; + break; + } + } else { + assert(in->size <= + ((FuseRingEnt *)opaque)->req_header.ring_ent_in_out.payload_sz); } - /* - * poll_fuse_fd() has checked that in_hdr->len matches the number of - * bytes read, which cannot exceed the max_write value we set - * (FUSE_MAX_WRITE_BYTES). So we know that FUSE_MAX_WRITE_BYTES >= - * in_hdr->len >= in->size + X, so this assertion must hold. - */ assert(in->size <= FUSE_MAX_WRITE_BYTES); - /* - * Passing a pointer to `in` (i.e. the request buffer) is fine because - * fuse_co_write() takes care to copy its contents before potentially - * yielding. - */ - ret = fuse_co_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf), - in->offset, in->size, in + 1, spillover_buf); + const void *in_place_buf = is_uring ? NULL : (in + 1); + const void *spill_buf = is_uring ? out_buf : spillover_buf; + + ret = fuse_co_write(exp, out, in->offset, in->size, + in_place_buf, spill_buf); break; } case FUSE_FALLOCATE: { - const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, q); + const struct fuse_fallocate_in *in = + (const struct fuse_fallocate_in *)op_in_buf; + ret = fuse_co_fallocate(exp, in->offset, in->length, in->mode); break; } @@ -1616,9 +1650,13 @@ fuse_co_process_request(FuseQueue *q, void *spillover_buf) #ifdef CONFIG_FUSE_LSEEK case FUSE_LSEEK: { - const struct fuse_lseek_in *in = FUSE_IN_OP_STRUCT(lseek, q); - ret = fuse_co_lseek(exp, FUSE_OUT_OP_STRUCT(lseek, out_buf), - in->offset, in->whence); + const struct fuse_lseek_in *in = + (const struct fuse_lseek_in *)op_in_buf; + + struct fuse_lseek_out *out = + (struct fuse_lseek_out *)op_out_buf; + + ret = fuse_co_lseek(exp, out, in->offset, in->whence); break; } #endif @@ -1627,20 +1665,135 @@ fuse_co_process_request(FuseQueue *q, void *spillover_buf) ret = -ENOSYS; } - /* Ignore errors from fuse_write*(), nothing we can do anyway */ + send_response(opaque, req_id, ret, out_data_buffer, out_buf); + if (out_data_buffer) { - assert(ret >= 0); - fuse_write_buf_response(q->fuse_fd, req_id, out_hdr, - out_data_buffer, ret); qemu_vfree(out_data_buffer); + } + + if (fd != -1) { + qemu_vfree(spillover_buf); + } +} + +/* Helper to send response for legacy */ +static void send_response_legacy(void *opaque, uint32_t req_id, ssize_t ret, + const void *buf, void *out_buf) +{ + FuseQueue *q = (FuseQueue *)opaque; + struct fuse_out_header *out_hdr = (struct fuse_out_header *)out_buf; + if (buf) { + assert(ret >= 0); + fuse_write_buf_response(q->fuse_fd, req_id, out_hdr, buf, ret); } else { fuse_write_response(q->fuse_fd, req_id, out_hdr, ret < 0 ? ret : 0, ret < 0 ? 0 : ret); } +} + +static void coroutine_fn +fuse_co_process_request(FuseQueue *q, void *spillover_buf) +{ + FuseExport *exp = q->exp; + uint32_t opcode; + uint64_t req_id; + + /* + * Return buffer. Must be large enough to hold all return headers, but does + * not include space for data returned by read requests. + */ + char out_buf[sizeof(struct fuse_out_header) + + MAX_CONST(sizeof(struct fuse_init_out), + MAX_CONST(sizeof(struct fuse_open_out), + MAX_CONST(sizeof(struct fuse_attr_out), + MAX_CONST(sizeof(struct fuse_write_out), + sizeof(struct fuse_lseek_out)))))] = {0}; + + /* Limit scope to ensure pointer is no longer used after yielding */ + { + const struct fuse_in_header *in_hdr = + (const struct fuse_in_header *)q->request_buf; + + opcode = in_hdr->opcode; + req_id = in_hdr->unique; + } + + fuse_co_process_request_common(exp, opcode, req_id, q->request_buf, + spillover_buf, out_buf, q->fuse_fd, send_response_legacy, q); +} + +#ifdef CONFIG_LINUX_IO_URING +static void fuse_uring_prep_sqe_commit(struct io_uring_sqe *sqe, void *opaque) +{ + FuseRingEnt *ent = opaque; + struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0]; + + fuse_uring_sqe_prepare(sqe, ent->q, FUSE_IO_URING_CMD_COMMIT_AND_FETCH); + fuse_uring_sqe_set_req_data(req, ent->q->qid, + ent->req_commit_id); +} + +static void +fuse_uring_send_response(FuseRingEnt *ent, uint32_t req_id, ssize_t ret, + const void *out_data_buffer) +{ + FuseExport *exp = ent->q->exp; + + struct fuse_uring_req_header *rrh = &ent->req_header; + struct fuse_out_header *out_header = (struct fuse_out_header *)&rrh->in_out; + struct fuse_uring_ent_in_out *ent_in_out = + (struct fuse_uring_ent_in_out *)&rrh->ring_ent_in_out; + + /* FUSE_READ */ + if (out_data_buffer && ret > 0) { + memcpy(ent->op_payload, out_data_buffer, ret); + } + + out_header->error = ret < 0 ? ret : 0; + out_header->unique = req_id; + /* out_header->len = ret > 0 ? ret : 0; */ + ent_in_out->payload_sz = ret > 0 ? ret : 0; + qemu_vfree(spillover_buf); + aio_add_sqe(fuse_uring_prep_sqe_commit, ent, + &ent->fuse_cqe_handler); +} + +/* Helper to send response for uring */ +static void send_response_uring(void *opaque, uint32_t req_id, ssize_t ret, + const void *out_data_buffer, void *payload) +{ + FuseRingEnt *ent = (FuseRingEnt *)opaque; + + fuse_uring_send_response(ent, req_id, ret, out_data_buffer); +} + +static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent) +{ + FuseQueue *q = ent->q; + FuseExport *exp = q->exp; + struct fuse_uring_req_header *rrh = &ent->req_header; + struct fuse_uring_ent_in_out *ent_in_out = + (struct fuse_uring_ent_in_out *)&rrh->ring_ent_in_out; + struct fuse_in_header *in_hdr = + (struct fuse_in_header *)&rrh->in_out; + uint32_t opcode = in_hdr->opcode; + uint64_t req_id = in_hdr->unique; + ent->req_commit_id = ent_in_out->commit_id; + + if (unlikely(ent->req_commit_id == 0)) { + error_report("If this happens kernel will not find the response - " + "it will be stuck forever - better to abort immediately."); + fuse_export_halt(exp); + return; + } + + fuse_co_process_request_common(exp, opcode, req_id, &rrh->op_in, + NULL, ent->op_payload, -1, send_response_uring, ent); } +#endif const BlockExportDriver blk_exp_fuse = { .type = BLOCK_EXPORT_TYPE_FUSE, -- 2.45.2 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 3/3] fuse: Safe termination for FUSE-uring 2025-08-15 3:46 [RFC PATCH 0/3] block/export: Add FUSE-over-io_uring for Storage Exports Zhi Song 2025-08-15 3:46 ` [PATCH 1/3] fuse: add FUSE-over-io_uring enable opt and init Zhi Song 2025-08-15 3:46 ` [PATCH 2/3] fuse: Handle FUSE-uring requests Zhi Song @ 2025-08-15 3:46 ` Zhi Song 2025-08-17 13:45 ` [RFC PATCH 0/3] block/export: Add FUSE-over-io_uring for Storage Exports Stefan Hajnoczi 3 siblings, 0 replies; 15+ messages in thread From: Zhi Song @ 2025-08-15 3:46 UTC (permalink / raw) To: qemu-block Cc: qemu-devel, armbru, bernd, fam, hibriansong, hreitz, kwolf, stefanha From: Brian Song <hibriansong@gmail.com> When the user sends a termination signal, storage-export-daemon stops the export, exits the main loop (main_loop_wait), and begins cleaning up associated resources. At this point, some SQEs submitted via FUSE_IO _URING_CMD_COMMIT_AND_FETCH may still be pending in the kernel, waiting for incoming FUSE requests, which can trigger CQE handlers in user space. Currently, there is no way to manually cancel these pending CQEs in the kernel. As a result, after export termination, the related data structures might be deleted before the pending CQEs return, causing the CQE handler to be invoked after it has been freed, which may lead to a segfault. As a workaround, when submitting an SQE to the kernel, we increment the block reference (blk_exp_ref) to prevent the CQE handler from being deleted during export termination. Once the CQE is received, we decrement the reference (blk_exp_unref). Suggested-by: Kevin Wolf <kwolf@redhat.com> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Brian Song <hibriansong@gmail.com> --- block/export/fuse.c | 52 +++++++++++++++++++++++++++++++++++---------- 1 file changed, 41 insertions(+), 11 deletions(-) diff --git a/block/export/fuse.c b/block/export/fuse.c index 7540f8f5a3..ddd83c50e2 100644 --- a/block/export/fuse.c +++ b/block/export/fuse.c @@ -298,6 +298,8 @@ static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque) /* Going to process requests */ fuse_inc_in_flight(exp); + /* A ring entry returned */ + blk_exp_unref(&exp->common); fuse_uring_co_process_request(ent); @@ -323,6 +325,9 @@ static void fuse_uring_cqe_handler(CqeHandler *cqe_handler) err != -ENOTCONN) { fuse_export_halt(exp); } + + /* A ring entry returned */ + blk_exp_unref(&exp->common); } else { co = qemu_coroutine_create(co_fuse_uring_queue_handle_cqes, ent); qemu_coroutine_enter(co); @@ -370,6 +375,8 @@ static void fuse_uring_submit_register(void *opaque) FuseQueue *q = opaque; FuseExport *exp = q->exp; + /* Commit and fetch a ring entry */ + blk_exp_ref(&exp->common); aio_add_sqe(fuse_uring_prep_sqe_register, q, &(q->ent.fuse_cqe_handler)); } @@ -762,6 +769,17 @@ static void read_from_fuse_fd(void *opaque) qemu_coroutine_enter(co); } +#ifdef CONFIG_LINUX_IO_URING +static void fuse_export_delete_uring(FuseExport *exp) +{ + exp->is_uring = false; + + for (size_t qid = 0; qid < exp->num_queues; qid++) { + g_free(exp->queues[qid].ent.op_payload); + } +} +#endif + static void fuse_export_shutdown(BlockExport *blk_exp) { FuseExport *exp = container_of(blk_exp, FuseExport, common); @@ -777,11 +795,6 @@ static void fuse_export_shutdown(BlockExport *blk_exp) */ g_hash_table_remove(exports, exp->mountpoint); } -} - -static void fuse_export_delete(BlockExport *blk_exp) -{ - FuseExport *exp = container_of(blk_exp, FuseExport, common); for (int i = 0; i < exp->num_queues; i++) { FuseQueue *q = &exp->queues[i]; @@ -790,11 +803,7 @@ static void fuse_export_delete(BlockExport *blk_exp) if (i > 0 && q->fuse_fd >= 0) { close(q->fuse_fd); } - if (q->spillover_buf) { - qemu_vfree(q->spillover_buf); - } } - g_free(exp->queues); if (exp->fuse_session) { if (exp->mounted) { @@ -803,8 +812,29 @@ static void fuse_export_delete(BlockExport *blk_exp) fuse_session_destroy(exp->fuse_session); } +} + +static void fuse_export_delete(BlockExport *blk_exp) +{ + FuseExport *exp = container_of(blk_exp, FuseExport, common); + + for (int i = 0; i < exp->num_queues; i++) { + FuseQueue *q = &exp->queues[i]; + + if (q->spillover_buf) { + qemu_vfree(q->spillover_buf); + } + } g_free(exp->mountpoint); + +#ifdef CONFIG_LINUX_IO_URING + if (exp->is_uring) { + fuse_export_delete_uring(exp); + } +#endif + + g_free(exp->queues); } /** @@ -1755,8 +1785,8 @@ fuse_uring_send_response(FuseRingEnt *ent, uint32_t req_id, ssize_t ret, /* out_header->len = ret > 0 ? ret : 0; */ ent_in_out->payload_sz = ret > 0 ? ret : 0; - - qemu_vfree(spillover_buf); + /* Commit and fetch a ring entry */ + blk_exp_ref(&exp->common); aio_add_sqe(fuse_uring_prep_sqe_commit, ent, &ent->fuse_cqe_handler); } -- 2.45.2 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/3] block/export: Add FUSE-over-io_uring for Storage Exports 2025-08-15 3:46 [RFC PATCH 0/3] block/export: Add FUSE-over-io_uring for Storage Exports Zhi Song ` (2 preceding siblings ...) 2025-08-15 3:46 ` [PATCH 3/3] fuse: Safe termination for FUSE-uring Zhi Song @ 2025-08-17 13:45 ` Stefan Hajnoczi 2025-08-18 22:54 ` Bernd Schubert 2025-08-21 1:32 ` Brian Song 3 siblings, 2 replies; 15+ messages in thread From: Stefan Hajnoczi @ 2025-08-17 13:45 UTC (permalink / raw) To: Zhi Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf [-- Attachment #1: Type: text/plain, Size: 537 bytes --] On Thu, Aug 14, 2025 at 11:46:16PM -0400, Zhi Song wrote: > Due to kernel limitations, when the FUSE-over-io_uring option is > enabled, > you must create and assign nr_cpu IOThreads. For example: While it would be nice for the kernel to support a more flexible queue mapping policy, userspace can work around this. I think Kevin suggested creating the number of FUSE queues required by the kernel and configuring them across the user's IOThreads. That way the number of IOThreads can be smaller than the number of FUSE queues. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/3] block/export: Add FUSE-over-io_uring for Storage Exports 2025-08-17 13:45 ` [RFC PATCH 0/3] block/export: Add FUSE-over-io_uring for Storage Exports Stefan Hajnoczi @ 2025-08-18 22:54 ` Bernd Schubert 2025-08-21 1:32 ` Brian Song 1 sibling, 0 replies; 15+ messages in thread From: Bernd Schubert @ 2025-08-18 22:54 UTC (permalink / raw) To: Stefan Hajnoczi, Zhi Song Cc: qemu-block, qemu-devel, armbru, fam, hreitz, kwolf On 8/17/25 15:45, Stefan Hajnoczi wrote: > On Thu, Aug 14, 2025 at 11:46:16PM -0400, Zhi Song wrote: >> Due to kernel limitations, when the FUSE-over-io_uring option is >> enabled, >> you must create and assign nr_cpu IOThreads. For example: > > While it would be nice for the kernel to support a more flexible queue > mapping policy, userspace can work around this. > > I think Kevin suggested creating the number of FUSE queues required by > the kernel and configuring them across the user's IOThreads. That way > the number of IOThreads can be smaller than the number of FUSE queues. Sorry, had been another week off last week and I'm only slowly catching up. Regarding more flexible queues, see here https://lore.kernel.org/r/20250722-reduced-nr-ring-queues_3-v1-0-aa8e37ae97e6@ddn.com And actually forgot to mention the corresponding libfuse branch for that: https://github.com/bsbernd/libfuse/tree/uring-reduce-nr-queues Bernd ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/3] block/export: Add FUSE-over-io_uring for Storage Exports 2025-08-17 13:45 ` [RFC PATCH 0/3] block/export: Add FUSE-over-io_uring for Storage Exports Stefan Hajnoczi 2025-08-18 22:54 ` Bernd Schubert @ 2025-08-21 1:32 ` Brian Song 2025-08-21 14:20 ` Stefan Hajnoczi 1 sibling, 1 reply; 15+ messages in thread From: Brian Song @ 2025-08-21 1:32 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf On 8/17/25 9:45 AM, Stefan Hajnoczi wrote: > On Thu, Aug 14, 2025 at 11:46:16PM -0400, Zhi Song wrote: >> Due to kernel limitations, when the FUSE-over-io_uring option is >> enabled, >> you must create and assign nr_cpu IOThreads. For example: > > While it would be nice for the kernel to support a more flexible queue > mapping policy, userspace can work around this. > > I think Kevin suggested creating the number of FUSE queues required by > the kernel and configuring them across the user's IOThreads. That way > the number of IOThreads can be smaller than the number of FUSE queues. > > Stefan If we are mapping user specified IOThreads to nr_cpu queues Q, when we register entries, we need to think about how many entries in each Q[i] go to different IOThreads, and bind the qid when submitting. Once a CQE comes back, the corresponding IOThread handles it. Looks like we don't really need a round robin for dispatching. The actual question is how to split entries in each queue across IOThreads. For example, if we split entries evenly: USER: define 2 IOThreads to submit and recv ring entries NR_CPU: 4 Q = malloc(sizeof(entry) * 32 * nr_cpu); IOThread-1: Q[0] Q[1] Q[2] Q[3] 16 16 16 16 IOThread-2: Q[0] Q[1] Q[2] Q[3] 16 16 16 16 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/3] block/export: Add FUSE-over-io_uring for Storage Exports 2025-08-21 1:32 ` Brian Song @ 2025-08-21 14:20 ` Stefan Hajnoczi 0 siblings, 0 replies; 15+ messages in thread From: Stefan Hajnoczi @ 2025-08-21 14:20 UTC (permalink / raw) To: Brian Song; +Cc: qemu-block, qemu-devel, armbru, bernd, fam, hreitz, kwolf [-- Attachment #1: Type: text/plain, Size: 1998 bytes --] On Wed, Aug 20, 2025 at 09:32:44PM -0400, Brian Song wrote: > On 8/17/25 9:45 AM, Stefan Hajnoczi wrote: > > On Thu, Aug 14, 2025 at 11:46:16PM -0400, Zhi Song wrote: > >> Due to kernel limitations, when the FUSE-over-io_uring option is > >> enabled, > >> you must create and assign nr_cpu IOThreads. For example: > > > > While it would be nice for the kernel to support a more flexible queue > > mapping policy, userspace can work around this. > > > > I think Kevin suggested creating the number of FUSE queues required by > > the kernel and configuring them across the user's IOThreads. That way > > the number of IOThreads can be smaller than the number of FUSE queues. > > > > Stefan > > If we are mapping user specified IOThreads to nr_cpu queues Q, when we > register entries, we need to think about how many entries in each Q[i] > go to different IOThreads, and bind the qid when submitting. Once a CQE > comes back, the corresponding IOThread handles it. Looks like we don't > really need a round robin for dispatching. The actual question is how Round-robin is needed for qid -> IOThread mapping, not for dispatching individual requests. The kernel currently dispatches requests based on a 1:1 CPU:Queue mapping. > to split entries in each queue across IOThreads. > > For example, if we split entries evenly: > > USER: define 2 IOThreads to submit and recv ring entries > NR_CPU: 4 > > Q = malloc(sizeof(entry) * 32 * nr_cpu); > > IOThread-1: > Q[0] Q[1] Q[2] Q[3] > 16 16 16 16 > > IOThread-2: > Q[0] Q[1] Q[2] Q[3] > 16 16 16 16 There is no need to have nr_cpus queues in each IOThread. The constraint is that the total number of queues across all IOThreads must equal nr_cpus. The malloc in your example implies that each FuseQueue will have 32 entries (REGISTER uring_cmds). nr_cpu is 4 so the mapping should look like this: IOThread-1: Q[0] Q[2] 32 32 IOThread-2: Q[1] Q[3] 32 32 Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-08-21 14:21 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-15 3:46 [RFC PATCH 0/3] block/export: Add FUSE-over-io_uring for Storage Exports Zhi Song 2025-08-15 3:46 ` [PATCH 1/3] fuse: add FUSE-over-io_uring enable opt and init Zhi Song 2025-08-16 23:13 ` Brian Song 2025-08-17 13:42 ` Stefan Hajnoczi 2025-08-18 23:04 ` Bernd Schubert 2025-08-19 1:15 ` Brian Song 2025-08-19 22:26 ` Bernd Schubert 2025-08-19 23:23 ` Brian Song 2025-08-20 3:31 ` Brian Song 2025-08-15 3:46 ` [PATCH 2/3] fuse: Handle FUSE-uring requests Zhi Song 2025-08-15 3:46 ` [PATCH 3/3] fuse: Safe termination for FUSE-uring Zhi Song 2025-08-17 13:45 ` [RFC PATCH 0/3] block/export: Add FUSE-over-io_uring for Storage Exports Stefan Hajnoczi 2025-08-18 22:54 ` Bernd Schubert 2025-08-21 1:32 ` Brian Song 2025-08-21 14:20 ` Stefan Hajnoczi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).