From: Stefan Hajnoczi <stefanha@redhat.com>
To: Bernd Schubert <bschubert@ddn.com>
Cc: Brian Song <hibriansong@gmail.com>,
"qemu-block@nongnu.org" <qemu-block@nongnu.org>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
"armbru@redhat.com" <armbru@redhat.com>,
"fam@euphon.net" <fam@euphon.net>,
"hreitz@redhat.com" <hreitz@redhat.com>,
"kwolf@redhat.com" <kwolf@redhat.com>
Subject: Re: [PATCH RFC 1/1] block/export: FUSE-over-io_uring Support for QEMU FUSE Exports
Date: Tue, 22 Jul 2025 11:43:57 -0400 [thread overview]
Message-ID: <20250722154357.GA21935@fedora> (raw)
In-Reply-To: <e8210399-9d04-46c1-8a23-6e8cdb472c6f@ddn.com>
[-- Attachment #1: Type: text/plain, Size: 10748 bytes --]
On Tue, Jul 22, 2025 at 02:06:04PM +0000, Bernd Schubert wrote:
> On 7/21/25 02:53, Stefan Hajnoczi wrote:
> > On Wed, Jul 16, 2025 at 02:38:24PM -0400, Brian Song wrote:
> >> This work provides an initial implementation of fuse-over-io_uring
> >> support for QEMU export. According to the fuse-over-io_uring protocol
> >> specification, the userspace side must create the same number of queues
> >> as the number of CPUs (nr_cpu), just like the kernel. Currently, each
> >> queue contains only a single SQE entry, which is used to validate the
> >> correctness of the fuse-over-io_uring functionality.
> >>
> >> All FUSE read and write operations interact with the kernel via io
> >> vectors embedded in the SQE entry during submission and CQE fetching.
> >> The req_header and op_payload members of each entry are included as
> >> parts of the io vector: req_header carries the FUSE operation header,
> >> and op_payload carries the data payload, such as file attributes in a
> >> getattr reply, file content in a read reply, or file content being
> >> written to the FUSE client in a write operation.
> >>
> >> At present, multi-threading support is still incomplete. In addition,
> >> handling connection termination and managing the "drained" state of a
> >> FUSE block export in QEMU remain as pending work.
> >>
> >> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> >> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >> Signed-off-by: Brian Song <hibriansong@gmail.com>
> >>
> >> ---
> >> block/export/fuse.c | 423 +++++++++++++++++++++++++--
> >> docs/tools/qemu-storage-daemon.rst | 10 +-
> >> qapi/block-export.json | 6 +-
> >> storage-daemon/qemu-storage-daemon.c | 1 +
> >> util/fdmon-io_uring.c | 5 +-
> >> 5 files changed, 420 insertions(+), 25 deletions(-)
> >
> > Here is feedback from a first pass over this patch.
> >
> >>
> >> diff --git a/block/export/fuse.c b/block/export/fuse.c
> >> index c0ad4696ce..637d36186a 100644
> >> --- a/block/export/fuse.c
> >> +++ b/block/export/fuse.c
> >> @@ -48,6 +48,11 @@
> >> #include <linux/fs.h>
> >> #endif
> >>
> >> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
> >> +
> >> +/* room needed in buffer to accommodate header */
> >> +#define FUSE_BUFFER_HEADER_SIZE 0x1000
> >> +
> >> /* Prevent overly long bounce buffer allocations */
> >> #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
> >> /*
> >> @@ -64,6 +69,26 @@
> >>
> >> typedef struct FuseExport FuseExport;
> >>
> >> +struct FuseQueue;
> >
> > Use "typedef struct FuseQueue FuseQueue;" here...
> >
> >> +
> >> +typedef struct FuseRingEnt {
> >> + /* back pointer */
> >> + struct FuseQueue *q;
> >
> > ...and then this can be "FuseQueue *q;" so that QEMU coding style is
> > followed.
> >
> >> +
> >> + /* commit id of a fuse request */
> >> + uint64_t req_commit_id;
> >> +
> >> + /* fuse request header and payload */
> >> + struct fuse_uring_req_header *req_header;
> >> + void *op_payload;
> >> + size_t req_payload_sz;
> >> +
> >> + /* The vector passed to the kernel */
> >> + struct iovec iov[2];
> >> +
> >> + CqeHandler fuse_cqe_handler;
> >> +} FuseRingEnt;
> >> +
> >> /*
> >> * One FUSE "queue", representing one FUSE FD from which requests are fetched
> >> * and processed. Each queue is tied to an AioContext.
> >> @@ -73,6 +98,7 @@ typedef struct FuseQueue {
> >>
> >> AioContext *ctx;
> >> int fuse_fd;
> >> + int qid;
> >
> > Could this go inside #ifdef CONFIG_LINUX_IO_URING? It seems to be
> > specific to FUSE-over-io_uring.
> >
> >>
> >> /*
> >> * The request buffer must be able to hold a full write, and/or at least
> >> @@ -109,6 +135,17 @@ typedef struct FuseQueue {
> >> * Free this buffer with qemu_vfree().
> >> */
> >> void *spillover_buf;
> >> +
> >> +#ifdef CONFIG_LINUX_IO_URING
> >> + FuseRingEnt ent;
> >> +
> >> + /*
> >> + * TODO
> >> + * Support multi-threaded FUSE over io_uring by using eventfd and allocating
> >> + * an extra SQE for each thread to be notified when the connection
> >> + * shuts down.
> >> + */
> >
> > eventfd and the extra SQE won't be necessary because
> > aio_bh_schedule_oneshot() can be used to cause threads to execute a
> > function.
> >
> > (I think this comment effectively says that connection shutdown still
> > needs to be implemented. The implementation details don't matter at this
> > point.)
> >
> >> +#endif
> >> } FuseQueue;
> >>
> >> /*
> >> @@ -148,6 +185,7 @@ struct FuseExport {
> >> bool growable;
> >> /* Whether allow_other was used as a mount option or not */
> >> bool allow_other;
> >> + bool is_uring;
> >>
> >> mode_t st_mode;
> >> uid_t st_uid;
> >> @@ -257,6 +295,126 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
> >> .drained_poll = fuse_export_drained_poll,
> >> };
> >>
> >> +#ifdef CONFIG_LINUX_IO_URING
> >> +static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent);
> >> +
> >> +static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque)
> >> +{
> >> + CqeHandler *cqe_handler = opaque;
> >> + FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, fuse_cqe_handler);
> >
> > Passing ent in opaque instead of cqe_handler would simplify this.
> >
> >> + FuseExport *exp = ent->q->exp;
> >> +
> >> + fuse_uring_co_process_request(ent);
> >> +
> >> + fuse_dec_in_flight(exp);
> >> +}
> >> +
> >> +static void fuse_uring_cqe_handler(CqeHandler *cqe_handler)
> >> +{
> >> + FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, fuse_cqe_handler);
> >> + FuseQueue *q = ent->q;
> >> + Coroutine *co;
> >> + FuseExport *exp = ent->q->exp;
> >> +
> >> + int err = cqe_handler->cqe.res;
> >> + if (err != 0) {
> >> + /* TODO end_conn support */
> >> +
> >> + /* -ENOTCONN is ok on umount */
> >> + if (err != -EINTR && err != -EOPNOTSUPP &&
> >> + err != -EAGAIN && err != -ENOTCONN) {
> >> + fuse_export_halt(exp);
> >> + }
> >> + } else {
> >> + co = qemu_coroutine_create(co_fuse_uring_queue_handle_cqes,
> >> + cqe_handler);
> >> + /* Decremented by co_fuse_uring_queue_handle_cqes() */
> >> + fuse_inc_in_flight(q->exp);
> >
> > Can this be moved inside co_fuse_uring_queue_handle_cqes() to avoid
> > calling inc/dec from different functions? That would make the code
> > easier to understand and more robust against future bugs.
> >
> >> + qemu_coroutine_enter(co);
> >> + }
> >> +}
> >> +
> >> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req,
> >> + const unsigned int qid,
> >> + const unsigned int commit_id)
> >> +{
> >> + req->qid = qid;
> >> + req->commit_id = commit_id;
> >> + req->flags = 0;
> >> +}
> >> +
> >> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseRingEnt *ent,
> >> + __u32 cmd_op)
> >> +{
> >> + sqe->opcode = IORING_OP_URING_CMD;
> >> +
> >> + sqe->fd = ent->q->fuse_fd;
> >> + sqe->rw_flags = 0;
> >> + sqe->ioprio = 0;
> >> + sqe->off = 0;
> >> +
> >> + sqe->cmd_op = cmd_op;
> >> + sqe->__pad1 = 0;
> >> +}
> >> +
> >> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque)
> >> +{
> >> + FuseQueue *q = opaque;
> >> + struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0];
> >> +
> >> + fuse_uring_sqe_prepare(sqe, &q->ent, FUSE_IO_URING_CMD_REGISTER);
> >> +
> >> + sqe->addr = (uint64_t)(q->ent.iov);
> >> + sqe->len = 2;
> >> +
> >> + fuse_uring_sqe_set_req_data(req, q->qid, 0);
> >> +}
> >> +
> >> +static void fuse_uring_start(FuseExport *exp, struct fuse_init_out *out)
> >> +{
> >> + /*
> >> + * Since we didn't enable the FUSE_MAX_PAGES feature, the value of
> >> + * fc->max_pages should be FUSE_DEFAULT_MAX_PAGES_PER_REQ, which is set by
> >> + * the kernel by default. Also, max_write should not exceed
> >> + * FUSE_DEFAULT_MAX_PAGES_PER_REQ * PAGE_SIZE.
> >> + */
> >> + size_t bufsize = out->max_write + FUSE_BUFFER_HEADER_SIZE;
> >> +
> >> + if (!(out->flags & FUSE_MAX_PAGES)) {
> >> + /*
> >> + * bufsize = MIN(FUSE_DEFAULT_MAX_PAGES_PER_REQ *
> >> + * qemu_real_host_page_size() + FUSE_BUFFER_HEADER_SIZE, bufsize);
> >> + */
> >> + bufsize = FUSE_DEFAULT_MAX_PAGES_PER_REQ * qemu_real_host_page_size()
> >> + + FUSE_BUFFER_HEADER_SIZE;
> >> + }
> >> +
> >> + for (int i = 0; i < exp->num_queues; i++) {
> >> + FuseQueue *q = &exp->queues[i];
> >> +
> >> + q->ent.q = q;
> >> +
> >> + q->ent.req_header = g_malloc0(sizeof(struct fuse_uring_req_header));
> >
> > It's probably easier to embed the header as a FuseRingEnt field instead
> > of heap allocating it.
>
> Hmm well. So we have two additional patch in the DDN branch for which I
> didn't have time to upstream them yet. These patches allow to pin these
> buffers/pages and with that the application doing IO can directly write
> into the buffer - saves context swithes. The initial RFC kernel patches
> were using mmaped buffers and when I had to switch to userspace buffers,
> performance went badly down. I didn't run real benchmarks, but just
> xfstests - with mmapped buffers it was running like 3 times faster than
> legacy fuse and that advantage got lost with normal buffers that
> get mapped per request. Switching to pinned buffers brought back the
> fast xfstest runs.
> Issue is that the buffer needs to be page aligned - which is why libfuse
> takes an extra allocation here.
> In libfuse I should probably make this optional, as pinned buffers
> will mostly only work for root (needs locked memory).
>
> In principle I would need to document these details somewhere, I should
> probably create blog or so.
That sounds like something we'd like to try out, although we try to
minimize privileges needed for qemu-storage-daemon.
A question about pinning and resource limits: qemu-storage-daemon uses
O_DIRECT for I/O as non-root without hitting mlock resource limits. Is
there a difference between pin_user_pages(FOLL_PIN) (for direct I/O) and
pin_user_pages(FOLL_PIN | FOLL_LONGTERM) (possibly for
FUSE-over-io_uring pinning) in terms of resource limit behavior?
Thanks,
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
next prev parent reply other threads:[~2025-07-22 16:48 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-16 18:38 [PATCH RFC 0/1] block/export: FUSE-over-io_uring Support for QEMU FUSE Exports Brian Song
2025-07-16 18:38 ` [PATCH RFC 1/1] " Brian Song
2025-07-17 6:03 ` Markus Armbruster
2025-07-22 12:00 ` Brian Song
2025-07-21 0:53 ` Stefan Hajnoczi
2025-07-22 12:00 ` Brian Song
2025-07-22 15:17 ` Kevin Wolf
2025-07-22 14:06 ` Bernd Schubert
2025-07-22 15:43 ` Stefan Hajnoczi [this message]
2025-07-22 16:20 ` Bernd Schubert
2025-07-21 13:51 ` Bernd Schubert
2025-07-21 18:26 ` Stefan Hajnoczi
2025-07-22 12:00 ` Brian Song
2025-07-22 14:51 ` Stefan Hajnoczi
2025-07-24 20:36 ` Stefan Hajnoczi
2025-07-22 13:32 ` Kevin Wolf
2025-07-20 16:13 ` [PATCH RFC 0/1] " Stefan Hajnoczi
2025-07-22 12:00 ` Brian Song
2025-07-22 14:47 ` Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250722154357.GA21935@fedora \
--to=stefanha@redhat.com \
--cc=armbru@redhat.com \
--cc=bschubert@ddn.com \
--cc=fam@euphon.net \
--cc=hibriansong@gmail.com \
--cc=hreitz@redhat.com \
--cc=kwolf@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).