Re: [PATCH RFC 1/1] block/export: FUSE-over-io_uring Support for QEMU FUSE Exports

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Stefan Hajnoczi <stefanha@redhat.com>
To: Bernd Schubert <bschubert@ddn.com>
Cc: Brian Song <hibriansong@gmail.com>,
	"qemu-block@nongnu.org" <qemu-block@nongnu.org>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"armbru@redhat.com" <armbru@redhat.com>,
	"fam@euphon.net" <fam@euphon.net>,
	"hreitz@redhat.com" <hreitz@redhat.com>,
	"kwolf@redhat.com" <kwolf@redhat.com>
Subject: Re: [PATCH RFC 1/1] block/export: FUSE-over-io_uring Support for QEMU FUSE Exports
Date: Tue, 22 Jul 2025 11:43:57 -0400	[thread overview]
Message-ID: <20250722154357.GA21935@fedora> (raw)
In-Reply-To: <e8210399-9d04-46c1-8a23-6e8cdb472c6f@ddn.com>

[-- Attachment #1: Type: text/plain, Size: 10748 bytes --]

On Tue, Jul 22, 2025 at 02:06:04PM +0000, Bernd Schubert wrote:
> On 7/21/25 02:53, Stefan Hajnoczi wrote:
> > On Wed, Jul 16, 2025 at 02:38:24PM -0400, Brian Song wrote:
> >> This work provides an initial implementation of fuse-over-io_uring
> >> support for QEMU export. According to the fuse-over-io_uring protocol
> >> specification, the userspace side must create the same number of queues
> >> as the number of CPUs (nr_cpu), just like the kernel. Currently, each
> >> queue contains only a single SQE entry, which is used to validate the
> >> correctness of the fuse-over-io_uring functionality.
> >>
> >> All FUSE read and write operations interact with the kernel via io
> >> vectors embedded in the SQE entry during submission and CQE fetching.
> >> The req_header and op_payload members of each entry are included as
> >> parts of the io vector: req_header carries the FUSE operation header,
> >> and op_payload carries the data payload, such as file attributes in a
> >> getattr reply, file content in a read reply, or file content being
> >> written to the FUSE client in a write operation.
> >>
> >> At present, multi-threading support is still incomplete. In addition,
> >> handling connection termination and managing the "drained" state of a
> >> FUSE block export in QEMU remain as pending work.
> >>
> >> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> >> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >> Signed-off-by: Brian Song <hibriansong@gmail.com>
> >>
> >> ---
> >>  block/export/fuse.c                  | 423 +++++++++++++++++++++++++--
> >>  docs/tools/qemu-storage-daemon.rst   |  10 +-
> >>  qapi/block-export.json               |   6 +-
> >>  storage-daemon/qemu-storage-daemon.c |   1 +
> >>  util/fdmon-io_uring.c                |   5 +-
> >>  5 files changed, 420 insertions(+), 25 deletions(-)
> > 
> > Here is feedback from a first pass over this patch.
> > 
> >>
> >> diff --git a/block/export/fuse.c b/block/export/fuse.c
> >> index c0ad4696ce..637d36186a 100644
> >> --- a/block/export/fuse.c
> >> +++ b/block/export/fuse.c
> >> @@ -48,6 +48,11 @@
> >>  #include <linux/fs.h>
> >>  #endif
> >>  
> >> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
> >> +
> >> +/* room needed in buffer to accommodate header */
> >> +#define FUSE_BUFFER_HEADER_SIZE 0x1000
> >> +
> >>  /* Prevent overly long bounce buffer allocations */
> >>  #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
> >>  /*
> >> @@ -64,6 +69,26 @@
> >>  
> >>  typedef struct FuseExport FuseExport;
> >>  
> >> +struct FuseQueue;
> > 
> > Use "typedef struct FuseQueue FuseQueue;" here...
> > 
> >> +
> >> +typedef struct FuseRingEnt {
> >> +    /* back pointer */
> >> +    struct FuseQueue *q;
> > 
> > ...and then this can be "FuseQueue *q;" so that QEMU coding style is
> > followed.
> > 
> >> +
> >> +    /* commit id of a fuse request */
> >> +    uint64_t req_commit_id;
> >> +
> >> +    /* fuse request header and payload */
> >> +    struct fuse_uring_req_header *req_header;
> >> +    void *op_payload;
> >> +    size_t req_payload_sz;
> >> +
> >> +    /* The vector passed to the kernel */
> >> +    struct iovec iov[2];
> >> +
> >> +    CqeHandler fuse_cqe_handler;
> >> +} FuseRingEnt;
> >> +
> >>  /*
> >>   * One FUSE "queue", representing one FUSE FD from which requests are fetched
> >>   * and processed.  Each queue is tied to an AioContext.
> >> @@ -73,6 +98,7 @@ typedef struct FuseQueue {
> >>  
> >>      AioContext *ctx;
> >>      int fuse_fd;
> >> +    int qid;
> > 
> > Could this go inside #ifdef CONFIG_LINUX_IO_URING? It seems to be
> > specific to FUSE-over-io_uring.
> > 
> >>  
> >>      /*
> >>       * The request buffer must be able to hold a full write, and/or at least
> >> @@ -109,6 +135,17 @@ typedef struct FuseQueue {
> >>       * Free this buffer with qemu_vfree().
> >>       */
> >>      void *spillover_buf;
> >> +
> >> +#ifdef CONFIG_LINUX_IO_URING
> >> +    FuseRingEnt ent;
> >> +
> >> +    /*
> >> +     * TODO
> >> +     * Support multi-threaded FUSE over io_uring by using eventfd and allocating
> >> +     * an extra SQE for each thread to be notified when the connection
> >> +     * shuts down.
> >> +     */
> > 
> > eventfd and the extra SQE won't be necessary because
> > aio_bh_schedule_oneshot() can be used to cause threads to execute a
> > function.
> > 
> > (I think this comment effectively says that connection shutdown still
> > needs to be implemented. The implementation details don't matter at this
> > point.)
> > 
> >> +#endif
> >>  } FuseQueue;
> >>  
> >>  /*
> >> @@ -148,6 +185,7 @@ struct FuseExport {
> >>      bool growable;
> >>      /* Whether allow_other was used as a mount option or not */
> >>      bool allow_other;
> >> +    bool is_uring;
> >>  
> >>      mode_t st_mode;
> >>      uid_t st_uid;
> >> @@ -257,6 +295,126 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
> >>      .drained_poll  = fuse_export_drained_poll,
> >>  };
> >>  
> >> +#ifdef CONFIG_LINUX_IO_URING
> >> +static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent);
> >> +
> >> +static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque)
> >> +{
> >> +    CqeHandler *cqe_handler = opaque;
> >> +    FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, fuse_cqe_handler);
> > 
> > Passing ent in opaque instead of cqe_handler would simplify this.
> > 
> >> +    FuseExport *exp = ent->q->exp;
> >> +
> >> +    fuse_uring_co_process_request(ent);
> >> +
> >> +    fuse_dec_in_flight(exp);
> >> +}
> >> +
> >> +static void fuse_uring_cqe_handler(CqeHandler *cqe_handler)
> >> +{
> >> +    FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, fuse_cqe_handler);
> >> +    FuseQueue *q = ent->q;
> >> +    Coroutine *co;
> >> +    FuseExport *exp = ent->q->exp;
> >> +
> >> +    int err = cqe_handler->cqe.res;
> >> +    if (err != 0) {
> >> +        /* TODO end_conn support */
> >> +
> >> +        /* -ENOTCONN is ok on umount  */
> >> +        if (err != -EINTR && err != -EOPNOTSUPP &&
> >> +            err != -EAGAIN && err != -ENOTCONN) {
> >> +            fuse_export_halt(exp);
> >> +        }
> >> +    } else {
> >> +        co = qemu_coroutine_create(co_fuse_uring_queue_handle_cqes,
> >> +                            cqe_handler);
> >> +        /* Decremented by co_fuse_uring_queue_handle_cqes() */
> >> +        fuse_inc_in_flight(q->exp);
> > 
> > Can this be moved inside co_fuse_uring_queue_handle_cqes() to avoid
> > calling inc/dec from different functions? That would make the code
> > easier to understand and more robust against future bugs.
> > 
> >> +        qemu_coroutine_enter(co);
> >> +    }
> >> +}
> >> +
> >> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req,
> >> +                    const unsigned int qid,
> >> +                    const unsigned int commit_id)
> >> +{
> >> +    req->qid = qid;
> >> +    req->commit_id = commit_id;
> >> +    req->flags = 0;
> >> +}
> >> +
> >> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseRingEnt *ent,
> >> +               __u32 cmd_op)
> >> +{
> >> +    sqe->opcode = IORING_OP_URING_CMD;
> >> +
> >> +    sqe->fd = ent->q->fuse_fd;
> >> +    sqe->rw_flags = 0;
> >> +    sqe->ioprio = 0;
> >> +    sqe->off = 0;
> >> +
> >> +    sqe->cmd_op = cmd_op;
> >> +    sqe->__pad1 = 0;
> >> +}
> >> +
> >> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *opaque)
> >> +{
> >> +    FuseQueue *q = opaque;
> >> +    struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0];
> >> +
> >> +    fuse_uring_sqe_prepare(sqe, &q->ent, FUSE_IO_URING_CMD_REGISTER);
> >> +
> >> +    sqe->addr = (uint64_t)(q->ent.iov);
> >> +    sqe->len = 2;
> >> +
> >> +    fuse_uring_sqe_set_req_data(req, q->qid, 0);
> >> +}
> >> +
> >> +static void fuse_uring_start(FuseExport *exp, struct fuse_init_out *out)
> >> +{
> >> +    /*
> >> +     * Since we didn't enable the FUSE_MAX_PAGES feature, the value of
> >> +     * fc->max_pages should be FUSE_DEFAULT_MAX_PAGES_PER_REQ, which is set by
> >> +     * the kernel by default. Also, max_write should not exceed
> >> +     * FUSE_DEFAULT_MAX_PAGES_PER_REQ * PAGE_SIZE.
> >> +     */
> >> +    size_t bufsize = out->max_write + FUSE_BUFFER_HEADER_SIZE;
> >> +
> >> +    if (!(out->flags & FUSE_MAX_PAGES)) {
> >> +        /*
> >> +         * bufsize = MIN(FUSE_DEFAULT_MAX_PAGES_PER_REQ *
> >> +         *       qemu_real_host_page_size() + FUSE_BUFFER_HEADER_SIZE, bufsize);
> >> +         */
> >> +        bufsize = FUSE_DEFAULT_MAX_PAGES_PER_REQ * qemu_real_host_page_size()
> >> +                         + FUSE_BUFFER_HEADER_SIZE;
> >> +    }
> >> +
> >> +    for (int i = 0; i < exp->num_queues; i++) {
> >> +        FuseQueue *q = &exp->queues[i];
> >> +
> >> +        q->ent.q = q;
> >> +
> >> +        q->ent.req_header = g_malloc0(sizeof(struct fuse_uring_req_header));
> > 
> > It's probably easier to embed the header as a FuseRingEnt field instead
> > of heap allocating it.
> 
> Hmm well. So we have two additional patch in the DDN branch for which I 
> didn't have time to upstream them yet. These patches allow to pin these
> buffers/pages and with that the application doing IO can directly write
> into the buffer - saves context swithes. The initial RFC kernel patches
> were using mmaped buffers and when I had to switch to userspace buffers,
> performance went badly down. I didn't run real benchmarks, but just
> xfstests  - with mmapped buffers it was running like 3 times faster than
> legacy fuse and that advantage got lost with normal buffers that
> get mapped per request. Switching to pinned buffers brought back the
> fast xfstest runs.
> Issue is that the buffer needs to be page aligned - which is why libfuse
> takes an extra allocation here.
> In libfuse I should probably make this optional, as pinned buffers
> will mostly only work for root (needs locked memory).
> 
> In principle I would need to document these details somewhere, I should
> probably create blog or so.

That sounds like something we'd like to try out, although we try to
minimize privileges needed for qemu-storage-daemon.

A question about pinning and resource limits: qemu-storage-daemon uses
O_DIRECT for I/O as non-root without hitting mlock resource limits. Is
there a difference between pin_user_pages(FOLL_PIN) (for direct I/O) and
pin_user_pages(FOLL_PIN | FOLL_LONGTERM) (possibly for
FUSE-over-io_uring pinning) in terms of resource limit behavior?

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

next prev parent reply	other threads:[~2025-07-22 16:48 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-16 18:38 [PATCH RFC 0/1] block/export: FUSE-over-io_uring Support for QEMU FUSE Exports Brian Song
2025-07-16 18:38 ` [PATCH RFC 1/1] " Brian Song
2025-07-17  6:03   ` Markus Armbruster
2025-07-22 12:00     ` Brian Song
2025-07-21  0:53   ` Stefan Hajnoczi
2025-07-22 12:00     ` Brian Song
2025-07-22 15:17       ` Kevin Wolf
2025-07-22 14:06     ` Bernd Schubert
2025-07-22 15:43       ` Stefan Hajnoczi [this message]
2025-07-22 16:20         ` Bernd Schubert
2025-07-21 13:51   ` Bernd Schubert
2025-07-21 18:26     ` Stefan Hajnoczi
2025-07-22 12:00     ` Brian Song
2025-07-22 14:51       ` Stefan Hajnoczi
2025-07-24 20:36     ` Stefan Hajnoczi
2025-07-22 13:32   ` Kevin Wolf
2025-07-20 16:13 ` [PATCH RFC 0/1] " Stefan Hajnoczi
2025-07-22 12:00   ` Brian Song
2025-07-22 14:47     ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250722154357.GA21935@fedora \
    --to=stefanha@redhat.com \
    --cc=armbru@redhat.com \
    --cc=bschubert@ddn.com \
    --cc=fam@euphon.net \
    --cc=hibriansong@gmail.com \
    --cc=hreitz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).