Re: [PATCH v2 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Stefan Hajnoczi <stefanha@redhat.com>
To: Brian <hibriansong@gmail.com>
Cc: Hanna Czenczek <hreitz@redhat.com>,
	qemu-block@nongnu.org, qemu-devel@nongnu.org,
	Kevin Wolf <kwolf@redhat.com>,
	Markus Armbruster <armbru@redhat.com>
Subject: Re: [PATCH v2 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer
Date: Wed, 11 Jun 2025 09:46:41 -0400	[thread overview]
Message-ID: <20250611134641.GA160966@fedora> (raw)
In-Reply-To: <0eb2f89d-7be4-4480-a7e3-1b16344364db@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 14178 bytes --]

On Tue, Jun 10, 2025 at 07:37:34PM -0400, Brian wrote:
> Hi all,
> 
> I'm currently working on the fuse over io_uring feature for QEMU
> FUSE export. When I submit SQEs to the /dev/fuse file descriptor
> during the register phase, the kernel returns an error (in the
> fuse_uring_create_ring_ent()). It seems this is because the payload
> size (i.e. spillover_buf size, which is FUSE_MAX_WRITE_BYTES (64 *
> 4k) minus FUSE_IN_PLACE_WRITE_BYTES (4 * 4k)) is smaller than
> ring->max_payload_sz (which is 32 pages * 4k).
> 
> Do we need to increase the spillover_buf size, or is there any
> other workaround?

When you implement support for concurrent FUSE-over-io_uring requests
you'll need to pre-allocate a buffer for each request. That
pre-allocation code will be separate from the request buffer that this
patch series defines. So I think this issue is specific to
FUSE-over-io_uring and something that can be done in your upcoming
patches rather than something that affects this patch series.

How about going ahead and writing the FUSE-over-io_uring-specific code
for pre-allocating request buffers right away instead of using
FuseQueue->request_buf[] in your code?

Stefan

> 
> 
> Brian
> 
> On 6/4/25 9:28 AM, Hanna Czenczek wrote:
> > We probably want to support larger write sizes than just 4k; 64k seems
> > nice.  However, we cannot read partial requests from the FUSE FD, we
> > always have to read requests in full; so our read buffer must be large
> > enough to accommodate potential 64k writes if we want to support that.
> > 
> > Always allocating FuseRequest objects with 64k buffers in them seems
> > wasteful, though.  But we can get around the issue by splitting the
> > buffer into two and using readv(): One part will hold all normal (up to
> > 4k) write requests and all other requests, and a second part (the
> > "spill-over buffer") will be used only for larger write requests.  Each
> > FuseQueue has its own spill-over buffer, and only if we find it used
> > when reading a request will we move its ownership into the FuseRequest
> > object and allocate a new spill-over buffer for the queue.
> > 
> > This way, we get to support "large" write sizes without having to
> > allocate big buffers when they aren't used.
> > 
> > Also, this even reduces the size of the FuseRequest objects because the
> > read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
> > the requests we support are not quite so large (except for >4k writes),
> > so until now, we basically had to have useless padding in there.
> > 
> > With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
> > is easily met and we can decrease the size of the buffer portion that is
> > right inside of FuseRequest.
> > 
> > As for benchmarks, the benefit of this patch can be shown easily by
> > writing a 4G image (with qemu-img convert) to a FUSE export:
> > - Before this patch: Takes 25.6 s (14.4 s with -t none)
> > - After this patch: Takes 4.5 s (5.5 s with -t none)
> > 
> > Reviewed-by: Stefan Hajnoczi<stefanha@redhat.com>
> > Signed-off-by: Hanna Czenczek<hreitz@redhat.com>
> > ---
> >   block/export/fuse.c | 137 ++++++++++++++++++++++++++++++++++++++------
> >   1 file changed, 118 insertions(+), 19 deletions(-)
> > 
> > diff --git a/block/export/fuse.c b/block/export/fuse.c
> > index cdec31f2a8..908266d101 100644
> > --- a/block/export/fuse.c
> > +++ b/block/export/fuse.c
> > @@ -50,8 +50,17 @@
> >   /* Prevent overly long bounce buffer allocations */
> >   #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
> > -/* Small enough to fit in the request buffer */
> > -#define FUSE_MAX_WRITE_BYTES (4 * 1024)
> > +/*
> > + * FUSE_MAX_WRITE_BYTES determines the maximum number of bytes we support in a
> > + * write request; FUSE_IN_PLACE_WRITE_BYTES and FUSE_SPILLOVER_BUF_SIZE
> > + * determine the split between the size of the in-place buffer in FuseRequest
> > + * and the spill-over buffer in FuseQueue.  See FuseQueue.spillover_buf for a
> > + * detailed explanation.
> > + */
> > +#define FUSE_IN_PLACE_WRITE_BYTES (4 * 1024)
> > +#define FUSE_MAX_WRITE_BYTES (64 * 1024)
> > +#define FUSE_SPILLOVER_BUF_SIZE \
> > +    (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
> >   typedef struct FuseExport FuseExport;
> > @@ -67,15 +76,49 @@ typedef struct FuseQueue {
> >       /*
> >        * The request buffer must be able to hold a full write, and/or at least
> > -     * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes
> > +     * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes.
> > +     * This however is just the first part of the buffer; every read is given
> > +     * a vector of this buffer (which should be enough for all normal requests,
> > +     * which we check via the static assertion in FUSE_IN_OP_STRUCT()) and the
> > +     * spill-over buffer below.
> > +     * Therefore, the size of this buffer plus FUSE_SPILLOVER_BUF_SIZE must be
> > +     * FUSE_MIN_READ_BUFFER or more (checked via static assertion below).
> > +     */
> > +    char request_buf[sizeof(struct fuse_in_header) +
> > +                     sizeof(struct fuse_write_in) +
> > +                     FUSE_IN_PLACE_WRITE_BYTES];
> > +
> > +    /*
> > +     * When retrieving a FUSE request, the destination buffer must always be
> > +     * sufficiently large for the whole request, i.e. with max_write=64k, we
> > +     * must provide a buffer that fits the WRITE header and 64 kB of space for
> > +     * data.
> > +     * We do want to support 64k write requests without requiring them to be
> > +     * split up, but at the same time, do not want to do such a large allocation
> > +     * for every single request.
> > +     * Therefore, the FuseRequest object provides an in-line buffer that is
> > +     * enough for write requests up to 4k (and all other requests), and for
> > +     * every request that is bigger, we provide a spill-over buffer here (for
> > +     * the remaining 64k - 4k = 60k).
> > +     * When poll_fuse_fd() reads a FUSE request, it passes these buffers as an
> > +     * I/O vector, and then checks the return value (number of bytes read) to
> > +     * find out whether the spill-over buffer was used.  If so, it will move the
> > +     * buffer to the request, and will allocate a new spill-over buffer for the
> > +     * next request.
> > +     *
> > +     * Free this buffer with qemu_vfree().
> >        */
> > -    char request_buf[MAX_CONST(
> > -        sizeof(struct fuse_in_header) + sizeof(struct fuse_write_in) +
> > -             FUSE_MAX_WRITE_BYTES,
> > -        FUSE_MIN_READ_BUFFER
> > -    )];
> > +    void *spillover_buf;
> >   } FuseQueue;
> > +/*
> > + * Verify that FuseQueue.request_buf plus the spill-over buffer together
> > + * are big enough to be accepted by the FUSE kernel driver.
> > + */
> > +QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
> > +                  FUSE_SPILLOVER_BUF_SIZE <
> > +                  FUSE_MIN_READ_BUFFER);
> > +
> >   struct FuseExport {
> >       BlockExport common;
> > @@ -131,7 +174,8 @@ static int clone_fuse_fd(int fd, Error **errp);
> >   static bool is_regular_file(const char *path, Error **errp);
> >   static void read_from_fuse_fd(void *opaque);
> > -static void coroutine_fn fuse_co_process_request(FuseQueue *q);
> > +static void coroutine_fn
> > +fuse_co_process_request(FuseQueue *q, void *spillover_buf);
> >   static void fuse_inc_in_flight(FuseExport *exp)
> >   {
> > @@ -476,12 +520,27 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
> >       FuseExport *exp = q->exp;
> >       ssize_t ret;
> >       const struct fuse_in_header *in_hdr;
> > +    struct iovec iov[2];
> > +    void *spillover_buf = NULL;
> >       if (unlikely(exp->halted)) {
> >           goto no_request;
> >       }
> > -    ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, sizeof(q->request_buf)));
> > +    /*
> > +     * If handling the last request consumed the spill-over buffer, allocate a
> > +     * new one.  Align it to the block device's alignment, which admittedly is
> > +     * only useful if FUSE_IN_PLACE_WRITE_BYTES is aligned, too.
> > +     */
> > +    if (unlikely(!q->spillover_buf)) {
> > +        q->spillover_buf = blk_blockalign(exp->common.blk,
> > +                                          FUSE_SPILLOVER_BUF_SIZE);
> > +    }
> > +    /* Construct the I/O vector to hold the FUSE request */
> > +    iov[0] = (struct iovec) { q->request_buf, sizeof(q->request_buf) };
> > +    iov[1] = (struct iovec) { q->spillover_buf, FUSE_SPILLOVER_BUF_SIZE };
> > +
> > +    ret = RETRY_ON_EINTR(readv(fuse_fd, iov, ARRAY_SIZE(iov)));
> >       if (ret < 0 && errno == EAGAIN) {
> >           /* No request available */
> >           goto no_request;
> > @@ -510,7 +569,13 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
> >           goto no_request;
> >       }
> > -    fuse_co_process_request(q);
> > +    if (unlikely(ret > sizeof(q->request_buf))) {
> > +        /* Spillover buffer used, take ownership */
> > +        spillover_buf = q->spillover_buf;
> > +        q->spillover_buf = NULL;
> > +    }
> > +
> > +    fuse_co_process_request(q, spillover_buf);
> >   no_request:
> >       fuse_dec_in_flight(exp);
> > @@ -560,6 +625,9 @@ static void fuse_export_delete(BlockExport *blk_exp)
> >           if (i > 0 && q->fuse_fd >= 0) {
> >               close(q->fuse_fd);
> >           }
> > +        if (q->spillover_buf) {
> > +            qemu_vfree(q->spillover_buf);
> > +        }
> >       }
> >       g_free(exp->queues);
> > @@ -869,17 +937,25 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
> >   }
> >   /**
> > - * Handle client writes to the exported image.  @buf has the data to be written
> > - * and will be copied to a bounce buffer before yielding for the first time.
> > + * Handle client writes to the exported image.  @in_place_buf has the first
> > + * FUSE_IN_PLACE_WRITE_BYTES bytes of the data to be written, @spillover_buf
> > + * contains the rest (if any; NULL otherwise).
> > + * Data in @in_place_buf is assumed to be overwritten after yielding, so will
> > + * be copied to a bounce buffer beforehand.  @spillover_buf in contrast is
> > + * assumed to be exclusively owned and will be used as-is.
> >    * Return the number of bytes written to *out on success, and -errno on error.
> >    */
> >   static ssize_t coroutine_fn
> >   fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
> > -              uint64_t offset, uint32_t size, const void *buf)
> > +              uint64_t offset, uint32_t size,
> > +              const void *in_place_buf, const void *spillover_buf)
> >   {
> > +    size_t in_place_size;
> >       void *copied;
> >       int64_t blk_len;
> >       int ret;
> > +    struct iovec iov[2];
> > +    QEMUIOVector qiov;
> >       /* Limited by max_write, should not happen */
> >       if (size > BDRV_REQUEST_MAX_BYTES) {
> > @@ -891,8 +967,9 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
> >       }
> >       /* Must copy to bounce buffer before potentially yielding */
> > -    copied = blk_blockalign(exp->common.blk, size);
> > -    memcpy(copied, buf, size);
> > +    in_place_size = MIN(size, FUSE_IN_PLACE_WRITE_BYTES);
> > +    copied = blk_blockalign(exp->common.blk, in_place_size);
> > +    memcpy(copied, in_place_buf, in_place_size);
> >       /**
> >        * Clients will expect short writes at EOF, so we have to limit
> > @@ -916,7 +993,21 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
> >           }
> >       }
> > -    ret = blk_co_pwrite(exp->common.blk, offset, size, copied, 0);
> > +    iov[0] = (struct iovec) {
> > +        .iov_base = copied,
> > +        .iov_len = in_place_size,
> > +    };
> > +    if (size > FUSE_IN_PLACE_WRITE_BYTES) {
> > +        assert(size - FUSE_IN_PLACE_WRITE_BYTES <= FUSE_SPILLOVER_BUF_SIZE);
> > +        iov[1] = (struct iovec) {
> > +            .iov_base = (void *)spillover_buf,
> > +            .iov_len = size - FUSE_IN_PLACE_WRITE_BYTES,
> > +        };
> > +        qemu_iovec_init_external(&qiov, iov, 2);
> > +    } else {
> > +        qemu_iovec_init_external(&qiov, iov, 1);
> > +    }
> > +    ret = blk_co_pwritev(exp->common.blk, offset, size, &qiov, 0);
> >       if (ret < 0) {
> >           goto fail_free_buffer;
> >       }
> > @@ -1275,8 +1366,14 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
> >    * Note that yielding in any request-processing function can overwrite the
> >    * contents of q->request_buf.  Anything that takes a buffer needs to take
> >    * care that the content is copied before yielding.
> > + *
> > + * @spillover_buf can contain the tail of a write request too large to fit into
> > + * q->request_buf.  This function takes ownership of it (i.e. will free it),
> > + * which assumes that its contents will not be overwritten by concurrent
> > + * requests (as opposed to q->request_buf).
> >    */
> > -static void coroutine_fn fuse_co_process_request(FuseQueue *q)
> > +static void coroutine_fn
> > +fuse_co_process_request(FuseQueue *q, void *spillover_buf)
> >   {
> >       FuseExport *exp = q->exp;
> >       uint32_t opcode;
> > @@ -1372,7 +1469,7 @@ static void coroutine_fn fuse_co_process_request(FuseQueue *q)
> >            * yielding.
> >            */
> >           ret = fuse_co_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
> > -                            in->offset, in->size, in + 1);
> > +                            in->offset, in->size, in + 1, spillover_buf);
> >           break;
> >       }
> > @@ -1414,6 +1511,8 @@ static void coroutine_fn fuse_co_process_request(FuseQueue *q)
> >                               ret < 0 ? ret : 0,
> >                               ret < 0 ? 0 : ret);
> >       }
> > +
> > +    qemu_vfree(spillover_buf);
> >   }
> >   const BlockExportDriver blk_exp_fuse = {

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

next prev parent reply	other threads:[~2025-06-11 13:47 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 01/21] fuse: Copy write buffer content before polling Hanna Czenczek
2025-06-09 14:45   ` Stefan Hajnoczi
2025-06-04 13:27 ` [PATCH v2 02/21] fuse: Ensure init clean-up even with error_fatal Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 03/21] fuse: Remove superfluous empty line Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 04/21] fuse: Explicitly set inode ID to 1 Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 05/21] fuse: Change setup_... to mount_fuse_export() Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 06/21] fuse: Fix mount options Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 07/21] fuse: Set direct_io and parallel_direct_writes Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 08/21] fuse: Introduce fuse_{at,de}tach_handlers() Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 09/21] fuse: Introduce fuse_{inc,dec}_in_flight() Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 10/21] fuse: Add halted flag Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 11/21] fuse: Rename length to blk_len in fuse_write() Hanna Czenczek
2025-06-09 14:48   ` Stefan Hajnoczi
2025-06-04 13:28 ` [PATCH v2 12/21] block: Move qemu_fcntl_addfl() into osdep.c Hanna Czenczek
2025-06-04 15:18   ` Eric Blake
2025-06-09 15:03   ` Stefan Hajnoczi
2025-07-01  7:24     ` Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 13/21] fuse: Manually process requests (without libfuse) Hanna Czenczek
2025-06-09 16:54   ` Stefan Hajnoczi
2025-06-04 13:28 ` [PATCH v2 14/21] fuse: Reduce max read size Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 15/21] fuse: Process requests in coroutines Hanna Czenczek
2025-06-05  8:12   ` Hanna Czenczek
2025-06-09 16:57   ` Stefan Hajnoczi
2025-06-04 13:28 ` [PATCH v2 16/21] block/export: Add multi-threading interface Hanna Czenczek
2025-06-04 13:58   ` Markus Armbruster
2025-06-09 17:00   ` Stefan Hajnoczi
2025-06-04 13:28 ` [PATCH v2 17/21] iotests/307: Test multi-thread export interface Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 18/21] fuse: Implement multi-threading Hanna Czenczek
2025-06-09 18:10   ` Stefan Hajnoczi
2025-06-27  1:08   ` Brian
2025-07-01  7:31     ` Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 19/21] qapi/block-export: Document FUSE's multi-threading Hanna Czenczek
2025-06-04 13:58   ` Markus Armbruster
2025-06-04 13:28 ` [PATCH v2 20/21] iotests/308: Add multi-threading sanity test Hanna Czenczek
2025-06-09 18:12   ` Stefan Hajnoczi
2025-06-04 13:28 ` [PATCH v2 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer Hanna Czenczek
2025-06-10 23:37   ` Brian
2025-06-11 13:46     ` Stefan Hajnoczi [this message]
2025-06-09 18:14 ` [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250611134641.GA160966@fedora \
    --to=stefanha@redhat.com \
    --cc=armbru@redhat.com \
    --cc=hibriansong@gmail.com \
    --cc=hreitz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).