From: Stefan Hajnoczi <stefanha@redhat.com>
To: Brian <hibriansong@gmail.com>
Cc: Hanna Czenczek <hreitz@redhat.com>,
qemu-block@nongnu.org, qemu-devel@nongnu.org,
Kevin Wolf <kwolf@redhat.com>,
Markus Armbruster <armbru@redhat.com>
Subject: Re: [PATCH v2 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer
Date: Wed, 11 Jun 2025 09:46:41 -0400 [thread overview]
Message-ID: <20250611134641.GA160966@fedora> (raw)
In-Reply-To: <0eb2f89d-7be4-4480-a7e3-1b16344364db@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 14178 bytes --]
On Tue, Jun 10, 2025 at 07:37:34PM -0400, Brian wrote:
> Hi all,
>
> I'm currently working on the fuse over io_uring feature for QEMU
> FUSE export. When I submit SQEs to the /dev/fuse file descriptor
> during the register phase, the kernel returns an error (in the
> fuse_uring_create_ring_ent()). It seems this is because the payload
> size (i.e. spillover_buf size, which is FUSE_MAX_WRITE_BYTES (64 *
> 4k) minus FUSE_IN_PLACE_WRITE_BYTES (4 * 4k)) is smaller than
> ring->max_payload_sz (which is 32 pages * 4k).
>
> Do we need to increase the spillover_buf size, or is there any
> other workaround?
When you implement support for concurrent FUSE-over-io_uring requests
you'll need to pre-allocate a buffer for each request. That
pre-allocation code will be separate from the request buffer that this
patch series defines. So I think this issue is specific to
FUSE-over-io_uring and something that can be done in your upcoming
patches rather than something that affects this patch series.
How about going ahead and writing the FUSE-over-io_uring-specific code
for pre-allocating request buffers right away instead of using
FuseQueue->request_buf[] in your code?
Stefan
>
>
> Brian
>
> On 6/4/25 9:28 AM, Hanna Czenczek wrote:
> > We probably want to support larger write sizes than just 4k; 64k seems
> > nice. However, we cannot read partial requests from the FUSE FD, we
> > always have to read requests in full; so our read buffer must be large
> > enough to accommodate potential 64k writes if we want to support that.
> >
> > Always allocating FuseRequest objects with 64k buffers in them seems
> > wasteful, though. But we can get around the issue by splitting the
> > buffer into two and using readv(): One part will hold all normal (up to
> > 4k) write requests and all other requests, and a second part (the
> > "spill-over buffer") will be used only for larger write requests. Each
> > FuseQueue has its own spill-over buffer, and only if we find it used
> > when reading a request will we move its ownership into the FuseRequest
> > object and allocate a new spill-over buffer for the queue.
> >
> > This way, we get to support "large" write sizes without having to
> > allocate big buffers when they aren't used.
> >
> > Also, this even reduces the size of the FuseRequest objects because the
> > read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
> > the requests we support are not quite so large (except for >4k writes),
> > so until now, we basically had to have useless padding in there.
> >
> > With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
> > is easily met and we can decrease the size of the buffer portion that is
> > right inside of FuseRequest.
> >
> > As for benchmarks, the benefit of this patch can be shown easily by
> > writing a 4G image (with qemu-img convert) to a FUSE export:
> > - Before this patch: Takes 25.6 s (14.4 s with -t none)
> > - After this patch: Takes 4.5 s (5.5 s with -t none)
> >
> > Reviewed-by: Stefan Hajnoczi<stefanha@redhat.com>
> > Signed-off-by: Hanna Czenczek<hreitz@redhat.com>
> > ---
> > block/export/fuse.c | 137 ++++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 118 insertions(+), 19 deletions(-)
> >
> > diff --git a/block/export/fuse.c b/block/export/fuse.c
> > index cdec31f2a8..908266d101 100644
> > --- a/block/export/fuse.c
> > +++ b/block/export/fuse.c
> > @@ -50,8 +50,17 @@
> > /* Prevent overly long bounce buffer allocations */
> > #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
> > -/* Small enough to fit in the request buffer */
> > -#define FUSE_MAX_WRITE_BYTES (4 * 1024)
> > +/*
> > + * FUSE_MAX_WRITE_BYTES determines the maximum number of bytes we support in a
> > + * write request; FUSE_IN_PLACE_WRITE_BYTES and FUSE_SPILLOVER_BUF_SIZE
> > + * determine the split between the size of the in-place buffer in FuseRequest
> > + * and the spill-over buffer in FuseQueue. See FuseQueue.spillover_buf for a
> > + * detailed explanation.
> > + */
> > +#define FUSE_IN_PLACE_WRITE_BYTES (4 * 1024)
> > +#define FUSE_MAX_WRITE_BYTES (64 * 1024)
> > +#define FUSE_SPILLOVER_BUF_SIZE \
> > + (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
> > typedef struct FuseExport FuseExport;
> > @@ -67,15 +76,49 @@ typedef struct FuseQueue {
> > /*
> > * The request buffer must be able to hold a full write, and/or at least
> > - * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes
> > + * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes.
> > + * This however is just the first part of the buffer; every read is given
> > + * a vector of this buffer (which should be enough for all normal requests,
> > + * which we check via the static assertion in FUSE_IN_OP_STRUCT()) and the
> > + * spill-over buffer below.
> > + * Therefore, the size of this buffer plus FUSE_SPILLOVER_BUF_SIZE must be
> > + * FUSE_MIN_READ_BUFFER or more (checked via static assertion below).
> > + */
> > + char request_buf[sizeof(struct fuse_in_header) +
> > + sizeof(struct fuse_write_in) +
> > + FUSE_IN_PLACE_WRITE_BYTES];
> > +
> > + /*
> > + * When retrieving a FUSE request, the destination buffer must always be
> > + * sufficiently large for the whole request, i.e. with max_write=64k, we
> > + * must provide a buffer that fits the WRITE header and 64 kB of space for
> > + * data.
> > + * We do want to support 64k write requests without requiring them to be
> > + * split up, but at the same time, do not want to do such a large allocation
> > + * for every single request.
> > + * Therefore, the FuseRequest object provides an in-line buffer that is
> > + * enough for write requests up to 4k (and all other requests), and for
> > + * every request that is bigger, we provide a spill-over buffer here (for
> > + * the remaining 64k - 4k = 60k).
> > + * When poll_fuse_fd() reads a FUSE request, it passes these buffers as an
> > + * I/O vector, and then checks the return value (number of bytes read) to
> > + * find out whether the spill-over buffer was used. If so, it will move the
> > + * buffer to the request, and will allocate a new spill-over buffer for the
> > + * next request.
> > + *
> > + * Free this buffer with qemu_vfree().
> > */
> > - char request_buf[MAX_CONST(
> > - sizeof(struct fuse_in_header) + sizeof(struct fuse_write_in) +
> > - FUSE_MAX_WRITE_BYTES,
> > - FUSE_MIN_READ_BUFFER
> > - )];
> > + void *spillover_buf;
> > } FuseQueue;
> > +/*
> > + * Verify that FuseQueue.request_buf plus the spill-over buffer together
> > + * are big enough to be accepted by the FUSE kernel driver.
> > + */
> > +QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
> > + FUSE_SPILLOVER_BUF_SIZE <
> > + FUSE_MIN_READ_BUFFER);
> > +
> > struct FuseExport {
> > BlockExport common;
> > @@ -131,7 +174,8 @@ static int clone_fuse_fd(int fd, Error **errp);
> > static bool is_regular_file(const char *path, Error **errp);
> > static void read_from_fuse_fd(void *opaque);
> > -static void coroutine_fn fuse_co_process_request(FuseQueue *q);
> > +static void coroutine_fn
> > +fuse_co_process_request(FuseQueue *q, void *spillover_buf);
> > static void fuse_inc_in_flight(FuseExport *exp)
> > {
> > @@ -476,12 +520,27 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
> > FuseExport *exp = q->exp;
> > ssize_t ret;
> > const struct fuse_in_header *in_hdr;
> > + struct iovec iov[2];
> > + void *spillover_buf = NULL;
> > if (unlikely(exp->halted)) {
> > goto no_request;
> > }
> > - ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, sizeof(q->request_buf)));
> > + /*
> > + * If handling the last request consumed the spill-over buffer, allocate a
> > + * new one. Align it to the block device's alignment, which admittedly is
> > + * only useful if FUSE_IN_PLACE_WRITE_BYTES is aligned, too.
> > + */
> > + if (unlikely(!q->spillover_buf)) {
> > + q->spillover_buf = blk_blockalign(exp->common.blk,
> > + FUSE_SPILLOVER_BUF_SIZE);
> > + }
> > + /* Construct the I/O vector to hold the FUSE request */
> > + iov[0] = (struct iovec) { q->request_buf, sizeof(q->request_buf) };
> > + iov[1] = (struct iovec) { q->spillover_buf, FUSE_SPILLOVER_BUF_SIZE };
> > +
> > + ret = RETRY_ON_EINTR(readv(fuse_fd, iov, ARRAY_SIZE(iov)));
> > if (ret < 0 && errno == EAGAIN) {
> > /* No request available */
> > goto no_request;
> > @@ -510,7 +569,13 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
> > goto no_request;
> > }
> > - fuse_co_process_request(q);
> > + if (unlikely(ret > sizeof(q->request_buf))) {
> > + /* Spillover buffer used, take ownership */
> > + spillover_buf = q->spillover_buf;
> > + q->spillover_buf = NULL;
> > + }
> > +
> > + fuse_co_process_request(q, spillover_buf);
> > no_request:
> > fuse_dec_in_flight(exp);
> > @@ -560,6 +625,9 @@ static void fuse_export_delete(BlockExport *blk_exp)
> > if (i > 0 && q->fuse_fd >= 0) {
> > close(q->fuse_fd);
> > }
> > + if (q->spillover_buf) {
> > + qemu_vfree(q->spillover_buf);
> > + }
> > }
> > g_free(exp->queues);
> > @@ -869,17 +937,25 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
> > }
> > /**
> > - * Handle client writes to the exported image. @buf has the data to be written
> > - * and will be copied to a bounce buffer before yielding for the first time.
> > + * Handle client writes to the exported image. @in_place_buf has the first
> > + * FUSE_IN_PLACE_WRITE_BYTES bytes of the data to be written, @spillover_buf
> > + * contains the rest (if any; NULL otherwise).
> > + * Data in @in_place_buf is assumed to be overwritten after yielding, so will
> > + * be copied to a bounce buffer beforehand. @spillover_buf in contrast is
> > + * assumed to be exclusively owned and will be used as-is.
> > * Return the number of bytes written to *out on success, and -errno on error.
> > */
> > static ssize_t coroutine_fn
> > fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
> > - uint64_t offset, uint32_t size, const void *buf)
> > + uint64_t offset, uint32_t size,
> > + const void *in_place_buf, const void *spillover_buf)
> > {
> > + size_t in_place_size;
> > void *copied;
> > int64_t blk_len;
> > int ret;
> > + struct iovec iov[2];
> > + QEMUIOVector qiov;
> > /* Limited by max_write, should not happen */
> > if (size > BDRV_REQUEST_MAX_BYTES) {
> > @@ -891,8 +967,9 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
> > }
> > /* Must copy to bounce buffer before potentially yielding */
> > - copied = blk_blockalign(exp->common.blk, size);
> > - memcpy(copied, buf, size);
> > + in_place_size = MIN(size, FUSE_IN_PLACE_WRITE_BYTES);
> > + copied = blk_blockalign(exp->common.blk, in_place_size);
> > + memcpy(copied, in_place_buf, in_place_size);
> > /**
> > * Clients will expect short writes at EOF, so we have to limit
> > @@ -916,7 +993,21 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
> > }
> > }
> > - ret = blk_co_pwrite(exp->common.blk, offset, size, copied, 0);
> > + iov[0] = (struct iovec) {
> > + .iov_base = copied,
> > + .iov_len = in_place_size,
> > + };
> > + if (size > FUSE_IN_PLACE_WRITE_BYTES) {
> > + assert(size - FUSE_IN_PLACE_WRITE_BYTES <= FUSE_SPILLOVER_BUF_SIZE);
> > + iov[1] = (struct iovec) {
> > + .iov_base = (void *)spillover_buf,
> > + .iov_len = size - FUSE_IN_PLACE_WRITE_BYTES,
> > + };
> > + qemu_iovec_init_external(&qiov, iov, 2);
> > + } else {
> > + qemu_iovec_init_external(&qiov, iov, 1);
> > + }
> > + ret = blk_co_pwritev(exp->common.blk, offset, size, &qiov, 0);
> > if (ret < 0) {
> > goto fail_free_buffer;
> > }
> > @@ -1275,8 +1366,14 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
> > * Note that yielding in any request-processing function can overwrite the
> > * contents of q->request_buf. Anything that takes a buffer needs to take
> > * care that the content is copied before yielding.
> > + *
> > + * @spillover_buf can contain the tail of a write request too large to fit into
> > + * q->request_buf. This function takes ownership of it (i.e. will free it),
> > + * which assumes that its contents will not be overwritten by concurrent
> > + * requests (as opposed to q->request_buf).
> > */
> > -static void coroutine_fn fuse_co_process_request(FuseQueue *q)
> > +static void coroutine_fn
> > +fuse_co_process_request(FuseQueue *q, void *spillover_buf)
> > {
> > FuseExport *exp = q->exp;
> > uint32_t opcode;
> > @@ -1372,7 +1469,7 @@ static void coroutine_fn fuse_co_process_request(FuseQueue *q)
> > * yielding.
> > */
> > ret = fuse_co_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
> > - in->offset, in->size, in + 1);
> > + in->offset, in->size, in + 1, spillover_buf);
> > break;
> > }
> > @@ -1414,6 +1511,8 @@ static void coroutine_fn fuse_co_process_request(FuseQueue *q)
> > ret < 0 ? ret : 0,
> > ret < 0 ? 0 : ret);
> > }
> > +
> > + qemu_vfree(spillover_buf);
> > }
> > const BlockExportDriver blk_exp_fuse = {
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
next prev parent reply other threads:[~2025-06-11 13:47 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 01/21] fuse: Copy write buffer content before polling Hanna Czenczek
2025-06-09 14:45 ` Stefan Hajnoczi
2025-06-04 13:27 ` [PATCH v2 02/21] fuse: Ensure init clean-up even with error_fatal Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 03/21] fuse: Remove superfluous empty line Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 04/21] fuse: Explicitly set inode ID to 1 Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 05/21] fuse: Change setup_... to mount_fuse_export() Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 06/21] fuse: Fix mount options Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 07/21] fuse: Set direct_io and parallel_direct_writes Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 08/21] fuse: Introduce fuse_{at,de}tach_handlers() Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 09/21] fuse: Introduce fuse_{inc,dec}_in_flight() Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 10/21] fuse: Add halted flag Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 11/21] fuse: Rename length to blk_len in fuse_write() Hanna Czenczek
2025-06-09 14:48 ` Stefan Hajnoczi
2025-06-04 13:28 ` [PATCH v2 12/21] block: Move qemu_fcntl_addfl() into osdep.c Hanna Czenczek
2025-06-04 15:18 ` Eric Blake
2025-06-09 15:03 ` Stefan Hajnoczi
2025-07-01 7:24 ` Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 13/21] fuse: Manually process requests (without libfuse) Hanna Czenczek
2025-06-09 16:54 ` Stefan Hajnoczi
2025-06-04 13:28 ` [PATCH v2 14/21] fuse: Reduce max read size Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 15/21] fuse: Process requests in coroutines Hanna Czenczek
2025-06-05 8:12 ` Hanna Czenczek
2025-06-09 16:57 ` Stefan Hajnoczi
2025-06-04 13:28 ` [PATCH v2 16/21] block/export: Add multi-threading interface Hanna Czenczek
2025-06-04 13:58 ` Markus Armbruster
2025-06-09 17:00 ` Stefan Hajnoczi
2025-06-04 13:28 ` [PATCH v2 17/21] iotests/307: Test multi-thread export interface Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 18/21] fuse: Implement multi-threading Hanna Czenczek
2025-06-09 18:10 ` Stefan Hajnoczi
2025-06-27 1:08 ` Brian
2025-07-01 7:31 ` Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 19/21] qapi/block-export: Document FUSE's multi-threading Hanna Czenczek
2025-06-04 13:58 ` Markus Armbruster
2025-06-04 13:28 ` [PATCH v2 20/21] iotests/308: Add multi-threading sanity test Hanna Czenczek
2025-06-09 18:12 ` Stefan Hajnoczi
2025-06-04 13:28 ` [PATCH v2 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer Hanna Czenczek
2025-06-10 23:37 ` Brian
2025-06-11 13:46 ` Stefan Hajnoczi [this message]
2025-06-09 18:14 ` [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250611134641.GA160966@fedora \
--to=stefanha@redhat.com \
--cc=armbru@redhat.com \
--cc=hibriansong@gmail.com \
--cc=hreitz@redhat.com \
--cc=kwolf@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).