From: Bernd Schubert <bernd@bsbernd.com>
To: Joanne Koong <joannelkoong@gmail.com>, miklos@szeredi.hu
Cc: axboe@kernel.dk, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v2 13/14] fuse: add zero-copy over io-uring
Date: Wed, 6 May 2026 01:45:24 +0200 [thread overview]
Message-ID: <45e57cb2-6b0c-46b7-b614-a32eb9aa394c@bsbernd.com> (raw)
In-Reply-To: <20260402162840.2989717-14-joannelkoong@gmail.com>
On 4/2/26 18:28, Joanne Koong wrote:
> Implement zero-copy data transfer for fuse over io-uring, eliminating
> memory copies between userspace, the kernel, and the fuse server for
> page-backed read/write operations.
>
> When the FUSE_URING_ZERO_COPY flag is set alongside FUSE_URING_BUFRING,
> the kernel registers the client's underlying pages as a sparse buffer at
> the entry's fixed id via io_buffer_register_bvec(). The fuse server can
> then perform io_uring read/write operations directly on these pages.
> Non-page-backed args (eg out headers) go through the payload buffer as
> normal.
>
> This requires CAP_SYS_ADMIN and buffer rings with pinned headers and
> buffers. Gating on pinned headers and buffers keeps the configuration
> space small and avoids partially-optimized modes that are unlikely to be
> useful in practice. Pages are unregistered when the request completes.
>
> The request flow for the zero-copy write path (client writes data,
> server reads it) is as follows:
> =======================================================================
> | Kernel | FUSE server
> | |
> | "write(fd, buf, 1MB)" |
> | |
> | >sys_write() |
> | >fuse_file_write_iter() |
> | >fuse_send_one() |
> | [req->args->in_pages = true] |
> | [folios hold client write data] |
> | |
> | >fuse_uring_copy_to_ring() |
> | >copy_header_to_ring(IN_OUT) |
> | [memcpy fuse_in_header to |
> | pinned headers buf via kaddr] |
> | >copy_header_to_ring(OP) |
> | [memcpy write_in header] |
> | |
> | >fuse_uring_args_to_ring() |
> | >setup_fuse_copy_state() |
> | [is_kaddr = true] |
> | [skip_folio_copy = true] |
> | |
> | >fuse_uring_set_up_zero_copy() |
> | [folio_get for each client folio] |
> | [build bio_vec array from folios] |
> | >io_buffer_register_bvec() |
> | [register pages at ent->id] |
Somehow I find ent->id really confusing here. ent->slot_idx? Or even
ent->tag?
> | [ent->zero_copied = true] |
> | |
> | >fuse_copy_args() |
> | [skip_folio_copy => return 0 |
> | for page arg, skip data copy] |
> | |
> | >copy_header_to_ring(RING_ENT) |
> | [memcpy ent_in_out] |
> | >io_uring_cmd_done() |
> | |
> | | [CQE received]
> | |
> | | [issue io_uring READ at
> | | ent->id]
> | | [reads directly from
> | |client's pages (ZERO_COPY)]
> | |
> | | [write data to backing
> | | store]
> | | [submit COMMIT AND FETCH]
> | |
> | >fuse_uring_commit_fetch() |
> | >fuse_uring_commit() |
> | >fuse_uring_copy_from_ring() |
> | >fuse_uring_req_end() |
> | >io_buffer_unregister(ent->id) |
> | [unregister sparse buffer] |
> | >fuse_zero_copy_release() |
> | [folio_put for each folio] |
> | [ent->zero_copied = false] |
> | >fuse_request_end() |
> | [wake up client] |
>
> The zero-copy read path is analogous.
>
> Some requests may have both page-backed args and non-page-backed args.
> For these requests, the page-backed args are zero-copied while the
> non-page-backed args are copied to the buffer selected from the buffer
> ring:
> zero-copy: pages registered via io_buffer_register_bvec()
> non-page-backed: copied to payload buffer via fuse_copy_args()
>
> For a request whose payload is zero-copied, the
> registration/unregistration path looks like:
>
> register: fuse_uring_set_up_zero_copy()
> folio_get() for each folio
> io_buffer_register_bvec(ent->id)
>
> [server accesses pages via io_uring fixed buf at ent->id]
>
> unregister: fuse_uring_req_end()
> io_buffer_unregister(ent->id)
> -> fuse_zero_copy_release() callback
> folio_put() for each folio
>
> The throughput improvement from zero-copy depends on how much of the
> per-request latency is spent on data copying vs backing I/O. When
> backing I/O dominates, the saved memcpy is a negligible fraction of
> overall latency. Please also note that for the server to read/write
> into the zero-copied pages, the read/write must go through io-uring
> as an IORING_OP_READ_FIXED / IORING_OP_WRITE_FIXED operation. If the
> server's backing I/O is instantaneous (eg served from cache), the
> overhead of the additional io_uring operation may negate the savings
> from eliminating the memcpy.
>
> In benchmarks using passthrough_hp on a high-performance NVMe-backed
> system, zero-copy showed around a 35% throughput improvement for direct
> randreads (~2150 MiB/s to ~2900 MiB/s), a 15% improvement for direct
> sequential reads (~2510 MiB/s to ~2900 MiB/s), a 15% improvement for
> buffered randreads (~2100 MiB/s to ~2470 MiB/s), and a 10% improvement
> for buffered sequential reads (~2500 MiB/s to ~2750 MiB/s).
>
> The benchmarks were run using:
> fio --name=test_run --ioengine=sync --rw=rand{read,write} --bs=1M
> --size=1G --numjobs=2 --ramp_time=30 --group_reporting=1
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> fs/fuse/dev.c | 7 +-
> fs/fuse/dev_uring.c | 167 +++++++++++++++++++++++++++++++++-----
> fs/fuse/dev_uring_i.h | 4 +
> fs/fuse/fuse_dev_i.h | 1 +
> include/uapi/linux/fuse.h | 5 ++
> 5 files changed, 160 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index a87939eaa103..cd326e61831b 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -1233,10 +1233,13 @@ int fuse_copy_args(struct fuse_copy_state *cs, unsigned numargs,
>
> for (i = 0; !err && i < numargs; i++) {
> struct fuse_arg *arg = &args[i];
> - if (i == numargs - 1 && argpages)
> + if (i == numargs - 1 && argpages) {
> + if (cs->skip_folio_copy)
> + return 0;
> err = fuse_copy_folios(cs, arg->size, zeroing);
> - else
> + } else {
> err = fuse_copy_one(cs, arg->value, arg->size);
> + }
> }
> return err;
> }
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 06d3d8dc1c82..d9f1ee4beaf3 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -31,6 +31,11 @@ struct fuse_uring_pdu {
> struct fuse_ring_ent *ent;
> };
>
> +struct fuse_zero_copy_bvs {
> + unsigned int nr_bvs;
> + struct bio_vec bvs[];
> +};
> +
> static const struct fuse_iqueue_ops fuse_io_uring_ops;
>
> enum fuse_uring_header_type {
> @@ -57,6 +62,11 @@ static inline bool bufring_pinned_buffers(struct fuse_ring_queue *queue)
> return queue->bufring->use_pinned_buffers;
> }
>
> +static inline bool bufring_zero_copy(struct fuse_ring_queue *queue)
> +{
> + return queue->bufring->use_zero_copy;
> +}
> +
> static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
> struct fuse_ring_ent *ring_ent)
> {
> @@ -102,8 +112,18 @@ static void fuse_uring_flush_bg(struct fuse_ring_queue *queue)
> }
> }
>
> +static bool can_zero_copy_req(struct fuse_ring_ent *ent, struct fuse_req *req)
> +{
> + struct fuse_args *args = req->args;
> +
> + if (!bufring_enabled(ent->queue) || !bufring_zero_copy(ent->queue))
> + return false;
> +
> + return args->in_pages || args->out_pages;
> +}
> +
> static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
> - int error)
> + int error, unsigned int issue_flags)
> {
> struct fuse_ring_queue *queue = ent->queue;
> struct fuse_ring *ring = queue->ring;
> @@ -122,6 +142,11 @@ static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
>
> spin_unlock(&queue->lock);
>
> + if (ent->zero_copied) {
> + io_buffer_unregister(ent->cmd, ent->id, issue_flags);
> + ent->zero_copied = false;
> + }
> +
> if (error)
> req->out.h.error = error;
>
> @@ -485,6 +510,7 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> struct iovec iov[FUSE_URING_IOV_SEGS];
> bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
> bool pinned_bufs = init_flags & FUSE_URING_PINNED_BUFFERS;
> + bool zero_copy = init_flags & FUSE_URING_ZERO_COPY;
> void __user *payload, *headers;
> size_t headers_size, payload_size, ring_size;
> struct fuse_bufring *br;
> @@ -508,7 +534,7 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> if (headers_size < queue_depth * sizeof(struct fuse_uring_req_header))
> return -EINVAL;
>
> - if (buf_size < queue->ring->max_payload_sz)
> + if (!zero_copy && buf_size < queue->ring->max_payload_sz)
> return -EINVAL;
>
> nr_bufs = payload_size / buf_size;
> @@ -521,6 +547,7 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> if (!br)
> return -ENOMEM;
>
> + br->use_zero_copy = zero_copy;
> br->queue_depth = queue_depth;
> if (pinned_headers) {
> err = fuse_bufring_pin_mem(&br->pinned_headers, headers,
> @@ -580,6 +607,7 @@ static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
> bool bufring = init_flags & FUSE_URING_BUFRING;
> bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
> bool pinned_bufs = init_flags & FUSE_URING_PINNED_BUFFERS;
> + bool zero_copy = init_flags & FUSE_URING_ZERO_COPY;
>
> if (bufring_enabled(queue) != bufring)
> return false;
> @@ -588,7 +616,8 @@ static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
> return true;
>
> return bufring_pinned_headers(queue) == pinned_headers &&
> - bufring_pinned_buffers(queue) == pinned_bufs;
> + bufring_pinned_buffers(queue) == pinned_bufs &&
> + bufring_zero_copy(queue) == zero_copy;
> }
>
> static struct fuse_ring_queue *
> @@ -1063,6 +1092,7 @@ static int setup_fuse_copy_state(struct fuse_copy_state *cs,
> cs->is_kaddr = true;
> cs->kaddr = (void *)ent->payload_buf.addr;
> cs->len = ent->payload_buf.len;
> + cs->skip_folio_copy = ent->zero_copied;
> }
>
> cs->is_uring = true;
> @@ -1095,11 +1125,70 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
> return err;
> }
>
> +static void fuse_zero_copy_release(void *priv)
> +{
> + struct fuse_zero_copy_bvs *zc_bvs = priv;
> + unsigned int i;
> +
> + for (i = 0; i < zc_bvs->nr_bvs; i++)
> + folio_put(page_folio(zc_bvs->bvs[i].bv_page));
> +
> + kfree(zc_bvs);
> +}
> +
> +static int fuse_uring_set_up_zero_copy(struct fuse_ring_ent *ent,
> + struct fuse_req *req,
> + unsigned int issue_flags)
> +{
> + struct fuse_args_pages *ap;
> + int err, i, ddir = 0;
> + struct fuse_zero_copy_bvs *zc_bvs;
> + struct bio_vec *bvs;
> +
> + /* out_pages indicates a read, in_pages indicates a write */
> + if (req->args->out_pages)
> + ddir |= IO_BUF_DEST;
> + if (req->args->in_pages)
> + ddir |= IO_BUF_SOURCE;
> +
> + WARN_ON_ONCE(!ddir);
> +
> + ap = container_of(req->args, typeof(*ap), args);
> +
> + zc_bvs = kmalloc(struct_size(zc_bvs, bvs, ap->num_folios),
> + GFP_KERNEL_ACCOUNT);
> + if (!zc_bvs)
> + return -ENOMEM;
> +
> + zc_bvs->nr_bvs = ap->num_folios;
> + bvs = zc_bvs->bvs;
> + for (i = 0; i < ap->num_folios; i++) {
> + bvs[i].bv_page = folio_page(ap->folios[i], 0);
Hmm, I thought everything was prepared for huge folios? Shouldn't this
function here be updated to handle that? Iterate over all folios add up
the number of pages and then iterate over all folios and pages?
> + bvs[i].bv_offset = ap->descs[i].offset;
> + bvs[i].bv_len = ap->descs[i].length;
> + folio_get(ap->folios[i]);
> + }
> +
Maybe a commment here like
/* ent->id is used in fuse-server with io_uring_prep_{write,read}_fixed */ ?
Thanks,
Bernd
next prev parent reply other threads:[~2026-05-05 23:45 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
2026-04-02 16:28 ` [PATCH v2 01/14] fuse: separate next request fetching from sending logic Joanne Koong
2026-04-29 11:52 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 02/14] fuse: refactor io-uring header copying to ring Joanne Koong
2026-04-29 12:05 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 03/14] fuse: refactor io-uring header copying from ring Joanne Koong
2026-04-29 12:06 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 04/14] fuse: use enum types for header copying Joanne Koong
2026-04-30 8:04 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 05/14] fuse: refactor setting up copy state for payload copying Joanne Koong
2026-04-30 8:06 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 06/14] fuse: support buffer copying for kernel addresses Joanne Koong
2026-04-30 8:19 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 07/14] fuse: use named constants for io-uring iovec indices Joanne Koong
2026-04-15 9:36 ` Bernd Schubert
2026-04-30 8:20 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 08/14] fuse: move fuse_uring_abort() from header to dev_uring.c Joanne Koong
2026-04-15 9:40 ` Bernd Schubert
2026-04-30 8:21 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 09/14] fuse: rearrange io-uring iovec and ent allocation logic Joanne Koong
2026-04-15 9:45 ` Bernd Schubert
2026-04-30 8:24 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 10/14] fuse: add io-uring buffer rings Joanne Koong
2026-04-15 9:48 ` Bernd Schubert
2026-04-15 21:40 ` Joanne Koong
2026-04-30 11:08 ` Jeff Layton
2026-04-30 12:44 ` Joanne Koong
2026-05-05 22:47 ` Bernd Schubert
2026-04-02 16:28 ` [PATCH v2 11/14] fuse: add pinned headers capability for " Joanne Koong
2026-04-14 12:47 ` Bernd Schubert
2026-04-15 0:48 ` Joanne Koong
2026-05-05 22:51 ` Bernd Schubert
2026-04-30 11:22 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 12/14] fuse: add pinned payload buffers " Joanne Koong
2026-04-30 11:29 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 13/14] fuse: add zero-copy over io-uring Joanne Koong
2026-04-30 11:42 ` Jeff Layton
2026-04-30 12:35 ` Joanne Koong
2026-04-30 12:55 ` Jeff Layton
2026-05-05 22:55 ` Bernd Schubert
2026-04-30 12:56 ` Jeff Layton
2026-05-05 23:45 ` Bernd Schubert [this message]
2026-04-02 16:28 ` [PATCH v2 14/14] docs: fuse: add io-uring bufring and zero-copy documentation Joanne Koong
2026-04-14 21:05 ` Bernd Schubert
2026-04-15 1:10 ` Joanne Koong
2026-04-15 10:55 ` Bernd Schubert
2026-04-15 22:40 ` Joanne Koong
2026-04-30 12:57 ` Jeff Layton
2026-04-30 12:59 ` [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=45e57cb2-6b0c-46b7-b614-a32eb9aa394c@bsbernd.com \
--to=bernd@bsbernd.com \
--cc=axboe@kernel.dk \
--cc=joannelkoong@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=miklos@szeredi.hu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox