* [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy
@ 2026-04-02 16:28 Joanne Koong
2026-04-02 16:28 ` [PATCH v2 01/14] fuse: separate next request fetching from sending logic Joanne Koong
` (14 more replies)
0 siblings, 15 replies; 49+ messages in thread
From: Joanne Koong @ 2026-04-02 16:28 UTC (permalink / raw)
To: miklos; +Cc: bernd, axboe, linux-fsdevel
This series adds buffer ring and zero-copy capabilities to fuse over io-uring.
Using buffer rings has advantages over the non-buffer-ring (iovec) path:
- Reduced memory usage: in the iovec path, each entry has its own
dedicated payload buffer, requiring N buffers for N entries where each
buffer must be large enough to accommodate the maximum possible
payload size. With buffer rings, payload buffers are pooled and
selected on demand. Entries only hold a buffer while actively
processing a request with payload data. When incremental buffer
consumption is added, this will allow non-overlapping regions of a
single buffer to be used simultaneously across multiple requests,
further reducing memory requirements.
- Foundation for pinned buffers: the buffer ring headers and payloads
are now each passed in as a contiguous memory allocation, which allows
fuse to easily pin and vmap the entire region in one operation during
queue setup. This will eliminate the per-request overhead of having to
pin/unpin user pages and translate virtual addresses and is a
prerequisite for future optimizations like performing data copies
outside of the server's task context.
This series adds the capability to pin the underlying header and payload
buffers by setting init flags at registration time, depending on the user's
mlock limit.
Zero-copy (only for privileged servers) is also opt-in by setting an init flag
at registration time. Zero-copy eliminates the memory copies between kernel and
userspace for read/write/payload-heavy operations by allowing the server to
directly operate on the client's underlying pages.
This series has a dependency on io-uring registered bvec buffers changes
in [1].
The throughput improvements from pinned buffers and zero-copy depends on how
much of the server's per-request latency is spent on data copying vs backing
I/O. When backing I/O dominates, the saved memcpy is a negligible fraction of
overall latency. Please also note that for the server to read/write
into the zero-copied pages, the read/write must go through io-uring
as an IORING_OP_READ_FIXED / IORING_OP_WRITE_FIXED operation. If the
server's backing I/O is instantaneous (eg served from cache), the
overhead of the additional io_uring operation may negate the savings
from eliminating the memcpy.
In benchmarks using passthrough_hp on a high-performance NVMe-backed
system, pinned headers and pinned payload buffers showed around a 10%
throughput improvement for direct randreads (~2150 MiB/s to ~2400
MiB/s), a 4% improvement for direct sequential reads (~2510 MiB/s to
~2620 MiB/s), a 8% improvement for buffered randreads (~2100 MiB/s to
~2280 MiB/s), and a 6% improvement for buffered sequential reads (~2500
MiB/s to ~2670 MiB/s).
Zero-copy showed around a 35% throughput improvement for direct
randreads (~2150 MiB/s to ~2900 MiB/s), a 15% improvement for direct
sequential reads (~2510 MiB/s to ~2900 MiB/s), a 15% improvement for
buffered randreads (~2100 MiB/s to ~2470 MiB/s), and a 10% improvement
for buffered sequential reads (~2500 MiB/s to ~2750 MiB/s). I didn't see
enough of a clear improvement for writes due to write latency being I/O
dominated.
The benchmarks were run using:
fio --name=test_run --ioengine=sync --rw=rand{read,write} --bs=1M
--size=1G --numjobs=2 --ramp_time=30 --group_reporting=1
To run the benchmark, please also add this patch [2].
The libfuse changes can be found in [3]. To test the server, run:
sudo ~/libfuse/build/example/passthrough_hp ~/src ~/mounts/tmp
--nopassthrough -o io_uring_zero_copy -o io_uring_q_depth=8
Once this series is merged, the libfuse changes will be tidied up and
submitted upstream.
Further optimizations for incremental buffer consumption, request
dispatching in current task context, and backing buffer integration with
IORING_OP_READ/IORING_OP_WRITE operations will be submitted as part of a
separate series.
Thanks,
Joanne
[1] https://lore.kernel.org/io-uring/20260402160929.2749744-1-joannelkoong@gmail.com/T/#t
[2] https://lore.kernel.org/linux-fsdevel/20260326215127.3857682-2-joannelkoong@gmail.com/
[3] https://github.com/joannekoong/libfuse/commits/zero_copy_v2/
Changelog
---------
v1: https://lore.kernel.org/linux-fsdevel/20260324224532.3733468-1-joannelkoong@gmail.com/
v1 -> v2:
* Drop kernel managed buffers from io-uring infrastructure and instead move
logic into fuse. To later use buffers with io-uring requests natively will
require fuse to place the backing buffer as a fixed buffer in a sparse slot
for the server, but that will be added as an optimization in a separate
series. This makes the io-uring code cleaner and accomodates for more flexible
fuse user configurations (eg mlock limits) and easier setup (me)
* Run more benchmarks and get more numbers (me)
* Add visual diagrams and more documentatoin to commit messages and
documentation patch (Bernd)
Joanne Koong (14):
fuse: separate next request fetching from sending logic
fuse: refactor io-uring header copying to ring
fuse: refactor io-uring header copying from ring
fuse: use enum types for header copying
fuse: refactor setting up copy state for payload copying
fuse: support buffer copying for kernel addresses
fuse: use named constants for io-uring iovec indices
fuse: move fuse_uring_abort() from header to dev_uring.c
fuse: rearrange io-uring iovec and ent allocation logic
fuse: add io-uring buffer rings
fuse: add pinned headers capability for io-uring buffer rings
fuse: add pinned payload buffers capability for io-uring buffer rings
fuse: add zero-copy over io-uring
docs: fuse: add io-uring bufring and zero-copy documentation
.../filesystems/fuse/fuse-io-uring.rst | 189 +++
fs/fuse/dev.c | 30 +-
fs/fuse/dev_uring.c | 1042 ++++++++++++++---
fs/fuse/dev_uring_i.h | 86 +-
fs/fuse/fuse_dev_i.h | 8 +-
include/uapi/linux/fuse.h | 36 +-
6 files changed, 1194 insertions(+), 197 deletions(-)
base-commit: 619fa72e875483dabf7683001496cc0ca4480aa6
--
2.52.0
^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH v2 01/14] fuse: separate next request fetching from sending logic
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
@ 2026-04-02 16:28 ` Joanne Koong
2026-04-29 11:52 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 02/14] fuse: refactor io-uring header copying to ring Joanne Koong
` (13 subsequent siblings)
14 siblings, 1 reply; 49+ messages in thread
From: Joanne Koong @ 2026-04-02 16:28 UTC (permalink / raw)
To: miklos; +Cc: bernd, axboe, linux-fsdevel, Bernd Schubert
Simplify the logic for fetching + sending off the next request.
This gets rid of fuse_uring_send_next_to_ring() which contained
duplicated logic from fuse_uring_send(). This decouples request fetching
from the send operation, which makes the control flow clearer and
reduces unnecessary parameter passing.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
---
fs/fuse/dev_uring.c | 78 ++++++++++++++++-----------------------------
1 file changed, 28 insertions(+), 50 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 3a38b61aac26..54436d3fda4d 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -714,34 +714,6 @@ static int fuse_uring_prepare_send(struct fuse_ring_ent *ent,
return err;
}
-/*
- * Write data to the ring buffer and send the request to userspace,
- * userspace will read it
- * This is comparable with classical read(/dev/fuse)
- */
-static int fuse_uring_send_next_to_ring(struct fuse_ring_ent *ent,
- struct fuse_req *req,
- unsigned int issue_flags)
-{
- struct fuse_ring_queue *queue = ent->queue;
- int err;
- struct io_uring_cmd *cmd;
-
- err = fuse_uring_prepare_send(ent, req);
- if (err)
- return err;
-
- spin_lock(&queue->lock);
- cmd = ent->cmd;
- ent->cmd = NULL;
- ent->state = FRRS_USERSPACE;
- list_move_tail(&ent->list, &queue->ent_in_userspace);
- spin_unlock(&queue->lock);
-
- io_uring_cmd_done(cmd, 0, issue_flags);
- return 0;
-}
-
/*
* Make a ring entry available for fuse_req assignment
*/
@@ -838,11 +810,13 @@ static void fuse_uring_commit(struct fuse_ring_ent *ent, struct fuse_req *req,
}
/*
- * Get the next fuse req and send it
+ * Get the next fuse req.
+ *
+ * Returns true if the next fuse request has been assigned to the ent.
+ * Else, there is no next fuse request and this returns false.
*/
-static void fuse_uring_next_fuse_req(struct fuse_ring_ent *ent,
- struct fuse_ring_queue *queue,
- unsigned int issue_flags)
+static bool fuse_uring_get_next_fuse_req(struct fuse_ring_ent *ent,
+ struct fuse_ring_queue *queue)
{
int err;
struct fuse_req *req;
@@ -854,10 +828,12 @@ static void fuse_uring_next_fuse_req(struct fuse_ring_ent *ent,
spin_unlock(&queue->lock);
if (req) {
- err = fuse_uring_send_next_to_ring(ent, req, issue_flags);
+ err = fuse_uring_prepare_send(ent, req);
if (err)
goto retry;
}
+
+ return req != NULL;
}
static int fuse_ring_ent_set_commit(struct fuse_ring_ent *ent)
@@ -875,6 +851,20 @@ static int fuse_ring_ent_set_commit(struct fuse_ring_ent *ent)
return 0;
}
+static void fuse_uring_send(struct fuse_ring_ent *ent, struct io_uring_cmd *cmd,
+ ssize_t ret, unsigned int issue_flags)
+{
+ struct fuse_ring_queue *queue = ent->queue;
+
+ spin_lock(&queue->lock);
+ ent->state = FRRS_USERSPACE;
+ list_move_tail(&ent->list, &queue->ent_in_userspace);
+ ent->cmd = NULL;
+ spin_unlock(&queue->lock);
+
+ io_uring_cmd_done(cmd, ret, issue_flags);
+}
+
/* FUSE_URING_CMD_COMMIT_AND_FETCH handler */
static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
struct fuse_conn *fc)
@@ -947,7 +937,8 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
* and fetching is done in one step vs legacy fuse, which has separated
* read (fetch request) and write (commit result).
*/
- fuse_uring_next_fuse_req(ent, queue, issue_flags);
+ if (fuse_uring_get_next_fuse_req(ent, queue))
+ fuse_uring_send(ent, cmd, 0, issue_flags);
return 0;
}
@@ -1196,20 +1187,6 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
return -EIOCBQUEUED;
}
-static void fuse_uring_send(struct fuse_ring_ent *ent, struct io_uring_cmd *cmd,
- ssize_t ret, unsigned int issue_flags)
-{
- struct fuse_ring_queue *queue = ent->queue;
-
- spin_lock(&queue->lock);
- ent->state = FRRS_USERSPACE;
- list_move_tail(&ent->list, &queue->ent_in_userspace);
- ent->cmd = NULL;
- spin_unlock(&queue->lock);
-
- io_uring_cmd_done(cmd, ret, issue_flags);
-}
-
/*
* This prepares and sends the ring request in fuse-uring task context.
* User buffers are not mapped yet - the application does not have permission
@@ -1226,8 +1203,9 @@ static void fuse_uring_send_in_task(struct io_tw_req tw_req, io_tw_token_t tw)
if (!tw.cancel) {
err = fuse_uring_prepare_send(ent, ent->fuse_req);
if (err) {
- fuse_uring_next_fuse_req(ent, queue, issue_flags);
- return;
+ if (!fuse_uring_get_next_fuse_req(ent, queue))
+ return;
+ err = 0;
}
} else {
err = -ECANCELED;
--
2.52.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v2 02/14] fuse: refactor io-uring header copying to ring
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
2026-04-02 16:28 ` [PATCH v2 01/14] fuse: separate next request fetching from sending logic Joanne Koong
@ 2026-04-02 16:28 ` Joanne Koong
2026-04-29 12:05 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 03/14] fuse: refactor io-uring header copying from ring Joanne Koong
` (12 subsequent siblings)
14 siblings, 1 reply; 49+ messages in thread
From: Joanne Koong @ 2026-04-02 16:28 UTC (permalink / raw)
To: miklos; +Cc: bernd, axboe, linux-fsdevel, Bernd Schubert
Move header copying to ring logic into a new copy_header_to_ring()
function. This makes the copy_to_user() logic more clear and centralizes
error handling / rate-limited logging.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
---
fs/fuse/dev_uring.c | 39 +++++++++++++++++++++------------------
1 file changed, 21 insertions(+), 18 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 54436d3fda4d..5fc8ca330595 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -575,6 +575,18 @@ static int fuse_uring_out_header_has_err(struct fuse_out_header *oh,
return err;
}
+static __always_inline int copy_header_to_ring(void __user *ring,
+ const void *header,
+ size_t header_size)
+{
+ if (copy_to_user(ring, header, header_size)) {
+ pr_info_ratelimited("Copying header to ring failed.\n");
+ return -EFAULT;
+ }
+
+ return 0;
+}
+
static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
struct fuse_req *req,
struct fuse_ring_ent *ent)
@@ -637,13 +649,11 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
* Some op code have that as zero size.
*/
if (args->in_args[0].size > 0) {
- err = copy_to_user(&ent->headers->op_in, in_args->value,
- in_args->size);
- if (err) {
- pr_info_ratelimited(
- "Copying the header failed.\n");
- return -EFAULT;
- }
+ err = copy_header_to_ring(&ent->headers->op_in,
+ in_args->value,
+ in_args->size);
+ if (err)
+ return err;
}
in_args++;
num_args--;
@@ -659,9 +669,8 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
}
ent_in_out.payload_sz = cs.ring.copied_sz;
- err = copy_to_user(&ent->headers->ring_ent_in_out, &ent_in_out,
- sizeof(ent_in_out));
- return err ? -EFAULT : 0;
+ return copy_header_to_ring(&ent->headers->ring_ent_in_out, &ent_in_out,
+ sizeof(ent_in_out));
}
static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
@@ -690,14 +699,8 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
}
/* copy fuse_in_header */
- err = copy_to_user(&ent->headers->in_out, &req->in.h,
- sizeof(req->in.h));
- if (err) {
- err = -EFAULT;
- return err;
- }
-
- return 0;
+ return copy_header_to_ring(&ent->headers->in_out, &req->in.h,
+ sizeof(req->in.h));
}
static int fuse_uring_prepare_send(struct fuse_ring_ent *ent,
--
2.52.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v2 03/14] fuse: refactor io-uring header copying from ring
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
2026-04-02 16:28 ` [PATCH v2 01/14] fuse: separate next request fetching from sending logic Joanne Koong
2026-04-02 16:28 ` [PATCH v2 02/14] fuse: refactor io-uring header copying to ring Joanne Koong
@ 2026-04-02 16:28 ` Joanne Koong
2026-04-29 12:06 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 04/14] fuse: use enum types for header copying Joanne Koong
` (11 subsequent siblings)
14 siblings, 1 reply; 49+ messages in thread
From: Joanne Koong @ 2026-04-02 16:28 UTC (permalink / raw)
To: miklos; +Cc: bernd, axboe, linux-fsdevel, Bernd Schubert
Move header copying from ring logic into a new copy_header_from_ring()
function. This makes the copy_from_user() logic more clear and
centralizes error handling / rate-limited logging.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
---
fs/fuse/dev_uring.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 5fc8ca330595..86f9bb94b45a 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -587,6 +587,18 @@ static __always_inline int copy_header_to_ring(void __user *ring,
return 0;
}
+static __always_inline int copy_header_from_ring(void *header,
+ const void __user *ring,
+ size_t header_size)
+{
+ if (copy_from_user(header, ring, header_size)) {
+ pr_info_ratelimited("Copying header from ring failed.\n");
+ return -EFAULT;
+ }
+
+ return 0;
+}
+
static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
struct fuse_req *req,
struct fuse_ring_ent *ent)
@@ -597,10 +609,10 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
int err;
struct fuse_uring_ent_in_out ring_in_out;
- err = copy_from_user(&ring_in_out, &ent->headers->ring_ent_in_out,
- sizeof(ring_in_out));
+ err = copy_header_from_ring(&ring_in_out, &ent->headers->ring_ent_in_out,
+ sizeof(ring_in_out));
if (err)
- return -EFAULT;
+ return err;
err = import_ubuf(ITER_SOURCE, ent->payload, ring->max_payload_sz,
&iter);
@@ -794,10 +806,10 @@ static void fuse_uring_commit(struct fuse_ring_ent *ent, struct fuse_req *req,
struct fuse_conn *fc = ring->fc;
ssize_t err = 0;
- err = copy_from_user(&req->out.h, &ent->headers->in_out,
- sizeof(req->out.h));
+ err = copy_header_from_ring(&req->out.h, &ent->headers->in_out,
+ sizeof(req->out.h));
if (err) {
- req->out.h.error = -EFAULT;
+ req->out.h.error = err;
goto out;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v2 04/14] fuse: use enum types for header copying
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
` (2 preceding siblings ...)
2026-04-02 16:28 ` [PATCH v2 03/14] fuse: refactor io-uring header copying from ring Joanne Koong
@ 2026-04-02 16:28 ` Joanne Koong
2026-04-30 8:04 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 05/14] fuse: refactor setting up copy state for payload copying Joanne Koong
` (10 subsequent siblings)
14 siblings, 1 reply; 49+ messages in thread
From: Joanne Koong @ 2026-04-02 16:28 UTC (permalink / raw)
To: miklos; +Cc: bernd, axboe, linux-fsdevel, Bernd Schubert
Use enum types to identify which part of the header needs to be copied.
This improves the interface and will simplify both kernel-space and
user-space header addresses copying when buffer rings are added.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
---
fs/fuse/dev_uring.c | 66 ++++++++++++++++++++++++++++++++++++---------
1 file changed, 53 insertions(+), 13 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 86f9bb94b45a..cca795dd72e1 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -31,6 +31,15 @@ struct fuse_uring_pdu {
static const struct fuse_iqueue_ops fuse_io_uring_ops;
+enum fuse_uring_header_type {
+ /* struct fuse_in_header / struct fuse_out_header */
+ FUSE_URING_HEADER_IN_OUT,
+ /* per op code header */
+ FUSE_URING_HEADER_OP,
+ /* struct fuse_uring_ent_in_out header */
+ FUSE_URING_HEADER_RING_ENT,
+};
+
static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
struct fuse_ring_ent *ring_ent)
{
@@ -575,10 +584,33 @@ static int fuse_uring_out_header_has_err(struct fuse_out_header *oh,
return err;
}
-static __always_inline int copy_header_to_ring(void __user *ring,
- const void *header,
- size_t header_size)
+static int ring_header_type_offset(enum fuse_uring_header_type type)
{
+ switch (type) {
+ case FUSE_URING_HEADER_IN_OUT:
+ return 0;
+ case FUSE_URING_HEADER_OP:
+ return offsetof(struct fuse_uring_req_header, op_in);
+ case FUSE_URING_HEADER_RING_ENT:
+ return offsetof(struct fuse_uring_req_header, ring_ent_in_out);
+ default:
+ WARN_ONCE(1, "Invalid header type: %d\n", type);
+ return -EINVAL;
+ }
+}
+
+static int copy_header_to_ring(struct fuse_ring_ent *ent,
+ enum fuse_uring_header_type type,
+ const void *header, size_t header_size)
+{
+ int offset = ring_header_type_offset(type);
+ void __user *ring;
+
+ if (offset < 0)
+ return offset;
+
+ ring = (void __user *)ent->headers + offset;
+
if (copy_to_user(ring, header, header_size)) {
pr_info_ratelimited("Copying header to ring failed.\n");
return -EFAULT;
@@ -587,10 +619,18 @@ static __always_inline int copy_header_to_ring(void __user *ring,
return 0;
}
-static __always_inline int copy_header_from_ring(void *header,
- const void __user *ring,
- size_t header_size)
+static int copy_header_from_ring(struct fuse_ring_ent *ent,
+ enum fuse_uring_header_type type, void *header,
+ size_t header_size)
{
+ int offset = ring_header_type_offset(type);
+ const void __user *ring;
+
+ if (offset < 0)
+ return offset;
+
+ ring = (void __user *)ent->headers + offset;
+
if (copy_from_user(header, ring, header_size)) {
pr_info_ratelimited("Copying header from ring failed.\n");
return -EFAULT;
@@ -609,8 +649,8 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
int err;
struct fuse_uring_ent_in_out ring_in_out;
- err = copy_header_from_ring(&ring_in_out, &ent->headers->ring_ent_in_out,
- sizeof(ring_in_out));
+ err = copy_header_from_ring(ent, FUSE_URING_HEADER_RING_ENT,
+ &ring_in_out, sizeof(ring_in_out));
if (err)
return err;
@@ -661,7 +701,7 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
* Some op code have that as zero size.
*/
if (args->in_args[0].size > 0) {
- err = copy_header_to_ring(&ent->headers->op_in,
+ err = copy_header_to_ring(ent, FUSE_URING_HEADER_OP,
in_args->value,
in_args->size);
if (err)
@@ -681,8 +721,8 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
}
ent_in_out.payload_sz = cs.ring.copied_sz;
- return copy_header_to_ring(&ent->headers->ring_ent_in_out, &ent_in_out,
- sizeof(ent_in_out));
+ return copy_header_to_ring(ent, FUSE_URING_HEADER_RING_ENT,
+ &ent_in_out, sizeof(ent_in_out));
}
static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
@@ -711,7 +751,7 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
}
/* copy fuse_in_header */
- return copy_header_to_ring(&ent->headers->in_out, &req->in.h,
+ return copy_header_to_ring(ent, FUSE_URING_HEADER_IN_OUT, &req->in.h,
sizeof(req->in.h));
}
@@ -806,7 +846,7 @@ static void fuse_uring_commit(struct fuse_ring_ent *ent, struct fuse_req *req,
struct fuse_conn *fc = ring->fc;
ssize_t err = 0;
- err = copy_header_from_ring(&req->out.h, &ent->headers->in_out,
+ err = copy_header_from_ring(ent, FUSE_URING_HEADER_IN_OUT, &req->out.h,
sizeof(req->out.h));
if (err) {
req->out.h.error = err;
--
2.52.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v2 05/14] fuse: refactor setting up copy state for payload copying
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
` (3 preceding siblings ...)
2026-04-02 16:28 ` [PATCH v2 04/14] fuse: use enum types for header copying Joanne Koong
@ 2026-04-02 16:28 ` Joanne Koong
2026-04-30 8:06 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 06/14] fuse: support buffer copying for kernel addresses Joanne Koong
` (9 subsequent siblings)
14 siblings, 1 reply; 49+ messages in thread
From: Joanne Koong @ 2026-04-02 16:28 UTC (permalink / raw)
To: miklos; +Cc: bernd, axboe, linux-fsdevel, Bernd Schubert
Add a new helper function setup_fuse_copy_state() to contain the logic
for setting up the copy state for payload copying.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
---
fs/fuse/dev_uring.c | 38 ++++++++++++++++++++++++--------------
1 file changed, 24 insertions(+), 14 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index cca795dd72e1..045394a7ae41 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -639,6 +639,27 @@ static int copy_header_from_ring(struct fuse_ring_ent *ent,
return 0;
}
+static int setup_fuse_copy_state(struct fuse_copy_state *cs,
+ struct fuse_ring *ring, struct fuse_req *req,
+ struct fuse_ring_ent *ent, int dir,
+ struct iov_iter *iter)
+{
+ int err;
+
+ err = import_ubuf(dir, ent->payload, ring->max_payload_sz, iter);
+ if (err) {
+ pr_info_ratelimited("fuse: Import of user buffer failed\n");
+ return err;
+ }
+
+ fuse_copy_init(cs, dir == ITER_DEST, iter);
+
+ cs->is_uring = true;
+ cs->req = req;
+
+ return 0;
+}
+
static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
struct fuse_req *req,
struct fuse_ring_ent *ent)
@@ -654,15 +675,10 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
if (err)
return err;
- err = import_ubuf(ITER_SOURCE, ent->payload, ring->max_payload_sz,
- &iter);
+ err = setup_fuse_copy_state(&cs, ring, req, ent, ITER_SOURCE, &iter);
if (err)
return err;
- fuse_copy_init(&cs, false, &iter);
- cs.is_uring = true;
- cs.req = req;
-
err = fuse_copy_out_args(&cs, args, ring_in_out.payload_sz);
fuse_copy_finish(&cs);
return err;
@@ -685,15 +701,9 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
.commit_id = req->in.h.unique,
};
- err = import_ubuf(ITER_DEST, ent->payload, ring->max_payload_sz, &iter);
- if (err) {
- pr_info_ratelimited("fuse: Import of user buffer failed\n");
+ err = setup_fuse_copy_state(&cs, ring, req, ent, ITER_DEST, &iter);
+ if (err)
return err;
- }
-
- fuse_copy_init(&cs, true, &iter);
- cs.is_uring = true;
- cs.req = req;
if (num_args > 0) {
/*
--
2.52.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v2 06/14] fuse: support buffer copying for kernel addresses
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
` (4 preceding siblings ...)
2026-04-02 16:28 ` [PATCH v2 05/14] fuse: refactor setting up copy state for payload copying Joanne Koong
@ 2026-04-02 16:28 ` Joanne Koong
2026-04-30 8:19 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 07/14] fuse: use named constants for io-uring iovec indices Joanne Koong
` (8 subsequent siblings)
14 siblings, 1 reply; 49+ messages in thread
From: Joanne Koong @ 2026-04-02 16:28 UTC (permalink / raw)
To: miklos; +Cc: bernd, axboe, linux-fsdevel
This is a preparatory patch needed to support pinned buffers in
fuse-over-io-uring. For pinned buffers, we get the vmapped address of
the buffer, which we can directly use with memcpy.
Currently, buffer copying in fuse only supports extracting underlying
pages from an iov iter and kmapping them. This commit allows buffer
copying to work directly on a kaddr.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
---
fs/fuse/dev.c | 23 +++++++++++++++++++----
fs/fuse/fuse_dev_i.h | 7 ++++++-
2 files changed, 25 insertions(+), 5 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0b0241f47170..a87939eaa103 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -848,6 +848,9 @@ void fuse_copy_init(struct fuse_copy_state *cs, bool write,
/* Unmap and put previous page of userspace buffer */
void fuse_copy_finish(struct fuse_copy_state *cs)
{
+ if (cs->is_kaddr)
+ return;
+
if (cs->currbuf) {
struct pipe_buffer *buf = cs->currbuf;
@@ -873,6 +876,12 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
struct page *page;
int err;
+ if (cs->is_kaddr) {
+ if (!cs->len)
+ return -ENOBUFS;
+ return 0;
+ }
+
err = unlock_request(cs->req);
if (err)
return err;
@@ -931,15 +940,21 @@ static int fuse_copy_do(struct fuse_copy_state *cs, void **val, unsigned *size)
{
unsigned ncpy = min(*size, cs->len);
if (val) {
- void *pgaddr = kmap_local_page(cs->pg);
- void *buf = pgaddr + cs->offset;
+ void *pgaddr, *buf;
+
+ if (!cs->is_kaddr) {
+ pgaddr = kmap_local_page(cs->pg);
+ buf = pgaddr + cs->offset;
+ } else {
+ buf = cs->kaddr + cs->offset;
+ }
if (cs->write)
memcpy(buf, *val, ncpy);
else
memcpy(*val, buf, ncpy);
-
- kunmap_local(pgaddr);
+ if (!cs->is_kaddr)
+ kunmap_local(pgaddr);
*val += ncpy;
}
*size -= ncpy;
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index 134bf44aff0d..aa1d25421054 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -28,12 +28,17 @@ struct fuse_copy_state {
struct pipe_buffer *currbuf;
struct pipe_inode_info *pipe;
unsigned long nr_segs;
- struct page *pg;
+ union {
+ struct page *pg;
+ void *kaddr;
+ };
unsigned int len;
unsigned int offset;
bool write:1;
bool move_folios:1;
bool is_uring:1;
+ /* if set, use kaddr; otherwise use pg */
+ bool is_kaddr:1;
struct {
unsigned int copied_sz; /* copied size into the user buffer */
} ring;
--
2.52.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v2 07/14] fuse: use named constants for io-uring iovec indices
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
` (5 preceding siblings ...)
2026-04-02 16:28 ` [PATCH v2 06/14] fuse: support buffer copying for kernel addresses Joanne Koong
@ 2026-04-02 16:28 ` Joanne Koong
2026-04-15 9:36 ` Bernd Schubert
2026-04-30 8:20 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 08/14] fuse: move fuse_uring_abort() from header to dev_uring.c Joanne Koong
` (7 subsequent siblings)
14 siblings, 2 replies; 49+ messages in thread
From: Joanne Koong @ 2026-04-02 16:28 UTC (permalink / raw)
To: miklos; +Cc: bernd, axboe, linux-fsdevel
Replace magic indices 0 and 1 for the iovec array with named constants
FUSE_URING_IOV_HEADERS and FUSE_URING_IOV_PAYLOAD. This makes the usages
self-documenting and prepares for buffer ring support which will also
reference these iovec slots by index.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/dev_uring.c | 24 +++++++++++++-----------
1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 045394a7ae41..a85acd9c2b71 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -18,7 +18,8 @@ MODULE_PARM_DESC(enable_uring,
"Enable userspace communication through io-uring");
#define FUSE_URING_IOV_SEGS 2 /* header and payload */
-
+#define FUSE_URING_IOV_HEADERS 0
+#define FUSE_URING_IOV_PAYLOAD 1
bool fuse_uring_enabled(void)
{
@@ -1063,8 +1064,8 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
}
/*
- * sqe->addr is a ptr to an iovec array, iov[0] has the headers, iov[1]
- * the payload
+ * sqe->addr is a ptr to an iovec array, iov[FUSE_URING_IOV_HEADERS] has the
+ * headers, iov[FUSE_URING_IOV_PAYLOAD] the payload
*/
static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
struct iovec iov[FUSE_URING_IOV_SEGS])
@@ -1094,8 +1095,8 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
{
struct fuse_ring *ring = queue->ring;
struct fuse_ring_ent *ent;
- size_t payload_size;
struct iovec iov[FUSE_URING_IOV_SEGS];
+ struct iovec *headers, *payload;
int err;
err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
@@ -1106,15 +1107,16 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
}
err = -EINVAL;
- if (iov[0].iov_len < sizeof(struct fuse_uring_req_header)) {
- pr_info_ratelimited("Invalid header len %zu\n", iov[0].iov_len);
+ headers = &iov[FUSE_URING_IOV_HEADERS];
+ if (headers->iov_len < sizeof(struct fuse_uring_req_header)) {
+ pr_info_ratelimited("Invalid header len %zu\n", headers->iov_len);
return ERR_PTR(err);
}
- payload_size = iov[1].iov_len;
- if (payload_size < ring->max_payload_sz) {
+ payload = &iov[FUSE_URING_IOV_PAYLOAD];
+ if (payload->iov_len < ring->max_payload_sz) {
pr_info_ratelimited("Invalid req payload len %zu\n",
- payload_size);
+ payload->iov_len);
return ERR_PTR(err);
}
@@ -1126,8 +1128,8 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
INIT_LIST_HEAD(&ent->list);
ent->queue = queue;
- ent->headers = iov[0].iov_base;
- ent->payload = iov[1].iov_base;
+ ent->headers = headers->iov_base;
+ ent->payload = payload->iov_base;
atomic_inc(&ring->queue_refs);
return ent;
--
2.52.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v2 08/14] fuse: move fuse_uring_abort() from header to dev_uring.c
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
` (6 preceding siblings ...)
2026-04-02 16:28 ` [PATCH v2 07/14] fuse: use named constants for io-uring iovec indices Joanne Koong
@ 2026-04-02 16:28 ` Joanne Koong
2026-04-15 9:40 ` Bernd Schubert
2026-04-30 8:21 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 09/14] fuse: rearrange io-uring iovec and ent allocation logic Joanne Koong
` (6 subsequent siblings)
14 siblings, 2 replies; 49+ messages in thread
From: Joanne Koong @ 2026-04-02 16:28 UTC (permalink / raw)
To: miklos; +Cc: bernd, axboe, linux-fsdevel
Move fuse_uring_abort() out of the inline header definition and into
dev_uring.c. This function calls several internal helpers (abort
requests, stop queues) that are all defined in dev_uring.c so inlining
it in the header unnecessarily exposes implementation details.
This will make the subsequent commit that adds pinning capabilties for
fuse buffers cleaner.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/dev_uring.c | 17 +++++++++++++++--
fs/fuse/dev_uring_i.h | 16 +---------------
2 files changed, 16 insertions(+), 17 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index a85acd9c2b71..cce8994241b7 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -129,7 +129,7 @@ static void fuse_uring_abort_end_queue_requests(struct fuse_ring_queue *queue)
fuse_dev_end_requests(&req_list);
}
-void fuse_uring_abort_end_requests(struct fuse_ring *ring)
+static void fuse_uring_abort_end_requests(struct fuse_ring *ring)
{
int qid;
struct fuse_ring_queue *queue;
@@ -477,7 +477,7 @@ static void fuse_uring_async_stop_queues(struct work_struct *work)
/*
* Stop the ring queues
*/
-void fuse_uring_stop_queues(struct fuse_ring *ring)
+static void fuse_uring_stop_queues(struct fuse_ring *ring)
{
int qid;
@@ -501,6 +501,19 @@ void fuse_uring_stop_queues(struct fuse_ring *ring)
}
}
+void fuse_uring_abort(struct fuse_conn *fc)
+{
+ struct fuse_ring *ring = fc->ring;
+
+ if (ring == NULL)
+ return;
+
+ if (atomic_read(&ring->queue_refs) > 0) {
+ fuse_uring_abort_end_requests(ring);
+ fuse_uring_stop_queues(ring);
+ }
+}
+
/*
* Handle IO_URING_F_CANCEL, typically should come on daemon termination.
*
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 51a563922ce1..349418db3374 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -137,27 +137,13 @@ struct fuse_ring {
bool fuse_uring_enabled(void);
void fuse_uring_destruct(struct fuse_conn *fc);
-void fuse_uring_stop_queues(struct fuse_ring *ring);
-void fuse_uring_abort_end_requests(struct fuse_ring *ring);
+void fuse_uring_abort(struct fuse_conn *fc);
int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req);
bool fuse_uring_queue_bq_req(struct fuse_req *req);
bool fuse_uring_remove_pending_req(struct fuse_req *req);
bool fuse_uring_request_expired(struct fuse_conn *fc);
-static inline void fuse_uring_abort(struct fuse_conn *fc)
-{
- struct fuse_ring *ring = fc->ring;
-
- if (ring == NULL)
- return;
-
- if (atomic_read(&ring->queue_refs) > 0) {
- fuse_uring_abort_end_requests(ring);
- fuse_uring_stop_queues(ring);
- }
-}
-
static inline void fuse_uring_wait_stopped_queues(struct fuse_conn *fc)
{
struct fuse_ring *ring = fc->ring;
--
2.52.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v2 09/14] fuse: rearrange io-uring iovec and ent allocation logic
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
` (7 preceding siblings ...)
2026-04-02 16:28 ` [PATCH v2 08/14] fuse: move fuse_uring_abort() from header to dev_uring.c Joanne Koong
@ 2026-04-02 16:28 ` Joanne Koong
2026-04-15 9:45 ` Bernd Schubert
2026-04-30 8:24 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 10/14] fuse: add io-uring buffer rings Joanne Koong
` (5 subsequent siblings)
14 siblings, 2 replies; 49+ messages in thread
From: Joanne Koong @ 2026-04-02 16:28 UTC (permalink / raw)
To: miklos; +Cc: bernd, axboe, linux-fsdevel
Move fuse_uring_get_iovec_from_sqe() to earlier in the file and
move the allocation logic in fuse_uring_create_ring_ent() to the
beginning of the function.
There is no change in logic, this is done to make the subsequent commit
that adds buffer rings easier to review.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/dev_uring.c | 78 ++++++++++++++++++++++++---------------------
1 file changed, 41 insertions(+), 37 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index cce8994241b7..a061f175b3fd 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -277,6 +277,32 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
return res;
}
+/*
+ * sqe->addr is a ptr to an iovec array, iov[FUSE_URING_IOV_HEADERS] has the
+ * headers, iov[FUSE_URING_IOV_PAYLOAD] the payload
+ */
+static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
+ struct iovec iov[FUSE_URING_IOV_SEGS])
+{
+ struct iovec __user *uiov = u64_to_user_ptr(READ_ONCE(sqe->addr));
+ struct iov_iter iter;
+ ssize_t ret;
+
+ if (sqe->len != FUSE_URING_IOV_SEGS)
+ return -EINVAL;
+
+ /*
+ * Direction for buffer access will actually be READ and WRITE,
+ * using write for the import should include READ access as well.
+ */
+ ret = import_iovec(WRITE, uiov, FUSE_URING_IOV_SEGS,
+ FUSE_URING_IOV_SEGS, &iov, &iter);
+ if (ret < 0)
+ return ret;
+
+ return 0;
+}
+
static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
int qid)
{
@@ -1076,32 +1102,6 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
}
}
-/*
- * sqe->addr is a ptr to an iovec array, iov[FUSE_URING_IOV_HEADERS] has the
- * headers, iov[FUSE_URING_IOV_PAYLOAD] the payload
- */
-static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
- struct iovec iov[FUSE_URING_IOV_SEGS])
-{
- struct iovec __user *uiov = u64_to_user_ptr(READ_ONCE(sqe->addr));
- struct iov_iter iter;
- ssize_t ret;
-
- if (sqe->len != FUSE_URING_IOV_SEGS)
- return -EINVAL;
-
- /*
- * Direction for buffer access will actually be READ and WRITE,
- * using write for the import should include READ access as well.
- */
- ret = import_iovec(WRITE, uiov, FUSE_URING_IOV_SEGS,
- FUSE_URING_IOV_SEGS, &iov, &iter);
- if (ret < 0)
- return ret;
-
- return 0;
-}
-
static struct fuse_ring_ent *
fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
struct fuse_ring_queue *queue)
@@ -1112,40 +1112,44 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
struct iovec *headers, *payload;
int err;
+ ent = kzalloc_obj(*ent, GFP_KERNEL_ACCOUNT);
+ if (!ent)
+ return ERR_PTR(-ENOMEM);
+
+ INIT_LIST_HEAD(&ent->list);
+
+ ent->queue = queue;
+
err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
if (err) {
pr_info_ratelimited("Failed to get iovec from sqe, err=%d\n",
err);
- return ERR_PTR(err);
+ goto error;
}
err = -EINVAL;
headers = &iov[FUSE_URING_IOV_HEADERS];
if (headers->iov_len < sizeof(struct fuse_uring_req_header)) {
pr_info_ratelimited("Invalid header len %zu\n", headers->iov_len);
- return ERR_PTR(err);
+ goto error;
}
payload = &iov[FUSE_URING_IOV_PAYLOAD];
if (payload->iov_len < ring->max_payload_sz) {
pr_info_ratelimited("Invalid req payload len %zu\n",
payload->iov_len);
- return ERR_PTR(err);
+ goto error;
}
- err = -ENOMEM;
- ent = kzalloc_obj(*ent, GFP_KERNEL_ACCOUNT);
- if (!ent)
- return ERR_PTR(err);
-
- INIT_LIST_HEAD(&ent->list);
-
- ent->queue = queue;
ent->headers = headers->iov_base;
ent->payload = payload->iov_base;
atomic_inc(&ring->queue_refs);
return ent;
+
+error:
+ kfree(ent);
+ return ERR_PTR(err);
}
/*
--
2.52.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v2 10/14] fuse: add io-uring buffer rings
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
` (8 preceding siblings ...)
2026-04-02 16:28 ` [PATCH v2 09/14] fuse: rearrange io-uring iovec and ent allocation logic Joanne Koong
@ 2026-04-02 16:28 ` Joanne Koong
2026-04-15 9:48 ` Bernd Schubert
` (2 more replies)
2026-04-02 16:28 ` [PATCH v2 11/14] fuse: add pinned headers capability for " Joanne Koong
` (4 subsequent siblings)
14 siblings, 3 replies; 49+ messages in thread
From: Joanne Koong @ 2026-04-02 16:28 UTC (permalink / raw)
To: miklos; +Cc: bernd, axboe, linux-fsdevel
Add fuse buffer rings for servers communicating through the io-uring
interface. To use this, the server must set the FUSE_URING_BUFRING
flag and provide header and payload buffers via an iovec array in the
sqe during registration. The payload buffers are used to back the buffer
ring. The kernel manages buffer selection and recycling through a simple
internal ring.
This has the following advantages over the non-bufring (iovec) path:
- Reduced memory usage: in the iovec path, each entry has its own
dedicated payload buffer, requiring N buffers for N entries where each
buffer must be large enough to accommodate the maximum possible
payload size. With buffer rings, payload buffers are pooled and
selected on demand. Entries only hold a buffer while actively
processing a request with payload data. When incremental buffer
consumption is added, this will allow non-overlapping regions of a
single buffer to be used simultaneously across multiple requests,
further reducing memory requirements.
- Foundation for pinned buffers: the buffer ring headers and payloads
are now each passed in as a contiguous memory allocation, which allows
fuse to easily pin and vmap the entire region in one operation during
queue setup. This will eliminate the per-request overhead of having to
pin/unpin user pages and translate virtual addresses and is a
prerequisite for future optimizations like performing data copies
outside of the server's task context.
Each ring entry gets a fixed ID (sqe->buf_index) that maps to a specific
header slot in the headers buffer. Payload buffers are selected from
the ring on demand and recycled after each request. Buffer ring usage is
set on a per-queue basis. All subsequent registration SQEs for the same
queue must use consistent flags.
The headers are laid out contiguously and provided via iov[0]. Each slot
maps to ent->id:
|<- headers_size (>= queue_depth * sizeof(fuse_uring_req_header)) ->|
+------------------------------+------------------------------+-----+
| struct fuse_uring_req_header | struct fuse_uring_req_header | ... |
| [ent id=0] | [ent id=1] | |
+------------------------------+------------------------------+-----+
On the server side, the ent id is used to determine where in the headers
buffer the headers data for the ent resides. This is done by
calculating ent_id * sizeof(struct fuse_uring_req_header) as the offset
into the headers buffer.
The buffer ring is backed by the payload buffer, which is contiguous but
partitioned into individual bufs according to the buf_size passed in at
registration.
PAYLOAD BUFFER POOL (contiguous, provided via iov[1]):
|<-------------- payload_size ------------>|
+--------- --+-----------+-----------+-----+
| buf [0] | buf [1] | buf [2] | ... |
| buf_size | buf_size | buf_size | ... |
+--------- --+-----------+-----------+-----+
buffer ring state (struct fuse_bufring, kernel-internal):
bufs[]: [ used | used | FREE | FREE | FREE ]
^^^^^^^^^^^^^^^^^^^
available for selection
The buffer ring logic is as follows:
select: buf = bufs[head % nbufs]; head++
recycle: bufs[tail % nbufs] = buf; tail++
empty: tail == head (no buffers available)
full: tail - head >= nbufs
Buffer ring request flow
------------------------
| Kernel | FUSE daemon
| |
| [client request arrives] |
| >fuse_uring_send() |
| [select payload buf from ring] |
| >fuse_uring_select_buffer() |
| [copy headers to ent's header slot] |
| >copy_header_to_ring() |
| [copy payload to selected buf] |
| >fuse_uring_copy_to_ring() |
| [set buf_id in ent_in_out header] |
| >io_uring_cmd_done() |
| | [CQE received]
| | [read headers from header
| | slot]
| | [read payload from buf_id]
| | [process request]
| | [write reply to header
| | slot]
| | [write reply payload to
| | buf]
| | >io_uring_submit()
| | COMMIT_AND_FETCH
| >fuse_uring_commit_fetch() |
| >fuse_uring_commit() |
| [copy reply from ring] |
| >fuse_uring_recycle_buffer() |
| >fuse_uring_get_next_fuse_req() |
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/dev_uring.c | 363 +++++++++++++++++++++++++++++++++-----
fs/fuse/dev_uring_i.h | 45 ++++-
include/uapi/linux/fuse.h | 27 ++-
3 files changed, 381 insertions(+), 54 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index a061f175b3fd..9f14a2bcde3f 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -41,6 +41,11 @@ enum fuse_uring_header_type {
FUSE_URING_HEADER_RING_ENT,
};
+static inline bool bufring_enabled(struct fuse_ring_queue *queue)
+{
+ return queue->bufring != NULL;
+}
+
static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
struct fuse_ring_ent *ring_ent)
{
@@ -222,6 +227,7 @@ void fuse_uring_destruct(struct fuse_conn *fc)
}
kfree(queue->fpq.processing);
+ kfree(queue->bufring);
kfree(queue);
ring->queues[qid] = NULL;
}
@@ -303,20 +309,102 @@ static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
return 0;
}
-static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
- int qid)
+static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
+ struct fuse_ring_queue *queue)
+{
+ const struct fuse_uring_cmd_req *cmd_req =
+ io_uring_sqe128_cmd(cmd->sqe, struct fuse_uring_cmd_req);
+ u16 queue_depth = READ_ONCE(cmd_req->init.queue_depth);
+ unsigned int buf_size = READ_ONCE(cmd_req->init.buf_size);
+ struct iovec iov[FUSE_URING_IOV_SEGS];
+ void __user *payload, *headers;
+ size_t headers_size, payload_size, ring_size;
+ struct fuse_bufring *br;
+ unsigned int nr_bufs, i;
+ uintptr_t payload_addr;
+ int err;
+
+ if (!queue_depth || !buf_size)
+ return -EINVAL;
+
+ err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
+ if (err)
+ return err;
+
+ headers = iov[FUSE_URING_IOV_HEADERS].iov_base;
+ headers_size = iov[FUSE_URING_IOV_HEADERS].iov_len;
+ payload = iov[FUSE_URING_IOV_PAYLOAD].iov_base;
+ payload_size = iov[FUSE_URING_IOV_PAYLOAD].iov_len;
+
+ /* check if there's enough space for all the headers */
+ if (headers_size < queue_depth * sizeof(struct fuse_uring_req_header))
+ return -EINVAL;
+
+ if (buf_size < queue->ring->max_payload_sz)
+ return -EINVAL;
+
+ nr_bufs = payload_size / buf_size;
+ if (!nr_bufs || nr_bufs > U16_MAX)
+ return -EINVAL;
+
+ /* create the ring buffer */
+ ring_size = struct_size(br, bufs, nr_bufs);
+ br = kzalloc(ring_size, GFP_KERNEL_ACCOUNT);
+ if (!br)
+ return -ENOMEM;
+
+ br->queue_depth = queue_depth;
+ br->headers = headers;
+
+ payload_addr = (uintptr_t)payload;
+
+ /* populate the ring buffer */
+ for (i = 0; i < nr_bufs; i++, payload_addr += buf_size) {
+ struct fuse_bufring_buf *buf = &br->bufs[i];
+
+ buf->addr = payload_addr;
+ buf->len = buf_size;
+ buf->id = i;
+ }
+
+ br->nbufs = nr_bufs;
+ br->tail = nr_bufs;
+
+ queue->bufring = br;
+
+ return 0;
+}
+
+/*
+ * if the queue is already registered, check that the queue was initialized with
+ * the same init flags set for this FUSE_IO_URING_CMD_REGISTER cmd. all
+ * FUSE_IO_URING_CMD_REGISTER cmds should have the same init fields set on a
+ * per-queue basis.
+ */
+static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
+ u64 init_flags)
{
+ bool bufring = init_flags & FUSE_URING_BUFRING;
+
+ return bufring_enabled(queue) == bufring;
+}
+
+static struct fuse_ring_queue *
+fuse_uring_create_queue(struct io_uring_cmd *cmd, struct fuse_ring *ring,
+ int qid, u64 init_flags)
+{
+ bool use_bufring = init_flags & FUSE_URING_BUFRING;
struct fuse_conn *fc = ring->fc;
struct fuse_ring_queue *queue;
struct list_head *pq;
queue = kzalloc_obj(*queue, GFP_KERNEL_ACCOUNT);
if (!queue)
- return NULL;
+ return ERR_PTR(-ENOMEM);
pq = kzalloc_objs(struct list_head, FUSE_PQ_HASH_SIZE);
if (!pq) {
kfree(queue);
- return NULL;
+ return ERR_PTR(-ENOMEM);
}
queue->qid = qid;
@@ -334,12 +422,29 @@ static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
queue->fpq.processing = pq;
fuse_pqueue_init(&queue->fpq);
+ if (use_bufring) {
+ int err = fuse_uring_bufring_setup(cmd, queue);
+
+ if (err) {
+ kfree(pq);
+ kfree(queue);
+ return ERR_PTR(err);
+ }
+ }
+
spin_lock(&fc->lock);
+ /* check if the queue creation raced with another thread */
if (ring->queues[qid]) {
spin_unlock(&fc->lock);
kfree(queue->fpq.processing);
+ if (use_bufring)
+ kfree(queue->bufring);
kfree(queue);
- return ring->queues[qid];
+
+ queue = ring->queues[qid];
+ if (!queue_init_flags_consistent(queue, init_flags))
+ return ERR_PTR(-EINVAL);
+ return queue;
}
/*
@@ -649,7 +754,14 @@ static int copy_header_to_ring(struct fuse_ring_ent *ent,
if (offset < 0)
return offset;
- ring = (void __user *)ent->headers + offset;
+ if (bufring_enabled(ent->queue)) {
+ int buf_offset = offset +
+ sizeof(struct fuse_uring_req_header) * ent->id;
+
+ ring = ent->queue->bufring->headers + buf_offset;
+ } else {
+ ring = (void __user *)ent->headers + offset;
+ }
if (copy_to_user(ring, header, header_size)) {
pr_info_ratelimited("Copying header to ring failed.\n");
@@ -669,7 +781,14 @@ static int copy_header_from_ring(struct fuse_ring_ent *ent,
if (offset < 0)
return offset;
- ring = (void __user *)ent->headers + offset;
+ if (bufring_enabled(ent->queue)) {
+ int buf_offset = offset +
+ sizeof(struct fuse_uring_req_header) * ent->id;
+
+ ring = ent->queue->bufring->headers + buf_offset;
+ } else {
+ ring = (void __user *)ent->headers + offset;
+ }
if (copy_from_user(header, ring, header_size)) {
pr_info_ratelimited("Copying header from ring failed.\n");
@@ -684,12 +803,20 @@ static int setup_fuse_copy_state(struct fuse_copy_state *cs,
struct fuse_ring_ent *ent, int dir,
struct iov_iter *iter)
{
+ void __user *payload;
int err;
- err = import_ubuf(dir, ent->payload, ring->max_payload_sz, iter);
- if (err) {
- pr_info_ratelimited("fuse: Import of user buffer failed\n");
- return err;
+ if (bufring_enabled(ent->queue))
+ payload = (void __user *)ent->payload_buf.addr;
+ else
+ payload = ent->payload;
+
+ if (payload) {
+ err = import_ubuf(dir, payload, ring->max_payload_sz, iter);
+ if (err) {
+ pr_info_ratelimited("fuse: Import of user buffer failed\n");
+ return err;
+ }
}
fuse_copy_init(cs, dir == ITER_DEST, iter);
@@ -741,6 +868,9 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
.commit_id = req->in.h.unique,
};
+ if (bufring_enabled(ent->queue))
+ ent_in_out.buf_id = ent->payload_buf.id;
+
err = setup_fuse_copy_state(&cs, ring, req, ent, ITER_DEST, &iter);
if (err)
return err;
@@ -805,6 +935,96 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
sizeof(req->in.h));
}
+static bool fuse_uring_req_has_payload(struct fuse_req *req)
+{
+ struct fuse_args *args = req->args;
+
+ return args->in_numargs > 1 || args->out_numargs;
+}
+
+static int fuse_uring_select_buffer(struct fuse_ring_ent *ent)
+ __must_hold(&ent->queue->lock)
+{
+ struct fuse_ring_queue *queue = ent->queue;
+ struct fuse_bufring *br = queue->bufring;
+ struct fuse_bufring_buf *buf;
+ unsigned int tail = br->tail, head = br->head;
+
+ lockdep_assert_held(&queue->lock);
+
+ /* Get a buffer to use for the payload */
+ if (tail == head)
+ return -ENOBUFS;
+
+ buf = &br->bufs[head % br->nbufs];
+ br->head++;
+
+ ent->payload_buf = *buf;
+
+ return 0;
+}
+
+static void fuse_uring_recycle_buffer(struct fuse_ring_ent *ent)
+ __must_hold(&ent->queue->lock)
+{
+ struct fuse_bufring_buf *ent_payload = &ent->payload_buf;
+ struct fuse_ring_queue *queue = ent->queue;
+ struct fuse_bufring_buf *buf;
+ struct fuse_bufring *br;
+
+ lockdep_assert_held(&queue->lock);
+
+ if (!bufring_enabled(queue) || !ent_payload->addr)
+ return;
+
+ br = queue->bufring;
+
+ /* ring should never be full */
+ WARN_ON_ONCE(br->tail - br->head >= br->nbufs);
+
+ buf = &br->bufs[(br->tail) % br->nbufs];
+
+ *buf = *ent_payload;
+
+ br->tail++;
+
+ memset(ent_payload, 0, sizeof(*ent_payload));
+}
+
+static int fuse_uring_next_req_update_buffer(struct fuse_ring_ent *ent,
+ struct fuse_req *req)
+{
+ bool buffer_selected;
+ bool has_payload;
+
+ if (!bufring_enabled(ent->queue))
+ return 0;
+
+ buffer_selected = !!ent->payload_buf.addr;
+ has_payload = fuse_uring_req_has_payload(req);
+
+ if (has_payload && !buffer_selected)
+ return fuse_uring_select_buffer(ent);
+
+ if (!has_payload && buffer_selected)
+ fuse_uring_recycle_buffer(ent);
+
+ return 0;
+}
+
+static int fuse_uring_prep_buffer(struct fuse_ring_ent *ent,
+ struct fuse_req *req)
+{
+ if (!bufring_enabled(ent->queue))
+ return 0;
+
+ /* no payload to copy, can skip selecting a buffer */
+ if (!fuse_uring_req_has_payload(req))
+ return 0;
+
+ return fuse_uring_select_buffer(ent);
+}
+
static int fuse_uring_prepare_send(struct fuse_ring_ent *ent,
struct fuse_req *req)
{
@@ -878,10 +1098,21 @@ static struct fuse_req *fuse_uring_ent_assign_req(struct fuse_ring_ent *ent)
/* get and assign the next entry while it is still holding the lock */
req = list_first_entry_or_null(req_queue, struct fuse_req, list);
- if (req)
- fuse_uring_add_req_to_ring_ent(ent, req);
+ if (req) {
+ int err = fuse_uring_next_req_update_buffer(ent, req);
- return req;
+ if (!err) {
+ fuse_uring_add_req_to_ring_ent(ent, req);
+ return req;
+ }
+ }
+
+ /*
+ * Buffer selection may fail if all the buffers are currently saturated.
+ * The request will be serviced when a buffer is freed up.
+ */
+ fuse_uring_recycle_buffer(ent);
+ return NULL;
}
/*
@@ -1041,6 +1272,12 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
* fuse requests would otherwise not get processed - committing
* and fetching is done in one step vs legacy fuse, which has separated
* read (fetch request) and write (commit result).
+ *
+ * If the server is using bufrings and has populated the ring with less
+ * payload buffers than ents, it is possible that there may not be an
+ * available buffer for the next request. If so, then the fetch is a
+ * no-op and the next request will be serviced when a buffer becomes
+ * available.
*/
if (fuse_uring_get_next_fuse_req(ent, queue))
fuse_uring_send(ent, cmd, 0, issue_flags);
@@ -1120,30 +1357,38 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
ent->queue = queue;
- err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
- if (err) {
- pr_info_ratelimited("Failed to get iovec from sqe, err=%d\n",
- err);
- goto error;
- }
+ if (bufring_enabled(queue)) {
+ ent->id = READ_ONCE(cmd->sqe->buf_index);
+ if (ent->id >= queue->bufring->queue_depth) {
+ err = -EINVAL;
+ goto error;
+ }
+ } else {
+ err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
+ if (err) {
+ pr_info_ratelimited("Failed to get iovec from sqe, err=%d\n",
+ err);
+ goto error;
+ }
- err = -EINVAL;
- headers = &iov[FUSE_URING_IOV_HEADERS];
- if (headers->iov_len < sizeof(struct fuse_uring_req_header)) {
- pr_info_ratelimited("Invalid header len %zu\n", headers->iov_len);
- goto error;
- }
+ err = -EINVAL;
+ headers = &iov[FUSE_URING_IOV_HEADERS];
+ if (headers->iov_len < sizeof(struct fuse_uring_req_header)) {
+ pr_info_ratelimited("Invalid header len %zu\n",
+ headers->iov_len);
+ goto error;
+ }
- payload = &iov[FUSE_URING_IOV_PAYLOAD];
- if (payload->iov_len < ring->max_payload_sz) {
- pr_info_ratelimited("Invalid req payload len %zu\n",
- payload->iov_len);
- goto error;
+ payload = &iov[FUSE_URING_IOV_PAYLOAD];
+ if (payload->iov_len < ring->max_payload_sz) {
+ pr_info_ratelimited("Invalid req payload len %zu\n",
+ payload->iov_len);
+ goto error;
+ }
+ ent->headers = headers->iov_base;
+ ent->payload = payload->iov_base;
}
- ent->headers = headers->iov_base;
- ent->payload = payload->iov_base;
-
atomic_inc(&ring->queue_refs);
return ent;
@@ -1152,6 +1397,13 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
return ERR_PTR(err);
}
+static bool init_flags_valid(u64 init_flags)
+{
+ u64 valid_flags = FUSE_URING_BUFRING;
+
+ return !(init_flags & ~valid_flags);
+}
+
/*
* Register header and payload buffer with the kernel and puts the
* entry as "ready to get fuse requests" on the queue
@@ -1161,6 +1413,7 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
{
const struct fuse_uring_cmd_req *cmd_req = io_uring_sqe128_cmd(cmd->sqe,
struct fuse_uring_cmd_req);
+ u64 init_flags = READ_ONCE(cmd_req->flags);
struct fuse_ring *ring = smp_load_acquire(&fc->ring);
struct fuse_ring_queue *queue;
struct fuse_ring_ent *ent;
@@ -1179,11 +1432,16 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
return -EINVAL;
}
+ if (!init_flags_valid(init_flags))
+ return -EINVAL;
+
queue = ring->queues[qid];
if (!queue) {
- queue = fuse_uring_create_queue(ring, qid);
- if (!queue)
- return err;
+ queue = fuse_uring_create_queue(cmd, ring, qid, init_flags);
+ if (IS_ERR(queue))
+ return PTR_ERR(queue);
+ } else if (!queue_init_flags_consistent(queue, init_flags)) {
+ return -EINVAL;
}
/*
@@ -1349,14 +1607,18 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
req->ring_queue = queue;
ent = list_first_entry_or_null(&queue->ent_avail_queue,
struct fuse_ring_ent, list);
- if (ent)
- fuse_uring_add_req_to_ring_ent(ent, req);
- else
- list_add_tail(&req->list, &queue->fuse_req_queue);
- spin_unlock(&queue->lock);
+ if (ent) {
+ err = fuse_uring_prep_buffer(ent, req);
+ if (!err) {
+ fuse_uring_add_req_to_ring_ent(ent, req);
+ spin_unlock(&queue->lock);
+ fuse_uring_dispatch_ent(ent);
+ return;
+ }
+ }
- if (ent)
- fuse_uring_dispatch_ent(ent);
+ list_add_tail(&req->list, &queue->fuse_req_queue);
+ spin_unlock(&queue->lock);
return;
@@ -1406,14 +1668,17 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
req = list_first_entry_or_null(&queue->fuse_req_queue, struct fuse_req,
list);
if (ent && req) {
- fuse_uring_add_req_to_ring_ent(ent, req);
- spin_unlock(&queue->lock);
+ int err = fuse_uring_prep_buffer(ent, req);
- fuse_uring_dispatch_ent(ent);
- } else {
- spin_unlock(&queue->lock);
+ if (!err) {
+ fuse_uring_add_req_to_ring_ent(ent, req);
+ spin_unlock(&queue->lock);
+ fuse_uring_dispatch_ent(ent);
+ return true;
+ }
}
+ spin_unlock(&queue->lock);
return true;
}
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 349418db3374..66d5d5f8dc3f 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -36,11 +36,47 @@ enum fuse_ring_req_state {
FRRS_RELEASED,
};
+struct fuse_bufring_buf {
+ uintptr_t addr;
+ unsigned int len;
+ unsigned int id;
+};
+
+struct fuse_bufring {
+ /* pointer to the headers buffer */
+ void __user *headers;
+
+ unsigned int queue_depth;
+
+ /* metadata tracking state of the bufring */
+ unsigned int nbufs;
+ unsigned int head;
+ unsigned int tail;
+
+ /* the buffers backing the ring */
+ __DECLARE_FLEX_ARRAY(struct fuse_bufring_buf, bufs);
+};
+
/** A fuse ring entry, part of the ring queue */
struct fuse_ring_ent {
- /* userspace buffer */
- struct fuse_uring_req_header __user *headers;
- void __user *payload;
+ union {
+ /* if bufrings are not used */
+ struct {
+ /* userspace buffers */
+ struct fuse_uring_req_header __user *headers;
+ void __user *payload;
+ };
+ /* if bufrings are used */
+ struct {
+ /*
+ * unique fixed id for the ent. used by kernel/server to
+ * locate where in the headers buffer the data for this
+ * ent resides
+ */
+ unsigned int id;
+ struct fuse_bufring_buf payload_buf;
+ };
+ };
/* the ring queue that owns the request */
struct fuse_ring_queue *queue;
@@ -99,6 +135,9 @@ struct fuse_ring_queue {
unsigned int active_background;
bool stopped;
+
+ /* only allocated if the server uses bufrings */
+ struct fuse_bufring *bufring;
};
/**
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index c13e1f9a2f12..8753de7eb189 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -240,6 +240,10 @@
* - add FUSE_COPY_FILE_RANGE_64
* - add struct fuse_copy_file_range_out
* - add FUSE_NOTIFY_PRUNE
+ *
+ * 7.46
+ * - add FUSE_URING_BUFRING flag
+ * - add fuse_uring_cmd_req init struct
*/
#ifndef _LINUX_FUSE_H
@@ -1263,7 +1267,13 @@ struct fuse_uring_ent_in_out {
/* size of user payload buffer */
uint32_t payload_sz;
- uint32_t padding;
+
+ /*
+ * if using bufrings, this is the id of the selected buffer.
+ * the selected buffer holds the request payload
+ */
+ uint16_t buf_id;
+ uint16_t padding;
uint64_t reserved;
};
@@ -1294,6 +1304,9 @@ enum fuse_uring_cmd {
FUSE_IO_URING_CMD_COMMIT_AND_FETCH = 2,
};
+/* fuse_uring_cmd_req flags */
+#define FUSE_URING_BUFRING (1 << 0)
+
/**
* In the 80B command area of the SQE.
*/
@@ -1305,7 +1318,17 @@ struct fuse_uring_cmd_req {
/* queue the command is for (queue index) */
uint16_t qid;
- uint8_t padding[6];
+ uint16_t padding;
+
+ union {
+ struct {
+ /* size of the bufring's backing buffers */
+ uint32_t buf_size;
+ /* number of entries in the queue */
+ uint16_t queue_depth;
+ uint16_t padding;
+ } init;
+ };
};
#endif /* _LINUX_FUSE_H */
--
2.52.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v2 11/14] fuse: add pinned headers capability for io-uring buffer rings
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
` (9 preceding siblings ...)
2026-04-02 16:28 ` [PATCH v2 10/14] fuse: add io-uring buffer rings Joanne Koong
@ 2026-04-02 16:28 ` Joanne Koong
2026-04-14 12:47 ` Bernd Schubert
2026-04-30 11:22 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 12/14] fuse: add pinned payload buffers " Joanne Koong
` (3 subsequent siblings)
14 siblings, 2 replies; 49+ messages in thread
From: Joanne Koong @ 2026-04-02 16:28 UTC (permalink / raw)
To: miklos; +Cc: bernd, axboe, linux-fsdevel
Allow fuse servers to pin their header buffers by setting the
FUSE_URING_PINNED_HEADERS flag alongside FUSE_URING_BUFRING on REGISTER
sqes. When set, the kernel pins the header pages, vmaps them for a
kernel virtual address, and uses direct memcpy for copying. This avoids
the per-request overhead of having to pin/unpin user pages and translate
virtual addresses.
Buffers must be page-aligned. The kernel accounts pinned pages against
RLIMIT_MEMLOCK (bypassed with CAP_IPC_LOCK) and tracks mm->pinned_vm.
Unpinning is done in process context during connection abort, since vmap
cannot run in softirq (where final destruction occurs via RCU).
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/dev_uring.c | 228 ++++++++++++++++++++++++++++++++++++--
fs/fuse/dev_uring_i.h | 23 +++-
include/uapi/linux/fuse.h | 2 +
3 files changed, 243 insertions(+), 10 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 9f14a2bcde3f..79736b02cf9f 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -11,6 +11,7 @@
#include <linux/fs.h>
#include <linux/io_uring/cmd.h>
+#include <linux/vmalloc.h>
static bool __read_mostly enable_uring;
module_param(enable_uring, bool, 0644);
@@ -46,6 +47,11 @@ static inline bool bufring_enabled(struct fuse_ring_queue *queue)
return queue->bufring != NULL;
}
+static inline bool bufring_pinned_headers(struct fuse_ring_queue *queue)
+{
+ return queue->bufring->use_pinned_headers;
+}
+
static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
struct fuse_ring_ent *ring_ent)
{
@@ -200,6 +206,37 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
return false;
}
+static void fuse_bufring_unpin_mem(struct fuse_bufring_pinned *mem)
+{
+ struct page **pages = mem->pages;
+ unsigned int nr_pages = mem->nr_pages;
+ struct user_struct *user = mem->user;
+ struct mm_struct *mm_account = mem->mm_account;
+
+ vunmap(mem->addr);
+ unpin_user_pages(pages, nr_pages);
+
+ if (user) {
+ atomic_long_sub(nr_pages, &user->locked_vm);
+ free_uid(user);
+ }
+
+ atomic64_sub(nr_pages, &mm_account->pinned_vm);
+ mmdrop(mm_account);
+
+ kvfree(mem->pages);
+}
+
+static void fuse_uring_bufring_unpin(struct fuse_ring_queue *queue)
+{
+ struct fuse_bufring *br = queue->bufring;
+
+ if (bufring_pinned_headers(queue)) {
+ fuse_bufring_unpin_mem(&br->pinned_headers);
+ br->use_pinned_headers = false;
+ }
+}
+
void fuse_uring_destruct(struct fuse_conn *fc)
{
struct fuse_ring *ring = fc->ring;
@@ -227,7 +264,10 @@ void fuse_uring_destruct(struct fuse_conn *fc)
}
kfree(queue->fpq.processing);
- kfree(queue->bufring);
+ if (bufring_enabled(queue)) {
+ fuse_uring_bufring_unpin(queue);
+ kfree(queue->bufring);
+ }
kfree(queue);
ring->queues[qid] = NULL;
}
@@ -309,14 +349,131 @@ static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
return 0;
}
+static struct page **fuse_uring_pin_user_pages(void __user *uaddr,
+ unsigned long len, int *npages)
+{
+ unsigned long addr = (unsigned long)uaddr;
+ unsigned long start, end, nr_pages;
+ struct page **pages;
+ int pinned;
+
+ if (check_add_overflow(addr, len, &end))
+ return ERR_PTR(-EOVERFLOW);
+ if (check_add_overflow(end, PAGE_SIZE - 1, &end))
+ return ERR_PTR(-EOVERFLOW);
+
+ end = end >> PAGE_SHIFT;
+ start = addr >> PAGE_SHIFT;
+ nr_pages = end - start;
+ if (WARN_ON_ONCE(!nr_pages))
+ return ERR_PTR(-EINVAL);
+ if (WARN_ON_ONCE(nr_pages > INT_MAX))
+ return ERR_PTR(-EOVERFLOW);
+
+ pages = kvmalloc_objs(struct page *, nr_pages, GFP_KERNEL_ACCOUNT);
+ if (!pages)
+ return ERR_PTR(-ENOMEM);
+
+ pinned = pin_user_pages_fast(addr, nr_pages, FOLL_WRITE | FOLL_LONGTERM,
+ pages);
+ /* success, mapped all pages */
+ if (pinned == nr_pages) {
+ *npages = nr_pages;
+ return pages;
+ }
+
+ /* remove any partial pins */
+ if (pinned > 0)
+ unpin_user_pages(pages, pinned);
+
+ kvfree(pages);
+
+ return ERR_PTR(pinned < 0 ? pinned : -EFAULT);
+}
+
+static int account_pinned_pages(struct fuse_bufring_pinned *mem,
+ struct page **pages, unsigned int nr_pages)
+{
+ unsigned long page_limit, cur_pages, new_pages;
+ struct user_struct *user = current_user();
+
+ if (!nr_pages)
+ return 0;
+
+ if (!capable(CAP_IPC_LOCK)) {
+ /* Don't allow more pages than we can safely lock */
+ page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+ cur_pages = atomic_long_read(&user->locked_vm);
+ do {
+ new_pages = cur_pages + nr_pages;
+ if (new_pages > page_limit)
+ return -ENOMEM;
+ } while (!atomic_long_try_cmpxchg(&user->locked_vm,
+ &cur_pages, new_pages));
+
+ mem->user = get_uid(current_user());
+ }
+
+ atomic64_add(nr_pages, ¤t->mm->pinned_vm);
+ mmgrab(current->mm);
+ mem->mm_account = current->mm;
+
+ return 0;
+}
+
+static int fuse_bufring_pin_mem(struct fuse_bufring_pinned *mem,
+ void __user *addr, size_t len)
+{
+ struct page **pages = NULL;
+ int nr_pages;
+ int err;
+
+ if (!PAGE_ALIGNED(addr))
+ return -EINVAL;
+
+ pages = fuse_uring_pin_user_pages(addr, len, &nr_pages);
+ if (IS_ERR(pages))
+ return PTR_ERR(pages);
+
+ err = account_pinned_pages(mem, pages, nr_pages);
+ if (err)
+ goto unpin;
+
+ mem->addr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
+ if (!mem->addr) {
+ err = -ENOMEM;
+ goto unaccount;
+ }
+
+ mem->pages = pages;
+ mem->nr_pages = nr_pages;
+
+ return 0;
+
+unaccount:
+ if (mem->user) {
+ atomic_long_sub(nr_pages, &mem->user->locked_vm);
+ free_uid(mem->user);
+ }
+ atomic64_sub(nr_pages, ¤t->mm->pinned_vm);
+ mmdrop(mem->mm_account);
+unpin:
+ unpin_user_pages(pages, nr_pages);
+ kvfree(pages);
+ return err;
+}
+
static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
- struct fuse_ring_queue *queue)
+ struct fuse_ring_queue *queue,
+ u64 init_flags)
{
const struct fuse_uring_cmd_req *cmd_req =
io_uring_sqe128_cmd(cmd->sqe, struct fuse_uring_cmd_req);
u16 queue_depth = READ_ONCE(cmd_req->init.queue_depth);
unsigned int buf_size = READ_ONCE(cmd_req->init.buf_size);
struct iovec iov[FUSE_URING_IOV_SEGS];
+ bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
void __user *payload, *headers;
size_t headers_size, payload_size, ring_size;
struct fuse_bufring *br;
@@ -354,7 +511,17 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
return -ENOMEM;
br->queue_depth = queue_depth;
- br->headers = headers;
+ if (pinned_headers) {
+ err = fuse_bufring_pin_mem(&br->pinned_headers, headers,
+ headers_size);
+ if (err) {
+ kfree(br);
+ return err;
+ }
+ br->use_pinned_headers = true;
+ } else {
+ br->headers = headers;
+ }
payload_addr = (uintptr_t)payload;
@@ -385,8 +552,15 @@ static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
u64 init_flags)
{
bool bufring = init_flags & FUSE_URING_BUFRING;
+ bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
+
+ if (bufring_enabled(queue) != bufring)
+ return false;
+
+ if (!bufring)
+ return true;
- return bufring_enabled(queue) == bufring;
+ return bufring_pinned_headers(queue) == pinned_headers;
}
static struct fuse_ring_queue *
@@ -423,7 +597,7 @@ fuse_uring_create_queue(struct io_uring_cmd *cmd, struct fuse_ring *ring,
fuse_pqueue_init(&queue->fpq);
if (use_bufring) {
- int err = fuse_uring_bufring_setup(cmd, queue);
+ int err = fuse_uring_bufring_setup(cmd, queue, init_flags);
if (err) {
kfree(pq);
@@ -437,8 +611,10 @@ fuse_uring_create_queue(struct io_uring_cmd *cmd, struct fuse_ring *ring,
if (ring->queues[qid]) {
spin_unlock(&fc->lock);
kfree(queue->fpq.processing);
- if (use_bufring)
+ if (use_bufring) {
+ fuse_uring_bufring_unpin(queue);
kfree(queue->bufring);
+ }
kfree(queue);
queue = ring->queues[qid];
@@ -605,6 +781,25 @@ static void fuse_uring_async_stop_queues(struct work_struct *work)
}
}
+static void fuse_uring_unpin_queues(struct fuse_ring *ring)
+{
+ int qid;
+
+ for (qid = 0; qid < ring->nr_queues; qid++) {
+ struct fuse_ring_queue *queue = READ_ONCE(ring->queues[qid]);
+ struct fuse_bufring *br;
+
+ if (!queue)
+ continue;
+
+ br = queue->bufring;
+ if (!br)
+ continue;
+
+ fuse_uring_bufring_unpin(queue);
+ }
+}
+
/*
* Stop the ring queues
*/
@@ -643,6 +838,9 @@ void fuse_uring_abort(struct fuse_conn *fc)
fuse_uring_abort_end_requests(ring);
fuse_uring_stop_queues(ring);
}
+
+ /* unpin while in process context - can't do this in softirq */
+ fuse_uring_unpin_queues(ring);
}
/*
@@ -758,6 +956,11 @@ static int copy_header_to_ring(struct fuse_ring_ent *ent,
int buf_offset = offset +
sizeof(struct fuse_uring_req_header) * ent->id;
+ if (bufring_pinned_headers(ent->queue)) {
+ memcpy(ent->queue->bufring->pinned_headers.addr + buf_offset,
+ header, header_size);
+ return 0;
+ }
ring = ent->queue->bufring->headers + buf_offset;
} else {
ring = (void __user *)ent->headers + offset;
@@ -785,6 +988,11 @@ static int copy_header_from_ring(struct fuse_ring_ent *ent,
int buf_offset = offset +
sizeof(struct fuse_uring_req_header) * ent->id;
+ if (bufring_pinned_headers(ent->queue)) {
+ memcpy(header, ent->queue->bufring->pinned_headers.addr + buf_offset,
+ header_size);
+ return 0;
+ }
ring = ent->queue->bufring->headers + buf_offset;
} else {
ring = (void __user *)ent->headers + offset;
@@ -1399,7 +1607,13 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
static bool init_flags_valid(u64 init_flags)
{
- u64 valid_flags = FUSE_URING_BUFRING;
+ u64 valid_flags =
+ FUSE_URING_BUFRING | FUSE_URING_PINNED_HEADERS;
+ bool bufring = init_flags & FUSE_URING_BUFRING;
+ bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
+
+ if (pinned_headers && !bufring)
+ return false;
return !(init_flags & ~valid_flags);
}
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 66d5d5f8dc3f..05c0f061a882 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -42,12 +42,29 @@ struct fuse_bufring_buf {
unsigned int id;
};
-struct fuse_bufring {
- /* pointer to the headers buffer */
- void __user *headers;
+struct fuse_bufring_pinned {
+ void *addr;
+ struct page **pages;
+ unsigned int nr_pages;
+
+ /*
+ * need to track this so we can unpin / unaccount pages during teardown
+ * when not running in the server's task context
+ */
+ struct user_struct *user;
+ struct mm_struct *mm_account;
+};
+struct fuse_bufring {
+ bool use_pinned_headers: 1;
unsigned int queue_depth;
+ union {
+ /* pointer to the headers buffer */
+ void __user *headers;
+ struct fuse_bufring_pinned pinned_headers;
+ };
+
/* metadata tracking state of the bufring */
unsigned int nbufs;
unsigned int head;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 8753de7eb189..e57244c03d42 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -244,6 +244,7 @@
* 7.46
* - add FUSE_URING_BUFRING flag
* - add fuse_uring_cmd_req init struct
+ * - add FUSE_URING_PINNED_HEADERS flag
*/
#ifndef _LINUX_FUSE_H
@@ -1306,6 +1307,7 @@ enum fuse_uring_cmd {
/* fuse_uring_cmd_req flags */
#define FUSE_URING_BUFRING (1 << 0)
+#define FUSE_URING_PINNED_HEADERS (1 << 1)
/**
* In the 80B command area of the SQE.
--
2.52.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v2 12/14] fuse: add pinned payload buffers capability for io-uring buffer rings
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
` (10 preceding siblings ...)
2026-04-02 16:28 ` [PATCH v2 11/14] fuse: add pinned headers capability for " Joanne Koong
@ 2026-04-02 16:28 ` Joanne Koong
2026-04-30 11:29 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 13/14] fuse: add zero-copy over io-uring Joanne Koong
` (2 subsequent siblings)
14 siblings, 1 reply; 49+ messages in thread
From: Joanne Koong @ 2026-04-02 16:28 UTC (permalink / raw)
To: miklos; +Cc: bernd, axboe, linux-fsdevel
Extend the buffer ring pinning capability to payload buffers via the
FUSE_URING_PINNED_BUFFERS flag. When set alongside FUSE_URING_BUFRING,
the kernel pins and vmaps the payload buffer region during queue setup.
With pinned payloads, the kernel uses direct memcpy for all payload
buffer copies, avoiding the per-request overhead of pinning/unpinning
user pages and translating virtual addresses. This is particularly
beneficial for large payload copies.
As with pinned headers, buffers must be page-aligned. Pinned pages are
accounted against RLIMIT_MEMLOCK (bypassed with CAP_IPC_LOCK) and
unpinned in process context during connection abort.
In benchmarks using passthrough_hp on a high-performance NVMe-backed
system, pinned headers and pinned payload buffers showed around a 10%
throughput improvement for direct randreads (~2150 MiB/s to ~2400
MiB/s), a 4% improvement for direct sequential reads (~2510 MiB/s to
~2620 MiB/s), a 8% improvement for buffered randreads (~2100 MiB/s to
~2280 MiB/s), and a 6% improvement for buffered sequential reads (~2500
MiB/s to ~2670 MiB/s).
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/dev_uring.c | 54 +++++++++++++++++++++++++++++++++------
fs/fuse/dev_uring_i.h | 4 +++
include/uapi/linux/fuse.h | 2 ++
3 files changed, 52 insertions(+), 8 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 79736b02cf9f..06d3d8dc1c82 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -52,6 +52,11 @@ static inline bool bufring_pinned_headers(struct fuse_ring_queue *queue)
return queue->bufring->use_pinned_headers;
}
+static inline bool bufring_pinned_buffers(struct fuse_ring_queue *queue)
+{
+ return queue->bufring->use_pinned_buffers;
+}
+
static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
struct fuse_ring_ent *ring_ent)
{
@@ -235,6 +240,11 @@ static void fuse_uring_bufring_unpin(struct fuse_ring_queue *queue)
fuse_bufring_unpin_mem(&br->pinned_headers);
br->use_pinned_headers = false;
}
+
+ if (bufring_pinned_buffers(queue)) {
+ fuse_bufring_unpin_mem(&br->pinned_bufs);
+ br->use_pinned_buffers = false;
+ }
}
void fuse_uring_destruct(struct fuse_conn *fc)
@@ -474,6 +484,7 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
unsigned int buf_size = READ_ONCE(cmd_req->init.buf_size);
struct iovec iov[FUSE_URING_IOV_SEGS];
bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
+ bool pinned_bufs = init_flags & FUSE_URING_PINNED_BUFFERS;
void __user *payload, *headers;
size_t headers_size, payload_size, ring_size;
struct fuse_bufring *br;
@@ -523,7 +534,22 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
br->headers = headers;
}
- payload_addr = (uintptr_t)payload;
+ if (pinned_bufs) {
+ err = fuse_bufring_pin_mem(&br->pinned_bufs, payload,
+ payload_size);
+ if (err) {
+ if (pinned_headers)
+ fuse_bufring_unpin_mem(&br->pinned_headers);
+ kfree(br);
+ return err;
+ }
+ br->use_pinned_buffers = true;
+ }
+
+ if (pinned_bufs)
+ payload_addr = (uintptr_t)br->pinned_bufs.addr;
+ else
+ payload_addr = (uintptr_t)payload;
/* populate the ring buffer */
for (i = 0; i < nr_bufs; i++, payload_addr += buf_size) {
@@ -553,6 +579,7 @@ static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
{
bool bufring = init_flags & FUSE_URING_BUFRING;
bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
+ bool pinned_bufs = init_flags & FUSE_URING_PINNED_BUFFERS;
if (bufring_enabled(queue) != bufring)
return false;
@@ -560,7 +587,8 @@ static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
if (!bufring)
return true;
- return bufring_pinned_headers(queue) == pinned_headers;
+ return bufring_pinned_headers(queue) == pinned_headers &&
+ bufring_pinned_buffers(queue) == pinned_bufs;
}
static struct fuse_ring_queue *
@@ -1011,13 +1039,15 @@ static int setup_fuse_copy_state(struct fuse_copy_state *cs,
struct fuse_ring_ent *ent, int dir,
struct iov_iter *iter)
{
- void __user *payload;
+ void __user *payload = NULL;
+ bool use_bufring = bufring_enabled(ent->queue);
+ bool pinned_buffers = use_bufring && bufring_pinned_buffers(ent->queue);
int err;
- if (bufring_enabled(ent->queue))
- payload = (void __user *)ent->payload_buf.addr;
- else
+ if (!use_bufring)
payload = ent->payload;
+ else if (!pinned_buffers)
+ payload = (void __user *)ent->payload_buf.addr;
if (payload) {
err = import_ubuf(dir, payload, ring->max_payload_sz, iter);
@@ -1029,6 +1059,12 @@ static int setup_fuse_copy_state(struct fuse_copy_state *cs,
fuse_copy_init(cs, dir == ITER_DEST, iter);
+ if (pinned_buffers) {
+ cs->is_kaddr = true;
+ cs->kaddr = (void *)ent->payload_buf.addr;
+ cs->len = ent->payload_buf.len;
+ }
+
cs->is_uring = true;
cs->req = req;
@@ -1608,11 +1644,13 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
static bool init_flags_valid(u64 init_flags)
{
u64 valid_flags =
- FUSE_URING_BUFRING | FUSE_URING_PINNED_HEADERS;
+ FUSE_URING_BUFRING | FUSE_URING_PINNED_HEADERS |
+ FUSE_URING_PINNED_BUFFERS;
bool bufring = init_flags & FUSE_URING_BUFRING;
bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
+ bool pinned_buffers = init_flags & FUSE_URING_PINNED_BUFFERS;
- if (pinned_headers && !bufring)
+ if (!bufring && (pinned_headers || pinned_buffers))
return false;
return !(init_flags & ~valid_flags);
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 05c0f061a882..859ee4e6ba03 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -57,6 +57,7 @@ struct fuse_bufring_pinned {
struct fuse_bufring {
bool use_pinned_headers: 1;
+ bool use_pinned_buffers: 1;
unsigned int queue_depth;
union {
@@ -65,6 +66,9 @@ struct fuse_bufring {
struct fuse_bufring_pinned pinned_headers;
};
+ /* only used if the buffers are pinned */
+ struct fuse_bufring_pinned pinned_bufs;
+
/* metadata tracking state of the bufring */
unsigned int nbufs;
unsigned int head;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index e57244c03d42..51ecb66dd6eb 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -245,6 +245,7 @@
* - add FUSE_URING_BUFRING flag
* - add fuse_uring_cmd_req init struct
* - add FUSE_URING_PINNED_HEADERS flag
+ * - add FUSE_URING_PINNED_BUFFERS flag
*/
#ifndef _LINUX_FUSE_H
@@ -1308,6 +1309,7 @@ enum fuse_uring_cmd {
/* fuse_uring_cmd_req flags */
#define FUSE_URING_BUFRING (1 << 0)
#define FUSE_URING_PINNED_HEADERS (1 << 1)
+#define FUSE_URING_PINNED_BUFFERS (1 << 2)
/**
* In the 80B command area of the SQE.
--
2.52.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v2 13/14] fuse: add zero-copy over io-uring
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
` (11 preceding siblings ...)
2026-04-02 16:28 ` [PATCH v2 12/14] fuse: add pinned payload buffers " Joanne Koong
@ 2026-04-02 16:28 ` Joanne Koong
2026-04-30 11:42 ` Jeff Layton
` (2 more replies)
2026-04-02 16:28 ` [PATCH v2 14/14] docs: fuse: add io-uring bufring and zero-copy documentation Joanne Koong
2026-04-30 12:59 ` [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Jeff Layton
14 siblings, 3 replies; 49+ messages in thread
From: Joanne Koong @ 2026-04-02 16:28 UTC (permalink / raw)
To: miklos; +Cc: bernd, axboe, linux-fsdevel
Implement zero-copy data transfer for fuse over io-uring, eliminating
memory copies between userspace, the kernel, and the fuse server for
page-backed read/write operations.
When the FUSE_URING_ZERO_COPY flag is set alongside FUSE_URING_BUFRING,
the kernel registers the client's underlying pages as a sparse buffer at
the entry's fixed id via io_buffer_register_bvec(). The fuse server can
then perform io_uring read/write operations directly on these pages.
Non-page-backed args (eg out headers) go through the payload buffer as
normal.
This requires CAP_SYS_ADMIN and buffer rings with pinned headers and
buffers. Gating on pinned headers and buffers keeps the configuration
space small and avoids partially-optimized modes that are unlikely to be
useful in practice. Pages are unregistered when the request completes.
The request flow for the zero-copy write path (client writes data,
server reads it) is as follows:
=======================================================================
| Kernel | FUSE server
| |
| "write(fd, buf, 1MB)" |
| |
| >sys_write() |
| >fuse_file_write_iter() |
| >fuse_send_one() |
| [req->args->in_pages = true] |
| [folios hold client write data] |
| |
| >fuse_uring_copy_to_ring() |
| >copy_header_to_ring(IN_OUT) |
| [memcpy fuse_in_header to |
| pinned headers buf via kaddr] |
| >copy_header_to_ring(OP) |
| [memcpy write_in header] |
| |
| >fuse_uring_args_to_ring() |
| >setup_fuse_copy_state() |
| [is_kaddr = true] |
| [skip_folio_copy = true] |
| |
| >fuse_uring_set_up_zero_copy() |
| [folio_get for each client folio] |
| [build bio_vec array from folios] |
| >io_buffer_register_bvec() |
| [register pages at ent->id] |
| [ent->zero_copied = true] |
| |
| >fuse_copy_args() |
| [skip_folio_copy => return 0 |
| for page arg, skip data copy] |
| |
| >copy_header_to_ring(RING_ENT) |
| [memcpy ent_in_out] |
| >io_uring_cmd_done() |
| |
| | [CQE received]
| |
| | [issue io_uring READ at
| | ent->id]
| | [reads directly from
| |client's pages (ZERO_COPY)]
| |
| | [write data to backing
| | store]
| | [submit COMMIT AND FETCH]
| |
| >fuse_uring_commit_fetch() |
| >fuse_uring_commit() |
| >fuse_uring_copy_from_ring() |
| >fuse_uring_req_end() |
| >io_buffer_unregister(ent->id) |
| [unregister sparse buffer] |
| >fuse_zero_copy_release() |
| [folio_put for each folio] |
| [ent->zero_copied = false] |
| >fuse_request_end() |
| [wake up client] |
The zero-copy read path is analogous.
Some requests may have both page-backed args and non-page-backed args.
For these requests, the page-backed args are zero-copied while the
non-page-backed args are copied to the buffer selected from the buffer
ring:
zero-copy: pages registered via io_buffer_register_bvec()
non-page-backed: copied to payload buffer via fuse_copy_args()
For a request whose payload is zero-copied, the
registration/unregistration path looks like:
register: fuse_uring_set_up_zero_copy()
folio_get() for each folio
io_buffer_register_bvec(ent->id)
[server accesses pages via io_uring fixed buf at ent->id]
unregister: fuse_uring_req_end()
io_buffer_unregister(ent->id)
-> fuse_zero_copy_release() callback
folio_put() for each folio
The throughput improvement from zero-copy depends on how much of the
per-request latency is spent on data copying vs backing I/O. When
backing I/O dominates, the saved memcpy is a negligible fraction of
overall latency. Please also note that for the server to read/write
into the zero-copied pages, the read/write must go through io-uring
as an IORING_OP_READ_FIXED / IORING_OP_WRITE_FIXED operation. If the
server's backing I/O is instantaneous (eg served from cache), the
overhead of the additional io_uring operation may negate the savings
from eliminating the memcpy.
In benchmarks using passthrough_hp on a high-performance NVMe-backed
system, zero-copy showed around a 35% throughput improvement for direct
randreads (~2150 MiB/s to ~2900 MiB/s), a 15% improvement for direct
sequential reads (~2510 MiB/s to ~2900 MiB/s), a 15% improvement for
buffered randreads (~2100 MiB/s to ~2470 MiB/s), and a 10% improvement
for buffered sequential reads (~2500 MiB/s to ~2750 MiB/s).
The benchmarks were run using:
fio --name=test_run --ioengine=sync --rw=rand{read,write} --bs=1M
--size=1G --numjobs=2 --ramp_time=30 --group_reporting=1
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/dev.c | 7 +-
fs/fuse/dev_uring.c | 167 +++++++++++++++++++++++++++++++++-----
fs/fuse/dev_uring_i.h | 4 +
fs/fuse/fuse_dev_i.h | 1 +
include/uapi/linux/fuse.h | 5 ++
5 files changed, 160 insertions(+), 24 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index a87939eaa103..cd326e61831b 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1233,10 +1233,13 @@ int fuse_copy_args(struct fuse_copy_state *cs, unsigned numargs,
for (i = 0; !err && i < numargs; i++) {
struct fuse_arg *arg = &args[i];
- if (i == numargs - 1 && argpages)
+ if (i == numargs - 1 && argpages) {
+ if (cs->skip_folio_copy)
+ return 0;
err = fuse_copy_folios(cs, arg->size, zeroing);
- else
+ } else {
err = fuse_copy_one(cs, arg->value, arg->size);
+ }
}
return err;
}
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 06d3d8dc1c82..d9f1ee4beaf3 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -31,6 +31,11 @@ struct fuse_uring_pdu {
struct fuse_ring_ent *ent;
};
+struct fuse_zero_copy_bvs {
+ unsigned int nr_bvs;
+ struct bio_vec bvs[];
+};
+
static const struct fuse_iqueue_ops fuse_io_uring_ops;
enum fuse_uring_header_type {
@@ -57,6 +62,11 @@ static inline bool bufring_pinned_buffers(struct fuse_ring_queue *queue)
return queue->bufring->use_pinned_buffers;
}
+static inline bool bufring_zero_copy(struct fuse_ring_queue *queue)
+{
+ return queue->bufring->use_zero_copy;
+}
+
static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
struct fuse_ring_ent *ring_ent)
{
@@ -102,8 +112,18 @@ static void fuse_uring_flush_bg(struct fuse_ring_queue *queue)
}
}
+static bool can_zero_copy_req(struct fuse_ring_ent *ent, struct fuse_req *req)
+{
+ struct fuse_args *args = req->args;
+
+ if (!bufring_enabled(ent->queue) || !bufring_zero_copy(ent->queue))
+ return false;
+
+ return args->in_pages || args->out_pages;
+}
+
static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
- int error)
+ int error, unsigned int issue_flags)
{
struct fuse_ring_queue *queue = ent->queue;
struct fuse_ring *ring = queue->ring;
@@ -122,6 +142,11 @@ static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
spin_unlock(&queue->lock);
+ if (ent->zero_copied) {
+ io_buffer_unregister(ent->cmd, ent->id, issue_flags);
+ ent->zero_copied = false;
+ }
+
if (error)
req->out.h.error = error;
@@ -485,6 +510,7 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
struct iovec iov[FUSE_URING_IOV_SEGS];
bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
bool pinned_bufs = init_flags & FUSE_URING_PINNED_BUFFERS;
+ bool zero_copy = init_flags & FUSE_URING_ZERO_COPY;
void __user *payload, *headers;
size_t headers_size, payload_size, ring_size;
struct fuse_bufring *br;
@@ -508,7 +534,7 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
if (headers_size < queue_depth * sizeof(struct fuse_uring_req_header))
return -EINVAL;
- if (buf_size < queue->ring->max_payload_sz)
+ if (!zero_copy && buf_size < queue->ring->max_payload_sz)
return -EINVAL;
nr_bufs = payload_size / buf_size;
@@ -521,6 +547,7 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
if (!br)
return -ENOMEM;
+ br->use_zero_copy = zero_copy;
br->queue_depth = queue_depth;
if (pinned_headers) {
err = fuse_bufring_pin_mem(&br->pinned_headers, headers,
@@ -580,6 +607,7 @@ static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
bool bufring = init_flags & FUSE_URING_BUFRING;
bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
bool pinned_bufs = init_flags & FUSE_URING_PINNED_BUFFERS;
+ bool zero_copy = init_flags & FUSE_URING_ZERO_COPY;
if (bufring_enabled(queue) != bufring)
return false;
@@ -588,7 +616,8 @@ static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
return true;
return bufring_pinned_headers(queue) == pinned_headers &&
- bufring_pinned_buffers(queue) == pinned_bufs;
+ bufring_pinned_buffers(queue) == pinned_bufs &&
+ bufring_zero_copy(queue) == zero_copy;
}
static struct fuse_ring_queue *
@@ -1063,6 +1092,7 @@ static int setup_fuse_copy_state(struct fuse_copy_state *cs,
cs->is_kaddr = true;
cs->kaddr = (void *)ent->payload_buf.addr;
cs->len = ent->payload_buf.len;
+ cs->skip_folio_copy = ent->zero_copied;
}
cs->is_uring = true;
@@ -1095,11 +1125,70 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
return err;
}
+static void fuse_zero_copy_release(void *priv)
+{
+ struct fuse_zero_copy_bvs *zc_bvs = priv;
+ unsigned int i;
+
+ for (i = 0; i < zc_bvs->nr_bvs; i++)
+ folio_put(page_folio(zc_bvs->bvs[i].bv_page));
+
+ kfree(zc_bvs);
+}
+
+static int fuse_uring_set_up_zero_copy(struct fuse_ring_ent *ent,
+ struct fuse_req *req,
+ unsigned int issue_flags)
+{
+ struct fuse_args_pages *ap;
+ int err, i, ddir = 0;
+ struct fuse_zero_copy_bvs *zc_bvs;
+ struct bio_vec *bvs;
+
+ /* out_pages indicates a read, in_pages indicates a write */
+ if (req->args->out_pages)
+ ddir |= IO_BUF_DEST;
+ if (req->args->in_pages)
+ ddir |= IO_BUF_SOURCE;
+
+ WARN_ON_ONCE(!ddir);
+
+ ap = container_of(req->args, typeof(*ap), args);
+
+ zc_bvs = kmalloc(struct_size(zc_bvs, bvs, ap->num_folios),
+ GFP_KERNEL_ACCOUNT);
+ if (!zc_bvs)
+ return -ENOMEM;
+
+ zc_bvs->nr_bvs = ap->num_folios;
+ bvs = zc_bvs->bvs;
+ for (i = 0; i < ap->num_folios; i++) {
+ bvs[i].bv_page = folio_page(ap->folios[i], 0);
+ bvs[i].bv_offset = ap->descs[i].offset;
+ bvs[i].bv_len = ap->descs[i].length;
+ folio_get(ap->folios[i]);
+ }
+
+ err = io_buffer_register_bvec(ent->cmd, bvs, ap->num_folios,
+ fuse_zero_copy_release, zc_bvs,
+ ddir, ent->id,
+ issue_flags);
+ if (err) {
+ fuse_zero_copy_release(zc_bvs);
+ return err;
+ }
+
+ ent->zero_copied = true;
+
+ return 0;
+}
+
/*
* Copy data from the req to the ring buffer
*/
static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
- struct fuse_ring_ent *ent)
+ struct fuse_ring_ent *ent,
+ unsigned int issue_flags)
{
struct fuse_copy_state cs;
struct fuse_args *args = req->args;
@@ -1112,8 +1201,15 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
.commit_id = req->in.h.unique,
};
- if (bufring_enabled(ent->queue))
+ if (bufring_enabled(ent->queue)) {
ent_in_out.buf_id = ent->payload_buf.id;
+ if (can_zero_copy_req(ent, req)) {
+ ent_in_out.flags |= FUSE_URING_ENT_ZERO_COPY;
+ err = fuse_uring_set_up_zero_copy(ent, req, issue_flags);
+ if (err)
+ return err;
+ }
+ }
err = setup_fuse_copy_state(&cs, ring, req, ent, ITER_DEST, &iter);
if (err)
@@ -1145,12 +1241,17 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
}
ent_in_out.payload_sz = cs.ring.copied_sz;
+ if (cs.skip_folio_copy && args->in_pages)
+ ent_in_out.payload_sz +=
+ args->in_args[args->in_numargs - 1].size;
+
return copy_header_to_ring(ent, FUSE_URING_HEADER_RING_ENT,
&ent_in_out, sizeof(ent_in_out));
}
static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
- struct fuse_req *req)
+ struct fuse_req *req,
+ unsigned int issue_flags)
{
struct fuse_ring_queue *queue = ent->queue;
struct fuse_ring *ring = queue->ring;
@@ -1168,7 +1269,7 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
return err;
/* copy the request */
- err = fuse_uring_args_to_ring(ring, req, ent);
+ err = fuse_uring_args_to_ring(ring, req, ent, issue_flags);
if (unlikely(err)) {
pr_info_ratelimited("Copy to ring failed: %d\n", err);
return err;
@@ -1179,11 +1280,25 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
sizeof(req->in.h));
}
-static bool fuse_uring_req_has_payload(struct fuse_req *req)
+static bool fuse_uring_req_has_copyable_payload(struct fuse_ring_ent *ent,
+ struct fuse_req *req)
{
struct fuse_args *args = req->args;
- return args->in_numargs > 1 || args->out_numargs;
+ if (!can_zero_copy_req(ent, req))
+ return args->in_numargs > 1 || args->out_numargs;
+
+ /*
+ * the asymmetry between in_numargs > 2 and out_numargs > 1 is because
+ * the per-op header is extracted before fuse_copy_args() for inargs but
+ * not for outargs
+ */
+ if ((args->in_numargs > 1) && (!args->in_pages || args->in_numargs > 2))
+ return true;
+ if (args->out_numargs && (!args->out_pages || args->out_numargs > 1))
+ return true;
+
+ return false;
}
static int fuse_uring_select_buffer(struct fuse_ring_ent *ent)
@@ -1245,7 +1360,7 @@ static int fuse_uring_next_req_update_buffer(struct fuse_ring_ent *ent,
return 0;
buffer_selected = !!ent->payload_buf.addr;
- has_payload = fuse_uring_req_has_payload(req);
+ has_payload = fuse_uring_req_has_copyable_payload(ent, req);
if (has_payload && !buffer_selected)
return fuse_uring_select_buffer(ent);
@@ -1263,22 +1378,23 @@ static int fuse_uring_prep_buffer(struct fuse_ring_ent *ent,
return 0;
/* no payload to copy, can skip selecting a buffer */
- if (!fuse_uring_req_has_payload(req))
+ if (!fuse_uring_req_has_copyable_payload(ent, req))
return 0;
return fuse_uring_select_buffer(ent);
}
static int fuse_uring_prepare_send(struct fuse_ring_ent *ent,
- struct fuse_req *req)
+ struct fuse_req *req,
+ unsigned int issue_flags)
{
int err;
- err = fuse_uring_copy_to_ring(ent, req);
+ err = fuse_uring_copy_to_ring(ent, req, issue_flags);
if (!err)
set_bit(FR_SENT, &req->flags);
else
- fuse_uring_req_end(ent, req, err);
+ fuse_uring_req_end(ent, req, err, issue_flags);
return err;
}
@@ -1386,7 +1502,7 @@ static void fuse_uring_commit(struct fuse_ring_ent *ent, struct fuse_req *req,
err = fuse_uring_copy_from_ring(ring, req, ent);
out:
- fuse_uring_req_end(ent, req, err);
+ fuse_uring_req_end(ent, req, err, issue_flags);
}
/*
@@ -1396,7 +1512,8 @@ static void fuse_uring_commit(struct fuse_ring_ent *ent, struct fuse_req *req,
* Else, there is no next fuse request and this returns false.
*/
static bool fuse_uring_get_next_fuse_req(struct fuse_ring_ent *ent,
- struct fuse_ring_queue *queue)
+ struct fuse_ring_queue *queue,
+ unsigned int issue_flags)
{
int err;
struct fuse_req *req;
@@ -1408,7 +1525,7 @@ static bool fuse_uring_get_next_fuse_req(struct fuse_ring_ent *ent,
spin_unlock(&queue->lock);
if (req) {
- err = fuse_uring_prepare_send(ent, req);
+ err = fuse_uring_prepare_send(ent, req, issue_flags);
if (err)
goto retry;
}
@@ -1523,7 +1640,7 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
* no-op and the next request will be serviced when a buffer becomes
* available.
*/
- if (fuse_uring_get_next_fuse_req(ent, queue))
+ if (fuse_uring_get_next_fuse_req(ent, queue, issue_flags))
fuse_uring_send(ent, cmd, 0, issue_flags);
return 0;
}
@@ -1645,12 +1762,17 @@ static bool init_flags_valid(u64 init_flags)
{
u64 valid_flags =
FUSE_URING_BUFRING | FUSE_URING_PINNED_HEADERS |
- FUSE_URING_PINNED_BUFFERS;
+ FUSE_URING_PINNED_BUFFERS | FUSE_URING_ZERO_COPY;
bool bufring = init_flags & FUSE_URING_BUFRING;
bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
bool pinned_buffers = init_flags & FUSE_URING_PINNED_BUFFERS;
+ bool zero_copy = init_flags & FUSE_URING_ZERO_COPY;
+
+ if (!bufring && (pinned_headers || pinned_buffers || zero_copy))
+ return false;
- if (!bufring && (pinned_headers || pinned_buffers))
+ if (zero_copy &&
+ (!capable(CAP_SYS_ADMIN) || !pinned_headers || !pinned_buffers))
return false;
return !(init_flags & ~valid_flags);
@@ -1795,9 +1917,10 @@ static void fuse_uring_send_in_task(struct io_tw_req tw_req, io_tw_token_t tw)
int err;
if (!tw.cancel) {
- err = fuse_uring_prepare_send(ent, ent->fuse_req);
+ err = fuse_uring_prepare_send(ent, ent->fuse_req, issue_flags);
if (err) {
- if (!fuse_uring_get_next_fuse_req(ent, queue))
+ if (!fuse_uring_get_next_fuse_req(ent, queue,
+ issue_flags))
return;
err = 0;
}
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 859ee4e6ba03..0546f719fc65 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -58,6 +58,8 @@ struct fuse_bufring_pinned {
struct fuse_bufring {
bool use_pinned_headers: 1;
bool use_pinned_buffers: 1;
+ /* this is only allowed on privileged servers */
+ bool use_zero_copy: 1;
unsigned int queue_depth;
union {
@@ -96,6 +98,8 @@ struct fuse_ring_ent {
*/
unsigned int id;
struct fuse_bufring_buf payload_buf;
+ /* true if the request's pages are being zero-copied */
+ bool zero_copied;
};
};
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index aa1d25421054..67b5bed451fe 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -39,6 +39,7 @@ struct fuse_copy_state {
bool is_uring:1;
/* if set, use kaddr; otherwise use pg */
bool is_kaddr:1;
+ bool skip_folio_copy:1;
struct {
unsigned int copied_sz; /* copied size into the user buffer */
} ring;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 51ecb66dd6eb..c2e53886cf06 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -246,6 +246,7 @@
* - add fuse_uring_cmd_req init struct
* - add FUSE_URING_PINNED_HEADERS flag
* - add FUSE_URING_PINNED_BUFFERS flag
+ * - add FUSE_URING_ZERO_COPY flag
*/
#ifndef _LINUX_FUSE_H
@@ -1257,6 +1258,9 @@ struct fuse_supp_groups {
#define FUSE_URING_IN_OUT_HEADER_SZ 128
#define FUSE_URING_OP_IN_OUT_SZ 128
+/* Set if the ent's payload is zero-copied */
+#define FUSE_URING_ENT_ZERO_COPY (1 << 0)
+
/* Used as part of the fuse_uring_req_header */
struct fuse_uring_ent_in_out {
uint64_t flags;
@@ -1310,6 +1314,7 @@ enum fuse_uring_cmd {
#define FUSE_URING_BUFRING (1 << 0)
#define FUSE_URING_PINNED_HEADERS (1 << 1)
#define FUSE_URING_PINNED_BUFFERS (1 << 2)
+#define FUSE_URING_ZERO_COPY (1 << 3)
/**
* In the 80B command area of the SQE.
--
2.52.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH v2 14/14] docs: fuse: add io-uring bufring and zero-copy documentation
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
` (12 preceding siblings ...)
2026-04-02 16:28 ` [PATCH v2 13/14] fuse: add zero-copy over io-uring Joanne Koong
@ 2026-04-02 16:28 ` Joanne Koong
2026-04-14 21:05 ` Bernd Schubert
2026-04-30 12:57 ` Jeff Layton
2026-04-30 12:59 ` [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Jeff Layton
14 siblings, 2 replies; 49+ messages in thread
From: Joanne Koong @ 2026-04-02 16:28 UTC (permalink / raw)
To: miklos; +Cc: bernd, axboe, linux-fsdevel
Add documentation for fuse over io-uring usage of buffer rings and
zero-copy.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
.../filesystems/fuse/fuse-io-uring.rst | 189 ++++++++++++++++++
1 file changed, 189 insertions(+)
diff --git a/Documentation/filesystems/fuse/fuse-io-uring.rst b/Documentation/filesystems/fuse/fuse-io-uring.rst
index d73dd0dbd238..bc47686c023f 100644
--- a/Documentation/filesystems/fuse/fuse-io-uring.rst
+++ b/Documentation/filesystems/fuse/fuse-io-uring.rst
@@ -95,5 +95,194 @@ Sending requests with CQEs
| <fuse_unlink() |
| <sys_unlink() |
+Buffer rings
+============
+Buffer rings have two main advantages:
+* Reduced memory usage: payload buffers are pooled and selected on demand
+ rather than dedicated per-entry, allowing fewer buffers than entries. This
+ infrastructure also allows for future optimizations like incremental buffer
+ consumption where non-overlapping parts of a buffer may be used across
+ concurrent requests.
+* Foundation for pinned buffers: contiguous buffer allocations allow the
+ kernel to pin and vmap the entire region, avoiding per-request page
+ resolution overhead
+
+At a high-level, this is how fuse uses buffer rings:
+
+* The first REGISTER SQE for a queue creates the queue and sets up the
+ buffer ring. The server provides two iovecs: one for headers and one for
+ payload buffers. Each entry gets a fixed ID (sqe->buf_index) that maps
+ to a specific header slot.
+* When a client request arrives, the kernel selects a payload buffer from
+ the ring (if the request has copyable data), copies headers and payload
+ data, and completes the sqe.
+* The buf_id of the selected payload buffer is communicated to the server
+ via the fuse_uring_ent_in_out header. The server uses this to locate the
+ payload data in its buffer.
+* The server processes the request and sends a COMMIT_AND_FETCH SQE with
+ the reply. The kernel processes the reply and recycles the buffer.
+
+Visually, this looks like::
+
+ Headers buffer:
+ +-----------------------+-----------------------+-----+
+ | fuse_uring_req_header | fuse_uring_req_header | ... |
+ | [ent 0] | [ent 1] | |
+ +-----------------------+-----------------------+-----+
+ ^ ^
+ | |
+ ent 0 header slot ent 1 header slot
+ (sqe->buf_index=0) (sqe->buf_index=1)
+
+ Payload buffer pool:
+ +-----------+-----------+-----------+-----+
+ | buf 0 | buf 1 | buf 2 | ... |
+ | (buf_size)| (buf_size)| (buf_size)| |
+ +-----------+-----------+-----------+-----+
+ selected on demand, recycled after each request
+
+Buffer ring request flow
+------------------------::
+
+| Kernel | FUSE daemon
+| |
+| [client request arrives] |
+| >fuse_uring_send() |
+| [select payload buf from ring] |
+| >fuse_uring_select_buffer() |
+| [copy headers to ent's header slot] |
+| >copy_header_to_ring() |
+| [copy payload to selected buf] |
+| >fuse_uring_copy_to_ring() |
+| [set buf_id in ent_in_out header] |
+| >io_uring_cmd_done() |
+| | [CQE received]
+| | [read headers from header slot]
+| | [read payload from buf_id]
+| | [process request]
+| | [write reply to header slot]
+| | [write reply payload to buf]
+| | >io_uring_submit()
+| | COMMIT_AND_FETCH
+| >fuse_uring_commit_fetch() |
+| >fuse_uring_commit() |
+| [copy reply from ring] |
+| >fuse_uring_recycle_buffer() |
+| >fuse_uring_get_next_fuse_req() |
+
+Pinned buffers
+==============
+
+Servers can optionally pin their header and/or payload buffers by setting
+FUSE_URING_PINNED_HEADERS and/or FUSE_URING_PINNED_BUFFERS flags. When
+set, the kernel pins the user pages and vmaps them during queue setup,
+enabling memcpy to/from the kernel virtual address instead of
+copy_to_user/copy_from_user.
+
+This avoids the per-request cost of pinning/unpinning user pages and
+translating virtual addresses. Buffers must be page-aligned. The pinned pages
+are accounted against RLIMIT_MEMLOCK (bypassable with CAP_IPC_LOCK).
+
+Zero-copy
+=========
+
+Fuse io-uring zero-copy allows the server to directly read from / write to
+the client's pages, bypassing any intermediary buffer copies. This requires
+the FUSE_URING_ZERO_COPY flag, buffer rings with pinned headers and buffers,
+and CAP_SYS_ADMIN.
+
+The kernel registers the client's underlying pages as a sparse buffer at
+the entry's fixed id via io_buffer_register_bvec(). The fuse server can
+then perform io_uring read/write operations directly on these pages.
+Non-page-backed args (eg out headers) go through the payload buffer as
+normal. Pages are unregistered when the request completes.
+
+The request flow for the zero-copy write path (client writes data, server
+reads it) is as follows:
+
+Zero-copy write
+---------------::
+| Kernel | FUSE server
+| |
+| "write(fd, buf, 1MB)" |
+| |
+| >sys_write() |
+| >fuse_file_write_iter() |
+| >fuse_send_one() |
+| [req->args->in_pages = true] |
+| [folios hold client write data] |
+| |
+| >fuse_uring_copy_to_ring() |
+| >copy_header_to_ring(IN_OUT) |
+| [memcpy fuse_in_header to |
+| pinned headers buf via kaddr] |
+| >copy_header_to_ring(OP) |
+| [memcpy write_in header] |
+| |
+| >fuse_uring_args_to_ring() |
+| >setup_fuse_copy_state() |
+| [is_kaddr = true] |
+| [skip_folio_copy = true] |
+| |
+| >fuse_uring_set_up_zero_copy() |
+| [folio_get for each client folio] |
+| [build bio_vec array from folios] |
+| >io_buffer_register_bvec() |
+| [register pages at ent->id] |
+| [ent->zero_copied = true] |
+| |
+| >fuse_copy_args() |
+| [skip_folio_copy => return 0 |
+| for page arg, skip data copy] |
+| |
+| >copy_header_to_ring(RING_ENT) |
+| [memcpy ent_in_out] |
+| >io_uring_cmd_done() |
+| |
+| | [CQE received]
+| |
+| | [issue io_uring READ at
+| | ent->id]
+| | [reads directly from
+| | client's pages (ZERO_COPY)]
+| |
+| | [write data to backing
+| | store]
+| | [submit COMMIT AND FETCH]
+| |
+| >fuse_uring_commit_fetch() |
+| >fuse_uring_commit() |
+| >fuse_uring_copy_from_ring() |
+| >fuse_uring_req_end() |
+| >io_buffer_unregister(ent->id) |
+| [unregister sparse buffer] |
+| >fuse_zero_copy_release() |
+| [folio_put for each folio] |
+| [ent->zero_copied = false] |
+| >fuse_request_end() |
+| [wake up client] |
+
+The zero-copy read path is analogous.
+
+Some requests may have both page-backed args and non-page-backed args.
+For these requests, the page-backed args are zero-copied while the
+non-page-backed args are copied to the buffer selected from the buffer
+ring:
+ zero-copy: pages registered via io_buffer_register_bvec()
+ non-page-backed: copied to payload buffer via fuse_copy_args()
+
+For a request whose payload is zero-copied, the registration/unregistration
+path looks like:
+
+register: fuse_uring_set_up_zero_copy()
+ folio_get() for each folio
+ io_buffer_register_bvec(ent->id)
+
+[server accesses pages via io_uring fixed buf at ent->id]
+
+unregister: fuse_uring_req_end()
+ io_buffer_unregister(ent->id)
+ -> fuse_zero_copy_release() callback
+ folio_put() for each folio
--
2.52.0
^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [PATCH v2 11/14] fuse: add pinned headers capability for io-uring buffer rings
2026-04-02 16:28 ` [PATCH v2 11/14] fuse: add pinned headers capability for " Joanne Koong
@ 2026-04-14 12:47 ` Bernd Schubert
2026-04-15 0:48 ` Joanne Koong
2026-04-30 11:22 ` Jeff Layton
1 sibling, 1 reply; 49+ messages in thread
From: Bernd Schubert @ 2026-04-14 12:47 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: axboe, linux-fsdevel
On 4/2/26 18:28, Joanne Koong wrote:
> Allow fuse servers to pin their header buffers by setting the
> FUSE_URING_PINNED_HEADERS flag alongside FUSE_URING_BUFRING on REGISTER
> sqes. When set, the kernel pins the header pages, vmaps them for a
> kernel virtual address, and uses direct memcpy for copying. This avoids
> the per-request overhead of having to pin/unpin user pages and translate
> virtual addresses.
>
> Buffers must be page-aligned. The kernel accounts pinned pages against
> RLIMIT_MEMLOCK (bypassed with CAP_IPC_LOCK) and tracks mm->pinned_vm.
> Unpinning is done in process context during connection abort, since vmap
> cannot run in softirq (where final destruction occurs via RCU).
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> fs/fuse/dev_uring.c | 228 ++++++++++++++++++++++++++++++++++++--
> fs/fuse/dev_uring_i.h | 23 +++-
> include/uapi/linux/fuse.h | 2 +
> 3 files changed, 243 insertions(+), 10 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 9f14a2bcde3f..79736b02cf9f 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -11,6 +11,7 @@
>
> #include <linux/fs.h>
> #include <linux/io_uring/cmd.h>
> +#include <linux/vmalloc.h>
>
> static bool __read_mostly enable_uring;
> module_param(enable_uring, bool, 0644);
> @@ -46,6 +47,11 @@ static inline bool bufring_enabled(struct fuse_ring_queue *queue)
> return queue->bufring != NULL;
> }
>
> +static inline bool bufring_pinned_headers(struct fuse_ring_queue *queue)
> +{
> + return queue->bufring->use_pinned_headers;
> +}
> +
> static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
> struct fuse_ring_ent *ring_ent)
> {
> @@ -200,6 +206,37 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
> return false;
> }
>
> +static void fuse_bufring_unpin_mem(struct fuse_bufring_pinned *mem)
> +{
> + struct page **pages = mem->pages;
> + unsigned int nr_pages = mem->nr_pages;
> + struct user_struct *user = mem->user;
> + struct mm_struct *mm_account = mem->mm_account;
> +
> + vunmap(mem->addr);
> + unpin_user_pages(pages, nr_pages);
> +
> + if (user) {
> + atomic_long_sub(nr_pages, &user->locked_vm);
> + free_uid(user);
> + }
> +
> + atomic64_sub(nr_pages, &mm_account->pinned_vm);
> + mmdrop(mm_account);
> +
> + kvfree(mem->pages);
> +}
> +
> +static void fuse_uring_bufring_unpin(struct fuse_ring_queue *queue)
> +{
> + struct fuse_bufring *br = queue->bufring;
> +
> + if (bufring_pinned_headers(queue)) {
> + fuse_bufring_unpin_mem(&br->pinned_headers);
> + br->use_pinned_headers = false;
> + }
> +}
> +
> void fuse_uring_destruct(struct fuse_conn *fc)
> {
> struct fuse_ring *ring = fc->ring;
> @@ -227,7 +264,10 @@ void fuse_uring_destruct(struct fuse_conn *fc)
> }
>
> kfree(queue->fpq.processing);
> - kfree(queue->bufring);
> + if (bufring_enabled(queue)) {
> + fuse_uring_bufring_unpin(queue);
> + kfree(queue->bufring);
> + }
> kfree(queue);
> ring->queues[qid] = NULL;
> }
> @@ -309,14 +349,131 @@ static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
> return 0;
> }
>
> +static struct page **fuse_uring_pin_user_pages(void __user *uaddr,
> + unsigned long len, int *npages)
I think this is a duplicate of io_pin_pages(), can we just export that
and use here? I'm basically going to propose to use the same technique
in ublk - would be another duplicate.
(Not a complete a review yet, just something I just noticed).
Thanks,
Bernd
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 14/14] docs: fuse: add io-uring bufring and zero-copy documentation
2026-04-02 16:28 ` [PATCH v2 14/14] docs: fuse: add io-uring bufring and zero-copy documentation Joanne Koong
@ 2026-04-14 21:05 ` Bernd Schubert
2026-04-15 1:10 ` Joanne Koong
2026-04-30 12:57 ` Jeff Layton
1 sibling, 1 reply; 49+ messages in thread
From: Bernd Schubert @ 2026-04-14 21:05 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: axboe, linux-fsdevel
On 4/2/26 18:28, Joanne Koong wrote:
> Add documentation for fuse over io-uring usage of buffer rings and
> zero-copy.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> .../filesystems/fuse/fuse-io-uring.rst | 189 ++++++++++++++++++
> 1 file changed, 189 insertions(+)
>
> diff --git a/Documentation/filesystems/fuse/fuse-io-uring.rst b/Documentation/filesystems/fuse/fuse-io-uring.rst
> index d73dd0dbd238..bc47686c023f 100644
> --- a/Documentation/filesystems/fuse/fuse-io-uring.rst
> +++ b/Documentation/filesystems/fuse/fuse-io-uring.rst
> @@ -95,5 +95,194 @@ Sending requests with CQEs
> | <fuse_unlink() |
> | <sys_unlink() |
>
> +Buffer rings
> +============
>
> +Buffer rings have two main advantages:
>
> +* Reduced memory usage: payload buffers are pooled and selected on demand
> + rather than dedicated per-entry, allowing fewer buffers than entries. This
> + infrastructure also allows for future optimizations like incremental buffer
> + consumption where non-overlapping parts of a buffer may be used across
> + concurrent requests.
> +* Foundation for pinned buffers: contiguous buffer allocations allow the
> + kernel to pin and vmap the entire region, avoiding per-request page
> + resolution overhead
> +
> +At a high-level, this is how fuse uses buffer rings:
> +
> +* The first REGISTER SQE for a queue creates the queue and sets up the
> + buffer ring. The server provides two iovecs: one for headers and one for
> + payload buffers. Each entry gets a fixed ID (sqe->buf_index) that maps
> + to a specific header slot.
Hi Joanne,
thanks a lot for this document! Could we discuss if we could just hook
in here and allow SQEs with different iovecs for the payload buffer?
Let's say fuse-server wants multiple IO sizes - it could easily do that
via different pBufs and just needs to specify the dedicated IO size per
pBuf. Those buffers could then get sorted into an array - we could
define either via FUSE init the number of buf sizes or use a fixed size
array. Fuse requests then would just need to pick the right array.
This is basically what I'm currently working on for ublk.
I think it would be good to agree on the design before it gets merged so
that uapi doesn't change.
Thanks,
Bernd
> +* When a client request arrives, the kernel selects a payload buffer from
> + the ring (if the request has copyable data), copies headers and payload
> + data, and completes the sqe.
> +* The buf_id of the selected payload buffer is communicated to the server
> + via the fuse_uring_ent_in_out header. The server uses this to locate the
> + payload data in its buffer.
> +* The server processes the request and sends a COMMIT_AND_FETCH SQE with
> + the reply. The kernel processes the reply and recycles the buffer.
> +
> +Visually, this looks like::
> +
> + Headers buffer:
> + +-----------------------+-----------------------+-----+
> + | fuse_uring_req_header | fuse_uring_req_header | ... |
> + | [ent 0] | [ent 1] | |
> + +-----------------------+-----------------------+-----+
> + ^ ^
> + | |
> + ent 0 header slot ent 1 header slot
> + (sqe->buf_index=0) (sqe->buf_index=1)
> +
> + Payload buffer pool:
> + +-----------+-----------+-----------+-----+
> + | buf 0 | buf 1 | buf 2 | ... |
> + | (buf_size)| (buf_size)| (buf_size)| |
> + +-----------+-----------+-----------+-----+
> + selected on demand, recycled after each request
> +
> +Buffer ring request flow
> +------------------------::
> +
> +| Kernel | FUSE daemon
> +| |
> +| [client request arrives] |
> +| >fuse_uring_send() |
> +| [select payload buf from ring] |
> +| >fuse_uring_select_buffer() |
> +| [copy headers to ent's header slot] |
> +| >copy_header_to_ring() |
> +| [copy payload to selected buf] |
> +| >fuse_uring_copy_to_ring() |
> +| [set buf_id in ent_in_out header] |
> +| >io_uring_cmd_done() |
> +| | [CQE received]
> +| | [read headers from header slot]
> +| | [read payload from buf_id]
> +| | [process request]
> +| | [write reply to header slot]
> +| | [write reply payload to buf]
> +| | >io_uring_submit()
> +| | COMMIT_AND_FETCH
> +| >fuse_uring_commit_fetch() |
> +| >fuse_uring_commit() |
> +| [copy reply from ring] |
> +| >fuse_uring_recycle_buffer() |
> +| >fuse_uring_get_next_fuse_req() |
> +
> +Pinned buffers
> +==============
> +
> +Servers can optionally pin their header and/or payload buffers by setting
> +FUSE_URING_PINNED_HEADERS and/or FUSE_URING_PINNED_BUFFERS flags. When
> +set, the kernel pins the user pages and vmaps them during queue setup,
> +enabling memcpy to/from the kernel virtual address instead of
> +copy_to_user/copy_from_user.
> +
> +This avoids the per-request cost of pinning/unpinning user pages and
> +translating virtual addresses. Buffers must be page-aligned. The pinned pages
> +are accounted against RLIMIT_MEMLOCK (bypassable with CAP_IPC_LOCK).
> +
> +Zero-copy
> +=========
> +
> +Fuse io-uring zero-copy allows the server to directly read from / write to
> +the client's pages, bypassing any intermediary buffer copies. This requires
> +the FUSE_URING_ZERO_COPY flag, buffer rings with pinned headers and buffers,
> +and CAP_SYS_ADMIN.
> +
> +The kernel registers the client's underlying pages as a sparse buffer at
> +the entry's fixed id via io_buffer_register_bvec(). The fuse server can
> +then perform io_uring read/write operations directly on these pages.
> +Non-page-backed args (eg out headers) go through the payload buffer as
> +normal. Pages are unregistered when the request completes.
> +
> +The request flow for the zero-copy write path (client writes data, server
> +reads it) is as follows:
> +
> +Zero-copy write
> +---------------::
> +| Kernel | FUSE server
> +| |
> +| "write(fd, buf, 1MB)" |
> +| |
> +| >sys_write() |
> +| >fuse_file_write_iter() |
> +| >fuse_send_one() |
> +| [req->args->in_pages = true] |
> +| [folios hold client write data] |
> +| |
> +| >fuse_uring_copy_to_ring() |
> +| >copy_header_to_ring(IN_OUT) |
> +| [memcpy fuse_in_header to |
> +| pinned headers buf via kaddr] |
> +| >copy_header_to_ring(OP) |
> +| [memcpy write_in header] |
> +| |
> +| >fuse_uring_args_to_ring() |
> +| >setup_fuse_copy_state() |
> +| [is_kaddr = true] |
> +| [skip_folio_copy = true] |
> +| |
> +| >fuse_uring_set_up_zero_copy() |
> +| [folio_get for each client folio] |
> +| [build bio_vec array from folios] |
> +| >io_buffer_register_bvec() |
> +| [register pages at ent->id] |
> +| [ent->zero_copied = true] |
> +| |
> +| >fuse_copy_args() |
> +| [skip_folio_copy => return 0 |
> +| for page arg, skip data copy] |
> +| |
> +| >copy_header_to_ring(RING_ENT) |
> +| [memcpy ent_in_out] |
> +| >io_uring_cmd_done() |
> +| |
> +| | [CQE received]
> +| |
> +| | [issue io_uring READ at
> +| | ent->id]
> +| | [reads directly from
> +| | client's pages (ZERO_COPY)]
> +| |
> +| | [write data to backing
> +| | store]
> +| | [submit COMMIT AND FETCH]
> +| |
> +| >fuse_uring_commit_fetch() |
> +| >fuse_uring_commit() |
> +| >fuse_uring_copy_from_ring() |
> +| >fuse_uring_req_end() |
> +| >io_buffer_unregister(ent->id) |
> +| [unregister sparse buffer] |
> +| >fuse_zero_copy_release() |
> +| [folio_put for each folio] |
> +| [ent->zero_copied = false] |
> +| >fuse_request_end() |
> +| [wake up client] |
> +
> +The zero-copy read path is analogous.
> +
> +Some requests may have both page-backed args and non-page-backed args.
> +For these requests, the page-backed args are zero-copied while the
> +non-page-backed args are copied to the buffer selected from the buffer
> +ring:
> + zero-copy: pages registered via io_buffer_register_bvec()
> + non-page-backed: copied to payload buffer via fuse_copy_args()
> +
> +For a request whose payload is zero-copied, the registration/unregistration
> +path looks like:
> +
> +register: fuse_uring_set_up_zero_copy()
> + folio_get() for each folio
> + io_buffer_register_bvec(ent->id)
> +
> +[server accesses pages via io_uring fixed buf at ent->id]
> +
> +unregister: fuse_uring_req_end()
> + io_buffer_unregister(ent->id)
> + -> fuse_zero_copy_release() callback
> + folio_put() for each folio
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 11/14] fuse: add pinned headers capability for io-uring buffer rings
2026-04-14 12:47 ` Bernd Schubert
@ 2026-04-15 0:48 ` Joanne Koong
2026-05-05 22:51 ` Bernd Schubert
0 siblings, 1 reply; 49+ messages in thread
From: Joanne Koong @ 2026-04-15 0:48 UTC (permalink / raw)
To: Bernd Schubert; +Cc: miklos, axboe, linux-fsdevel
On Tue, Apr 14, 2026 at 5:47 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>
>
>
> On 4/2/26 18:28, Joanne Koong wrote:
> > Allow fuse servers to pin their header buffers by setting the
> > FUSE_URING_PINNED_HEADERS flag alongside FUSE_URING_BUFRING on REGISTER
> > sqes. When set, the kernel pins the header pages, vmaps them for a
> > kernel virtual address, and uses direct memcpy for copying. This avoids
> > the per-request overhead of having to pin/unpin user pages and translate
> > virtual addresses.
> >
> > Buffers must be page-aligned. The kernel accounts pinned pages against
> > RLIMIT_MEMLOCK (bypassed with CAP_IPC_LOCK) and tracks mm->pinned_vm.
> > Unpinning is done in process context during connection abort, since vmap
> > cannot run in softirq (where final destruction occurs via RCU).
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> > fs/fuse/dev_uring.c | 228 ++++++++++++++++++++++++++++++++++++--
> > fs/fuse/dev_uring_i.h | 23 +++-
> > include/uapi/linux/fuse.h | 2 +
> > 3 files changed, 243 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> > index 9f14a2bcde3f..79736b02cf9f 100644
> > --- a/fs/fuse/dev_uring.c
> > +++ b/fs/fuse/dev_uring.c
> > @@ -11,6 +11,7 @@
> >
> > #include <linux/fs.h>
> > #include <linux/io_uring/cmd.h>
> > +#include <linux/vmalloc.h>
> >
> > static bool __read_mostly enable_uring;
> > module_param(enable_uring, bool, 0644);
> > @@ -46,6 +47,11 @@ static inline bool bufring_enabled(struct fuse_ring_queue *queue)
> > return queue->bufring != NULL;
> > }
> >
> > +static inline bool bufring_pinned_headers(struct fuse_ring_queue *queue)
> > +{
> > + return queue->bufring->use_pinned_headers;
> > +}
> > +
> > static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
> > struct fuse_ring_ent *ring_ent)
> > {
> > @@ -200,6 +206,37 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
> > return false;
> > }
> >
> > +static void fuse_bufring_unpin_mem(struct fuse_bufring_pinned *mem)
> > +{
> > + struct page **pages = mem->pages;
> > + unsigned int nr_pages = mem->nr_pages;
> > + struct user_struct *user = mem->user;
> > + struct mm_struct *mm_account = mem->mm_account;
> > +
> > + vunmap(mem->addr);
> > + unpin_user_pages(pages, nr_pages);
> > +
> > + if (user) {
> > + atomic_long_sub(nr_pages, &user->locked_vm);
> > + free_uid(user);
> > + }
> > +
> > + atomic64_sub(nr_pages, &mm_account->pinned_vm);
> > + mmdrop(mm_account);
> > +
> > + kvfree(mem->pages);
> > +}
> > +
> > +static void fuse_uring_bufring_unpin(struct fuse_ring_queue *queue)
> > +{
> > + struct fuse_bufring *br = queue->bufring;
> > +
> > + if (bufring_pinned_headers(queue)) {
> > + fuse_bufring_unpin_mem(&br->pinned_headers);
> > + br->use_pinned_headers = false;
> > + }
> > +}
> > +
> > void fuse_uring_destruct(struct fuse_conn *fc)
> > {
> > struct fuse_ring *ring = fc->ring;
> > @@ -227,7 +264,10 @@ void fuse_uring_destruct(struct fuse_conn *fc)
> > }
> >
> > kfree(queue->fpq.processing);
> > - kfree(queue->bufring);
> > + if (bufring_enabled(queue)) {
> > + fuse_uring_bufring_unpin(queue);
> > + kfree(queue->bufring);
> > + }
> > kfree(queue);
> > ring->queues[qid] = NULL;
> > }
> > @@ -309,14 +349,131 @@ static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
> > return 0;
> > }
> >
> > +static struct page **fuse_uring_pin_user_pages(void __user *uaddr,
> > + unsigned long len, int *npages)
>
> I think this is a duplicate of io_pin_pages(), can we just export that
> and use here? I'm basically going to propose to use the same technique
> in ublk - would be another duplicate.
>
Tbh I think this is generic logic that makes more sense to live in the
mm layer instead of fuse calling this as an exported io-uring
function. The memory it's passing in is not related to io-uring, so
that was my hesitation. For your ublk use case, is the memory you're
passing into this user-allocated memory that's not part of io-uring?
If so, then maybe it's best to move io_pin_pages() out of io-uring and
into generic mm.
Thanks,
Joanne
>
> (Not a complete a review yet, just something I just noticed).
>
>
> Thanks,
> Bernd
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 14/14] docs: fuse: add io-uring bufring and zero-copy documentation
2026-04-14 21:05 ` Bernd Schubert
@ 2026-04-15 1:10 ` Joanne Koong
2026-04-15 10:55 ` Bernd Schubert
0 siblings, 1 reply; 49+ messages in thread
From: Joanne Koong @ 2026-04-15 1:10 UTC (permalink / raw)
To: Bernd Schubert; +Cc: miklos, axboe, linux-fsdevel
On Tue, Apr 14, 2026 at 2:05 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>
> On 4/2/26 18:28, Joanne Koong wrote:
> > Add documentation for fuse over io-uring usage of buffer rings and
> > zero-copy.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> > .../filesystems/fuse/fuse-io-uring.rst | 189 ++++++++++++++++++
> > 1 file changed, 189 insertions(+)
> >
> > diff --git a/Documentation/filesystems/fuse/fuse-io-uring.rst b/Documentation/filesystems/fuse/fuse-io-uring.rst
> > index d73dd0dbd238..bc47686c023f 100644
> > --- a/Documentation/filesystems/fuse/fuse-io-uring.rst
> > +++ b/Documentation/filesystems/fuse/fuse-io-uring.rst
> > @@ -95,5 +95,194 @@ Sending requests with CQEs
> > | <fuse_unlink() |
> > | <sys_unlink() |
> >
> > +Buffer rings
> > +============
> >
> > +Buffer rings have two main advantages:
> >
> > +* Reduced memory usage: payload buffers are pooled and selected on demand
> > + rather than dedicated per-entry, allowing fewer buffers than entries. This
> > + infrastructure also allows for future optimizations like incremental buffer
> > + consumption where non-overlapping parts of a buffer may be used across
> > + concurrent requests.
> > +* Foundation for pinned buffers: contiguous buffer allocations allow the
> > + kernel to pin and vmap the entire region, avoiding per-request page
> > + resolution overhead
> > +
> > +At a high-level, this is how fuse uses buffer rings:
> > +
> > +* The first REGISTER SQE for a queue creates the queue and sets up the
> > + buffer ring. The server provides two iovecs: one for headers and one for
> > + payload buffers. Each entry gets a fixed ID (sqe->buf_index) that maps
> > + to a specific header slot.
>
> Hi Joanne,
>
> thanks a lot for this document! Could we discuss if we could just hook
> in here and allow SQEs with different iovecs for the payload buffer?
> Let's say fuse-server wants multiple IO sizes - it could easily do that
> via different pBufs and just needs to specify the dedicated IO size per
> pBuf. Those buffers could then get sorted into an array - we could
> define either via FUSE init the number of buf sizes or use a fixed size
> array. Fuse requests then would just need to pick the right array.
> This is basically what I'm currently working on for ublk.
>
> I think it would be good to agree on the design before it gets merged so
> that uapi doesn't change.
Hi Bernd,
I'm not certain I fully see the use case for a fuse server preferring
a static preallocation of multiple IO sizes over using incremental
buffer consumption, but in my mind to support multiple IO size
payloads, I was thinking something like this might work best:
* iov[0] for the headers stays the same. no matter how many IO size
payloads there are, the ent always maps to a header and the headers
are a fixed size
* iov[1...x] are the payload buffers. From the uapi perspective, in
the fues_uring_cmd_req init struct, there would need to be an array of
uint32_t buf_sizes. Each index in the array would correspond to index
+ 1 in the iov[] payloads passed
* on the fuse side, each of the buffer pools has its own ring. I think
this makes managing the different buffers a lot easier and gets rid of
having to do any array sorting, and makes buffer selection/recycling
O(1).
From the uapi perspective, right now this patchset adds this to the
struct fuse_uring_cmd_req [1]:
union {
struct {
/* size of the bufring's backing buffers */
uint32_t buf_size;
/* number of entries in the queue */
uint16_t queue_depth;
uint16_t padding;
} init;
};
To accommodate multiple io size payloads, the buf_size should probably
be the last field in the struct so it can be extended. How many
different payload sizes do you envision needing?
Does this align with what you had in mind?
Thanks,
Joanne
[1] https://lore.kernel.org/linux-fsdevel/20260402162840.2989717-11-joannelkoong@gmail.com/
>
> Thanks,
> Bernd
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 07/14] fuse: use named constants for io-uring iovec indices
2026-04-02 16:28 ` [PATCH v2 07/14] fuse: use named constants for io-uring iovec indices Joanne Koong
@ 2026-04-15 9:36 ` Bernd Schubert
2026-04-30 8:20 ` Jeff Layton
1 sibling, 0 replies; 49+ messages in thread
From: Bernd Schubert @ 2026-04-15 9:36 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: axboe, linux-fsdevel
On 4/2/26 18:28, Joanne Koong wrote:
> Replace magic indices 0 and 1 for the iovec array with named constants
> FUSE_URING_IOV_HEADERS and FUSE_URING_IOV_PAYLOAD. This makes the usages
> self-documenting and prepares for buffer ring support which will also
> reference these iovec slots by index.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> fs/fuse/dev_uring.c | 24 +++++++++++++-----------
> 1 file changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 045394a7ae41..a85acd9c2b71 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -18,7 +18,8 @@ MODULE_PARM_DESC(enable_uring,
> "Enable userspace communication through io-uring");
>
> #define FUSE_URING_IOV_SEGS 2 /* header and payload */
> -
> +#define FUSE_URING_IOV_HEADERS 0
> +#define FUSE_URING_IOV_PAYLOAD 1
>
> bool fuse_uring_enabled(void)
> {
> @@ -1063,8 +1064,8 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
> }
>
> /*
> - * sqe->addr is a ptr to an iovec array, iov[0] has the headers, iov[1]
> - * the payload
> + * sqe->addr is a ptr to an iovec array, iov[FUSE_URING_IOV_HEADERS] has the
> + * headers, iov[FUSE_URING_IOV_PAYLOAD] the payload
> */
> static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
> struct iovec iov[FUSE_URING_IOV_SEGS])
> @@ -1094,8 +1095,8 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
> {
> struct fuse_ring *ring = queue->ring;
> struct fuse_ring_ent *ent;
> - size_t payload_size;
> struct iovec iov[FUSE_URING_IOV_SEGS];
> + struct iovec *headers, *payload;
> int err;
>
> err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
> @@ -1106,15 +1107,16 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
> }
>
> err = -EINVAL;
> - if (iov[0].iov_len < sizeof(struct fuse_uring_req_header)) {
> - pr_info_ratelimited("Invalid header len %zu\n", iov[0].iov_len);
> + headers = &iov[FUSE_URING_IOV_HEADERS];
> + if (headers->iov_len < sizeof(struct fuse_uring_req_header)) {
> + pr_info_ratelimited("Invalid header len %zu\n", headers->iov_len);
> return ERR_PTR(err);
> }
>
> - payload_size = iov[1].iov_len;
> - if (payload_size < ring->max_payload_sz) {
> + payload = &iov[FUSE_URING_IOV_PAYLOAD];
> + if (payload->iov_len < ring->max_payload_sz) {
> pr_info_ratelimited("Invalid req payload len %zu\n",
> - payload_size);
> + payload->iov_len);
> return ERR_PTR(err);
> }
>
> @@ -1126,8 +1128,8 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
> INIT_LIST_HEAD(&ent->list);
>
> ent->queue = queue;
> - ent->headers = iov[0].iov_base;
> - ent->payload = iov[1].iov_base;
> + ent->headers = headers->iov_base;
> + ent->payload = payload->iov_base;
>
> atomic_inc(&ring->queue_refs);
> return ent;
LGTM, thanks, I should have done that from the beginning.
Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 08/14] fuse: move fuse_uring_abort() from header to dev_uring.c
2026-04-02 16:28 ` [PATCH v2 08/14] fuse: move fuse_uring_abort() from header to dev_uring.c Joanne Koong
@ 2026-04-15 9:40 ` Bernd Schubert
2026-04-30 8:21 ` Jeff Layton
1 sibling, 0 replies; 49+ messages in thread
From: Bernd Schubert @ 2026-04-15 9:40 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: axboe, linux-fsdevel
On 4/2/26 18:28, Joanne Koong wrote:
> Move fuse_uring_abort() out of the inline header definition and into
> dev_uring.c. This function calls several internal helpers (abort
> requests, stop queues) that are all defined in dev_uring.c so inlining
> it in the header unnecessarily exposes implementation details.
>
> This will make the subsequent commit that adds pinning capabilties for
> fuse buffers cleaner.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> fs/fuse/dev_uring.c | 17 +++++++++++++++--
> fs/fuse/dev_uring_i.h | 16 +---------------
> 2 files changed, 16 insertions(+), 17 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index a85acd9c2b71..cce8994241b7 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -129,7 +129,7 @@ static void fuse_uring_abort_end_queue_requests(struct fuse_ring_queue *queue)
> fuse_dev_end_requests(&req_list);
> }
>
> -void fuse_uring_abort_end_requests(struct fuse_ring *ring)
> +static void fuse_uring_abort_end_requests(struct fuse_ring *ring)
> {
> int qid;
> struct fuse_ring_queue *queue;
> @@ -477,7 +477,7 @@ static void fuse_uring_async_stop_queues(struct work_struct *work)
> /*
> * Stop the ring queues
> */
> -void fuse_uring_stop_queues(struct fuse_ring *ring)
> +static void fuse_uring_stop_queues(struct fuse_ring *ring)
> {
> int qid;
>
> @@ -501,6 +501,19 @@ void fuse_uring_stop_queues(struct fuse_ring *ring)
> }
> }
>
> +void fuse_uring_abort(struct fuse_conn *fc)
> +{
> + struct fuse_ring *ring = fc->ring;
> +
> + if (ring == NULL)
> + return;
> +
> + if (atomic_read(&ring->queue_refs) > 0) {
> + fuse_uring_abort_end_requests(ring);
> + fuse_uring_stop_queues(ring);
> + }
> +}
> +
> /*
> * Handle IO_URING_F_CANCEL, typically should come on daemon termination.
> *
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> index 51a563922ce1..349418db3374 100644
> --- a/fs/fuse/dev_uring_i.h
> +++ b/fs/fuse/dev_uring_i.h
> @@ -137,27 +137,13 @@ struct fuse_ring {
>
> bool fuse_uring_enabled(void);
> void fuse_uring_destruct(struct fuse_conn *fc);
> -void fuse_uring_stop_queues(struct fuse_ring *ring);
> -void fuse_uring_abort_end_requests(struct fuse_ring *ring);
> +void fuse_uring_abort(struct fuse_conn *fc);
> int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
> void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req);
> bool fuse_uring_queue_bq_req(struct fuse_req *req);
> bool fuse_uring_remove_pending_req(struct fuse_req *req);
> bool fuse_uring_request_expired(struct fuse_conn *fc);
>
> -static inline void fuse_uring_abort(struct fuse_conn *fc)
> -{
> - struct fuse_ring *ring = fc->ring;
> -
> - if (ring == NULL)
> - return;
> -
> - if (atomic_read(&ring->queue_refs) > 0) {
> - fuse_uring_abort_end_requests(ring);
> - fuse_uring_stop_queues(ring);
> - }
> -}
> -
> static inline void fuse_uring_wait_stopped_queues(struct fuse_conn *fc)
> {
> struct fuse_ring *ring = fc->ring;
I had it put in there because of
#else /* CONFIG_FUSE_IO_URING */
but sure, we can also move the real function into the c file.
Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 09/14] fuse: rearrange io-uring iovec and ent allocation logic
2026-04-02 16:28 ` [PATCH v2 09/14] fuse: rearrange io-uring iovec and ent allocation logic Joanne Koong
@ 2026-04-15 9:45 ` Bernd Schubert
2026-04-30 8:24 ` Jeff Layton
1 sibling, 0 replies; 49+ messages in thread
From: Bernd Schubert @ 2026-04-15 9:45 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: axboe, linux-fsdevel
On 4/2/26 18:28, Joanne Koong wrote:
> Move fuse_uring_get_iovec_from_sqe() to earlier in the file and
> move the allocation logic in fuse_uring_create_ring_ent() to the
> beginning of the function.
>
> There is no change in logic, this is done to make the subsequent commit
> that adds buffer rings easier to review.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> fs/fuse/dev_uring.c | 78 ++++++++++++++++++++++++---------------------
> 1 file changed, 41 insertions(+), 37 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index cce8994241b7..a061f175b3fd 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -277,6 +277,32 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
> return res;
> }
>
> +/*
> + * sqe->addr is a ptr to an iovec array, iov[FUSE_URING_IOV_HEADERS] has the
> + * headers, iov[FUSE_URING_IOV_PAYLOAD] the payload
> + */
> +static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
> + struct iovec iov[FUSE_URING_IOV_SEGS])
> +{
> + struct iovec __user *uiov = u64_to_user_ptr(READ_ONCE(sqe->addr));
> + struct iov_iter iter;
> + ssize_t ret;
> +
> + if (sqe->len != FUSE_URING_IOV_SEGS)
> + return -EINVAL;
> +
> + /*
> + * Direction for buffer access will actually be READ and WRITE,
> + * using write for the import should include READ access as well.
> + */
> + ret = import_iovec(WRITE, uiov, FUSE_URING_IOV_SEGS,
> + FUSE_URING_IOV_SEGS, &iov, &iter);
> + if (ret < 0)
> + return ret;
> +
> + return 0;
> +}
> +
> static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
> int qid)
> {
> @@ -1076,32 +1102,6 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
> }
> }
>
> -/*
> - * sqe->addr is a ptr to an iovec array, iov[FUSE_URING_IOV_HEADERS] has the
> - * headers, iov[FUSE_URING_IOV_PAYLOAD] the payload
> - */
> -static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
> - struct iovec iov[FUSE_URING_IOV_SEGS])
> -{
> - struct iovec __user *uiov = u64_to_user_ptr(READ_ONCE(sqe->addr));
> - struct iov_iter iter;
> - ssize_t ret;
> -
> - if (sqe->len != FUSE_URING_IOV_SEGS)
> - return -EINVAL;
> -
> - /*
> - * Direction for buffer access will actually be READ and WRITE,
> - * using write for the import should include READ access as well.
> - */
> - ret = import_iovec(WRITE, uiov, FUSE_URING_IOV_SEGS,
> - FUSE_URING_IOV_SEGS, &iov, &iter);
> - if (ret < 0)
> - return ret;
> -
> - return 0;
> -}
> -
> static struct fuse_ring_ent *
> fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
> struct fuse_ring_queue *queue)
> @@ -1112,40 +1112,44 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
> struct iovec *headers, *payload;
> int err;
>
> + ent = kzalloc_obj(*ent, GFP_KERNEL_ACCOUNT);
> + if (!ent)
> + return ERR_PTR(-ENOMEM);
> +
> + INIT_LIST_HEAD(&ent->list);
> +
> + ent->queue = queue;
> +
> err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
> if (err) {
> pr_info_ratelimited("Failed to get iovec from sqe, err=%d\n",
> err);
> - return ERR_PTR(err);
> + goto error;
> }
>
> err = -EINVAL;
> headers = &iov[FUSE_URING_IOV_HEADERS];
> if (headers->iov_len < sizeof(struct fuse_uring_req_header)) {
> pr_info_ratelimited("Invalid header len %zu\n", headers->iov_len);
> - return ERR_PTR(err);
> + goto error;
> }
>
> payload = &iov[FUSE_URING_IOV_PAYLOAD];
> if (payload->iov_len < ring->max_payload_sz) {
> pr_info_ratelimited("Invalid req payload len %zu\n",
> payload->iov_len);
> - return ERR_PTR(err);
> + goto error;
> }
>
> - err = -ENOMEM;
> - ent = kzalloc_obj(*ent, GFP_KERNEL_ACCOUNT);
> - if (!ent)
> - return ERR_PTR(err);
> -
> - INIT_LIST_HEAD(&ent->list);
> -
> - ent->queue = queue;
> ent->headers = headers->iov_base;
> ent->payload = payload->iov_base;
>
> atomic_inc(&ring->queue_refs);
> return ent;
> +
> +error:
> + kfree(ent);
> + return ERR_PTR(err);
> }
>
> /*
Hmm, goal was to first do the checks and then to allocate, but if the
change is useful for future commits, fine with me.
Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 10/14] fuse: add io-uring buffer rings
2026-04-02 16:28 ` [PATCH v2 10/14] fuse: add io-uring buffer rings Joanne Koong
@ 2026-04-15 9:48 ` Bernd Schubert
2026-04-15 21:40 ` Joanne Koong
2026-04-30 11:08 ` Jeff Layton
2026-05-05 22:47 ` Bernd Schubert
2 siblings, 1 reply; 49+ messages in thread
From: Bernd Schubert @ 2026-04-15 9:48 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: axboe, linux-fsdevel
On 4/2/26 18:28, Joanne Koong wrote:
> Add fuse buffer rings for servers communicating through the io-uring
> interface. To use this, the server must set the FUSE_URING_BUFRING
> flag and provide header and payload buffers via an iovec array in the
> sqe during registration. The payload buffers are used to back the buffer
> ring. The kernel manages buffer selection and recycling through a simple
> internal ring.
>
> This has the following advantages over the non-bufring (iovec) path:
> - Reduced memory usage: in the iovec path, each entry has its own
> dedicated payload buffer, requiring N buffers for N entries where each
> buffer must be large enough to accommodate the maximum possible
> payload size. With buffer rings, payload buffers are pooled and
> selected on demand. Entries only hold a buffer while actively
> processing a request with payload data. When incremental buffer
> consumption is added, this will allow non-overlapping regions of a
> single buffer to be used simultaneously across multiple requests,
> further reducing memory requirements.
> - Foundation for pinned buffers: the buffer ring headers and payloads
> are now each passed in as a contiguous memory allocation, which allows
> fuse to easily pin and vmap the entire region in one operation during
> queue setup. This will eliminate the per-request overhead of having to
> pin/unpin user pages and translate virtual addresses and is a
> prerequisite for future optimizations like performing data copies
> outside of the server's task context.
>
> Each ring entry gets a fixed ID (sqe->buf_index) that maps to a specific
> header slot in the headers buffer. Payload buffers are selected from
> the ring on demand and recycled after each request. Buffer ring usage is
> set on a per-queue basis. All subsequent registration SQEs for the same
> queue must use consistent flags.
>
> The headers are laid out contiguously and provided via iov[0]. Each slot
> maps to ent->id:
>
> |<- headers_size (>= queue_depth * sizeof(fuse_uring_req_header)) ->|
> +------------------------------+------------------------------+-----+
> | struct fuse_uring_req_header | struct fuse_uring_req_header | ... |
> | [ent id=0] | [ent id=1] | |
> +------------------------------+------------------------------+-----+
>
> On the server side, the ent id is used to determine where in the headers
> buffer the headers data for the ent resides. This is done by
> calculating ent_id * sizeof(struct fuse_uring_req_header) as the offset
> into the headers buffer.
>
> The buffer ring is backed by the payload buffer, which is contiguous but
> partitioned into individual bufs according to the buf_size passed in at
> registration.
>
> PAYLOAD BUFFER POOL (contiguous, provided via iov[1]):
> |<-------------- payload_size ------------>|
> +--------- --+-----------+-----------+-----+
> | buf [0] | buf [1] | buf [2] | ... |
> | buf_size | buf_size | buf_size | ... |
> +--------- --+-----------+-----------+-----+
>
> buffer ring state (struct fuse_bufring, kernel-internal):
> bufs[]: [ used | used | FREE | FREE | FREE ]
> ^^^^^^^^^^^^^^^^^^^
> available for selection
>
> The buffer ring logic is as follows:
> select: buf = bufs[head % nbufs]; head++
> recycle: bufs[tail % nbufs] = buf; tail++
> empty: tail == head (no buffers available)
> full: tail - head >= nbufs
>
> Buffer ring request flow
> ------------------------
> | Kernel | FUSE daemon
> | |
> | [client request arrives] |
> | >fuse_uring_send() |
> | [select payload buf from ring] |
> | >fuse_uring_select_buffer() |
> | [copy headers to ent's header slot] |
> | >copy_header_to_ring() |
> | [copy payload to selected buf] |
> | >fuse_uring_copy_to_ring() |
> | [set buf_id in ent_in_out header] |
> | >io_uring_cmd_done() |
> | | [CQE received]
> | | [read headers from header
> | | slot]
> | | [read payload from buf_id]
> | | [process request]
> | | [write reply to header
> | | slot]
> | | [write reply payload to
> | | buf]
> | | >io_uring_submit()
> | | COMMIT_AND_FETCH
> | >fuse_uring_commit_fetch() |
> | >fuse_uring_commit() |
> | [copy reply from ring] |
> | >fuse_uring_recycle_buffer() |
> | >fuse_uring_get_next_fuse_req() |
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> fs/fuse/dev_uring.c | 363 +++++++++++++++++++++++++++++++++-----
> fs/fuse/dev_uring_i.h | 45 ++++-
> include/uapi/linux/fuse.h | 27 ++-
> 3 files changed, 381 insertions(+), 54 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index a061f175b3fd..9f14a2bcde3f 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -41,6 +41,11 @@ enum fuse_uring_header_type {
> FUSE_URING_HEADER_RING_ENT,
> };
>
> +static inline bool bufring_enabled(struct fuse_ring_queue *queue)
> +{
> + return queue->bufring != NULL;
> +}
> +
> static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
> struct fuse_ring_ent *ring_ent)
> {
> @@ -222,6 +227,7 @@ void fuse_uring_destruct(struct fuse_conn *fc)
> }
>
> kfree(queue->fpq.processing);
> + kfree(queue->bufring);
> kfree(queue);
> ring->queues[qid] = NULL;
> }
> @@ -303,20 +309,102 @@ static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
> return 0;
> }
>
> -static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
> - int qid)
> +static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> + struct fuse_ring_queue *queue)
> +{
> + const struct fuse_uring_cmd_req *cmd_req =
> + io_uring_sqe128_cmd(cmd->sqe, struct fuse_uring_cmd_req);
> + u16 queue_depth = READ_ONCE(cmd_req->init.queue_depth);
> + unsigned int buf_size = READ_ONCE(cmd_req->init.buf_size);
> + struct iovec iov[FUSE_URING_IOV_SEGS];
> + void __user *payload, *headers;
> + size_t headers_size, payload_size, ring_size;
> + struct fuse_bufring *br;
> + unsigned int nr_bufs, i;
> + uintptr_t payload_addr;
> + int err;
> +
> + if (!queue_depth || !buf_size)
> + return -EINVAL;
> +
> + err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
> + if (err)
> + return err;
> +
> + headers = iov[FUSE_URING_IOV_HEADERS].iov_base;
> + headers_size = iov[FUSE_URING_IOV_HEADERS].iov_len;
> + payload = iov[FUSE_URING_IOV_PAYLOAD].iov_base;
> + payload_size = iov[FUSE_URING_IOV_PAYLOAD].iov_len;
> +
> + /* check if there's enough space for all the headers */
> + if (headers_size < queue_depth * sizeof(struct fuse_uring_req_header))
> + return -EINVAL;
> +
> + if (buf_size < queue->ring->max_payload_sz)
> + return -EINVAL;
> +
> + nr_bufs = payload_size / buf_size;
> + if (!nr_bufs || nr_bufs > U16_MAX)
> + return -EINVAL;
> +
> + /* create the ring buffer */
> + ring_size = struct_size(br, bufs, nr_bufs);
> + br = kzalloc(ring_size, GFP_KERNEL_ACCOUNT);
> + if (!br)
> + return -ENOMEM;
> +
> + br->queue_depth = queue_depth;
> + br->headers = headers;
> +
> + payload_addr = (uintptr_t)payload;
> +
> + /* populate the ring buffer */
> + for (i = 0; i < nr_bufs; i++, payload_addr += buf_size) {
> + struct fuse_bufring_buf *buf = &br->bufs[i];
> +
> + buf->addr = payload_addr;
> + buf->len = buf_size;
> + buf->id = i;
> + }
> +
> + br->nbufs = nr_bufs;
> + br->tail = nr_bufs;
> +
> + queue->bufring = br;
> +
> + return 0;
> +}
> +
> +/*
> + * if the queue is already registered, check that the queue was initialized with
> + * the same init flags set for this FUSE_IO_URING_CMD_REGISTER cmd. all
> + * FUSE_IO_URING_CMD_REGISTER cmds should have the same init fields set on a
> + * per-queue basis.
> + */
> +static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
> + u64 init_flags)
> {
> + bool bufring = init_flags & FUSE_URING_BUFRING;
> +
> + return bufring_enabled(queue) == bufring;
> +}
> +
> +static struct fuse_ring_queue *
> +fuse_uring_create_queue(struct io_uring_cmd *cmd, struct fuse_ring *ring,
> + int qid, u64 init_flags)
> +{
> + bool use_bufring = init_flags & FUSE_URING_BUFRING;
> struct fuse_conn *fc = ring->fc;
> struct fuse_ring_queue *queue;
> struct list_head *pq;
>
> queue = kzalloc_obj(*queue, GFP_KERNEL_ACCOUNT);
> if (!queue)
> - return NULL;
> + return ERR_PTR(-ENOMEM);
> pq = kzalloc_objs(struct list_head, FUSE_PQ_HASH_SIZE);
> if (!pq) {
> kfree(queue);
> - return NULL;
> + return ERR_PTR(-ENOMEM);
> }
>
> queue->qid = qid;
> @@ -334,12 +422,29 @@ static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
> queue->fpq.processing = pq;
> fuse_pqueue_init(&queue->fpq);
>
> + if (use_bufring) {
> + int err = fuse_uring_bufring_setup(cmd, queue);
> +
> + if (err) {
> + kfree(pq);
> + kfree(queue);
> + return ERR_PTR(err);
> + }
> + }
> +
> spin_lock(&fc->lock);
> + /* check if the queue creation raced with another thread */
> if (ring->queues[qid]) {
> spin_unlock(&fc->lock);
> kfree(queue->fpq.processing);
> + if (use_bufring)
> + kfree(queue->bufring);
> kfree(queue);
> - return ring->queues[qid];
> +
> + queue = ring->queues[qid];
> + if (!queue_init_flags_consistent(queue, init_flags))
> + return ERR_PTR(-EINVAL);
> + return queue;
> }
>
> /*
> @@ -649,7 +754,14 @@ static int copy_header_to_ring(struct fuse_ring_ent *ent,
> if (offset < 0)
> return offset;
>
> - ring = (void __user *)ent->headers + offset;
> + if (bufring_enabled(ent->queue)) {
> + int buf_offset = offset +
> + sizeof(struct fuse_uring_req_header) * ent->id;
> +
> + ring = ent->queue->bufring->headers + buf_offset;
> + } else {
> + ring = (void __user *)ent->headers + offset;
> + }
>
> if (copy_to_user(ring, header, header_size)) {
> pr_info_ratelimited("Copying header to ring failed.\n");
> @@ -669,7 +781,14 @@ static int copy_header_from_ring(struct fuse_ring_ent *ent,
> if (offset < 0)
> return offset;
>
> - ring = (void __user *)ent->headers + offset;
> + if (bufring_enabled(ent->queue)) {
> + int buf_offset = offset +
> + sizeof(struct fuse_uring_req_header) * ent->id;
> +
> + ring = ent->queue->bufring->headers + buf_offset;
> + } else {
> + ring = (void __user *)ent->headers + offset;
> + }
>
> if (copy_from_user(header, ring, header_size)) {
> pr_info_ratelimited("Copying header from ring failed.\n");
> @@ -684,12 +803,20 @@ static int setup_fuse_copy_state(struct fuse_copy_state *cs,
> struct fuse_ring_ent *ent, int dir,
> struct iov_iter *iter)
> {
> + void __user *payload;
> int err;
>
> - err = import_ubuf(dir, ent->payload, ring->max_payload_sz, iter);
> - if (err) {
> - pr_info_ratelimited("fuse: Import of user buffer failed\n");
> - return err;
> + if (bufring_enabled(ent->queue))
> + payload = (void __user *)ent->payload_buf.addr;
> + else
> + payload = ent->payload;
> +
> + if (payload) {
> + err = import_ubuf(dir, payload, ring->max_payload_sz, iter);
> + if (err) {
> + pr_info_ratelimited("fuse: Import of user buffer failed\n");
> + return err;
> + }
> }
>
> fuse_copy_init(cs, dir == ITER_DEST, iter);
> @@ -741,6 +868,9 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
> .commit_id = req->in.h.unique,
> };
>
> + if (bufring_enabled(ent->queue))
> + ent_in_out.buf_id = ent->payload_buf.id;
> +
> err = setup_fuse_copy_state(&cs, ring, req, ent, ITER_DEST, &iter);
> if (err)
> return err;
> @@ -805,6 +935,96 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
> sizeof(req->in.h));
> }
>
> +static bool fuse_uring_req_has_payload(struct fuse_req *req)
> +{
> + struct fuse_args *args = req->args;
> +
> + return args->in_numargs > 1 || args->out_numargs;
> +}
> +
> +static int fuse_uring_select_buffer(struct fuse_ring_ent *ent)
> + __must_hold(&ent->queue->lock)
> +{
> + struct fuse_ring_queue *queue = ent->queue;
> + struct fuse_bufring *br = queue->bufring;
> + struct fuse_bufring_buf *buf;
> + unsigned int tail = br->tail, head = br->head;
> +
> + lockdep_assert_held(&queue->lock);
> +
> + /* Get a buffer to use for the payload */
> + if (tail == head)
> + return -ENOBUFS;
> +
> + buf = &br->bufs[head % br->nbufs];
> + br->head++;
> +
> + ent->payload_buf = *buf;
> +
> + return 0;
> +}
> +
> +static void fuse_uring_recycle_buffer(struct fuse_ring_ent *ent)
> + __must_hold(&ent->queue->lock)
> +{
> + struct fuse_bufring_buf *ent_payload = &ent->payload_buf;
> + struct fuse_ring_queue *queue = ent->queue;
> + struct fuse_bufring_buf *buf;
> + struct fuse_bufring *br;
> +
> + lockdep_assert_held(&queue->lock);
> +
> + if (!bufring_enabled(queue) || !ent_payload->addr)
> + return;
> +
> + br = queue->bufring;
> +
> + /* ring should never be full */
> + WARN_ON_ONCE(br->tail - br->head >= br->nbufs);
> +
> + buf = &br->bufs[(br->tail) % br->nbufs];
> +
> + *buf = *ent_payload;
> +
> + br->tail++;
> +
> + memset(ent_payload, 0, sizeof(*ent_payload));
> +}
> +
> +static int fuse_uring_next_req_update_buffer(struct fuse_ring_ent *ent,
> + struct fuse_req *req)
> +{
> + bool buffer_selected;
> + bool has_payload;
> +
> + if (!bufring_enabled(ent->queue))
> + return 0;
> +
> + buffer_selected = !!ent->payload_buf.addr;
> + has_payload = fuse_uring_req_has_payload(req);
> +
> + if (has_payload && !buffer_selected)
> + return fuse_uring_select_buffer(ent);
> +
> + if (!has_payload && buffer_selected)
> + fuse_uring_recycle_buffer(ent);
> +
> + return 0;
> +}
> +
> +static int fuse_uring_prep_buffer(struct fuse_ring_ent *ent,
> + struct fuse_req *req)
> +{
> + if (!bufring_enabled(ent->queue))
> + return 0;
> +
> + /* no payload to copy, can skip selecting a buffer */
> + if (!fuse_uring_req_has_payload(req))
> + return 0;
> +
> + return fuse_uring_select_buffer(ent);
> +}
> +
> static int fuse_uring_prepare_send(struct fuse_ring_ent *ent,
> struct fuse_req *req)
> {
> @@ -878,10 +1098,21 @@ static struct fuse_req *fuse_uring_ent_assign_req(struct fuse_ring_ent *ent)
>
> /* get and assign the next entry while it is still holding the lock */
> req = list_first_entry_or_null(req_queue, struct fuse_req, list);
> - if (req)
> - fuse_uring_add_req_to_ring_ent(ent, req);
> + if (req) {
> + int err = fuse_uring_next_req_update_buffer(ent, req);
>
> - return req;
> + if (!err) {
> + fuse_uring_add_req_to_ring_ent(ent, req);
> + return req;
> + }
> + }
> +
> + /*
> + * Buffer selection may fail if all the buffers are currently saturated.
> + * The request will be serviced when a buffer is freed up.
> + */
> + fuse_uring_recycle_buffer(ent);
> + return NULL;
> }
>
> /*
> @@ -1041,6 +1272,12 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
> * fuse requests would otherwise not get processed - committing
> * and fetching is done in one step vs legacy fuse, which has separated
> * read (fetch request) and write (commit result).
> + *
> + * If the server is using bufrings and has populated the ring with less
> + * payload buffers than ents, it is possible that there may not be an
> + * available buffer for the next request. If so, then the fetch is a
> + * no-op and the next request will be serviced when a buffer becomes
> + * available.
> */
> if (fuse_uring_get_next_fuse_req(ent, queue))
> fuse_uring_send(ent, cmd, 0, issue_flags);
> @@ -1120,30 +1357,38 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
>
> ent->queue = queue;
>
> - err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
> - if (err) {
> - pr_info_ratelimited("Failed to get iovec from sqe, err=%d\n",
> - err);
> - goto error;
> - }
> + if (bufring_enabled(queue)) {
> + ent->id = READ_ONCE(cmd->sqe->buf_index);
> + if (ent->id >= queue->bufring->queue_depth) {
> + err = -EINVAL;
> + goto error;
> + }
> + } else {
> + err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
> + if (err) {
> + pr_info_ratelimited("Failed to get iovec from sqe, err=%d\n",
> + err);
> + goto error;
> + }
>
> - err = -EINVAL;
> - headers = &iov[FUSE_URING_IOV_HEADERS];
> - if (headers->iov_len < sizeof(struct fuse_uring_req_header)) {
> - pr_info_ratelimited("Invalid header len %zu\n", headers->iov_len);
> - goto error;
> - }
> + err = -EINVAL;
> + headers = &iov[FUSE_URING_IOV_HEADERS];
> + if (headers->iov_len < sizeof(struct fuse_uring_req_header)) {
> + pr_info_ratelimited("Invalid header len %zu\n",
> + headers->iov_len);
> + goto error;
> + }
>
> - payload = &iov[FUSE_URING_IOV_PAYLOAD];
> - if (payload->iov_len < ring->max_payload_sz) {
> - pr_info_ratelimited("Invalid req payload len %zu\n",
> - payload->iov_len);
> - goto error;
> + payload = &iov[FUSE_URING_IOV_PAYLOAD];
> + if (payload->iov_len < ring->max_payload_sz) {
> + pr_info_ratelimited("Invalid req payload len %zu\n",
> + payload->iov_len);
> + goto error;
> + }
> + ent->headers = headers->iov_base;
> + ent->payload = payload->iov_base;
> }
>
> - ent->headers = headers->iov_base;
> - ent->payload = payload->iov_base;
> -
> atomic_inc(&ring->queue_refs);
> return ent;
>
> @@ -1152,6 +1397,13 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
> return ERR_PTR(err);
> }
>
> +static bool init_flags_valid(u64 init_flags)
> +{
> + u64 valid_flags = FUSE_URING_BUFRING;
> +
> + return !(init_flags & ~valid_flags);
> +}
> +
> /*
> * Register header and payload buffer with the kernel and puts the
> * entry as "ready to get fuse requests" on the queue
> @@ -1161,6 +1413,7 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
> {
> const struct fuse_uring_cmd_req *cmd_req = io_uring_sqe128_cmd(cmd->sqe,
> struct fuse_uring_cmd_req);
> + u64 init_flags = READ_ONCE(cmd_req->flags);
> struct fuse_ring *ring = smp_load_acquire(&fc->ring);
> struct fuse_ring_queue *queue;
> struct fuse_ring_ent *ent;
> @@ -1179,11 +1432,16 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
> return -EINVAL;
> }
>
> + if (!init_flags_valid(init_flags))
> + return -EINVAL;
> +
> queue = ring->queues[qid];
> if (!queue) {
> - queue = fuse_uring_create_queue(ring, qid);
> - if (!queue)
> - return err;
> + queue = fuse_uring_create_queue(cmd, ring, qid, init_flags);
> + if (IS_ERR(queue))
> + return PTR_ERR(queue);
> + } else if (!queue_init_flags_consistent(queue, init_flags)) {
> + return -EINVAL;
> }
>
> /*
> @@ -1349,14 +1607,18 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
> req->ring_queue = queue;
> ent = list_first_entry_or_null(&queue->ent_avail_queue,
> struct fuse_ring_ent, list);
> - if (ent)
> - fuse_uring_add_req_to_ring_ent(ent, req);
> - else
> - list_add_tail(&req->list, &queue->fuse_req_queue);
> - spin_unlock(&queue->lock);
> + if (ent) {
> + err = fuse_uring_prep_buffer(ent, req);
> + if (!err) {
> + fuse_uring_add_req_to_ring_ent(ent, req);
> + spin_unlock(&queue->lock);
> + fuse_uring_dispatch_ent(ent);
> + return;
> + }
> + }
>
> - if (ent)
> - fuse_uring_dispatch_ent(ent);
> + list_add_tail(&req->list, &queue->fuse_req_queue);
> + spin_unlock(&queue->lock);
>
> return;
>
> @@ -1406,14 +1668,17 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
> req = list_first_entry_or_null(&queue->fuse_req_queue, struct fuse_req,
> list);
> if (ent && req) {
> - fuse_uring_add_req_to_ring_ent(ent, req);
> - spin_unlock(&queue->lock);
> + int err = fuse_uring_prep_buffer(ent, req);
>
> - fuse_uring_dispatch_ent(ent);
> - } else {
> - spin_unlock(&queue->lock);
> + if (!err) {
> + fuse_uring_add_req_to_ring_ent(ent, req);
> + spin_unlock(&queue->lock);
> + fuse_uring_dispatch_ent(ent);
> + return true;
> + }
> }
>
> + spin_unlock(&queue->lock);
> return true;
> }
>
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> index 349418db3374..66d5d5f8dc3f 100644
> --- a/fs/fuse/dev_uring_i.h
> +++ b/fs/fuse/dev_uring_i.h
> @@ -36,11 +36,47 @@ enum fuse_ring_req_state {
> FRRS_RELEASED,
> };
>
> +struct fuse_bufring_buf {
> + uintptr_t addr;
> + unsigned int len;
> + unsigned int id;
> +};
> +
> +struct fuse_bufring {
> + /* pointer to the headers buffer */
> + void __user *headers;
> +
> + unsigned int queue_depth;
Could we call this 'max_queue_depth'? I still think that it might be
useful to register ring entries dynamically when needed at some point.
And then this would become a 'max' value and not the actual value.
> +
> + /* metadata tracking state of the bufring */
> + unsigned int nbufs;
> + unsigned int head;
> + unsigned int tail;
> +
> + /* the buffers backing the ring */
> + __DECLARE_FLEX_ARRAY(struct fuse_bufring_buf, bufs);
> +};
> +
> /** A fuse ring entry, part of the ring queue */
> struct fuse_ring_ent {
> - /* userspace buffer */
> - struct fuse_uring_req_header __user *headers;
> - void __user *payload;
> + union {
> + /* if bufrings are not used */
> + struct {
> + /* userspace buffers */
> + struct fuse_uring_req_header __user *headers;
> + void __user *payload;
> + };
> + /* if bufrings are used */
> + struct {
> + /*
> + * unique fixed id for the ent. used by kernel/server to
> + * locate where in the headers buffer the data for this
> + * ent resides
> + */
> + unsigned int id;
> + struct fuse_bufring_buf payload_buf;
> + };
> + };
>
> /* the ring queue that owns the request */
> struct fuse_ring_queue *queue;
> @@ -99,6 +135,9 @@ struct fuse_ring_queue {
> unsigned int active_background;
>
> bool stopped;
> +
> + /* only allocated if the server uses bufrings */
> + struct fuse_bufring *bufring;
> };
>
> /**
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index c13e1f9a2f12..8753de7eb189 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -240,6 +240,10 @@
> * - add FUSE_COPY_FILE_RANGE_64
> * - add struct fuse_copy_file_range_out
> * - add FUSE_NOTIFY_PRUNE
> + *
> + * 7.46
> + * - add FUSE_URING_BUFRING flag
> + * - add fuse_uring_cmd_req init struct
> */
>
> #ifndef _LINUX_FUSE_H
> @@ -1263,7 +1267,13 @@ struct fuse_uring_ent_in_out {
>
> /* size of user payload buffer */
> uint32_t payload_sz;
> - uint32_t padding;
> +
> + /*
> + * if using bufrings, this is the id of the selected buffer.
> + * the selected buffer holds the request payload
> + */
> + uint16_t buf_id;
> + uint16_t padding;
>
> uint64_t reserved;
> };
> @@ -1294,6 +1304,9 @@ enum fuse_uring_cmd {
> FUSE_IO_URING_CMD_COMMIT_AND_FETCH = 2,
> };
>
> +/* fuse_uring_cmd_req flags */
> +#define FUSE_URING_BUFRING (1 << 0)
> +
> /**
> * In the 80B command area of the SQE.
> */
> @@ -1305,7 +1318,17 @@ struct fuse_uring_cmd_req {
>
> /* queue the command is for (queue index) */
> uint16_t qid;
> - uint8_t padding[6];
> + uint16_t padding;
> +
> + union {
> + struct {
> + /* size of the bufring's backing buffers */
> + uint32_t buf_size;
> + /* number of entries in the queue */
> + uint16_t queue_depth;
If you agree to change, also needs to be changed here
> + uint16_t padding;
> + } init;
> + };
> };
>
> #endif /* _LINUX_FUSE_H */
Thanks,
Bernd
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 14/14] docs: fuse: add io-uring bufring and zero-copy documentation
2026-04-15 1:10 ` Joanne Koong
@ 2026-04-15 10:55 ` Bernd Schubert
2026-04-15 22:40 ` Joanne Koong
0 siblings, 1 reply; 49+ messages in thread
From: Bernd Schubert @ 2026-04-15 10:55 UTC (permalink / raw)
To: Joanne Koong; +Cc: miklos, axboe, linux-fsdevel
On 4/15/26 03:10, Joanne Koong wrote:
> On Tue, Apr 14, 2026 at 2:05 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>>
>> On 4/2/26 18:28, Joanne Koong wrote:
>>> Add documentation for fuse over io-uring usage of buffer rings and
>>> zero-copy.
>>>
>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>> ---
>>> .../filesystems/fuse/fuse-io-uring.rst | 189 ++++++++++++++++++
>>> 1 file changed, 189 insertions(+)
>>>
>>> diff --git a/Documentation/filesystems/fuse/fuse-io-uring.rst b/Documentation/filesystems/fuse/fuse-io-uring.rst
>>> index d73dd0dbd238..bc47686c023f 100644
>>> --- a/Documentation/filesystems/fuse/fuse-io-uring.rst
>>> +++ b/Documentation/filesystems/fuse/fuse-io-uring.rst
>>> @@ -95,5 +95,194 @@ Sending requests with CQEs
>>> | <fuse_unlink() |
>>> | <sys_unlink() |
>>>
>>> +Buffer rings
>>> +============
>>>
>>> +Buffer rings have two main advantages:
>>>
>>> +* Reduced memory usage: payload buffers are pooled and selected on demand
>>> + rather than dedicated per-entry, allowing fewer buffers than entries. This
Then don't register that many entries? An entry is useless if it cannot
carry data - why do you need to register that many entries then?
>>> + infrastructure also allows for future optimizations like incremental buffer
>>> + consumption where non-overlapping parts of a buffer may be used across
>>> + concurrent requests.
>>> +* Foundation for pinned buffers: contiguous buffer allocations allow the
>>> + kernel to pin and vmap the entire region, avoiding per-request page
>>> + resolution overhead
Pinning can be done per buffer as well. The part that is harder is
pinning of the headers - this is why libfuse currently allocates 4K for
every header - prepares the pinning. From my point of view, we _should_
make use of that and set at registration time that the header is
allocated as 4K and then small requests can be inlined into the
remaining part of those 4K. With that ring bufs become useful, because
most metadata do not need a new payload buffer anymore.
However, I think in your current design headers are mapped into a large
region and there is no way to use extra space. I think that is fine, as
long as we have the capability to have multi size buf pools.
Contiguous buffer allocation can be done for entries as well - userspace
just needs to assign it to buffers that way. It becomes it a bit harder
with dynamic entry registration - entries buffers should then allocat in
sizes of system huge pages.
In fact I initially had that in libfuse and had allocated all userspace
buffers as one big memory. Then 'temporarily' removed it because I had
development stability issues - the single buffer needs to be marked with
ASAN areas in order to catch issues. For initial development that was
just overkill, but could be added now, in combination with ASAN buf marking.
For pools it would be good think about ASAN as well.
>>> +
>>> +At a high-level, this is how fuse uses buffer rings:
>>> +
>>> +* The first REGISTER SQE for a queue creates the queue and sets up the
>>> + buffer ring. The server provides two iovecs: one for headers and one for
>>> + payload buffers. Each entry gets a fixed ID (sqe->buf_index) that maps
>>> + to a specific header slot.
>>
>> Hi Joanne,
>>
>> thanks a lot for this document! Could we discuss if we could just hook
>> in here and allow SQEs with different iovecs for the payload buffer?
>> Let's say fuse-server wants multiple IO sizes - it could easily do that
>> via different pBufs and just needs to specify the dedicated IO size per
>> pBuf. Those buffers could then get sorted into an array - we could
>> define either via FUSE init the number of buf sizes or use a fixed size
>> array. Fuse requests then would just need to pick the right array.
>> This is basically what I'm currently working on for ublk.
>>
>> I think it would be good to agree on the design before it gets merged so
>> that uapi doesn't change.
>
> Hi Bernd,
>
> I'm not certain I fully see the use case for a fuse server preferring
> a static preallocation of multiple IO sizes over using incremental
> buffer consumption, but in my mind to support multiple IO size
I have to admit that I don't see why we need pbuf for dynamic
allocation. While the io-uring ring has a fixed number of SQEs/CQEs and
while libfuse currently strongly couples these to fuse buffers, there is
no technical reason. Initially it was, because I had taken the 'tags'
from ublk design, but then Miklos asked to make it lists that just get
appended whenever a FUSE_IO_URING_CMD_REGISTER is send. Which means
libfuse _could_ add new entries any time. You could start with 1 entry
per queue, additionally with the reduce-nr-queue patches you could even
start with a single queue and a single entry - and then extend that at
any time to what libfuse or the application believes is needed.
I.e. except of io-uring setup, adding or even removing ring entries and
their buffers is mainly a missing userspace issue. In order to remove
idle entries, we could add another notification type like
FUSE_NOTIFY_WAKE_RING_ENTRIES and it would then wake a given amount per
queue and maybe send via a new op code like FUSE_NOOP. All of that seems
to be easy.
> payloads, I was thinking something like this might work best:
>
> * iov[0] for the headers stays the same. no matter how many IO size
> payloads there are, the ent always maps to a header and the headers
> are a fixed size
> * iov[1...x] are the payload buffers. From the uapi perspective, in
> the fues_uring_cmd_req init struct, there would need to be an array of
> uint32_t buf_sizes. Each index in the array would correspond to index
> + 1 in the iov[] payloads passed
> * on the fuse side, each of the buffer pools has its own ring. I think
> this makes managing the different buffers a lot easier and gets rid of
> having to do any array sorting, and makes buffer selection/recycling
> O(1).
Let's say we would have per queue
struct fuse_bufring {
bool use_pinned_headers: 1;
bool use_zero_copy: 1;
unsigned int max_queue_depth; /* headers buffer capacity; frozen
at first REGISTER */
union {
void __user *headers;
struct fuse_bufring_pinned pinned_headers;
};
unsigned int nr_pools;
struct fuse_bufring_pool *pools[FUSE_URING_MAX_POOLS];
/* lookup: order (req size) pool */
struct fuse_bufring_pool *order_map[FUSE_URING_NR_ORDERS];
};
Order map is then dynamically created at buf pool registration time, and
then we would eventually get to
struct fuse_bufring_pool *pool = order_map[get_order(fuse_len_args())];
(obviously the final code needs to get a check that we don't exceed max
payload size.)
The looked up pool can be stored into ring_ent for buf recycling.
And then
struct fuse_uring_cmd_req {
...
union {
struct {
__u32 max_queue_depth; /* (renamed from queue_depth) */
__u32 buf_size;
__u8 pool_idx;
__u8 _pad[3];
} init;
...
};
};
I think pool_idx is needed one way or the other, because the io-uring
ring owner might have other pools for its own purposes.
Thanks,
Bernd
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 10/14] fuse: add io-uring buffer rings
2026-04-15 9:48 ` Bernd Schubert
@ 2026-04-15 21:40 ` Joanne Koong
0 siblings, 0 replies; 49+ messages in thread
From: Joanne Koong @ 2026-04-15 21:40 UTC (permalink / raw)
To: Bernd Schubert; +Cc: miklos, axboe, linux-fsdevel
On Wed, Apr 15, 2026 at 2:48 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>
>
> On 4/2/26 18:28, Joanne Koong wrote:
> > Add fuse buffer rings for servers communicating through the io-uring
> > interface. To use this, the server must set the FUSE_URING_BUFRING
> > flag and provide header and payload buffers via an iovec array in the
> > sqe during registration. The payload buffers are used to back the buffer
> > ring. The kernel manages buffer selection and recycling through a simple
> > internal ring.
> >
> > This has the following advantages over the non-bufring (iovec) path:
> > - Reduced memory usage: in the iovec path, each entry has its own
> > dedicated payload buffer, requiring N buffers for N entries where each
> > buffer must be large enough to accommodate the maximum possible
> > payload size. With buffer rings, payload buffers are pooled and
> > selected on demand. Entries only hold a buffer while actively
> > processing a request with payload data. When incremental buffer
> > consumption is added, this will allow non-overlapping regions of a
> > single buffer to be used simultaneously across multiple requests,
> > further reducing memory requirements.
> > - Foundation for pinned buffers: the buffer ring headers and payloads
> > are now each passed in as a contiguous memory allocation, which allows
> > fuse to easily pin and vmap the entire region in one operation during
> > queue setup. This will eliminate the per-request overhead of having to
> > pin/unpin user pages and translate virtual addresses and is a
> > prerequisite for future optimizations like performing data copies
> > outside of the server's task context.
> >
> > Each ring entry gets a fixed ID (sqe->buf_index) that maps to a specific
> > header slot in the headers buffer. Payload buffers are selected from
> > the ring on demand and recycled after each request. Buffer ring usage is
> > set on a per-queue basis. All subsequent registration SQEs for the same
> > queue must use consistent flags.
> >
> > The headers are laid out contiguously and provided via iov[0]. Each slot
> > maps to ent->id:
> >
> > |<- headers_size (>= queue_depth * sizeof(fuse_uring_req_header)) ->|
> > +------------------------------+------------------------------+-----+
> > | struct fuse_uring_req_header | struct fuse_uring_req_header | ... |
> > | [ent id=0] | [ent id=1] | |
> > +------------------------------+------------------------------+-----+
> >
> > On the server side, the ent id is used to determine where in the headers
> > buffer the headers data for the ent resides. This is done by
> > calculating ent_id * sizeof(struct fuse_uring_req_header) as the offset
> > into the headers buffer.
> >
> > The buffer ring is backed by the payload buffer, which is contiguous but
> > partitioned into individual bufs according to the buf_size passed in at
> > registration.
> >
> > PAYLOAD BUFFER POOL (contiguous, provided via iov[1]):
> > |<-------------- payload_size ------------>|
> > +--------- --+-----------+-----------+-----+
> > | buf [0] | buf [1] | buf [2] | ... |
> > | buf_size | buf_size | buf_size | ... |
> > +--------- --+-----------+-----------+-----+
> >
> > buffer ring state (struct fuse_bufring, kernel-internal):
> > bufs[]: [ used | used | FREE | FREE | FREE ]
> > ^^^^^^^^^^^^^^^^^^^
> > available for selection
> >
> > The buffer ring logic is as follows:
> > select: buf = bufs[head % nbufs]; head++
> > recycle: bufs[tail % nbufs] = buf; tail++
> > empty: tail == head (no buffers available)
> > full: tail - head >= nbufs
> >
> > Buffer ring request flow
> > ------------------------
> > | Kernel | FUSE daemon
> > | |
> > | [client request arrives] |
> > | >fuse_uring_send() |
> > | [select payload buf from ring] |
> > | >fuse_uring_select_buffer() |
> > | [copy headers to ent's header slot] |
> > | >copy_header_to_ring() |
> > | [copy payload to selected buf] |
> > | >fuse_uring_copy_to_ring() |
> > | [set buf_id in ent_in_out header] |
> > | >io_uring_cmd_done() |
> > | | [CQE received]
> > | | [read headers from header
> > | | slot]
> > | | [read payload from buf_id]
> > | | [process request]
> > | | [write reply to header
> > | | slot]
> > | | [write reply payload to
> > | | buf]
> > | | >io_uring_submit()
> > | | COMMIT_AND_FETCH
> > | >fuse_uring_commit_fetch() |
> > | >fuse_uring_commit() |
> > | [copy reply from ring] |
> > | >fuse_uring_recycle_buffer() |
> > | >fuse_uring_get_next_fuse_req() |
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> > fs/fuse/dev_uring.c | 363 +++++++++++++++++++++++++++++++++-----
> > fs/fuse/dev_uring_i.h | 45 ++++-
> > include/uapi/linux/fuse.h | 27 ++-
> > 3 files changed, 381 insertions(+), 54 deletions(-)
> >
> > diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> > index a061f175b3fd..9f14a2bcde3f 100644
> > diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> > index 349418db3374..66d5d5f8dc3f 100644
> > --- a/fs/fuse/dev_uring_i.h
> > +++ b/fs/fuse/dev_uring_i.h
> > @@ -36,11 +36,47 @@ enum fuse_ring_req_state {
> > FRRS_RELEASED,
> > };
> >
> > +struct fuse_bufring_buf {
> > + uintptr_t addr;
> > + unsigned int len;
> > + unsigned int id;
> > +};
> > +
> > +struct fuse_bufring {
> > + /* pointer to the headers buffer */
> > + void __user *headers;
> > +
> > + unsigned int queue_depth;
>
> Could we call this 'max_queue_depth'? I still think that it might be
> useful to register ring entries dynamically when needed at some point.
> And then this would become a 'max' value and not the actual value.
Sounds good, I will rename this (and the one below) to
max_queue_depth. Thanks for taking a look at the patches.
Thanks,
Joanne
>
> > +
> > + /* metadata tracking state of the bufring */
> > + unsigned int nbufs;
> > + unsigned int head;
> > + unsigned int tail;
> > +
> > + /* the buffers backing the ring */
> > + __DECLARE_FLEX_ARRAY(struct fuse_bufring_buf, bufs);
> > +};
> > +
> > /** A fuse ring entry, part of the ring queue */
> > struct fuse_ring_ent {
> > - /* userspace buffer */
> > - struct fuse_uring_req_header __user *headers;
> > - void __user *payload;
> > + union {
> > + /* if bufrings are not used */
> > + struct {
> > + /* userspace buffers */
> > + struct fuse_uring_req_header __user *headers;
> > + void __user *payload;
> > + };
> > + /* if bufrings are used */
> > + struct {
> > + /*
> > + * unique fixed id for the ent. used by kernel/server to
> > + * locate where in the headers buffer the data for this
> > + * ent resides
> > + */
> > + unsigned int id;
> > + struct fuse_bufring_buf payload_buf;
> > + };
> > + };
> >
> > /* the ring queue that owns the request */
> > struct fuse_ring_queue *queue;
> > @@ -99,6 +135,9 @@ struct fuse_ring_queue {
> > unsigned int active_background;
> >
> > bool stopped;
> > +
> > + /* only allocated if the server uses bufrings */
> > + struct fuse_bufring *bufring;
> > };
> >
> > /**
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index c13e1f9a2f12..8753de7eb189 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -240,6 +240,10 @@
> > * - add FUSE_COPY_FILE_RANGE_64
> > * - add struct fuse_copy_file_range_out
> > * - add FUSE_NOTIFY_PRUNE
> > + *
> > + * 7.46
> > + * - add FUSE_URING_BUFRING flag
> > + * - add fuse_uring_cmd_req init struct
> > */
> >
> > #ifndef _LINUX_FUSE_H
> > @@ -1263,7 +1267,13 @@ struct fuse_uring_ent_in_out {
> >
> > /* size of user payload buffer */
> > uint32_t payload_sz;
> > - uint32_t padding;
> > +
> > + /*
> > + * if using bufrings, this is the id of the selected buffer.
> > + * the selected buffer holds the request payload
> > + */
> > + uint16_t buf_id;
> > + uint16_t padding;
> >
> > uint64_t reserved;
> > };
> > @@ -1294,6 +1304,9 @@ enum fuse_uring_cmd {
> > FUSE_IO_URING_CMD_COMMIT_AND_FETCH = 2,
> > };
> >
> > +/* fuse_uring_cmd_req flags */
> > +#define FUSE_URING_BUFRING (1 << 0)
> > +
> > /**
> > * In the 80B command area of the SQE.
> > */
> > @@ -1305,7 +1318,17 @@ struct fuse_uring_cmd_req {
> >
> > /* queue the command is for (queue index) */
> > uint16_t qid;
> > - uint8_t padding[6];
> > + uint16_t padding;
> > +
> > + union {
> > + struct {
> > + /* size of the bufring's backing buffers */
> > + uint32_t buf_size;
> > + /* number of entries in the queue */
> > + uint16_t queue_depth;
>
> If you agree to change, also needs to be changed here
>
> > + uint16_t padding;
> > + } init;
> > + };
> > };
> >
> > #endif /* _LINUX_FUSE_H */
>
>
> Thanks,
> Bernd
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 14/14] docs: fuse: add io-uring bufring and zero-copy documentation
2026-04-15 10:55 ` Bernd Schubert
@ 2026-04-15 22:40 ` Joanne Koong
0 siblings, 0 replies; 49+ messages in thread
From: Joanne Koong @ 2026-04-15 22:40 UTC (permalink / raw)
To: Bernd Schubert; +Cc: miklos, axboe, linux-fsdevel
On Wed, Apr 15, 2026 at 3:55 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>
> On 4/15/26 03:10, Joanne Koong wrote:
> > On Tue, Apr 14, 2026 at 2:05 PM Bernd Schubert <bernd@bsbernd.com> wrote:
> >>
> >> On 4/2/26 18:28, Joanne Koong wrote:
> >>> Add documentation for fuse over io-uring usage of buffer rings and
> >>> zero-copy.
> >>>
> >>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>> ---
> >>> .../filesystems/fuse/fuse-io-uring.rst | 189 ++++++++++++++++++
> >>> 1 file changed, 189 insertions(+)
> >>>
> >>> diff --git a/Documentation/filesystems/fuse/fuse-io-uring.rst b/Documentation/filesystems/fuse/fuse-io-uring.rst
> >>> index d73dd0dbd238..bc47686c023f 100644
> >>> --- a/Documentation/filesystems/fuse/fuse-io-uring.rst
> >>> +++ b/Documentation/filesystems/fuse/fuse-io-uring.rst
> >>> @@ -95,5 +95,194 @@ Sending requests with CQEs
> >>> | <fuse_unlink() |
> >>> | <sys_unlink() |
> >>>
> >>> +Buffer rings
> >>> +============
> >>>
> >>> +Buffer rings have two main advantages:
> >>>
> >>> +* Reduced memory usage: payload buffers are pooled and selected on demand
> >>> + rather than dedicated per-entry, allowing fewer buffers than entries. This
>
> Then don't register that many entries? An entry is useless if it cannot
> carry data - why do you need to register that many entries then?
Registering more entries gives higher concurrency (since the number of
entries determines the max number of in-flight requests). Since not
all fuse operations require large payload buffers simultaneously,
buffer rings with incremental buffer consumption will allow servers to
support higher concurrency with a lot less memory than dedicating a
buffer per entry especially for metadata-heavy workloads.
>
> >>> + infrastructure also allows for future optimizations like incremental buffer
> >>> + consumption where non-overlapping parts of a buffer may be used across
> >>> + concurrent requests.
> >>> +* Foundation for pinned buffers: contiguous buffer allocations allow the
> >>> + kernel to pin and vmap the entire region, avoiding per-request page
> >>> + resolution overhead
>
> Pinning can be done per buffer as well. The part that is harder is
Yes it could be done per buffer but imo it is much cleaner to do it
contiguously.
> pinning of the headers - this is why libfuse currently allocates 4K for
> every header - prepares the pinning. From my point of view, we _should_
> make use of that and set at registration time that the header is
> allocated as 4K and then small requests can be inlined into the
> remaining part of those 4K. With that ring bufs become useful, because
> most metadata do not need a new payload buffer anymore.
> However, I think in your current design headers are mapped into a large
> region and there is no way to use extra space. I think that is fine, as
> long as we have the capability to have multi size buf pools.
I'm not sure if you are trying to make this point or just describing
the non-ringbuf "legacy" path, but I strongly disagree that pinning
should be done per-header. Each struct fuse_uring_req_header is 288
bytes, and rounding that up to 4K per header is not ideal and I don't
see any benefit from doing that. In the original (eg non-ringbuf) path
where each header is allocated at 4k, that can't be used with ringbufs
as using ringbufs mandates that the headers are allocated
contiguously.
>
>
> Contiguous buffer allocation can be done for entries as well - userspace
> just needs to assign it to buffers that way. It becomes it a bit harder
> with dynamic entry registration - entries buffers should then allocat in
> sizes of system huge pages.
>
>
> In fact I initially had that in libfuse and had allocated all userspace
> buffers as one big memory. Then 'temporarily' removed it because I had
> development stability issues - the single buffer needs to be marked with
> ASAN areas in order to catch issues. For initial development that was
> just overkill, but could be added now, in combination with ASAN buf marking.
> For pools it would be good think about ASAN as well.
>
> >>> +
> >>> +At a high-level, this is how fuse uses buffer rings:
> >>> +
> >>> +* The first REGISTER SQE for a queue creates the queue and sets up the
> >>> + buffer ring. The server provides two iovecs: one for headers and one for
> >>> + payload buffers. Each entry gets a fixed ID (sqe->buf_index) that maps
> >>> + to a specific header slot.
> >>
> >> Hi Joanne,
> >>
> >> thanks a lot for this document! Could we discuss if we could just hook
> >> in here and allow SQEs with different iovecs for the payload buffer?
> >> Let's say fuse-server wants multiple IO sizes - it could easily do that
> >> via different pBufs and just needs to specify the dedicated IO size per
> >> pBuf. Those buffers could then get sorted into an array - we could
> >> define either via FUSE init the number of buf sizes or use a fixed size
> >> array. Fuse requests then would just need to pick the right array.
> >> This is basically what I'm currently working on for ublk.
> >>
> >> I think it would be good to agree on the design before it gets merged so
> >> that uapi doesn't change.
> >
> > Hi Bernd,
> >
> > I'm not certain I fully see the use case for a fuse server preferring
> > a static preallocation of multiple IO sizes over using incremental
> > buffer consumption, but in my mind to support multiple IO size
>
> I have to admit that I don't see why we need pbuf for dynamic
> allocation. While the io-uring ring has a fixed number of SQEs/CQEs and
If the ents are dynamically allocated, bufrings are still useful
because incremental buffer consumption still helps with memory
efficiency
> while libfuse currently strongly couples these to fuse buffers, there is
> no technical reason. Initially it was, because I had taken the 'tags'
> from ublk design, but then Miklos asked to make it lists that just get
> appended whenever a FUSE_IO_URING_CMD_REGISTER is send. Which means
> libfuse _could_ add new entries any time. You could start with 1 entry
> per queue, additionally with the reduce-nr-queue patches you could even
> start with a single queue and a single entry - and then extend that at
> any time to what libfuse or the application believes is needed.
> I.e. except of io-uring setup, adding or even removing ring entries and
> their buffers is mainly a missing userspace issue. In order to remove
> idle entries, we could add another notification type like
> FUSE_NOTIFY_WAKE_RING_ENTRIES and it would then wake a given amount per
> queue and maybe send via a new op code like FUSE_NOOP. All of that seems
> to be easy.
>
> > payloads, I was thinking something like this might work best:
> >
> > * iov[0] for the headers stays the same. no matter how many IO size
> > payloads there are, the ent always maps to a header and the headers
> > are a fixed size
> > * iov[1...x] are the payload buffers. From the uapi perspective, in
> > the fues_uring_cmd_req init struct, there would need to be an array of
> > uint32_t buf_sizes. Each index in the array would correspond to index
> > + 1 in the iov[] payloads passed
> > * on the fuse side, each of the buffer pools has its own ring. I think
> > this makes managing the different buffers a lot easier and gets rid of
> > having to do any array sorting, and makes buffer selection/recycling
> > O(1).
>
>
> Let's say we would have per queue
>
> struct fuse_bufring {
> bool use_pinned_headers: 1;
> bool use_zero_copy: 1;
> unsigned int max_queue_depth; /* headers buffer capacity; frozen
> at first REGISTER */
>
> union {
> void __user *headers;
> struct fuse_bufring_pinned pinned_headers;
> };
>
> unsigned int nr_pools;
> struct fuse_bufring_pool *pools[FUSE_URING_MAX_POOLS];
>
> /* lookup: order (req size) pool */
> struct fuse_bufring_pool *order_map[FUSE_URING_NR_ORDERS];
> };
>
>
> Order map is then dynamically created at buf pool registration time, and
> then we would eventually get to
>
> struct fuse_bufring_pool *pool = order_map[get_order(fuse_len_args())];
>
> (obviously the final code needs to get a check that we don't exceed max
> payload size.)
>
> The looked up pool can be stored into ring_ent for buf recycling.
>
Yes, this aligns with what I had in mind.
>
>
> And then
>
> struct fuse_uring_cmd_req {
>
> ...
> union {
> struct {
> __u32 max_queue_depth; /* (renamed from queue_depth) */
> __u32 buf_size;
>
> __u8 pool_idx;
> __u8 _pad[3];
>
> } init;
> ...
>
> };
>
> };
This makes sense to me.
>
>
>
> I think pool_idx is needed one way or the other, because the io-uring
> ring owner might have other pools for its own purposes.
>
>
> Thanks,
> Bernd
Thanks,
Joanne
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 01/14] fuse: separate next request fetching from sending logic
2026-04-02 16:28 ` [PATCH v2 01/14] fuse: separate next request fetching from sending logic Joanne Koong
@ 2026-04-29 11:52 ` Jeff Layton
0 siblings, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2026-04-29 11:52 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel, Bernd Schubert
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> Simplify the logic for fetching + sending off the next request.
>
> This gets rid of fuse_uring_send_next_to_ring() which contained
> duplicated logic from fuse_uring_send(). This decouples request fetching
> from the send operation, which makes the control flow clearer and
> reduces unnecessary parameter passing.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Reviewed-by: Bernd Schubert <bschubert@ddn.com>
> ---
> fs/fuse/dev_uring.c | 78 ++++++++++++++++-----------------------------
> 1 file changed, 28 insertions(+), 50 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 3a38b61aac26..54436d3fda4d 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -714,34 +714,6 @@ static int fuse_uring_prepare_send(struct fuse_ring_ent *ent,
> return err;
> }
>
> -/*
> - * Write data to the ring buffer and send the request to userspace,
> - * userspace will read it
> - * This is comparable with classical read(/dev/fuse)
> - */
> -static int fuse_uring_send_next_to_ring(struct fuse_ring_ent *ent,
> - struct fuse_req *req,
> - unsigned int issue_flags)
> -{
> - struct fuse_ring_queue *queue = ent->queue;
> - int err;
> - struct io_uring_cmd *cmd;
> -
> - err = fuse_uring_prepare_send(ent, req);
> - if (err)
> - return err;
> -
> - spin_lock(&queue->lock);
> - cmd = ent->cmd;
> - ent->cmd = NULL;
> - ent->state = FRRS_USERSPACE;
> - list_move_tail(&ent->list, &queue->ent_in_userspace);
> - spin_unlock(&queue->lock);
> -
> - io_uring_cmd_done(cmd, 0, issue_flags);
> - return 0;
> -}
> -
> /*
> * Make a ring entry available for fuse_req assignment
> */
> @@ -838,11 +810,13 @@ static void fuse_uring_commit(struct fuse_ring_ent *ent, struct fuse_req *req,
> }
>
> /*
> - * Get the next fuse req and send it
> + * Get the next fuse req.
> + *
> + * Returns true if the next fuse request has been assigned to the ent.
> + * Else, there is no next fuse request and this returns false.
> */
> -static void fuse_uring_next_fuse_req(struct fuse_ring_ent *ent,
> - struct fuse_ring_queue *queue,
> - unsigned int issue_flags)
> +static bool fuse_uring_get_next_fuse_req(struct fuse_ring_ent *ent,
> + struct fuse_ring_queue *queue)
> {
> int err;
> struct fuse_req *req;
> @@ -854,10 +828,12 @@ static void fuse_uring_next_fuse_req(struct fuse_ring_ent *ent,
> spin_unlock(&queue->lock);
>
> if (req) {
> - err = fuse_uring_send_next_to_ring(ent, req, issue_flags);
> + err = fuse_uring_prepare_send(ent, req);
> if (err)
> goto retry;
> }
> +
> + return req != NULL;
> }
>
> static int fuse_ring_ent_set_commit(struct fuse_ring_ent *ent)
> @@ -875,6 +851,20 @@ static int fuse_ring_ent_set_commit(struct fuse_ring_ent *ent)
> return 0;
> }
>
> +static void fuse_uring_send(struct fuse_ring_ent *ent, struct io_uring_cmd *cmd,
> + ssize_t ret, unsigned int issue_flags)
> +{
> + struct fuse_ring_queue *queue = ent->queue;
> +
> + spin_lock(&queue->lock);
> + ent->state = FRRS_USERSPACE;
> + list_move_tail(&ent->list, &queue->ent_in_userspace);
> + ent->cmd = NULL;
> + spin_unlock(&queue->lock);
> +
> + io_uring_cmd_done(cmd, ret, issue_flags);
> +}
> +
> /* FUSE_URING_CMD_COMMIT_AND_FETCH handler */
> static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
> struct fuse_conn *fc)
> @@ -947,7 +937,8 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
> * and fetching is done in one step vs legacy fuse, which has separated
> * read (fetch request) and write (commit result).
> */
> - fuse_uring_next_fuse_req(ent, queue, issue_flags);
> + if (fuse_uring_get_next_fuse_req(ent, queue))
> + fuse_uring_send(ent, cmd, 0, issue_flags);
> return 0;
> }
>
> @@ -1196,20 +1187,6 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
> return -EIOCBQUEUED;
> }
>
> -static void fuse_uring_send(struct fuse_ring_ent *ent, struct io_uring_cmd *cmd,
> - ssize_t ret, unsigned int issue_flags)
> -{
> - struct fuse_ring_queue *queue = ent->queue;
> -
> - spin_lock(&queue->lock);
> - ent->state = FRRS_USERSPACE;
> - list_move_tail(&ent->list, &queue->ent_in_userspace);
> - ent->cmd = NULL;
> - spin_unlock(&queue->lock);
> -
> - io_uring_cmd_done(cmd, ret, issue_flags);
> -}
> -
> /*
> * This prepares and sends the ring request in fuse-uring task context.
> * User buffers are not mapped yet - the application does not have permission
> @@ -1226,8 +1203,9 @@ static void fuse_uring_send_in_task(struct io_tw_req tw_req, io_tw_token_t tw)
> if (!tw.cancel) {
> err = fuse_uring_prepare_send(ent, ent->fuse_req);
> if (err) {
> - fuse_uring_next_fuse_req(ent, queue, issue_flags);
> - return;
> + if (!fuse_uring_get_next_fuse_req(ent, queue))
> + return;
> + err = 0;
> }
> } else {
> err = -ECANCELED;
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 02/14] fuse: refactor io-uring header copying to ring
2026-04-02 16:28 ` [PATCH v2 02/14] fuse: refactor io-uring header copying to ring Joanne Koong
@ 2026-04-29 12:05 ` Jeff Layton
0 siblings, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2026-04-29 12:05 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel, Bernd Schubert
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> Move header copying to ring logic into a new copy_header_to_ring()
> function. This makes the copy_to_user() logic more clear and centralizes
> error handling / rate-limited logging.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Reviewed-by: Bernd Schubert <bschubert@ddn.com>
> ---
> fs/fuse/dev_uring.c | 39 +++++++++++++++++++++------------------
> 1 file changed, 21 insertions(+), 18 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 54436d3fda4d..5fc8ca330595 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -575,6 +575,18 @@ static int fuse_uring_out_header_has_err(struct fuse_out_header *oh,
> return err;
> }
>
> +static __always_inline int copy_header_to_ring(void __user *ring,
> + const void *header,
> + size_t header_size)
> +{
> + if (copy_to_user(ring, header, header_size)) {
> + pr_info_ratelimited("Copying header to ring failed.\n");
> + return -EFAULT;
> + }
> +
> + return 0;
> +}
> +
> static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
> struct fuse_req *req,
> struct fuse_ring_ent *ent)
> @@ -637,13 +649,11 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
> * Some op code have that as zero size.
> */
> if (args->in_args[0].size > 0) {
> - err = copy_to_user(&ent->headers->op_in, in_args->value,
> - in_args->size);
> - if (err) {
> - pr_info_ratelimited(
> - "Copying the header failed.\n");
> - return -EFAULT;
> - }
> + err = copy_header_to_ring(&ent->headers->op_in,
> + in_args->value,
> + in_args->size);
> + if (err)
> + return err;
> }
> in_args++;
> num_args--;
> @@ -659,9 +669,8 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
> }
>
> ent_in_out.payload_sz = cs.ring.copied_sz;
> - err = copy_to_user(&ent->headers->ring_ent_in_out, &ent_in_out,
> - sizeof(ent_in_out));
> - return err ? -EFAULT : 0;
> + return copy_header_to_ring(&ent->headers->ring_ent_in_out, &ent_in_out,
> + sizeof(ent_in_out));
> }
>
> static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
> @@ -690,14 +699,8 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
> }
>
> /* copy fuse_in_header */
> - err = copy_to_user(&ent->headers->in_out, &req->in.h,
> - sizeof(req->in.h));
> - if (err) {
> - err = -EFAULT;
> - return err;
> - }
> -
> - return 0;
> + return copy_header_to_ring(&ent->headers->in_out, &req->in.h,
> + sizeof(req->in.h));
> }
>
> static int fuse_uring_prepare_send(struct fuse_ring_ent *ent,
Nice little cleanup.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 03/14] fuse: refactor io-uring header copying from ring
2026-04-02 16:28 ` [PATCH v2 03/14] fuse: refactor io-uring header copying from ring Joanne Koong
@ 2026-04-29 12:06 ` Jeff Layton
0 siblings, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2026-04-29 12:06 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel, Bernd Schubert
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> Move header copying from ring logic into a new copy_header_from_ring()
> function. This makes the copy_from_user() logic more clear and
> centralizes error handling / rate-limited logging.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Reviewed-by: Bernd Schubert <bschubert@ddn.com>
> ---
> fs/fuse/dev_uring.c | 24 ++++++++++++++++++------
> 1 file changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 5fc8ca330595..86f9bb94b45a 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -587,6 +587,18 @@ static __always_inline int copy_header_to_ring(void __user *ring,
> return 0;
> }
>
> +static __always_inline int copy_header_from_ring(void *header,
> + const void __user *ring,
> + size_t header_size)
> +{
> + if (copy_from_user(header, ring, header_size)) {
> + pr_info_ratelimited("Copying header from ring failed.\n");
> + return -EFAULT;
> + }
> +
> + return 0;
> +}
> +
> static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
> struct fuse_req *req,
> struct fuse_ring_ent *ent)
> @@ -597,10 +609,10 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
> int err;
> struct fuse_uring_ent_in_out ring_in_out;
>
> - err = copy_from_user(&ring_in_out, &ent->headers->ring_ent_in_out,
> - sizeof(ring_in_out));
> + err = copy_header_from_ring(&ring_in_out, &ent->headers->ring_ent_in_out,
> + sizeof(ring_in_out));
> if (err)
> - return -EFAULT;
> + return err;
>
> err = import_ubuf(ITER_SOURCE, ent->payload, ring->max_payload_sz,
> &iter);
> @@ -794,10 +806,10 @@ static void fuse_uring_commit(struct fuse_ring_ent *ent, struct fuse_req *req,
> struct fuse_conn *fc = ring->fc;
> ssize_t err = 0;
>
> - err = copy_from_user(&req->out.h, &ent->headers->in_out,
> - sizeof(req->out.h));
> + err = copy_header_from_ring(&req->out.h, &ent->headers->in_out,
> + sizeof(req->out.h));
> if (err) {
> - req->out.h.error = -EFAULT;
> + req->out.h.error = err;
> goto out;
> }
>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 04/14] fuse: use enum types for header copying
2026-04-02 16:28 ` [PATCH v2 04/14] fuse: use enum types for header copying Joanne Koong
@ 2026-04-30 8:04 ` Jeff Layton
0 siblings, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2026-04-30 8:04 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel, Bernd Schubert
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> Use enum types to identify which part of the header needs to be copied.
> This improves the interface and will simplify both kernel-space and
> user-space header addresses copying when buffer rings are added.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Reviewed-by: Bernd Schubert <bschubert@ddn.com>
> ---
> fs/fuse/dev_uring.c | 66 ++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 53 insertions(+), 13 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 86f9bb94b45a..cca795dd72e1 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -31,6 +31,15 @@ struct fuse_uring_pdu {
>
> static const struct fuse_iqueue_ops fuse_io_uring_ops;
>
> +enum fuse_uring_header_type {
> + /* struct fuse_in_header / struct fuse_out_header */
> + FUSE_URING_HEADER_IN_OUT,
> + /* per op code header */
> + FUSE_URING_HEADER_OP,
> + /* struct fuse_uring_ent_in_out header */
> + FUSE_URING_HEADER_RING_ENT,
> +};
> +
> static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
> struct fuse_ring_ent *ring_ent)
> {
> @@ -575,10 +584,33 @@ static int fuse_uring_out_header_has_err(struct fuse_out_header *oh,
> return err;
> }
>
> -static __always_inline int copy_header_to_ring(void __user *ring,
> - const void *header,
> - size_t header_size)
> +static int ring_header_type_offset(enum fuse_uring_header_type type)
> {
> + switch (type) {
> + case FUSE_URING_HEADER_IN_OUT:
> + return 0;
> + case FUSE_URING_HEADER_OP:
> + return offsetof(struct fuse_uring_req_header, op_in);
> + case FUSE_URING_HEADER_RING_ENT:
> + return offsetof(struct fuse_uring_req_header, ring_ent_in_out);
> + default:
> + WARN_ONCE(1, "Invalid header type: %d\n", type);
> + return -EINVAL;
> + }
> +}
> +
> +static int copy_header_to_ring(struct fuse_ring_ent *ent,
> + enum fuse_uring_header_type type,
> + const void *header, size_t header_size)
> +{
> + int offset = ring_header_type_offset(type);
> + void __user *ring;
> +
> + if (offset < 0)
> + return offset;
> +
> + ring = (void __user *)ent->headers + offset;
> +
> if (copy_to_user(ring, header, header_size)) {
> pr_info_ratelimited("Copying header to ring failed.\n");
> return -EFAULT;
> @@ -587,10 +619,18 @@ static __always_inline int copy_header_to_ring(void __user *ring,
> return 0;
> }
>
> -static __always_inline int copy_header_from_ring(void *header,
> - const void __user *ring,
> - size_t header_size)
nit: this patch drops the __always_inline's (which is probably a good
idea), but there is no mention of why you did it in the changelog.
> +static int copy_header_from_ring(struct fuse_ring_ent *ent,
> + enum fuse_uring_header_type type, void *header,
> + size_t header_size)
> {
> + int offset = ring_header_type_offset(type);
> + const void __user *ring;
> +
> + if (offset < 0)
> + return offset;
> +
> + ring = (void __user *)ent->headers + offset;
> +
> if (copy_from_user(header, ring, header_size)) {
> pr_info_ratelimited("Copying header from ring failed.\n");
> return -EFAULT;
> @@ -609,8 +649,8 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
> int err;
> struct fuse_uring_ent_in_out ring_in_out;
>
> - err = copy_header_from_ring(&ring_in_out, &ent->headers->ring_ent_in_out,
> - sizeof(ring_in_out));
> + err = copy_header_from_ring(ent, FUSE_URING_HEADER_RING_ENT,
> + &ring_in_out, sizeof(ring_in_out));
> if (err)
> return err;
>
> @@ -661,7 +701,7 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
> * Some op code have that as zero size.
> */
> if (args->in_args[0].size > 0) {
> - err = copy_header_to_ring(&ent->headers->op_in,
> + err = copy_header_to_ring(ent, FUSE_URING_HEADER_OP,
> in_args->value,
> in_args->size);
> if (err)
> @@ -681,8 +721,8 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
> }
>
> ent_in_out.payload_sz = cs.ring.copied_sz;
> - return copy_header_to_ring(&ent->headers->ring_ent_in_out, &ent_in_out,
> - sizeof(ent_in_out));
> + return copy_header_to_ring(ent, FUSE_URING_HEADER_RING_ENT,
> + &ent_in_out, sizeof(ent_in_out));
> }
>
> static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
> @@ -711,7 +751,7 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
> }
>
> /* copy fuse_in_header */
> - return copy_header_to_ring(&ent->headers->in_out, &req->in.h,
> + return copy_header_to_ring(ent, FUSE_URING_HEADER_IN_OUT, &req->in.h,
> sizeof(req->in.h));
> }
>
> @@ -806,7 +846,7 @@ static void fuse_uring_commit(struct fuse_ring_ent *ent, struct fuse_req *req,
> struct fuse_conn *fc = ring->fc;
> ssize_t err = 0;
>
> - err = copy_header_from_ring(&req->out.h, &ent->headers->in_out,
> + err = copy_header_from_ring(ent, FUSE_URING_HEADER_IN_OUT, &req->out.h,
> sizeof(req->out.h));
> if (err) {
> req->out.h.error = err;
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 05/14] fuse: refactor setting up copy state for payload copying
2026-04-02 16:28 ` [PATCH v2 05/14] fuse: refactor setting up copy state for payload copying Joanne Koong
@ 2026-04-30 8:06 ` Jeff Layton
0 siblings, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2026-04-30 8:06 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel, Bernd Schubert
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> Add a new helper function setup_fuse_copy_state() to contain the logic
> for setting up the copy state for payload copying.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Reviewed-by: Bernd Schubert <bschubert@ddn.com>
> ---
> fs/fuse/dev_uring.c | 38 ++++++++++++++++++++++++--------------
> 1 file changed, 24 insertions(+), 14 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index cca795dd72e1..045394a7ae41 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -639,6 +639,27 @@ static int copy_header_from_ring(struct fuse_ring_ent *ent,
> return 0;
> }
>
> +static int setup_fuse_copy_state(struct fuse_copy_state *cs,
> + struct fuse_ring *ring, struct fuse_req *req,
> + struct fuse_ring_ent *ent, int dir,
> + struct iov_iter *iter)
> +{
> + int err;
> +
> + err = import_ubuf(dir, ent->payload, ring->max_payload_sz, iter);
> + if (err) {
> + pr_info_ratelimited("fuse: Import of user buffer failed\n");
> + return err;
> + }
> +
> + fuse_copy_init(cs, dir == ITER_DEST, iter);
> +
> + cs->is_uring = true;
> + cs->req = req;
> +
> + return 0;
> +}
> +
> static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
> struct fuse_req *req,
> struct fuse_ring_ent *ent)
> @@ -654,15 +675,10 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
> if (err)
> return err;
>
> - err = import_ubuf(ITER_SOURCE, ent->payload, ring->max_payload_sz,
> - &iter);
> + err = setup_fuse_copy_state(&cs, ring, req, ent, ITER_SOURCE, &iter);
> if (err)
> return err;
>
> - fuse_copy_init(&cs, false, &iter);
> - cs.is_uring = true;
> - cs.req = req;
> -
> err = fuse_copy_out_args(&cs, args, ring_in_out.payload_sz);
> fuse_copy_finish(&cs);
> return err;
> @@ -685,15 +701,9 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
> .commit_id = req->in.h.unique,
> };
>
> - err = import_ubuf(ITER_DEST, ent->payload, ring->max_payload_sz, &iter);
> - if (err) {
> - pr_info_ratelimited("fuse: Import of user buffer failed\n");
> + err = setup_fuse_copy_state(&cs, ring, req, ent, ITER_DEST, &iter);
> + if (err)
> return err;
> - }
> -
> - fuse_copy_init(&cs, true, &iter);
> - cs.is_uring = true;
> - cs.req = req;
>
> if (num_args > 0) {
> /*
Nice cleanup.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 06/14] fuse: support buffer copying for kernel addresses
2026-04-02 16:28 ` [PATCH v2 06/14] fuse: support buffer copying for kernel addresses Joanne Koong
@ 2026-04-30 8:19 ` Jeff Layton
0 siblings, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2026-04-30 8:19 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> This is a preparatory patch needed to support pinned buffers in
> fuse-over-io-uring. For pinned buffers, we get the vmapped address of
> the buffer, which we can directly use with memcpy.
>
> Currently, buffer copying in fuse only supports extracting underlying
> pages from an iov iter and kmapping them. This commit allows buffer
> copying to work directly on a kaddr.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
> ---
> fs/fuse/dev.c | 23 +++++++++++++++++++----
> fs/fuse/fuse_dev_i.h | 7 ++++++-
> 2 files changed, 25 insertions(+), 5 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 0b0241f47170..a87939eaa103 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -848,6 +848,9 @@ void fuse_copy_init(struct fuse_copy_state *cs, bool write,
> /* Unmap and put previous page of userspace buffer */
> void fuse_copy_finish(struct fuse_copy_state *cs)
> {
> + if (cs->is_kaddr)
> + return;
> +
> if (cs->currbuf) {
> struct pipe_buffer *buf = cs->currbuf;
>
> @@ -873,6 +876,12 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
> struct page *page;
> int err;
>
> + if (cs->is_kaddr) {
> + if (!cs->len)
> + return -ENOBUFS;
> + return 0;
> + }
> +
> err = unlock_request(cs->req);
> if (err)
> return err;
> @@ -931,15 +940,21 @@ static int fuse_copy_do(struct fuse_copy_state *cs, void **val, unsigned *size)
> {
> unsigned ncpy = min(*size, cs->len);
> if (val) {
> - void *pgaddr = kmap_local_page(cs->pg);
> - void *buf = pgaddr + cs->offset;
> + void *pgaddr, *buf;
> +
> + if (!cs->is_kaddr) {
> + pgaddr = kmap_local_page(cs->pg);
> + buf = pgaddr + cs->offset;
> + } else {
> + buf = cs->kaddr + cs->offset;
> + }
>
> if (cs->write)
> memcpy(buf, *val, ncpy);
> else
> memcpy(*val, buf, ncpy);
> -
> - kunmap_local(pgaddr);
> + if (!cs->is_kaddr)
> + kunmap_local(pgaddr);
> *val += ncpy;
> }
> *size -= ncpy;
> diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
> index 134bf44aff0d..aa1d25421054 100644
> --- a/fs/fuse/fuse_dev_i.h
> +++ b/fs/fuse/fuse_dev_i.h
> @@ -28,12 +28,17 @@ struct fuse_copy_state {
> struct pipe_buffer *currbuf;
> struct pipe_inode_info *pipe;
> unsigned long nr_segs;
> - struct page *pg;
> + union {
> + struct page *pg;
> + void *kaddr;
> + };
> unsigned int len;
> unsigned int offset;
> bool write:1;
> bool move_folios:1;
> bool is_uring:1;
> + /* if set, use kaddr; otherwise use pg */
> + bool is_kaddr:1;
> struct {
> unsigned int copied_sz; /* copied size into the user buffer */
> } ring;
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 07/14] fuse: use named constants for io-uring iovec indices
2026-04-02 16:28 ` [PATCH v2 07/14] fuse: use named constants for io-uring iovec indices Joanne Koong
2026-04-15 9:36 ` Bernd Schubert
@ 2026-04-30 8:20 ` Jeff Layton
1 sibling, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2026-04-30 8:20 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> Replace magic indices 0 and 1 for the iovec array with named constants
> FUSE_URING_IOV_HEADERS and FUSE_URING_IOV_PAYLOAD. This makes the usages
> self-documenting and prepares for buffer ring support which will also
> reference these iovec slots by index.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> fs/fuse/dev_uring.c | 24 +++++++++++++-----------
> 1 file changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 045394a7ae41..a85acd9c2b71 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -18,7 +18,8 @@ MODULE_PARM_DESC(enable_uring,
> "Enable userspace communication through io-uring");
>
> #define FUSE_URING_IOV_SEGS 2 /* header and payload */
> -
> +#define FUSE_URING_IOV_HEADERS 0
> +#define FUSE_URING_IOV_PAYLOAD 1
>
> bool fuse_uring_enabled(void)
> {
> @@ -1063,8 +1064,8 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
> }
>
> /*
> - * sqe->addr is a ptr to an iovec array, iov[0] has the headers, iov[1]
> - * the payload
> + * sqe->addr is a ptr to an iovec array, iov[FUSE_URING_IOV_HEADERS] has the
> + * headers, iov[FUSE_URING_IOV_PAYLOAD] the payload
> */
> static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
> struct iovec iov[FUSE_URING_IOV_SEGS])
> @@ -1094,8 +1095,8 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
> {
> struct fuse_ring *ring = queue->ring;
> struct fuse_ring_ent *ent;
> - size_t payload_size;
> struct iovec iov[FUSE_URING_IOV_SEGS];
> + struct iovec *headers, *payload;
> int err;
>
> err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
> @@ -1106,15 +1107,16 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
> }
>
> err = -EINVAL;
> - if (iov[0].iov_len < sizeof(struct fuse_uring_req_header)) {
> - pr_info_ratelimited("Invalid header len %zu\n", iov[0].iov_len);
> + headers = &iov[FUSE_URING_IOV_HEADERS];
> + if (headers->iov_len < sizeof(struct fuse_uring_req_header)) {
> + pr_info_ratelimited("Invalid header len %zu\n", headers->iov_len);
> return ERR_PTR(err);
> }
>
> - payload_size = iov[1].iov_len;
> - if (payload_size < ring->max_payload_sz) {
> + payload = &iov[FUSE_URING_IOV_PAYLOAD];
> + if (payload->iov_len < ring->max_payload_sz) {
> pr_info_ratelimited("Invalid req payload len %zu\n",
> - payload_size);
> + payload->iov_len);
> return ERR_PTR(err);
> }
>
> @@ -1126,8 +1128,8 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
> INIT_LIST_HEAD(&ent->list);
>
> ent->queue = queue;
> - ent->headers = iov[0].iov_base;
> - ent->payload = iov[1].iov_base;
> + ent->headers = headers->iov_base;
> + ent->payload = payload->iov_base;
>
> atomic_inc(&ring->queue_refs);
> return ent;
Much more readable.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 08/14] fuse: move fuse_uring_abort() from header to dev_uring.c
2026-04-02 16:28 ` [PATCH v2 08/14] fuse: move fuse_uring_abort() from header to dev_uring.c Joanne Koong
2026-04-15 9:40 ` Bernd Schubert
@ 2026-04-30 8:21 ` Jeff Layton
1 sibling, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2026-04-30 8:21 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> Move fuse_uring_abort() out of the inline header definition and into
> dev_uring.c. This function calls several internal helpers (abort
> requests, stop queues) that are all defined in dev_uring.c so inlining
> it in the header unnecessarily exposes implementation details.
>
> This will make the subsequent commit that adds pinning capabilties for
> fuse buffers cleaner.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> fs/fuse/dev_uring.c | 17 +++++++++++++++--
> fs/fuse/dev_uring_i.h | 16 +---------------
> 2 files changed, 16 insertions(+), 17 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index a85acd9c2b71..cce8994241b7 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -129,7 +129,7 @@ static void fuse_uring_abort_end_queue_requests(struct fuse_ring_queue *queue)
> fuse_dev_end_requests(&req_list);
> }
>
> -void fuse_uring_abort_end_requests(struct fuse_ring *ring)
> +static void fuse_uring_abort_end_requests(struct fuse_ring *ring)
> {
> int qid;
> struct fuse_ring_queue *queue;
> @@ -477,7 +477,7 @@ static void fuse_uring_async_stop_queues(struct work_struct *work)
> /*
> * Stop the ring queues
> */
> -void fuse_uring_stop_queues(struct fuse_ring *ring)
> +static void fuse_uring_stop_queues(struct fuse_ring *ring)
> {
> int qid;
>
> @@ -501,6 +501,19 @@ void fuse_uring_stop_queues(struct fuse_ring *ring)
> }
> }
>
> +void fuse_uring_abort(struct fuse_conn *fc)
> +{
> + struct fuse_ring *ring = fc->ring;
> +
> + if (ring == NULL)
> + return;
> +
> + if (atomic_read(&ring->queue_refs) > 0) {
> + fuse_uring_abort_end_requests(ring);
> + fuse_uring_stop_queues(ring);
> + }
> +}
> +
> /*
> * Handle IO_URING_F_CANCEL, typically should come on daemon termination.
> *
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> index 51a563922ce1..349418db3374 100644
> --- a/fs/fuse/dev_uring_i.h
> +++ b/fs/fuse/dev_uring_i.h
> @@ -137,27 +137,13 @@ struct fuse_ring {
>
> bool fuse_uring_enabled(void);
> void fuse_uring_destruct(struct fuse_conn *fc);
> -void fuse_uring_stop_queues(struct fuse_ring *ring);
> -void fuse_uring_abort_end_requests(struct fuse_ring *ring);
> +void fuse_uring_abort(struct fuse_conn *fc);
> int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
> void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req);
> bool fuse_uring_queue_bq_req(struct fuse_req *req);
> bool fuse_uring_remove_pending_req(struct fuse_req *req);
> bool fuse_uring_request_expired(struct fuse_conn *fc);
>
> -static inline void fuse_uring_abort(struct fuse_conn *fc)
> -{
> - struct fuse_ring *ring = fc->ring;
> -
> - if (ring == NULL)
> - return;
> -
> - if (atomic_read(&ring->queue_refs) > 0) {
> - fuse_uring_abort_end_requests(ring);
> - fuse_uring_stop_queues(ring);
> - }
> -}
> -
> static inline void fuse_uring_wait_stopped_queues(struct fuse_conn *fc)
> {
> struct fuse_ring *ring = fc->ring;
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 09/14] fuse: rearrange io-uring iovec and ent allocation logic
2026-04-02 16:28 ` [PATCH v2 09/14] fuse: rearrange io-uring iovec and ent allocation logic Joanne Koong
2026-04-15 9:45 ` Bernd Schubert
@ 2026-04-30 8:24 ` Jeff Layton
1 sibling, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2026-04-30 8:24 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> Move fuse_uring_get_iovec_from_sqe() to earlier in the file and
> move the allocation logic in fuse_uring_create_ring_ent() to the
> beginning of the function.
>
> There is no change in logic, this is done to make the subsequent commit
> that adds buffer rings easier to review.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> fs/fuse/dev_uring.c | 78 ++++++++++++++++++++++++---------------------
> 1 file changed, 41 insertions(+), 37 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index cce8994241b7..a061f175b3fd 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -277,6 +277,32 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
> return res;
> }
>
> +/*
> + * sqe->addr is a ptr to an iovec array, iov[FUSE_URING_IOV_HEADERS] has the
> + * headers, iov[FUSE_URING_IOV_PAYLOAD] the payload
> + */
> +static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
> + struct iovec iov[FUSE_URING_IOV_SEGS])
> +{
> + struct iovec __user *uiov = u64_to_user_ptr(READ_ONCE(sqe->addr));
> + struct iov_iter iter;
> + ssize_t ret;
> +
> + if (sqe->len != FUSE_URING_IOV_SEGS)
> + return -EINVAL;
> +
> + /*
> + * Direction for buffer access will actually be READ and WRITE,
> + * using write for the import should include READ access as well.
> + */
> + ret = import_iovec(WRITE, uiov, FUSE_URING_IOV_SEGS,
> + FUSE_URING_IOV_SEGS, &iov, &iter);
> + if (ret < 0)
> + return ret;
> +
> + return 0;
> +}
> +
> static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
> int qid)
> {
> @@ -1076,32 +1102,6 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
> }
> }
>
> -/*
> - * sqe->addr is a ptr to an iovec array, iov[FUSE_URING_IOV_HEADERS] has the
> - * headers, iov[FUSE_URING_IOV_PAYLOAD] the payload
> - */
> -static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
> - struct iovec iov[FUSE_URING_IOV_SEGS])
> -{
> - struct iovec __user *uiov = u64_to_user_ptr(READ_ONCE(sqe->addr));
> - struct iov_iter iter;
> - ssize_t ret;
> -
> - if (sqe->len != FUSE_URING_IOV_SEGS)
> - return -EINVAL;
> -
> - /*
> - * Direction for buffer access will actually be READ and WRITE,
> - * using write for the import should include READ access as well.
> - */
> - ret = import_iovec(WRITE, uiov, FUSE_URING_IOV_SEGS,
> - FUSE_URING_IOV_SEGS, &iov, &iter);
> - if (ret < 0)
> - return ret;
> -
> - return 0;
> -}
> -
> static struct fuse_ring_ent *
> fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
> struct fuse_ring_queue *queue)
> @@ -1112,40 +1112,44 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
> struct iovec *headers, *payload;
> int err;
>
> + ent = kzalloc_obj(*ent, GFP_KERNEL_ACCOUNT);
> + if (!ent)
> + return ERR_PTR(-ENOMEM);
> +
> + INIT_LIST_HEAD(&ent->list);
> +
> + ent->queue = queue;
> +
> err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
> if (err) {
> pr_info_ratelimited("Failed to get iovec from sqe, err=%d\n",
> err);
> - return ERR_PTR(err);
> + goto error;
> }
>
> err = -EINVAL;
> headers = &iov[FUSE_URING_IOV_HEADERS];
> if (headers->iov_len < sizeof(struct fuse_uring_req_header)) {
> pr_info_ratelimited("Invalid header len %zu\n", headers->iov_len);
> - return ERR_PTR(err);
> + goto error;
> }
>
> payload = &iov[FUSE_URING_IOV_PAYLOAD];
> if (payload->iov_len < ring->max_payload_sz) {
> pr_info_ratelimited("Invalid req payload len %zu\n",
> payload->iov_len);
> - return ERR_PTR(err);
> + goto error;
> }
>
> - err = -ENOMEM;
> - ent = kzalloc_obj(*ent, GFP_KERNEL_ACCOUNT);
> - if (!ent)
> - return ERR_PTR(err);
> -
> - INIT_LIST_HEAD(&ent->list);
> -
> - ent->queue = queue;
> ent->headers = headers->iov_base;
> ent->payload = payload->iov_base;
>
> atomic_inc(&ring->queue_refs);
> return ent;
> +
> +error:
> + kfree(ent);
> + return ERR_PTR(err);
> }
>
> /*
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 10/14] fuse: add io-uring buffer rings
2026-04-02 16:28 ` [PATCH v2 10/14] fuse: add io-uring buffer rings Joanne Koong
2026-04-15 9:48 ` Bernd Schubert
@ 2026-04-30 11:08 ` Jeff Layton
2026-04-30 12:44 ` Joanne Koong
2026-05-05 22:47 ` Bernd Schubert
2 siblings, 1 reply; 49+ messages in thread
From: Jeff Layton @ 2026-04-30 11:08 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> Add fuse buffer rings for servers communicating through the io-uring
> interface. To use this, the server must set the FUSE_URING_BUFRING
> flag and provide header and payload buffers via an iovec array in the
> sqe during registration. The payload buffers are used to back the buffer
> ring. The kernel manages buffer selection and recycling through a simple
> internal ring.
>
> This has the following advantages over the non-bufring (iovec) path:
> - Reduced memory usage: in the iovec path, each entry has its own
> dedicated payload buffer, requiring N buffers for N entries where each
> buffer must be large enough to accommodate the maximum possible
> payload size. With buffer rings, payload buffers are pooled and
> selected on demand. Entries only hold a buffer while actively
> processing a request with payload data. When incremental buffer
> consumption is added, this will allow non-overlapping regions of a
> single buffer to be used simultaneously across multiple requests,
> further reducing memory requirements.
> - Foundation for pinned buffers: the buffer ring headers and payloads
> are now each passed in as a contiguous memory allocation, which allows
> fuse to easily pin and vmap the entire region in one operation during
> queue setup. This will eliminate the per-request overhead of having to
> pin/unpin user pages and translate virtual addresses and is a
> prerequisite for future optimizations like performing data copies
> outside of the server's task context.
>
> Each ring entry gets a fixed ID (sqe->buf_index) that maps to a specific
> header slot in the headers buffer. Payload buffers are selected from
> the ring on demand and recycled after each request. Buffer ring usage is
> set on a per-queue basis. All subsequent registration SQEs for the same
> queue must use consistent flags.
>
> The headers are laid out contiguously and provided via iov[0]. Each slot
> maps to ent->id:
>
> > <- headers_size (>= queue_depth * sizeof(fuse_uring_req_header)) ->|
> +------------------------------+------------------------------+-----+
> > struct fuse_uring_req_header | struct fuse_uring_req_header | ... |
> > [ent id=0] | [ent id=1] | |
> +------------------------------+------------------------------+-----+
>
> On the server side, the ent id is used to determine where in the headers
> buffer the headers data for the ent resides. This is done by
> calculating ent_id * sizeof(struct fuse_uring_req_header) as the offset
> into the headers buffer.
>
> The buffer ring is backed by the payload buffer, which is contiguous but
> partitioned into individual bufs according to the buf_size passed in at
> registration.
>
> PAYLOAD BUFFER POOL (contiguous, provided via iov[1]):
> |<-------------- payload_size ------------>|
> +--------- --+-----------+-----------+-----+
> | buf [0] | buf [1] | buf [2] | ... |
> | buf_size | buf_size | buf_size | ... |
> +--------- --+-----------+-----------+-----+
>
> buffer ring state (struct fuse_bufring, kernel-internal):
> bufs[]: [ used | used | FREE | FREE | FREE ]
> ^^^^^^^^^^^^^^^^^^^
> available for selection
>
> The buffer ring logic is as follows:
> select: buf = bufs[head % nbufs]; head++
> recycle: bufs[tail % nbufs] = buf; tail++
> empty: tail == head (no buffers available)
> full: tail - head >= nbufs
>
> Buffer ring request flow
> ------------------------
> > Kernel | FUSE daemon
> > |
> > [client request arrives] |
> > >fuse_uring_send() |
> > [select payload buf from ring] |
> > >fuse_uring_select_buffer() |
> > [copy headers to ent's header slot] |
> > >copy_header_to_ring() |
> > [copy payload to selected buf] |
> > >fuse_uring_copy_to_ring() |
> > [set buf_id in ent_in_out header] |
> > >io_uring_cmd_done() |
> > | [CQE received]
> > | [read headers from header
> > | slot]
> > | [read payload from buf_id]
> > | [process request]
> > | [write reply to header
> > | slot]
> > | [write reply payload to
> > | buf]
> > | >io_uring_submit()
> > | COMMIT_AND_FETCH
> > >fuse_uring_commit_fetch() |
> > >fuse_uring_commit() |
> > [copy reply from ring] |
> > >fuse_uring_recycle_buffer() |
> > >fuse_uring_get_next_fuse_req() |
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> fs/fuse/dev_uring.c | 363 +++++++++++++++++++++++++++++++++-----
> fs/fuse/dev_uring_i.h | 45 ++++-
> include/uapi/linux/fuse.h | 27 ++-
> 3 files changed, 381 insertions(+), 54 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index a061f175b3fd..9f14a2bcde3f 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -41,6 +41,11 @@ enum fuse_uring_header_type {
> FUSE_URING_HEADER_RING_ENT,
> };
>
> +static inline bool bufring_enabled(struct fuse_ring_queue *queue)
> +{
> + return queue->bufring != NULL;
> +}
> +
> static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
> struct fuse_ring_ent *ring_ent)
> {
> @@ -222,6 +227,7 @@ void fuse_uring_destruct(struct fuse_conn *fc)
> }
>
> kfree(queue->fpq.processing);
> + kfree(queue->bufring);
> kfree(queue);
> ring->queues[qid] = NULL;
> }
> @@ -303,20 +309,102 @@ static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
> return 0;
> }
>
> -static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
> - int qid)
> +static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> + struct fuse_ring_queue *queue)
> +{
> + const struct fuse_uring_cmd_req *cmd_req =
> + io_uring_sqe128_cmd(cmd->sqe, struct fuse_uring_cmd_req);
> + u16 queue_depth = READ_ONCE(cmd_req->init.queue_depth);
> + unsigned int buf_size = READ_ONCE(cmd_req->init.buf_size);
> + struct iovec iov[FUSE_URING_IOV_SEGS];
> + void __user *payload, *headers;
> + size_t headers_size, payload_size, ring_size;
> + struct fuse_bufring *br;
> + unsigned int nr_bufs, i;
> + uintptr_t payload_addr;
> + int err;
> +
> + if (!queue_depth || !buf_size)
> + return -EINVAL;
> +
> + err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
> + if (err)
> + return err;
> +
> + headers = iov[FUSE_URING_IOV_HEADERS].iov_base;
> + headers_size = iov[FUSE_URING_IOV_HEADERS].iov_len;
> + payload = iov[FUSE_URING_IOV_PAYLOAD].iov_base;
> + payload_size = iov[FUSE_URING_IOV_PAYLOAD].iov_len;
> +
> + /* check if there's enough space for all the headers */
> + if (headers_size < queue_depth * sizeof(struct fuse_uring_req_header))
> + return -EINVAL;
> +
> + if (buf_size < queue->ring->max_payload_sz)
> + return -EINVAL;
> +
> + nr_bufs = payload_size / buf_size;
> + if (!nr_bufs || nr_bufs > U16_MAX)
What's the significance of U16_MAX here? It looks like the br->nbufs
field is an unsigned int. Is it because struct fuse_uring_ent_in_out
has buf_id as a u16?
Not that I think you'll ever need more than 2^16 buffers, just curious
about the limitation.
> + return -EINVAL;
> +
> + /* create the ring buffer */
> + ring_size = struct_size(br, bufs, nr_bufs);
> + br = kzalloc(ring_size, GFP_KERNEL_ACCOUNT);
> + if (!br)
> + return -ENOMEM;
> +
> + br->queue_depth = queue_depth;
> + br->headers = headers;
> +
> + payload_addr = (uintptr_t)payload;
> +
> + /* populate the ring buffer */
> + for (i = 0; i < nr_bufs; i++, payload_addr += buf_size) {
> + struct fuse_bufring_buf *buf = &br->bufs[i];
> +
> + buf->addr = payload_addr;
> + buf->len = buf_size;
> + buf->id = i;
> + }
> +
> + br->nbufs = nr_bufs;
> + br->tail = nr_bufs;
> +
> + queue->bufring = br;
> +
> + return 0;
> +}
> +
> +/*
> + * if the queue is already registered, check that the queue was initialized with
> + * the same init flags set for this FUSE_IO_URING_CMD_REGISTER cmd. all
> + * FUSE_IO_URING_CMD_REGISTER cmds should have the same init fields set on a
> + * per-queue basis.
> + */
> +static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
> + u64 init_flags)
> {
> + bool bufring = init_flags & FUSE_URING_BUFRING;
> +
> + return bufring_enabled(queue) == bufring;
> +}
> +
> +static struct fuse_ring_queue *
> +fuse_uring_create_queue(struct io_uring_cmd *cmd, struct fuse_ring *ring,
> + int qid, u64 init_flags)
> +{
> + bool use_bufring = init_flags & FUSE_URING_BUFRING;
> struct fuse_conn *fc = ring->fc;
> struct fuse_ring_queue *queue;
> struct list_head *pq;
>
> queue = kzalloc_obj(*queue, GFP_KERNEL_ACCOUNT);
> if (!queue)
> - return NULL;
> + return ERR_PTR(-ENOMEM);
> pq = kzalloc_objs(struct list_head, FUSE_PQ_HASH_SIZE);
> if (!pq) {
> kfree(queue);
> - return NULL;
> + return ERR_PTR(-ENOMEM);
> }
>
> queue->qid = qid;
> @@ -334,12 +422,29 @@ static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
> queue->fpq.processing = pq;
> fuse_pqueue_init(&queue->fpq);
>
> + if (use_bufring) {
> + int err = fuse_uring_bufring_setup(cmd, queue);
> +
> + if (err) {
> + kfree(pq);
> + kfree(queue);
> + return ERR_PTR(err);
> + }
> + }
> +
> spin_lock(&fc->lock);
> + /* check if the queue creation raced with another thread */
> if (ring->queues[qid]) {
> spin_unlock(&fc->lock);
> kfree(queue->fpq.processing);
> + if (use_bufring)
> + kfree(queue->bufring);
nit: presumably you could skip the if here. If use_bufring is false,
then queue->bufring _should_ be NULL.
> kfree(queue);
> - return ring->queues[qid];
> +
> + queue = ring->queues[qid];
> + if (!queue_init_flags_consistent(queue, init_flags))
> + return ERR_PTR(-EINVAL);
> + return queue;
> }
>
> /*
> @@ -649,7 +754,14 @@ static int copy_header_to_ring(struct fuse_ring_ent *ent,
> if (offset < 0)
> return offset;
>
> - ring = (void __user *)ent->headers + offset;
> + if (bufring_enabled(ent->queue)) {
> + int buf_offset = offset +
> + sizeof(struct fuse_uring_req_header) * ent->id;
> +
> + ring = ent->queue->bufring->headers + buf_offset;
> + } else {
> + ring = (void __user *)ent->headers + offset;
> + }
>
> if (copy_to_user(ring, header, header_size)) {
> pr_info_ratelimited("Copying header to ring failed.\n");
> @@ -669,7 +781,14 @@ static int copy_header_from_ring(struct fuse_ring_ent *ent,
> if (offset < 0)
> return offset;
>
> - ring = (void __user *)ent->headers + offset;
> + if (bufring_enabled(ent->queue)) {
> + int buf_offset = offset +
> + sizeof(struct fuse_uring_req_header) * ent->id;
> +
> + ring = ent->queue->bufring->headers + buf_offset;
> + } else {
> + ring = (void __user *)ent->headers + offset;
> + }
>
> if (copy_from_user(header, ring, header_size)) {
> pr_info_ratelimited("Copying header from ring failed.\n");
> @@ -684,12 +803,20 @@ static int setup_fuse_copy_state(struct fuse_copy_state *cs,
> struct fuse_ring_ent *ent, int dir,
> struct iov_iter *iter)
> {
> + void __user *payload;
> int err;
>
> - err = import_ubuf(dir, ent->payload, ring->max_payload_sz, iter);
> - if (err) {
> - pr_info_ratelimited("fuse: Import of user buffer failed\n");
> - return err;
> + if (bufring_enabled(ent->queue))
> + payload = (void __user *)ent->payload_buf.addr;
> + else
> + payload = ent->payload;
> +
> + if (payload) {
> + err = import_ubuf(dir, payload, ring->max_payload_sz, iter);
> + if (err) {
> + pr_info_ratelimited("fuse: Import of user buffer failed\n");
> + return err;
> + }
> }
>
> fuse_copy_init(cs, dir == ITER_DEST, iter);
> @@ -741,6 +868,9 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
> .commit_id = req->in.h.unique,
> };
>
> + if (bufring_enabled(ent->queue))
> + ent_in_out.buf_id = ent->payload_buf.id;
> +
> err = setup_fuse_copy_state(&cs, ring, req, ent, ITER_DEST, &iter);
> if (err)
> return err;
> @@ -805,6 +935,96 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
> sizeof(req->in.h));
> }
>
> +static bool fuse_uring_req_has_payload(struct fuse_req *req)
> +{
> + struct fuse_args *args = req->args;
> +
> + return args->in_numargs > 1 || args->out_numargs;
> +}
> +
> +static int fuse_uring_select_buffer(struct fuse_ring_ent *ent)
> + __must_hold(&ent->queue->lock)
> +{
> + struct fuse_ring_queue *queue = ent->queue;
> + struct fuse_bufring *br = queue->bufring;
> + struct fuse_bufring_buf *buf;
> + unsigned int tail = br->tail, head = br->head;
> +
> + lockdep_assert_held(&queue->lock);
> +
> + /* Get a buffer to use for the payload */
> + if (tail == head)
> + return -ENOBUFS;
> +
> + buf = &br->bufs[head % br->nbufs];
> + br->head++;
> +
> + ent->payload_buf = *buf;
> +
> + return 0;
> +}
> +
> +static void fuse_uring_recycle_buffer(struct fuse_ring_ent *ent)
> + __must_hold(&ent->queue->lock)
> +{
> + struct fuse_bufring_buf *ent_payload = &ent->payload_buf;
> + struct fuse_ring_queue *queue = ent->queue;
> + struct fuse_bufring_buf *buf;
> + struct fuse_bufring *br;
> +
> + lockdep_assert_held(&queue->lock);
> +
> + if (!bufring_enabled(queue) || !ent_payload->addr)
> + return;
> +
> + br = queue->bufring;
> +
> + /* ring should never be full */
> + WARN_ON_ONCE(br->tail - br->head >= br->nbufs);
> +
> + buf = &br->bufs[(br->tail) % br->nbufs];
> +
> + *buf = *ent_payload;
> +
> + br->tail++;
> +
> + memset(ent_payload, 0, sizeof(*ent_payload));
> +}
> +
> +static int fuse_uring_next_req_update_buffer(struct fuse_ring_ent *ent,
> + struct fuse_req *req)
> +{
> + bool buffer_selected;
> + bool has_payload;
> +
> + if (!bufring_enabled(ent->queue))
> + return 0;
> +
> + buffer_selected = !!ent->payload_buf.addr;
> + has_payload = fuse_uring_req_has_payload(req);
> +
> + if (has_payload && !buffer_selected)
> + return fuse_uring_select_buffer(ent);
> +
> + if (!has_payload && buffer_selected)
> + fuse_uring_recycle_buffer(ent);
> +
> + return 0;
> +}
> +
> +static int fuse_uring_prep_buffer(struct fuse_ring_ent *ent,
> + struct fuse_req *req)
> +{
> + if (!bufring_enabled(ent->queue))
> + return 0;
> +
> + /* no payload to copy, can skip selecting a buffer */
> + if (!fuse_uring_req_has_payload(req))
> + return 0;
> +
> + return fuse_uring_select_buffer(ent);
> +}
> +
> static int fuse_uring_prepare_send(struct fuse_ring_ent *ent,
> struct fuse_req *req)
> {
> @@ -878,10 +1098,21 @@ static struct fuse_req *fuse_uring_ent_assign_req(struct fuse_ring_ent *ent)
>
> /* get and assign the next entry while it is still holding the lock */
> req = list_first_entry_or_null(req_queue, struct fuse_req, list);
> - if (req)
> - fuse_uring_add_req_to_ring_ent(ent, req);
> + if (req) {
> + int err = fuse_uring_next_req_update_buffer(ent, req);
>
> - return req;
> + if (!err) {
> + fuse_uring_add_req_to_ring_ent(ent, req);
> + return req;
> + }
> + }
> +
> + /*
> + * Buffer selection may fail if all the buffers are currently saturated.
> + * The request will be serviced when a buffer is freed up.
> + */
> + fuse_uring_recycle_buffer(ent);
> + return NULL;
> }
>
> /*
> @@ -1041,6 +1272,12 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
> * fuse requests would otherwise not get processed - committing
> * and fetching is done in one step vs legacy fuse, which has separated
> * read (fetch request) and write (commit result).
> + *
> + * If the server is using bufrings and has populated the ring with less
> + * payload buffers than ents, it is possible that there may not be an
> + * available buffer for the next request. If so, then the fetch is a
> + * no-op and the next request will be serviced when a buffer becomes
> + * available.
> */
> if (fuse_uring_get_next_fuse_req(ent, queue))
> fuse_uring_send(ent, cmd, 0, issue_flags);
> @@ -1120,30 +1357,38 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
>
> ent->queue = queue;
>
> - err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
> - if (err) {
> - pr_info_ratelimited("Failed to get iovec from sqe, err=%d\n",
> - err);
> - goto error;
> - }
> + if (bufring_enabled(queue)) {
> + ent->id = READ_ONCE(cmd->sqe->buf_index);
> + if (ent->id >= queue->bufring->queue_depth) {
> + err = -EINVAL;
> + goto error;
> + }
> + } else {
> + err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
> + if (err) {
> + pr_info_ratelimited("Failed to get iovec from sqe, err=%d\n",
> + err);
> + goto error;
> + }
>
> - err = -EINVAL;
> - headers = &iov[FUSE_URING_IOV_HEADERS];
> - if (headers->iov_len < sizeof(struct fuse_uring_req_header)) {
> - pr_info_ratelimited("Invalid header len %zu\n", headers->iov_len);
> - goto error;
> - }
> + err = -EINVAL;
> + headers = &iov[FUSE_URING_IOV_HEADERS];
> + if (headers->iov_len < sizeof(struct fuse_uring_req_header)) {
> + pr_info_ratelimited("Invalid header len %zu\n",
> + headers->iov_len);
> + goto error;
> + }
>
> - payload = &iov[FUSE_URING_IOV_PAYLOAD];
> - if (payload->iov_len < ring->max_payload_sz) {
> - pr_info_ratelimited("Invalid req payload len %zu\n",
> - payload->iov_len);
> - goto error;
> + payload = &iov[FUSE_URING_IOV_PAYLOAD];
> + if (payload->iov_len < ring->max_payload_sz) {
> + pr_info_ratelimited("Invalid req payload len %zu\n",
> + payload->iov_len);
> + goto error;
> + }
> + ent->headers = headers->iov_base;
> + ent->payload = payload->iov_base;
> }
>
> - ent->headers = headers->iov_base;
> - ent->payload = payload->iov_base;
> -
> atomic_inc(&ring->queue_refs);
> return ent;
>
> @@ -1152,6 +1397,13 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
> return ERR_PTR(err);
> }
>
> +static bool init_flags_valid(u64 init_flags)
> +{
> + u64 valid_flags = FUSE_URING_BUFRING;
> +
> + return !(init_flags & ~valid_flags);
> +}
> +
> /*
> * Register header and payload buffer with the kernel and puts the
> * entry as "ready to get fuse requests" on the queue
> @@ -1161,6 +1413,7 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
> {
> const struct fuse_uring_cmd_req *cmd_req = io_uring_sqe128_cmd(cmd->sqe,
> struct fuse_uring_cmd_req);
> + u64 init_flags = READ_ONCE(cmd_req->flags);
> struct fuse_ring *ring = smp_load_acquire(&fc->ring);
> struct fuse_ring_queue *queue;
> struct fuse_ring_ent *ent;
> @@ -1179,11 +1432,16 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
> return -EINVAL;
> }
>
> + if (!init_flags_valid(init_flags))
> + return -EINVAL;
> +
> queue = ring->queues[qid];
> if (!queue) {
> - queue = fuse_uring_create_queue(ring, qid);
> - if (!queue)
> - return err;
> + queue = fuse_uring_create_queue(cmd, ring, qid, init_flags);
> + if (IS_ERR(queue))
> + return PTR_ERR(queue);
> + } else if (!queue_init_flags_consistent(queue, init_flags)) {
> + return -EINVAL;
> }
>
> /*
> @@ -1349,14 +1607,18 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
> req->ring_queue = queue;
> ent = list_first_entry_or_null(&queue->ent_avail_queue,
> struct fuse_ring_ent, list);
> - if (ent)
> - fuse_uring_add_req_to_ring_ent(ent, req);
> - else
> - list_add_tail(&req->list, &queue->fuse_req_queue);
> - spin_unlock(&queue->lock);
> + if (ent) {
> + err = fuse_uring_prep_buffer(ent, req);
> + if (!err) {
> + fuse_uring_add_req_to_ring_ent(ent, req);
> + spin_unlock(&queue->lock);
> + fuse_uring_dispatch_ent(ent);
> + return;
> + }
> + }
>
> - if (ent)
> - fuse_uring_dispatch_ent(ent);
> + list_add_tail(&req->list, &queue->fuse_req_queue);
> + spin_unlock(&queue->lock);
>
> return;
>
> @@ -1406,14 +1668,17 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
> req = list_first_entry_or_null(&queue->fuse_req_queue, struct fuse_req,
> list);
> if (ent && req) {
> - fuse_uring_add_req_to_ring_ent(ent, req);
> - spin_unlock(&queue->lock);
> + int err = fuse_uring_prep_buffer(ent, req);
>
> - fuse_uring_dispatch_ent(ent);
> - } else {
> - spin_unlock(&queue->lock);
> + if (!err) {
> + fuse_uring_add_req_to_ring_ent(ent, req);
> + spin_unlock(&queue->lock);
> + fuse_uring_dispatch_ent(ent);
> + return true;
> + }
> }
>
> + spin_unlock(&queue->lock);
> return true;
> }
>
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> index 349418db3374..66d5d5f8dc3f 100644
> --- a/fs/fuse/dev_uring_i.h
> +++ b/fs/fuse/dev_uring_i.h
> @@ -36,11 +36,47 @@ enum fuse_ring_req_state {
> FRRS_RELEASED,
> };
>
> +struct fuse_bufring_buf {
> + uintptr_t addr;
> + unsigned int len;
> + unsigned int id;
> +};
> +
> +struct fuse_bufring {
> + /* pointer to the headers buffer */
> + void __user *headers;
> +
> + unsigned int queue_depth;
> +
> + /* metadata tracking state of the bufring */
> + unsigned int nbufs;
> + unsigned int head;
> + unsigned int tail;
> +
> + /* the buffers backing the ring */
> + __DECLARE_FLEX_ARRAY(struct fuse_bufring_buf, bufs);
> +};
> +
> /** A fuse ring entry, part of the ring queue */
> struct fuse_ring_ent {
> - /* userspace buffer */
> - struct fuse_uring_req_header __user *headers;
> - void __user *payload;
> + union {
> + /* if bufrings are not used */
> + struct {
> + /* userspace buffers */
> + struct fuse_uring_req_header __user *headers;
> + void __user *payload;
> + };
> + /* if bufrings are used */
> + struct {
> + /*
> + * unique fixed id for the ent. used by kernel/server to
> + * locate where in the headers buffer the data for this
> + * ent resides
> + */
> + unsigned int id;
> + struct fuse_bufring_buf payload_buf;
> + };
> + };
>
> /* the ring queue that owns the request */
> struct fuse_ring_queue *queue;
> @@ -99,6 +135,9 @@ struct fuse_ring_queue {
> unsigned int active_background;
>
> bool stopped;
> +
> + /* only allocated if the server uses bufrings */
> + struct fuse_bufring *bufring;
> };
>
> /**
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index c13e1f9a2f12..8753de7eb189 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -240,6 +240,10 @@
> * - add FUSE_COPY_FILE_RANGE_64
> * - add struct fuse_copy_file_range_out
> * - add FUSE_NOTIFY_PRUNE
> + *
> + * 7.46
> + * - add FUSE_URING_BUFRING flag
> + * - add fuse_uring_cmd_req init struct
> */
>
> #ifndef _LINUX_FUSE_H
> @@ -1263,7 +1267,13 @@ struct fuse_uring_ent_in_out {
>
> /* size of user payload buffer */
> uint32_t payload_sz;
> - uint32_t padding;
> +
> + /*
> + * if using bufrings, this is the id of the selected buffer.
> + * the selected buffer holds the request payload
> + */
> + uint16_t buf_id;
> + uint16_t padding;
>
> uint64_t reserved;
> };
> @@ -1294,6 +1304,9 @@ enum fuse_uring_cmd {
> FUSE_IO_URING_CMD_COMMIT_AND_FETCH = 2,
> };
>
> +/* fuse_uring_cmd_req flags */
> +#define FUSE_URING_BUFRING (1 << 0)
> +
> /**
> * In the 80B command area of the SQE.
> */
> @@ -1305,7 +1318,17 @@ struct fuse_uring_cmd_req {
>
> /* queue the command is for (queue index) */
> uint16_t qid;
> - uint8_t padding[6];
> + uint16_t padding;
> +
> + union {
> + struct {
> + /* size of the bufring's backing buffers */
> + uint32_t buf_size;
> + /* number of entries in the queue */
> + uint16_t queue_depth;
> + uint16_t padding;
> + } init;
> + };
> };
>
> #endif /* _LINUX_FUSE_H */
Overall, this looks good though.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 11/14] fuse: add pinned headers capability for io-uring buffer rings
2026-04-02 16:28 ` [PATCH v2 11/14] fuse: add pinned headers capability for " Joanne Koong
2026-04-14 12:47 ` Bernd Schubert
@ 2026-04-30 11:22 ` Jeff Layton
1 sibling, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2026-04-30 11:22 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> Allow fuse servers to pin their header buffers by setting the
> FUSE_URING_PINNED_HEADERS flag alongside FUSE_URING_BUFRING on REGISTER
> sqes. When set, the kernel pins the header pages, vmaps them for a
> kernel virtual address, and uses direct memcpy for copying. This avoids
> the per-request overhead of having to pin/unpin user pages and translate
> virtual addresses.
>
> Buffers must be page-aligned. The kernel accounts pinned pages against
> RLIMIT_MEMLOCK (bypassed with CAP_IPC_LOCK) and tracks mm->pinned_vm.
> Unpinning is done in process context during connection abort, since vmap
> cannot run in softirq (where final destruction occurs via RCU).
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> fs/fuse/dev_uring.c | 228 ++++++++++++++++++++++++++++++++++++--
> fs/fuse/dev_uring_i.h | 23 +++-
> include/uapi/linux/fuse.h | 2 +
> 3 files changed, 243 insertions(+), 10 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 9f14a2bcde3f..79736b02cf9f 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -11,6 +11,7 @@
>
> #include <linux/fs.h>
> #include <linux/io_uring/cmd.h>
> +#include <linux/vmalloc.h>
>
> static bool __read_mostly enable_uring;
> module_param(enable_uring, bool, 0644);
> @@ -46,6 +47,11 @@ static inline bool bufring_enabled(struct fuse_ring_queue *queue)
> return queue->bufring != NULL;
> }
>
> +static inline bool bufring_pinned_headers(struct fuse_ring_queue *queue)
> +{
> + return queue->bufring->use_pinned_headers;
> +}
> +
> static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
> struct fuse_ring_ent *ring_ent)
> {
> @@ -200,6 +206,37 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
> return false;
> }
>
> +static void fuse_bufring_unpin_mem(struct fuse_bufring_pinned *mem)
> +{
> + struct page **pages = mem->pages;
> + unsigned int nr_pages = mem->nr_pages;
> + struct user_struct *user = mem->user;
> + struct mm_struct *mm_account = mem->mm_account;
> +
> + vunmap(mem->addr);
> + unpin_user_pages(pages, nr_pages);
> +
> + if (user) {
> + atomic_long_sub(nr_pages, &user->locked_vm);
> + free_uid(user);
> + }
> +
> + atomic64_sub(nr_pages, &mm_account->pinned_vm);
> + mmdrop(mm_account);
> +
> + kvfree(mem->pages);
> +}
> +
> +static void fuse_uring_bufring_unpin(struct fuse_ring_queue *queue)
> +{
> + struct fuse_bufring *br = queue->bufring;
> +
> + if (bufring_pinned_headers(queue)) {
> + fuse_bufring_unpin_mem(&br->pinned_headers);
> + br->use_pinned_headers = false;
> + }
> +}
> +
> void fuse_uring_destruct(struct fuse_conn *fc)
> {
> struct fuse_ring *ring = fc->ring;
> @@ -227,7 +264,10 @@ void fuse_uring_destruct(struct fuse_conn *fc)
> }
>
> kfree(queue->fpq.processing);
> - kfree(queue->bufring);
> + if (bufring_enabled(queue)) {
> + fuse_uring_bufring_unpin(queue);
> + kfree(queue->bufring);
> + }
> kfree(queue);
> ring->queues[qid] = NULL;
> }
> @@ -309,14 +349,131 @@ static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
> return 0;
> }
>
> +static struct page **fuse_uring_pin_user_pages(void __user *uaddr,
> + unsigned long len, int *npages)
> +{
> + unsigned long addr = (unsigned long)uaddr;
> + unsigned long start, end, nr_pages;
> + struct page **pages;
> + int pinned;
> +
> + if (check_add_overflow(addr, len, &end))
> + return ERR_PTR(-EOVERFLOW);
> + if (check_add_overflow(end, PAGE_SIZE - 1, &end))
> + return ERR_PTR(-EOVERFLOW);
> +
> + end = end >> PAGE_SHIFT;
> + start = addr >> PAGE_SHIFT;
> + nr_pages = end - start;
> + if (WARN_ON_ONCE(!nr_pages))
> + return ERR_PTR(-EINVAL);
> + if (WARN_ON_ONCE(nr_pages > INT_MAX))
> + return ERR_PTR(-EOVERFLOW);
> +
> + pages = kvmalloc_objs(struct page *, nr_pages, GFP_KERNEL_ACCOUNT);
> + if (!pages)
> + return ERR_PTR(-ENOMEM);
> +
> + pinned = pin_user_pages_fast(addr, nr_pages, FOLL_WRITE | FOLL_LONGTERM,
> + pages);
> + /* success, mapped all pages */
> + if (pinned == nr_pages) {
> + *npages = nr_pages;
> + return pages;
> + }
> +
> + /* remove any partial pins */
> + if (pinned > 0)
> + unpin_user_pages(pages, pinned);
> +
> + kvfree(pages);
> +
> + return ERR_PTR(pinned < 0 ? pinned : -EFAULT);
> +}
> +
> +static int account_pinned_pages(struct fuse_bufring_pinned *mem,
> + struct page **pages, unsigned int nr_pages)
> +{
> + unsigned long page_limit, cur_pages, new_pages;
> + struct user_struct *user = current_user();
> +
> + if (!nr_pages)
> + return 0;
> +
> + if (!capable(CAP_IPC_LOCK)) {
> + /* Don't allow more pages than we can safely lock */
> + page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +
> + cur_pages = atomic_long_read(&user->locked_vm);
> + do {
> + new_pages = cur_pages + nr_pages;
> + if (new_pages > page_limit)
> + return -ENOMEM;
> + } while (!atomic_long_try_cmpxchg(&user->locked_vm,
> + &cur_pages, new_pages));
> +
> + mem->user = get_uid(current_user());
> + }
> +
> + atomic64_add(nr_pages, ¤t->mm->pinned_vm);
> + mmgrab(current->mm);
> + mem->mm_account = current->mm;
> +
> + return 0;
> +}
> +
> +static int fuse_bufring_pin_mem(struct fuse_bufring_pinned *mem,
> + void __user *addr, size_t len)
> +{
> + struct page **pages = NULL;
> + int nr_pages;
> + int err;
> +
> + if (!PAGE_ALIGNED(addr))
> + return -EINVAL;
> +
> + pages = fuse_uring_pin_user_pages(addr, len, &nr_pages);
> + if (IS_ERR(pages))
> + return PTR_ERR(pages);
> +
> + err = account_pinned_pages(mem, pages, nr_pages);
> + if (err)
> + goto unpin;
> +
> + mem->addr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
> + if (!mem->addr) {
> + err = -ENOMEM;
> + goto unaccount;
> + }
> +
> + mem->pages = pages;
> + mem->nr_pages = nr_pages;
> +
> + return 0;
> +
> +unaccount:
> + if (mem->user) {
> + atomic_long_sub(nr_pages, &mem->user->locked_vm);
> + free_uid(mem->user);
> + }
> + atomic64_sub(nr_pages, ¤t->mm->pinned_vm);
> + mmdrop(mem->mm_account);
> +unpin:
> + unpin_user_pages(pages, nr_pages);
> + kvfree(pages);
> + return err;
> +}
> +
> static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> - struct fuse_ring_queue *queue)
> + struct fuse_ring_queue *queue,
> + u64 init_flags)
> {
> const struct fuse_uring_cmd_req *cmd_req =
> io_uring_sqe128_cmd(cmd->sqe, struct fuse_uring_cmd_req);
> u16 queue_depth = READ_ONCE(cmd_req->init.queue_depth);
> unsigned int buf_size = READ_ONCE(cmd_req->init.buf_size);
> struct iovec iov[FUSE_URING_IOV_SEGS];
> + bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
> void __user *payload, *headers;
> size_t headers_size, payload_size, ring_size;
> struct fuse_bufring *br;
> @@ -354,7 +511,17 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> return -ENOMEM;
>
> br->queue_depth = queue_depth;
> - br->headers = headers;
> + if (pinned_headers) {
> + err = fuse_bufring_pin_mem(&br->pinned_headers, headers,
> + headers_size);
> + if (err) {
> + kfree(br);
> + return err;
> + }
> + br->use_pinned_headers = true;
> + } else {
> + br->headers = headers;
> + }
>
> payload_addr = (uintptr_t)payload;
>
> @@ -385,8 +552,15 @@ static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
> u64 init_flags)
> {
> bool bufring = init_flags & FUSE_URING_BUFRING;
> + bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
> +
> + if (bufring_enabled(queue) != bufring)
> + return false;
> +
> + if (!bufring)
> + return true;
>
> - return bufring_enabled(queue) == bufring;
> + return bufring_pinned_headers(queue) == pinned_headers;
> }
>
> static struct fuse_ring_queue *
> @@ -423,7 +597,7 @@ fuse_uring_create_queue(struct io_uring_cmd *cmd, struct fuse_ring *ring,
> fuse_pqueue_init(&queue->fpq);
>
> if (use_bufring) {
> - int err = fuse_uring_bufring_setup(cmd, queue);
> + int err = fuse_uring_bufring_setup(cmd, queue, init_flags);
>
> if (err) {
> kfree(pq);
> @@ -437,8 +611,10 @@ fuse_uring_create_queue(struct io_uring_cmd *cmd, struct fuse_ring *ring,
> if (ring->queues[qid]) {
> spin_unlock(&fc->lock);
> kfree(queue->fpq.processing);
> - if (use_bufring)
> + if (use_bufring) {
> + fuse_uring_bufring_unpin(queue);
> kfree(queue->bufring);
> + }
> kfree(queue);
>
> queue = ring->queues[qid];
> @@ -605,6 +781,25 @@ static void fuse_uring_async_stop_queues(struct work_struct *work)
> }
> }
>
> +static void fuse_uring_unpin_queues(struct fuse_ring *ring)
> +{
> + int qid;
> +
> + for (qid = 0; qid < ring->nr_queues; qid++) {
> + struct fuse_ring_queue *queue = READ_ONCE(ring->queues[qid]);
> + struct fuse_bufring *br;
> +
> + if (!queue)
> + continue;
> +
> + br = queue->bufring;
> + if (!br)
> + continue;
> +
> + fuse_uring_bufring_unpin(queue);
> + }
> +}
> +
> /*
> * Stop the ring queues
> */
> @@ -643,6 +838,9 @@ void fuse_uring_abort(struct fuse_conn *fc)
> fuse_uring_abort_end_requests(ring);
> fuse_uring_stop_queues(ring);
> }
> +
> + /* unpin while in process context - can't do this in softirq */
> + fuse_uring_unpin_queues(ring);
> }
>
> /*
> @@ -758,6 +956,11 @@ static int copy_header_to_ring(struct fuse_ring_ent *ent,
> int buf_offset = offset +
> sizeof(struct fuse_uring_req_header) * ent->id;
>
> + if (bufring_pinned_headers(ent->queue)) {
> + memcpy(ent->queue->bufring->pinned_headers.addr + buf_offset,
> + header, header_size);
> + return 0;
> + }
> ring = ent->queue->bufring->headers + buf_offset;
> } else {
> ring = (void __user *)ent->headers + offset;
> @@ -785,6 +988,11 @@ static int copy_header_from_ring(struct fuse_ring_ent *ent,
> int buf_offset = offset +
> sizeof(struct fuse_uring_req_header) * ent->id;
>
> + if (bufring_pinned_headers(ent->queue)) {
> + memcpy(header, ent->queue->bufring->pinned_headers.addr + buf_offset,
> + header_size);
> + return 0;
> + }
> ring = ent->queue->bufring->headers + buf_offset;
> } else {
> ring = (void __user *)ent->headers + offset;
> @@ -1399,7 +1607,13 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
>
> static bool init_flags_valid(u64 init_flags)
> {
> - u64 valid_flags = FUSE_URING_BUFRING;
> + u64 valid_flags =
> + FUSE_URING_BUFRING | FUSE_URING_PINNED_HEADERS;
> + bool bufring = init_flags & FUSE_URING_BUFRING;
> + bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
> +
> + if (pinned_headers && !bufring)
> + return false;
>
> return !(init_flags & ~valid_flags);
> }
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> index 66d5d5f8dc3f..05c0f061a882 100644
> --- a/fs/fuse/dev_uring_i.h
> +++ b/fs/fuse/dev_uring_i.h
> @@ -42,12 +42,29 @@ struct fuse_bufring_buf {
> unsigned int id;
> };
>
> -struct fuse_bufring {
> - /* pointer to the headers buffer */
> - void __user *headers;
> +struct fuse_bufring_pinned {
> + void *addr;
> + struct page **pages;
> + unsigned int nr_pages;
> +
> + /*
> + * need to track this so we can unpin / unaccount pages during teardown
> + * when not running in the server's task context
> + */
> + struct user_struct *user;
> + struct mm_struct *mm_account;
> +};
>
> +struct fuse_bufring {
> + bool use_pinned_headers: 1;
> unsigned int queue_depth;
>
> + union {
> + /* pointer to the headers buffer */
> + void __user *headers;
> + struct fuse_bufring_pinned pinned_headers;
> + };
> +
> /* metadata tracking state of the bufring */
> unsigned int nbufs;
> unsigned int head;
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 8753de7eb189..e57244c03d42 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -244,6 +244,7 @@
> * 7.46
> * - add FUSE_URING_BUFRING flag
> * - add fuse_uring_cmd_req init struct
> + * - add FUSE_URING_PINNED_HEADERS flag
> */
>
> #ifndef _LINUX_FUSE_H
> @@ -1306,6 +1307,7 @@ enum fuse_uring_cmd {
>
> /* fuse_uring_cmd_req flags */
> #define FUSE_URING_BUFRING (1 << 0)
> +#define FUSE_URING_PINNED_HEADERS (1 << 1)
>
> /**
> * In the 80B command area of the SQE.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 12/14] fuse: add pinned payload buffers capability for io-uring buffer rings
2026-04-02 16:28 ` [PATCH v2 12/14] fuse: add pinned payload buffers " Joanne Koong
@ 2026-04-30 11:29 ` Jeff Layton
0 siblings, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2026-04-30 11:29 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> Extend the buffer ring pinning capability to payload buffers via the
> FUSE_URING_PINNED_BUFFERS flag. When set alongside FUSE_URING_BUFRING,
> the kernel pins and vmaps the payload buffer region during queue setup.
>
> With pinned payloads, the kernel uses direct memcpy for all payload
> buffer copies, avoiding the per-request overhead of pinning/unpinning
> user pages and translating virtual addresses. This is particularly
> beneficial for large payload copies.
>
> As with pinned headers, buffers must be page-aligned. Pinned pages are
> accounted against RLIMIT_MEMLOCK (bypassed with CAP_IPC_LOCK) and
> unpinned in process context during connection abort.
>
> In benchmarks using passthrough_hp on a high-performance NVMe-backed
> system, pinned headers and pinned payload buffers showed around a 10%
> throughput improvement for direct randreads (~2150 MiB/s to ~2400
> MiB/s), a 4% improvement for direct sequential reads (~2510 MiB/s to
> ~2620 MiB/s), a 8% improvement for buffered randreads (~2100 MiB/s to
> ~2280 MiB/s), and a 6% improvement for buffered sequential reads (~2500
> MiB/s to ~2670 MiB/s).
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> fs/fuse/dev_uring.c | 54 +++++++++++++++++++++++++++++++++------
> fs/fuse/dev_uring_i.h | 4 +++
> include/uapi/linux/fuse.h | 2 ++
> 3 files changed, 52 insertions(+), 8 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 79736b02cf9f..06d3d8dc1c82 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -52,6 +52,11 @@ static inline bool bufring_pinned_headers(struct fuse_ring_queue *queue)
> return queue->bufring->use_pinned_headers;
> }
>
> +static inline bool bufring_pinned_buffers(struct fuse_ring_queue *queue)
> +{
> + return queue->bufring->use_pinned_buffers;
> +}
> +
> static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
> struct fuse_ring_ent *ring_ent)
> {
> @@ -235,6 +240,11 @@ static void fuse_uring_bufring_unpin(struct fuse_ring_queue *queue)
> fuse_bufring_unpin_mem(&br->pinned_headers);
> br->use_pinned_headers = false;
> }
> +
> + if (bufring_pinned_buffers(queue)) {
> + fuse_bufring_unpin_mem(&br->pinned_bufs);
> + br->use_pinned_buffers = false;
> + }
> }
>
> void fuse_uring_destruct(struct fuse_conn *fc)
> @@ -474,6 +484,7 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> unsigned int buf_size = READ_ONCE(cmd_req->init.buf_size);
> struct iovec iov[FUSE_URING_IOV_SEGS];
> bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
> + bool pinned_bufs = init_flags & FUSE_URING_PINNED_BUFFERS;
> void __user *payload, *headers;
> size_t headers_size, payload_size, ring_size;
> struct fuse_bufring *br;
> @@ -523,7 +534,22 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> br->headers = headers;
> }
>
> - payload_addr = (uintptr_t)payload;
> + if (pinned_bufs) {
> + err = fuse_bufring_pin_mem(&br->pinned_bufs, payload,
> + payload_size);
> + if (err) {
> + if (pinned_headers)
> + fuse_bufring_unpin_mem(&br->pinned_headers);
> + kfree(br);
> + return err;
> + }
> + br->use_pinned_buffers = true;
> + }
> +
> + if (pinned_bufs)
> + payload_addr = (uintptr_t)br->pinned_bufs.addr;
> + else
> + payload_addr = (uintptr_t)payload;
>
> /* populate the ring buffer */
> for (i = 0; i < nr_bufs; i++, payload_addr += buf_size) {
> @@ -553,6 +579,7 @@ static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
> {
> bool bufring = init_flags & FUSE_URING_BUFRING;
> bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
> + bool pinned_bufs = init_flags & FUSE_URING_PINNED_BUFFERS;
>
> if (bufring_enabled(queue) != bufring)
> return false;
> @@ -560,7 +587,8 @@ static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
> if (!bufring)
> return true;
>
> - return bufring_pinned_headers(queue) == pinned_headers;
> + return bufring_pinned_headers(queue) == pinned_headers &&
> + bufring_pinned_buffers(queue) == pinned_bufs;
> }
>
> static struct fuse_ring_queue *
> @@ -1011,13 +1039,15 @@ static int setup_fuse_copy_state(struct fuse_copy_state *cs,
> struct fuse_ring_ent *ent, int dir,
> struct iov_iter *iter)
> {
> - void __user *payload;
> + void __user *payload = NULL;
> + bool use_bufring = bufring_enabled(ent->queue);
> + bool pinned_buffers = use_bufring && bufring_pinned_buffers(ent->queue);
> int err;
>
> - if (bufring_enabled(ent->queue))
> - payload = (void __user *)ent->payload_buf.addr;
> - else
> + if (!use_bufring)
> payload = ent->payload;
> + else if (!pinned_buffers)
> + payload = (void __user *)ent->payload_buf.addr;
>
> if (payload) {
> err = import_ubuf(dir, payload, ring->max_payload_sz, iter);
> @@ -1029,6 +1059,12 @@ static int setup_fuse_copy_state(struct fuse_copy_state *cs,
>
> fuse_copy_init(cs, dir == ITER_DEST, iter);
>
> + if (pinned_buffers) {
> + cs->is_kaddr = true;
> + cs->kaddr = (void *)ent->payload_buf.addr;
> + cs->len = ent->payload_buf.len;
> + }
> +
> cs->is_uring = true;
> cs->req = req;
>
> @@ -1608,11 +1644,13 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
> static bool init_flags_valid(u64 init_flags)
> {
> u64 valid_flags =
> - FUSE_URING_BUFRING | FUSE_URING_PINNED_HEADERS;
> + FUSE_URING_BUFRING | FUSE_URING_PINNED_HEADERS |
> + FUSE_URING_PINNED_BUFFERS;
> bool bufring = init_flags & FUSE_URING_BUFRING;
> bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
> + bool pinned_buffers = init_flags & FUSE_URING_PINNED_BUFFERS;
>
> - if (pinned_headers && !bufring)
> + if (!bufring && (pinned_headers || pinned_buffers))
> return false;
>
> return !(init_flags & ~valid_flags);
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> index 05c0f061a882..859ee4e6ba03 100644
> --- a/fs/fuse/dev_uring_i.h
> +++ b/fs/fuse/dev_uring_i.h
> @@ -57,6 +57,7 @@ struct fuse_bufring_pinned {
>
> struct fuse_bufring {
> bool use_pinned_headers: 1;
> + bool use_pinned_buffers: 1;
> unsigned int queue_depth;
>
> union {
> @@ -65,6 +66,9 @@ struct fuse_bufring {
> struct fuse_bufring_pinned pinned_headers;
> };
>
> + /* only used if the buffers are pinned */
> + struct fuse_bufring_pinned pinned_bufs;
> +
> /* metadata tracking state of the bufring */
> unsigned int nbufs;
> unsigned int head;
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index e57244c03d42..51ecb66dd6eb 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -245,6 +245,7 @@
> * - add FUSE_URING_BUFRING flag
> * - add fuse_uring_cmd_req init struct
> * - add FUSE_URING_PINNED_HEADERS flag
> + * - add FUSE_URING_PINNED_BUFFERS flag
> */
>
> #ifndef _LINUX_FUSE_H
> @@ -1308,6 +1309,7 @@ enum fuse_uring_cmd {
> /* fuse_uring_cmd_req flags */
> #define FUSE_URING_BUFRING (1 << 0)
> #define FUSE_URING_PINNED_HEADERS (1 << 1)
> +#define FUSE_URING_PINNED_BUFFERS (1 << 2)
>
> /**
> * In the 80B command area of the SQE.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 13/14] fuse: add zero-copy over io-uring
2026-04-02 16:28 ` [PATCH v2 13/14] fuse: add zero-copy over io-uring Joanne Koong
@ 2026-04-30 11:42 ` Jeff Layton
2026-04-30 12:35 ` Joanne Koong
2026-04-30 12:56 ` Jeff Layton
2026-05-05 23:45 ` Bernd Schubert
2 siblings, 1 reply; 49+ messages in thread
From: Jeff Layton @ 2026-04-30 11:42 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> Implement zero-copy data transfer for fuse over io-uring, eliminating
> memory copies between userspace, the kernel, and the fuse server for
> page-backed read/write operations.
>
> When the FUSE_URING_ZERO_COPY flag is set alongside FUSE_URING_BUFRING,
> the kernel registers the client's underlying pages as a sparse buffer at
> the entry's fixed id via io_buffer_register_bvec(). The fuse server can
> then perform io_uring read/write operations directly on these pages.
> Non-page-backed args (eg out headers) go through the payload buffer as
> normal.
>
> This requires CAP_SYS_ADMIN and buffer rings with pinned headers and
> buffers. Gating on pinned headers and buffers keeps the configuration
> space small and avoids partially-optimized modes that are unlikely to be
> useful in practice. Pages are unregistered when the request completes.
>
Can you elaborate a bit more on why CAP_SYS_ADMIN is needed here? It's
not immediately obvious to me.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 13/14] fuse: add zero-copy over io-uring
2026-04-30 11:42 ` Jeff Layton
@ 2026-04-30 12:35 ` Joanne Koong
2026-04-30 12:55 ` Jeff Layton
0 siblings, 1 reply; 49+ messages in thread
From: Joanne Koong @ 2026-04-30 12:35 UTC (permalink / raw)
To: Jeff Layton; +Cc: miklos, bernd, axboe, linux-fsdevel
On Thu, Apr 30, 2026 at 12:42 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> > Implement zero-copy data transfer for fuse over io-uring, eliminating
> > memory copies between userspace, the kernel, and the fuse server for
> > page-backed read/write operations.
> >
> > When the FUSE_URING_ZERO_COPY flag is set alongside FUSE_URING_BUFRING,
> > the kernel registers the client's underlying pages as a sparse buffer at
> > the entry's fixed id via io_buffer_register_bvec(). The fuse server can
> > then perform io_uring read/write operations directly on these pages.
> > Non-page-backed args (eg out headers) go through the payload buffer as
> > normal.
> >
> > This requires CAP_SYS_ADMIN and buffer rings with pinned headers and
> > buffers. Gating on pinned headers and buffers keeps the configuration
> > space small and avoids partially-optimized modes that are unlikely to be
> > useful in practice. Pages are unregistered when the request completes.
> >
>
> Can you elaborate a bit more on why CAP_SYS_ADMIN is needed here? It's
> not immediately obvious to me.
Thank you for reviewing this series, Jeff!
This is gated behind CAP_SYS_ADMIN because zero-copy allows the server
direct access to the client's underlying pages, rather than operating
on an intermediary buffer that the contents of client's pages were
copied into. A malicious unprivileged server could keep direct access
to the client's pages (eg even if the client tries to cancel a
read/write, if the request was already sent to userspace, the server
will still have access to the underlying pages). In the non-zero-copy
path this isn't possible because the server only operates on the copy
of the pages and not on the actual pages.
Thanks,
Joanne
>
> --
> Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 10/14] fuse: add io-uring buffer rings
2026-04-30 11:08 ` Jeff Layton
@ 2026-04-30 12:44 ` Joanne Koong
0 siblings, 0 replies; 49+ messages in thread
From: Joanne Koong @ 2026-04-30 12:44 UTC (permalink / raw)
To: Jeff Layton; +Cc: miklos, bernd, axboe, linux-fsdevel
On Thu, Apr 30, 2026 at 12:08 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> > Add fuse buffer rings for servers communicating through the io-uring
> > interface. To use this, the server must set the FUSE_URING_BUFRING
> > flag and provide header and payload buffers via an iovec array in the
> > sqe during registration. The payload buffers are used to back the buffer
> > ring. The kernel manages buffer selection and recycling through a simple
> > internal ring.
> >
> > This has the following advantages over the non-bufring (iovec) path:
> > - Reduced memory usage: in the iovec path, each entry has its own
> > dedicated payload buffer, requiring N buffers for N entries where each
> > buffer must be large enough to accommodate the maximum possible
> > payload size. With buffer rings, payload buffers are pooled and
> > selected on demand. Entries only hold a buffer while actively
> > processing a request with payload data. When incremental buffer
> > consumption is added, this will allow non-overlapping regions of a
> > single buffer to be used simultaneously across multiple requests,
> > further reducing memory requirements.
> > - Foundation for pinned buffers: the buffer ring headers and payloads
> > are now each passed in as a contiguous memory allocation, which allows
> > fuse to easily pin and vmap the entire region in one operation during
> > queue setup. This will eliminate the per-request overhead of having to
> > pin/unpin user pages and translate virtual addresses and is a
> > prerequisite for future optimizations like performing data copies
> > outside of the server's task context.
> >
> > Each ring entry gets a fixed ID (sqe->buf_index) that maps to a specific
> > header slot in the headers buffer. Payload buffers are selected from
> > the ring on demand and recycled after each request. Buffer ring usage is
> > set on a per-queue basis. All subsequent registration SQEs for the same
> > queue must use consistent flags.
> >
> > The headers are laid out contiguously and provided via iov[0]. Each slot
> > maps to ent->id:
> >
> > > <- headers_size (>= queue_depth * sizeof(fuse_uring_req_header)) ->|
> > +------------------------------+------------------------------+-----+
> > > struct fuse_uring_req_header | struct fuse_uring_req_header | ... |
> > > [ent id=0] | [ent id=1] | |
> > +------------------------------+------------------------------+-----+
> >
> > On the server side, the ent id is used to determine where in the headers
> > buffer the headers data for the ent resides. This is done by
> > calculating ent_id * sizeof(struct fuse_uring_req_header) as the offset
> > into the headers buffer.
> >
> > The buffer ring is backed by the payload buffer, which is contiguous but
> > partitioned into individual bufs according to the buf_size passed in at
> > registration.
> >
> > PAYLOAD BUFFER POOL (contiguous, provided via iov[1]):
> > |<-------------- payload_size ------------>|
> > +--------- --+-----------+-----------+-----+
> > | buf [0] | buf [1] | buf [2] | ... |
> > | buf_size | buf_size | buf_size | ... |
> > +--------- --+-----------+-----------+-----+
> >
> > buffer ring state (struct fuse_bufring, kernel-internal):
> > bufs[]: [ used | used | FREE | FREE | FREE ]
> > ^^^^^^^^^^^^^^^^^^^
> > available for selection
> >
> > The buffer ring logic is as follows:
> > select: buf = bufs[head % nbufs]; head++
> > recycle: bufs[tail % nbufs] = buf; tail++
> > empty: tail == head (no buffers available)
> > full: tail - head >= nbufs
> >
> > Buffer ring request flow
> > ------------------------
> > > Kernel | FUSE daemon
> > > |
> > > [client request arrives] |
> > > >fuse_uring_send() |
> > > [select payload buf from ring] |
> > > >fuse_uring_select_buffer() |
> > > [copy headers to ent's header slot] |
> > > >copy_header_to_ring() |
> > > [copy payload to selected buf] |
> > > >fuse_uring_copy_to_ring() |
> > > [set buf_id in ent_in_out header] |
> > > >io_uring_cmd_done() |
> > > | [CQE received]
> > > | [read headers from header
> > > | slot]
> > > | [read payload from buf_id]
> > > | [process request]
> > > | [write reply to header
> > > | slot]
> > > | [write reply payload to
> > > | buf]
> > > | >io_uring_submit()
> > > | COMMIT_AND_FETCH
> > > >fuse_uring_commit_fetch() |
> > > >fuse_uring_commit() |
> > > [copy reply from ring] |
> > > >fuse_uring_recycle_buffer() |
> > > >fuse_uring_get_next_fuse_req() |
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> > fs/fuse/dev_uring.c | 363 +++++++++++++++++++++++++++++++++-----
> > fs/fuse/dev_uring_i.h | 45 ++++-
> > include/uapi/linux/fuse.h | 27 ++-
> > 3 files changed, 381 insertions(+), 54 deletions(-)
> >
> > diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> > index a061f175b3fd..9f14a2bcde3f 100644
> > --- a/fs/fuse/dev_uring.c
> > +++ b/fs/fuse/dev_uring.c
> > @@ -41,6 +41,11 @@ enum fuse_uring_header_type {
> > FUSE_URING_HEADER_RING_ENT,
> > };
> >
> > +static inline bool bufring_enabled(struct fuse_ring_queue *queue)
> > +{
> > + return queue->bufring != NULL;
> > +}
> > +
> > static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
> > struct fuse_ring_ent *ring_ent)
> > {
> > @@ -222,6 +227,7 @@ void fuse_uring_destruct(struct fuse_conn *fc)
> > }
> >
> > kfree(queue->fpq.processing);
> > + kfree(queue->bufring);
> > kfree(queue);
> > ring->queues[qid] = NULL;
> > }
> > @@ -303,20 +309,102 @@ static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
> > return 0;
> > }
> >
> > -static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
> > - int qid)
> > +static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> > + struct fuse_ring_queue *queue)
> > +{
> > + const struct fuse_uring_cmd_req *cmd_req =
> > + io_uring_sqe128_cmd(cmd->sqe, struct fuse_uring_cmd_req);
> > + u16 queue_depth = READ_ONCE(cmd_req->init.queue_depth);
> > + unsigned int buf_size = READ_ONCE(cmd_req->init.buf_size);
> > + struct iovec iov[FUSE_URING_IOV_SEGS];
> > + void __user *payload, *headers;
> > + size_t headers_size, payload_size, ring_size;
> > + struct fuse_bufring *br;
> > + unsigned int nr_bufs, i;
> > + uintptr_t payload_addr;
> > + int err;
> > +
> > + if (!queue_depth || !buf_size)
> > + return -EINVAL;
> > +
> > + err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
> > + if (err)
> > + return err;
> > +
> > + headers = iov[FUSE_URING_IOV_HEADERS].iov_base;
> > + headers_size = iov[FUSE_URING_IOV_HEADERS].iov_len;
> > + payload = iov[FUSE_URING_IOV_PAYLOAD].iov_base;
> > + payload_size = iov[FUSE_URING_IOV_PAYLOAD].iov_len;
> > +
> > + /* check if there's enough space for all the headers */
> > + if (headers_size < queue_depth * sizeof(struct fuse_uring_req_header))
> > + return -EINVAL;
> > +
> > + if (buf_size < queue->ring->max_payload_sz)
> > + return -EINVAL;
> > +
> > + nr_bufs = payload_size / buf_size;
> > + if (!nr_bufs || nr_bufs > U16_MAX)
>
> What's the significance of U16_MAX here? It looks like the br->nbufs
> field is an unsigned int. Is it because struct fuse_uring_ent_in_out
> has buf_id as a u16?
Yes, this is because in the uapi, struct fuse_uring_ent_in_out's
buf_id field is set to a uint16_t.
>
> Not that I think you'll ever need more than 2^16 buffers, just curious
> about the limitation.
>
> > + return -EINVAL;
> > +
> > + /* create the ring buffer */
> > + ring_size = struct_size(br, bufs, nr_bufs);
> > + br = kzalloc(ring_size, GFP_KERNEL_ACCOUNT);
> > + if (!br)
> > + return -ENOMEM;
> > +
> > + br->queue_depth = queue_depth;
> > + br->headers = headers;
> > +
> > + payload_addr = (uintptr_t)payload;
> > +
> > + /* populate the ring buffer */
> > + for (i = 0; i < nr_bufs; i++, payload_addr += buf_size) {
> > + struct fuse_bufring_buf *buf = &br->bufs[i];
> > +
> > + buf->addr = payload_addr;
> > + buf->len = buf_size;
> > + buf->id = i;
> > + }
> > +
> > + br->nbufs = nr_bufs;
> > + br->tail = nr_bufs;
> > +
> > + queue->bufring = br;
> > +
> > + return 0;
> > +}
> > +
> > +/*
> > + * if the queue is already registered, check that the queue was initialized with
> > + * the same init flags set for this FUSE_IO_URING_CMD_REGISTER cmd. all
> > + * FUSE_IO_URING_CMD_REGISTER cmds should have the same init fields set on a
> > + * per-queue basis.
> > + */
> > +static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
> > + u64 init_flags)
> > {
> > + bool bufring = init_flags & FUSE_URING_BUFRING;
> > +
> > + return bufring_enabled(queue) == bufring;
> > +}
> > +
> > +static struct fuse_ring_queue *
> > +fuse_uring_create_queue(struct io_uring_cmd *cmd, struct fuse_ring *ring,
> > + int qid, u64 init_flags)
> > +{
> > + bool use_bufring = init_flags & FUSE_URING_BUFRING;
> > struct fuse_conn *fc = ring->fc;
> > struct fuse_ring_queue *queue;
> > struct list_head *pq;
> >
> > queue = kzalloc_obj(*queue, GFP_KERNEL_ACCOUNT);
> > if (!queue)
> > - return NULL;
> > + return ERR_PTR(-ENOMEM);
> > pq = kzalloc_objs(struct list_head, FUSE_PQ_HASH_SIZE);
> > if (!pq) {
> > kfree(queue);
> > - return NULL;
> > + return ERR_PTR(-ENOMEM);
> > }
> >
> > queue->qid = qid;
> > @@ -334,12 +422,29 @@ static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
> > queue->fpq.processing = pq;
> > fuse_pqueue_init(&queue->fpq);
> >
> > + if (use_bufring) {
> > + int err = fuse_uring_bufring_setup(cmd, queue);
> > +
> > + if (err) {
> > + kfree(pq);
> > + kfree(queue);
> > + return ERR_PTR(err);
> > + }
> > + }
> > +
> > spin_lock(&fc->lock);
> > + /* check if the queue creation raced with another thread */
> > if (ring->queues[qid]) {
> > spin_unlock(&fc->lock);
> > kfree(queue->fpq.processing);
> > + if (use_bufring)
> > + kfree(queue->bufring);
>
> nit: presumably you could skip the if here. If use_bufring is false,
> then queue->bufring _should_ be NULL.
I will drop this if check in the next version.
Thank you for taking the time to review this series.
Thanks,
Joanne
>
> > kfree(queue);
> > - return ring->queues[qid];
> > +
> > + queue = ring->queues[qid];
> > + if (!queue_init_flags_consistent(queue, init_flags))
> > + return ERR_PTR(-EINVAL);
> > + return queue;
> > }
> >
> > /*
> > @@ -649,7 +754,14 @@ static int copy_header_to_ring(struct fuse_ring_ent *ent,
> > if (offset < 0)
> > return offset;
> >
> > - ring = (void __user *)ent->headers + offset;
> > + if (bufring_enabled(ent->queue)) {
> > + int buf_offset = offset +
> > + sizeof(struct fuse_uring_req_header) * ent->id;
> > +
> > + ring = ent->queue->bufring->headers + buf_offset;
> > + } else {
> > + ring = (void __user *)ent->headers + offset;
> > + }
> >
> > if (copy_to_user(ring, header, header_size)) {
> > pr_info_ratelimited("Copying header to ring failed.\n");
> > @@ -669,7 +781,14 @@ static int copy_header_from_ring(struct fuse_ring_ent *ent,
> > if (offset < 0)
> > return offset;
> >
> > - ring = (void __user *)ent->headers + offset;
> > + if (bufring_enabled(ent->queue)) {
> > + int buf_offset = offset +
> > + sizeof(struct fuse_uring_req_header) * ent->id;
> > +
> > + ring = ent->queue->bufring->headers + buf_offset;
> > + } else {
> > + ring = (void __user *)ent->headers + offset;
> > + }
> >
> > if (copy_from_user(header, ring, header_size)) {
> > pr_info_ratelimited("Copying header from ring failed.\n");
> > @@ -684,12 +803,20 @@ static int setup_fuse_copy_state(struct fuse_copy_state *cs,
> > struct fuse_ring_ent *ent, int dir,
> > struct iov_iter *iter)
> > {
> > + void __user *payload;
> > int err;
> >
> > - err = import_ubuf(dir, ent->payload, ring->max_payload_sz, iter);
> > - if (err) {
> > - pr_info_ratelimited("fuse: Import of user buffer failed\n");
> > - return err;
> > + if (bufring_enabled(ent->queue))
> > + payload = (void __user *)ent->payload_buf.addr;
> > + else
> > + payload = ent->payload;
> > +
> > + if (payload) {
> > + err = import_ubuf(dir, payload, ring->max_payload_sz, iter);
> > + if (err) {
> > + pr_info_ratelimited("fuse: Import of user buffer failed\n");
> > + return err;
> > + }
> > }
> >
> > fuse_copy_init(cs, dir == ITER_DEST, iter);
> > @@ -741,6 +868,9 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
> > .commit_id = req->in.h.unique,
> > };
> >
> > + if (bufring_enabled(ent->queue))
> > + ent_in_out.buf_id = ent->payload_buf.id;
> > +
> > err = setup_fuse_copy_state(&cs, ring, req, ent, ITER_DEST, &iter);
> > if (err)
> > return err;
> > @@ -805,6 +935,96 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
> > sizeof(req->in.h));
> > }
> >
> > +static bool fuse_uring_req_has_payload(struct fuse_req *req)
> > +{
> > + struct fuse_args *args = req->args;
> > +
> > + return args->in_numargs > 1 || args->out_numargs;
> > +}
> > +
> > +static int fuse_uring_select_buffer(struct fuse_ring_ent *ent)
> > + __must_hold(&ent->queue->lock)
> > +{
> > + struct fuse_ring_queue *queue = ent->queue;
> > + struct fuse_bufring *br = queue->bufring;
> > + struct fuse_bufring_buf *buf;
> > + unsigned int tail = br->tail, head = br->head;
> > +
> > + lockdep_assert_held(&queue->lock);
> > +
> > + /* Get a buffer to use for the payload */
> > + if (tail == head)
> > + return -ENOBUFS;
> > +
> > + buf = &br->bufs[head % br->nbufs];
> > + br->head++;
> > +
> > + ent->payload_buf = *buf;
> > +
> > + return 0;
> > +}
> > +
> > +static void fuse_uring_recycle_buffer(struct fuse_ring_ent *ent)
> > + __must_hold(&ent->queue->lock)
> > +{
> > + struct fuse_bufring_buf *ent_payload = &ent->payload_buf;
> > + struct fuse_ring_queue *queue = ent->queue;
> > + struct fuse_bufring_buf *buf;
> > + struct fuse_bufring *br;
> > +
> > + lockdep_assert_held(&queue->lock);
> > +
> > + if (!bufring_enabled(queue) || !ent_payload->addr)
> > + return;
> > +
> > + br = queue->bufring;
> > +
> > + /* ring should never be full */
> > + WARN_ON_ONCE(br->tail - br->head >= br->nbufs);
> > +
> > + buf = &br->bufs[(br->tail) % br->nbufs];
> > +
> > + *buf = *ent_payload;
> > +
> > + br->tail++;
> > +
> > + memset(ent_payload, 0, sizeof(*ent_payload));
> > +}
> > +
> > +static int fuse_uring_next_req_update_buffer(struct fuse_ring_ent *ent,
> > + struct fuse_req *req)
> > +{
> > + bool buffer_selected;
> > + bool has_payload;
> > +
> > + if (!bufring_enabled(ent->queue))
> > + return 0;
> > +
> > + buffer_selected = !!ent->payload_buf.addr;
> > + has_payload = fuse_uring_req_has_payload(req);
> > +
> > + if (has_payload && !buffer_selected)
> > + return fuse_uring_select_buffer(ent);
> > +
> > + if (!has_payload && buffer_selected)
> > + fuse_uring_recycle_buffer(ent);
> > +
> > + return 0;
> > +}
> > +
> > +static int fuse_uring_prep_buffer(struct fuse_ring_ent *ent,
> > + struct fuse_req *req)
> > +{
> > + if (!bufring_enabled(ent->queue))
> > + return 0;
> > +
> > + /* no payload to copy, can skip selecting a buffer */
> > + if (!fuse_uring_req_has_payload(req))
> > + return 0;
> > +
> > + return fuse_uring_select_buffer(ent);
> > +}
> > +
> > static int fuse_uring_prepare_send(struct fuse_ring_ent *ent,
> > struct fuse_req *req)
> > {
> > @@ -878,10 +1098,21 @@ static struct fuse_req *fuse_uring_ent_assign_req(struct fuse_ring_ent *ent)
> >
> > /* get and assign the next entry while it is still holding the lock */
> > req = list_first_entry_or_null(req_queue, struct fuse_req, list);
> > - if (req)
> > - fuse_uring_add_req_to_ring_ent(ent, req);
> > + if (req) {
> > + int err = fuse_uring_next_req_update_buffer(ent, req);
> >
> > - return req;
> > + if (!err) {
> > + fuse_uring_add_req_to_ring_ent(ent, req);
> > + return req;
> > + }
> > + }
> > +
> > + /*
> > + * Buffer selection may fail if all the buffers are currently saturated.
> > + * The request will be serviced when a buffer is freed up.
> > + */
> > + fuse_uring_recycle_buffer(ent);
> > + return NULL;
> > }
> >
> > /*
> > @@ -1041,6 +1272,12 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
> > * fuse requests would otherwise not get processed - committing
> > * and fetching is done in one step vs legacy fuse, which has separated
> > * read (fetch request) and write (commit result).
> > + *
> > + * If the server is using bufrings and has populated the ring with less
> > + * payload buffers than ents, it is possible that there may not be an
> > + * available buffer for the next request. If so, then the fetch is a
> > + * no-op and the next request will be serviced when a buffer becomes
> > + * available.
> > */
> > if (fuse_uring_get_next_fuse_req(ent, queue))
> > fuse_uring_send(ent, cmd, 0, issue_flags);
> > @@ -1120,30 +1357,38 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
> >
> > ent->queue = queue;
> >
> > - err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
> > - if (err) {
> > - pr_info_ratelimited("Failed to get iovec from sqe, err=%d\n",
> > - err);
> > - goto error;
> > - }
> > + if (bufring_enabled(queue)) {
> > + ent->id = READ_ONCE(cmd->sqe->buf_index);
> > + if (ent->id >= queue->bufring->queue_depth) {
> > + err = -EINVAL;
> > + goto error;
> > + }
> > + } else {
> > + err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
> > + if (err) {
> > + pr_info_ratelimited("Failed to get iovec from sqe, err=%d\n",
> > + err);
> > + goto error;
> > + }
> >
> > - err = -EINVAL;
> > - headers = &iov[FUSE_URING_IOV_HEADERS];
> > - if (headers->iov_len < sizeof(struct fuse_uring_req_header)) {
> > - pr_info_ratelimited("Invalid header len %zu\n", headers->iov_len);
> > - goto error;
> > - }
> > + err = -EINVAL;
> > + headers = &iov[FUSE_URING_IOV_HEADERS];
> > + if (headers->iov_len < sizeof(struct fuse_uring_req_header)) {
> > + pr_info_ratelimited("Invalid header len %zu\n",
> > + headers->iov_len);
> > + goto error;
> > + }
> >
> > - payload = &iov[FUSE_URING_IOV_PAYLOAD];
> > - if (payload->iov_len < ring->max_payload_sz) {
> > - pr_info_ratelimited("Invalid req payload len %zu\n",
> > - payload->iov_len);
> > - goto error;
> > + payload = &iov[FUSE_URING_IOV_PAYLOAD];
> > + if (payload->iov_len < ring->max_payload_sz) {
> > + pr_info_ratelimited("Invalid req payload len %zu\n",
> > + payload->iov_len);
> > + goto error;
> > + }
> > + ent->headers = headers->iov_base;
> > + ent->payload = payload->iov_base;
> > }
> >
> > - ent->headers = headers->iov_base;
> > - ent->payload = payload->iov_base;
> > -
> > atomic_inc(&ring->queue_refs);
> > return ent;
> >
> > @@ -1152,6 +1397,13 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
> > return ERR_PTR(err);
> > }
> >
> > +static bool init_flags_valid(u64 init_flags)
> > +{
> > + u64 valid_flags = FUSE_URING_BUFRING;
> > +
> > + return !(init_flags & ~valid_flags);
> > +}
> > +
> > /*
> > * Register header and payload buffer with the kernel and puts the
> > * entry as "ready to get fuse requests" on the queue
> > @@ -1161,6 +1413,7 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
> > {
> > const struct fuse_uring_cmd_req *cmd_req = io_uring_sqe128_cmd(cmd->sqe,
> > struct fuse_uring_cmd_req);
> > + u64 init_flags = READ_ONCE(cmd_req->flags);
> > struct fuse_ring *ring = smp_load_acquire(&fc->ring);
> > struct fuse_ring_queue *queue;
> > struct fuse_ring_ent *ent;
> > @@ -1179,11 +1432,16 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
> > return -EINVAL;
> > }
> >
> > + if (!init_flags_valid(init_flags))
> > + return -EINVAL;
> > +
> > queue = ring->queues[qid];
> > if (!queue) {
> > - queue = fuse_uring_create_queue(ring, qid);
> > - if (!queue)
> > - return err;
> > + queue = fuse_uring_create_queue(cmd, ring, qid, init_flags);
> > + if (IS_ERR(queue))
> > + return PTR_ERR(queue);
> > + } else if (!queue_init_flags_consistent(queue, init_flags)) {
> > + return -EINVAL;
> > }
> >
> > /*
> > @@ -1349,14 +1607,18 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
> > req->ring_queue = queue;
> > ent = list_first_entry_or_null(&queue->ent_avail_queue,
> > struct fuse_ring_ent, list);
> > - if (ent)
> > - fuse_uring_add_req_to_ring_ent(ent, req);
> > - else
> > - list_add_tail(&req->list, &queue->fuse_req_queue);
> > - spin_unlock(&queue->lock);
> > + if (ent) {
> > + err = fuse_uring_prep_buffer(ent, req);
> > + if (!err) {
> > + fuse_uring_add_req_to_ring_ent(ent, req);
> > + spin_unlock(&queue->lock);
> > + fuse_uring_dispatch_ent(ent);
> > + return;
> > + }
> > + }
> >
> > - if (ent)
> > - fuse_uring_dispatch_ent(ent);
> > + list_add_tail(&req->list, &queue->fuse_req_queue);
> > + spin_unlock(&queue->lock);
> >
> > return;
> >
> > @@ -1406,14 +1668,17 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
> > req = list_first_entry_or_null(&queue->fuse_req_queue, struct fuse_req,
> > list);
> > if (ent && req) {
> > - fuse_uring_add_req_to_ring_ent(ent, req);
> > - spin_unlock(&queue->lock);
> > + int err = fuse_uring_prep_buffer(ent, req);
> >
> > - fuse_uring_dispatch_ent(ent);
> > - } else {
> > - spin_unlock(&queue->lock);
> > + if (!err) {
> > + fuse_uring_add_req_to_ring_ent(ent, req);
> > + spin_unlock(&queue->lock);
> > + fuse_uring_dispatch_ent(ent);
> > + return true;
> > + }
> > }
> >
> > + spin_unlock(&queue->lock);
> > return true;
> > }
> >
> > diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> > index 349418db3374..66d5d5f8dc3f 100644
> > --- a/fs/fuse/dev_uring_i.h
> > +++ b/fs/fuse/dev_uring_i.h
> > @@ -36,11 +36,47 @@ enum fuse_ring_req_state {
> > FRRS_RELEASED,
> > };
> >
> > +struct fuse_bufring_buf {
> > + uintptr_t addr;
> > + unsigned int len;
> > + unsigned int id;
> > +};
> > +
> > +struct fuse_bufring {
> > + /* pointer to the headers buffer */
> > + void __user *headers;
> > +
> > + unsigned int queue_depth;
> > +
> > + /* metadata tracking state of the bufring */
> > + unsigned int nbufs;
> > + unsigned int head;
> > + unsigned int tail;
> > +
> > + /* the buffers backing the ring */
> > + __DECLARE_FLEX_ARRAY(struct fuse_bufring_buf, bufs);
> > +};
> > +
> > /** A fuse ring entry, part of the ring queue */
> > struct fuse_ring_ent {
> > - /* userspace buffer */
> > - struct fuse_uring_req_header __user *headers;
> > - void __user *payload;
> > + union {
> > + /* if bufrings are not used */
> > + struct {
> > + /* userspace buffers */
> > + struct fuse_uring_req_header __user *headers;
> > + void __user *payload;
> > + };
> > + /* if bufrings are used */
> > + struct {
> > + /*
> > + * unique fixed id for the ent. used by kernel/server to
> > + * locate where in the headers buffer the data for this
> > + * ent resides
> > + */
> > + unsigned int id;
> > + struct fuse_bufring_buf payload_buf;
> > + };
> > + };
> >
> > /* the ring queue that owns the request */
> > struct fuse_ring_queue *queue;
> > @@ -99,6 +135,9 @@ struct fuse_ring_queue {
> > unsigned int active_background;
> >
> > bool stopped;
> > +
> > + /* only allocated if the server uses bufrings */
> > + struct fuse_bufring *bufring;
> > };
> >
> > /**
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index c13e1f9a2f12..8753de7eb189 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -240,6 +240,10 @@
> > * - add FUSE_COPY_FILE_RANGE_64
> > * - add struct fuse_copy_file_range_out
> > * - add FUSE_NOTIFY_PRUNE
> > + *
> > + * 7.46
> > + * - add FUSE_URING_BUFRING flag
> > + * - add fuse_uring_cmd_req init struct
> > */
> >
> > #ifndef _LINUX_FUSE_H
> > @@ -1263,7 +1267,13 @@ struct fuse_uring_ent_in_out {
> >
> > /* size of user payload buffer */
> > uint32_t payload_sz;
> > - uint32_t padding;
> > +
> > + /*
> > + * if using bufrings, this is the id of the selected buffer.
> > + * the selected buffer holds the request payload
> > + */
> > + uint16_t buf_id;
> > + uint16_t padding;
> >
> > uint64_t reserved;
> > };
> > @@ -1294,6 +1304,9 @@ enum fuse_uring_cmd {
> > FUSE_IO_URING_CMD_COMMIT_AND_FETCH = 2,
> > };
> >
> > +/* fuse_uring_cmd_req flags */
> > +#define FUSE_URING_BUFRING (1 << 0)
> > +
> > /**
> > * In the 80B command area of the SQE.
> > */
> > @@ -1305,7 +1318,17 @@ struct fuse_uring_cmd_req {
> >
> > /* queue the command is for (queue index) */
> > uint16_t qid;
> > - uint8_t padding[6];
> > + uint16_t padding;
> > +
> > + union {
> > + struct {
> > + /* size of the bufring's backing buffers */
> > + uint32_t buf_size;
> > + /* number of entries in the queue */
> > + uint16_t queue_depth;
> > + uint16_t padding;
> > + } init;
> > + };
> > };
> >
> > #endif /* _LINUX_FUSE_H */
>
> Overall, this looks good though.
>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 13/14] fuse: add zero-copy over io-uring
2026-04-30 12:35 ` Joanne Koong
@ 2026-04-30 12:55 ` Jeff Layton
2026-05-05 22:55 ` Bernd Schubert
0 siblings, 1 reply; 49+ messages in thread
From: Jeff Layton @ 2026-04-30 12:55 UTC (permalink / raw)
To: Joanne Koong; +Cc: miklos, bernd, axboe, linux-fsdevel
On Thu, 2026-04-30 at 13:35 +0100, Joanne Koong wrote:
> On Thu, Apr 30, 2026 at 12:42 PM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> > > Implement zero-copy data transfer for fuse over io-uring, eliminating
> > > memory copies between userspace, the kernel, and the fuse server for
> > > page-backed read/write operations.
> > >
> > > When the FUSE_URING_ZERO_COPY flag is set alongside FUSE_URING_BUFRING,
> > > the kernel registers the client's underlying pages as a sparse buffer at
> > > the entry's fixed id via io_buffer_register_bvec(). The fuse server can
> > > then perform io_uring read/write operations directly on these pages.
> > > Non-page-backed args (eg out headers) go through the payload buffer as
> > > normal.
> > >
> > > This requires CAP_SYS_ADMIN and buffer rings with pinned headers and
> > > buffers. Gating on pinned headers and buffers keeps the configuration
> > > space small and avoids partially-optimized modes that are unlikely to be
> > > useful in practice. Pages are unregistered when the request completes.
> > >
> >
> > Can you elaborate a bit more on why CAP_SYS_ADMIN is needed here? It's
> > not immediately obvious to me.
>
> Thank you for reviewing this series, Jeff!
>
> This is gated behind CAP_SYS_ADMIN because zero-copy allows the server
> direct access to the client's underlying pages, rather than operating
> on an intermediary buffer that the contents of client's pages were
> copied into. A malicious unprivileged server could keep direct access
> to the client's pages (eg even if the client tries to cancel a
> read/write, if the request was already sent to userspace, the server
> will still have access to the underlying pages). In the non-zero-copy
> path this isn't possible because the server only operates on the copy
> of the pages and not on the actual pages.
>
Thanks for the explanation. I'd suggest adding that to the commit
message (and maybe comments near the CAP_SYS_ADMIN checks) in case
others aren't clear why this is gated on that.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 13/14] fuse: add zero-copy over io-uring
2026-04-02 16:28 ` [PATCH v2 13/14] fuse: add zero-copy over io-uring Joanne Koong
2026-04-30 11:42 ` Jeff Layton
@ 2026-04-30 12:56 ` Jeff Layton
2026-05-05 23:45 ` Bernd Schubert
2 siblings, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2026-04-30 12:56 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> Implement zero-copy data transfer for fuse over io-uring, eliminating
> memory copies between userspace, the kernel, and the fuse server for
> page-backed read/write operations.
>
> When the FUSE_URING_ZERO_COPY flag is set alongside FUSE_URING_BUFRING,
> the kernel registers the client's underlying pages as a sparse buffer at
> the entry's fixed id via io_buffer_register_bvec(). The fuse server can
> then perform io_uring read/write operations directly on these pages.
> Non-page-backed args (eg out headers) go through the payload buffer as
> normal.
>
> This requires CAP_SYS_ADMIN and buffer rings with pinned headers and
> buffers. Gating on pinned headers and buffers keeps the configuration
> space small and avoids partially-optimized modes that are unlikely to be
> useful in practice. Pages are unregistered when the request completes.
>
> The request flow for the zero-copy write path (client writes data,
> server reads it) is as follows:
> =======================================================================
> > Kernel | FUSE server
> > |
> > "write(fd, buf, 1MB)" |
> > |
> > >sys_write() |
> > >fuse_file_write_iter() |
> > >fuse_send_one() |
> > [req->args->in_pages = true] |
> > [folios hold client write data] |
> > |
> > >fuse_uring_copy_to_ring() |
> > >copy_header_to_ring(IN_OUT) |
> > [memcpy fuse_in_header to |
> > pinned headers buf via kaddr] |
> > >copy_header_to_ring(OP) |
> > [memcpy write_in header] |
> > |
> > >fuse_uring_args_to_ring() |
> > >setup_fuse_copy_state() |
> > [is_kaddr = true] |
> > [skip_folio_copy = true] |
> > |
> > >fuse_uring_set_up_zero_copy() |
> > [folio_get for each client folio] |
> > [build bio_vec array from folios] |
> > >io_buffer_register_bvec() |
> > [register pages at ent->id] |
> > [ent->zero_copied = true] |
> > |
> > >fuse_copy_args() |
> > [skip_folio_copy => return 0 |
> > for page arg, skip data copy] |
> > |
> > >copy_header_to_ring(RING_ENT) |
> > [memcpy ent_in_out] |
> > >io_uring_cmd_done() |
> > |
> > | [CQE received]
> > |
> > | [issue io_uring READ at
> > | ent->id]
> > | [reads directly from
> > |client's pages (ZERO_COPY)]
> > |
> > | [write data to backing
> > | store]
> > | [submit COMMIT AND FETCH]
> > |
> > >fuse_uring_commit_fetch() |
> > >fuse_uring_commit() |
> > >fuse_uring_copy_from_ring() |
> > >fuse_uring_req_end() |
> > >io_buffer_unregister(ent->id) |
> > [unregister sparse buffer] |
> > >fuse_zero_copy_release() |
> > [folio_put for each folio] |
> > [ent->zero_copied = false] |
> > >fuse_request_end() |
> > [wake up client] |
>
> The zero-copy read path is analogous.
>
> Some requests may have both page-backed args and non-page-backed args.
> For these requests, the page-backed args are zero-copied while the
> non-page-backed args are copied to the buffer selected from the buffer
> ring:
> zero-copy: pages registered via io_buffer_register_bvec()
> non-page-backed: copied to payload buffer via fuse_copy_args()
>
> For a request whose payload is zero-copied, the
> registration/unregistration path looks like:
>
> register: fuse_uring_set_up_zero_copy()
> folio_get() for each folio
> io_buffer_register_bvec(ent->id)
>
> [server accesses pages via io_uring fixed buf at ent->id]
>
> unregister: fuse_uring_req_end()
> io_buffer_unregister(ent->id)
> -> fuse_zero_copy_release() callback
> folio_put() for each folio
>
> The throughput improvement from zero-copy depends on how much of the
> per-request latency is spent on data copying vs backing I/O. When
> backing I/O dominates, the saved memcpy is a negligible fraction of
> overall latency. Please also note that for the server to read/write
> into the zero-copied pages, the read/write must go through io-uring
> as an IORING_OP_READ_FIXED / IORING_OP_WRITE_FIXED operation. If the
> server's backing I/O is instantaneous (eg served from cache), the
> overhead of the additional io_uring operation may negate the savings
> from eliminating the memcpy.
>
> In benchmarks using passthrough_hp on a high-performance NVMe-backed
> system, zero-copy showed around a 35% throughput improvement for direct
> randreads (~2150 MiB/s to ~2900 MiB/s), a 15% improvement for direct
> sequential reads (~2510 MiB/s to ~2900 MiB/s), a 15% improvement for
> buffered randreads (~2100 MiB/s to ~2470 MiB/s), and a 10% improvement
> for buffered sequential reads (~2500 MiB/s to ~2750 MiB/s).
>
> The benchmarks were run using:
> fio --name=test_run --ioengine=sync --rw=rand{read,write} --bs=1M
> --size=1G --numjobs=2 --ramp_time=30 --group_reporting=1
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> fs/fuse/dev.c | 7 +-
> fs/fuse/dev_uring.c | 167 +++++++++++++++++++++++++++++++++-----
> fs/fuse/dev_uring_i.h | 4 +
> fs/fuse/fuse_dev_i.h | 1 +
> include/uapi/linux/fuse.h | 5 ++
> 5 files changed, 160 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index a87939eaa103..cd326e61831b 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -1233,10 +1233,13 @@ int fuse_copy_args(struct fuse_copy_state *cs, unsigned numargs,
>
> for (i = 0; !err && i < numargs; i++) {
> struct fuse_arg *arg = &args[i];
> - if (i == numargs - 1 && argpages)
> + if (i == numargs - 1 && argpages) {
> + if (cs->skip_folio_copy)
> + return 0;
> err = fuse_copy_folios(cs, arg->size, zeroing);
> - else
> + } else {
> err = fuse_copy_one(cs, arg->value, arg->size);
> + }
> }
> return err;
> }
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 06d3d8dc1c82..d9f1ee4beaf3 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -31,6 +31,11 @@ struct fuse_uring_pdu {
> struct fuse_ring_ent *ent;
> };
>
> +struct fuse_zero_copy_bvs {
> + unsigned int nr_bvs;
> + struct bio_vec bvs[];
> +};
> +
> static const struct fuse_iqueue_ops fuse_io_uring_ops;
>
> enum fuse_uring_header_type {
> @@ -57,6 +62,11 @@ static inline bool bufring_pinned_buffers(struct fuse_ring_queue *queue)
> return queue->bufring->use_pinned_buffers;
> }
>
> +static inline bool bufring_zero_copy(struct fuse_ring_queue *queue)
> +{
> + return queue->bufring->use_zero_copy;
> +}
> +
> static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
> struct fuse_ring_ent *ring_ent)
> {
> @@ -102,8 +112,18 @@ static void fuse_uring_flush_bg(struct fuse_ring_queue *queue)
> }
> }
>
> +static bool can_zero_copy_req(struct fuse_ring_ent *ent, struct fuse_req *req)
> +{
> + struct fuse_args *args = req->args;
> +
> + if (!bufring_enabled(ent->queue) || !bufring_zero_copy(ent->queue))
> + return false;
> +
> + return args->in_pages || args->out_pages;
> +}
> +
> static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
> - int error)
> + int error, unsigned int issue_flags)
> {
> struct fuse_ring_queue *queue = ent->queue;
> struct fuse_ring *ring = queue->ring;
> @@ -122,6 +142,11 @@ static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
>
> spin_unlock(&queue->lock);
>
> + if (ent->zero_copied) {
> + io_buffer_unregister(ent->cmd, ent->id, issue_flags);
> + ent->zero_copied = false;
> + }
> +
> if (error)
> req->out.h.error = error;
>
> @@ -485,6 +510,7 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> struct iovec iov[FUSE_URING_IOV_SEGS];
> bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
> bool pinned_bufs = init_flags & FUSE_URING_PINNED_BUFFERS;
> + bool zero_copy = init_flags & FUSE_URING_ZERO_COPY;
> void __user *payload, *headers;
> size_t headers_size, payload_size, ring_size;
> struct fuse_bufring *br;
> @@ -508,7 +534,7 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> if (headers_size < queue_depth * sizeof(struct fuse_uring_req_header))
> return -EINVAL;
>
> - if (buf_size < queue->ring->max_payload_sz)
> + if (!zero_copy && buf_size < queue->ring->max_payload_sz)
> return -EINVAL;
>
> nr_bufs = payload_size / buf_size;
> @@ -521,6 +547,7 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> if (!br)
> return -ENOMEM;
>
> + br->use_zero_copy = zero_copy;
> br->queue_depth = queue_depth;
> if (pinned_headers) {
> err = fuse_bufring_pin_mem(&br->pinned_headers, headers,
> @@ -580,6 +607,7 @@ static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
> bool bufring = init_flags & FUSE_URING_BUFRING;
> bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
> bool pinned_bufs = init_flags & FUSE_URING_PINNED_BUFFERS;
> + bool zero_copy = init_flags & FUSE_URING_ZERO_COPY;
>
> if (bufring_enabled(queue) != bufring)
> return false;
> @@ -588,7 +616,8 @@ static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
> return true;
>
> return bufring_pinned_headers(queue) == pinned_headers &&
> - bufring_pinned_buffers(queue) == pinned_bufs;
> + bufring_pinned_buffers(queue) == pinned_bufs &&
> + bufring_zero_copy(queue) == zero_copy;
> }
>
> static struct fuse_ring_queue *
> @@ -1063,6 +1092,7 @@ static int setup_fuse_copy_state(struct fuse_copy_state *cs,
> cs->is_kaddr = true;
> cs->kaddr = (void *)ent->payload_buf.addr;
> cs->len = ent->payload_buf.len;
> + cs->skip_folio_copy = ent->zero_copied;
> }
>
> cs->is_uring = true;
> @@ -1095,11 +1125,70 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
> return err;
> }
>
> +static void fuse_zero_copy_release(void *priv)
> +{
> + struct fuse_zero_copy_bvs *zc_bvs = priv;
> + unsigned int i;
> +
> + for (i = 0; i < zc_bvs->nr_bvs; i++)
> + folio_put(page_folio(zc_bvs->bvs[i].bv_page));
> +
> + kfree(zc_bvs);
> +}
> +
> +static int fuse_uring_set_up_zero_copy(struct fuse_ring_ent *ent,
> + struct fuse_req *req,
> + unsigned int issue_flags)
> +{
> + struct fuse_args_pages *ap;
> + int err, i, ddir = 0;
> + struct fuse_zero_copy_bvs *zc_bvs;
> + struct bio_vec *bvs;
> +
> + /* out_pages indicates a read, in_pages indicates a write */
> + if (req->args->out_pages)
> + ddir |= IO_BUF_DEST;
> + if (req->args->in_pages)
> + ddir |= IO_BUF_SOURCE;
> +
> + WARN_ON_ONCE(!ddir);
> +
> + ap = container_of(req->args, typeof(*ap), args);
> +
> + zc_bvs = kmalloc(struct_size(zc_bvs, bvs, ap->num_folios),
> + GFP_KERNEL_ACCOUNT);
> + if (!zc_bvs)
> + return -ENOMEM;
> +
> + zc_bvs->nr_bvs = ap->num_folios;
> + bvs = zc_bvs->bvs;
> + for (i = 0; i < ap->num_folios; i++) {
> + bvs[i].bv_page = folio_page(ap->folios[i], 0);
> + bvs[i].bv_offset = ap->descs[i].offset;
> + bvs[i].bv_len = ap->descs[i].length;
> + folio_get(ap->folios[i]);
> + }
> +
> + err = io_buffer_register_bvec(ent->cmd, bvs, ap->num_folios,
> + fuse_zero_copy_release, zc_bvs,
> + ddir, ent->id,
> + issue_flags);
> + if (err) {
> + fuse_zero_copy_release(zc_bvs);
> + return err;
> + }
> +
> + ent->zero_copied = true;
> +
> + return 0;
> +}
> +
> /*
> * Copy data from the req to the ring buffer
> */
> static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
> - struct fuse_ring_ent *ent)
> + struct fuse_ring_ent *ent,
> + unsigned int issue_flags)
> {
> struct fuse_copy_state cs;
> struct fuse_args *args = req->args;
> @@ -1112,8 +1201,15 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
> .commit_id = req->in.h.unique,
> };
>
> - if (bufring_enabled(ent->queue))
> + if (bufring_enabled(ent->queue)) {
> ent_in_out.buf_id = ent->payload_buf.id;
> + if (can_zero_copy_req(ent, req)) {
> + ent_in_out.flags |= FUSE_URING_ENT_ZERO_COPY;
> + err = fuse_uring_set_up_zero_copy(ent, req, issue_flags);
> + if (err)
> + return err;
> + }
> + }
>
> err = setup_fuse_copy_state(&cs, ring, req, ent, ITER_DEST, &iter);
> if (err)
> @@ -1145,12 +1241,17 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
> }
>
> ent_in_out.payload_sz = cs.ring.copied_sz;
> + if (cs.skip_folio_copy && args->in_pages)
> + ent_in_out.payload_sz +=
> + args->in_args[args->in_numargs - 1].size;
> +
> return copy_header_to_ring(ent, FUSE_URING_HEADER_RING_ENT,
> &ent_in_out, sizeof(ent_in_out));
> }
>
> static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
> - struct fuse_req *req)
> + struct fuse_req *req,
> + unsigned int issue_flags)
> {
> struct fuse_ring_queue *queue = ent->queue;
> struct fuse_ring *ring = queue->ring;
> @@ -1168,7 +1269,7 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
> return err;
>
> /* copy the request */
> - err = fuse_uring_args_to_ring(ring, req, ent);
> + err = fuse_uring_args_to_ring(ring, req, ent, issue_flags);
> if (unlikely(err)) {
> pr_info_ratelimited("Copy to ring failed: %d\n", err);
> return err;
> @@ -1179,11 +1280,25 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
> sizeof(req->in.h));
> }
>
> -static bool fuse_uring_req_has_payload(struct fuse_req *req)
> +static bool fuse_uring_req_has_copyable_payload(struct fuse_ring_ent *ent,
> + struct fuse_req *req)
> {
> struct fuse_args *args = req->args;
>
> - return args->in_numargs > 1 || args->out_numargs;
> + if (!can_zero_copy_req(ent, req))
> + return args->in_numargs > 1 || args->out_numargs;
> +
> + /*
> + * the asymmetry between in_numargs > 2 and out_numargs > 1 is because
> + * the per-op header is extracted before fuse_copy_args() for inargs but
> + * not for outargs
> + */
> + if ((args->in_numargs > 1) && (!args->in_pages || args->in_numargs > 2))
> + return true;
> + if (args->out_numargs && (!args->out_pages || args->out_numargs > 1))
> + return true;
> +
> + return false;
> }
>
> static int fuse_uring_select_buffer(struct fuse_ring_ent *ent)
> @@ -1245,7 +1360,7 @@ static int fuse_uring_next_req_update_buffer(struct fuse_ring_ent *ent,
> return 0;
>
> buffer_selected = !!ent->payload_buf.addr;
> - has_payload = fuse_uring_req_has_payload(req);
> + has_payload = fuse_uring_req_has_copyable_payload(ent, req);
>
> if (has_payload && !buffer_selected)
> return fuse_uring_select_buffer(ent);
> @@ -1263,22 +1378,23 @@ static int fuse_uring_prep_buffer(struct fuse_ring_ent *ent,
> return 0;
>
> /* no payload to copy, can skip selecting a buffer */
> - if (!fuse_uring_req_has_payload(req))
> + if (!fuse_uring_req_has_copyable_payload(ent, req))
> return 0;
>
> return fuse_uring_select_buffer(ent);
> }
>
> static int fuse_uring_prepare_send(struct fuse_ring_ent *ent,
> - struct fuse_req *req)
> + struct fuse_req *req,
> + unsigned int issue_flags)
> {
> int err;
>
> - err = fuse_uring_copy_to_ring(ent, req);
> + err = fuse_uring_copy_to_ring(ent, req, issue_flags);
> if (!err)
> set_bit(FR_SENT, &req->flags);
> else
> - fuse_uring_req_end(ent, req, err);
> + fuse_uring_req_end(ent, req, err, issue_flags);
>
> return err;
> }
> @@ -1386,7 +1502,7 @@ static void fuse_uring_commit(struct fuse_ring_ent *ent, struct fuse_req *req,
>
> err = fuse_uring_copy_from_ring(ring, req, ent);
> out:
> - fuse_uring_req_end(ent, req, err);
> + fuse_uring_req_end(ent, req, err, issue_flags);
> }
>
> /*
> @@ -1396,7 +1512,8 @@ static void fuse_uring_commit(struct fuse_ring_ent *ent, struct fuse_req *req,
> * Else, there is no next fuse request and this returns false.
> */
> static bool fuse_uring_get_next_fuse_req(struct fuse_ring_ent *ent,
> - struct fuse_ring_queue *queue)
> + struct fuse_ring_queue *queue,
> + unsigned int issue_flags)
> {
> int err;
> struct fuse_req *req;
> @@ -1408,7 +1525,7 @@ static bool fuse_uring_get_next_fuse_req(struct fuse_ring_ent *ent,
> spin_unlock(&queue->lock);
>
> if (req) {
> - err = fuse_uring_prepare_send(ent, req);
> + err = fuse_uring_prepare_send(ent, req, issue_flags);
> if (err)
> goto retry;
> }
> @@ -1523,7 +1640,7 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
> * no-op and the next request will be serviced when a buffer becomes
> * available.
> */
> - if (fuse_uring_get_next_fuse_req(ent, queue))
> + if (fuse_uring_get_next_fuse_req(ent, queue, issue_flags))
> fuse_uring_send(ent, cmd, 0, issue_flags);
> return 0;
> }
> @@ -1645,12 +1762,17 @@ static bool init_flags_valid(u64 init_flags)
> {
> u64 valid_flags =
> FUSE_URING_BUFRING | FUSE_URING_PINNED_HEADERS |
> - FUSE_URING_PINNED_BUFFERS;
> + FUSE_URING_PINNED_BUFFERS | FUSE_URING_ZERO_COPY;
> bool bufring = init_flags & FUSE_URING_BUFRING;
> bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
> bool pinned_buffers = init_flags & FUSE_URING_PINNED_BUFFERS;
> + bool zero_copy = init_flags & FUSE_URING_ZERO_COPY;
> +
> + if (!bufring && (pinned_headers || pinned_buffers || zero_copy))
> + return false;
>
> - if (!bufring && (pinned_headers || pinned_buffers))
> + if (zero_copy &&
> + (!capable(CAP_SYS_ADMIN) || !pinned_headers || !pinned_buffers))
> return false;
>
> return !(init_flags & ~valid_flags);
> @@ -1795,9 +1917,10 @@ static void fuse_uring_send_in_task(struct io_tw_req tw_req, io_tw_token_t tw)
> int err;
>
> if (!tw.cancel) {
> - err = fuse_uring_prepare_send(ent, ent->fuse_req);
> + err = fuse_uring_prepare_send(ent, ent->fuse_req, issue_flags);
> if (err) {
> - if (!fuse_uring_get_next_fuse_req(ent, queue))
> + if (!fuse_uring_get_next_fuse_req(ent, queue,
> + issue_flags))
> return;
> err = 0;
> }
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> index 859ee4e6ba03..0546f719fc65 100644
> --- a/fs/fuse/dev_uring_i.h
> +++ b/fs/fuse/dev_uring_i.h
> @@ -58,6 +58,8 @@ struct fuse_bufring_pinned {
> struct fuse_bufring {
> bool use_pinned_headers: 1;
> bool use_pinned_buffers: 1;
> + /* this is only allowed on privileged servers */
> + bool use_zero_copy: 1;
> unsigned int queue_depth;
>
> union {
> @@ -96,6 +98,8 @@ struct fuse_ring_ent {
> */
> unsigned int id;
> struct fuse_bufring_buf payload_buf;
> + /* true if the request's pages are being zero-copied */
> + bool zero_copied;
> };
> };
>
> diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
> index aa1d25421054..67b5bed451fe 100644
> --- a/fs/fuse/fuse_dev_i.h
> +++ b/fs/fuse/fuse_dev_i.h
> @@ -39,6 +39,7 @@ struct fuse_copy_state {
> bool is_uring:1;
> /* if set, use kaddr; otherwise use pg */
> bool is_kaddr:1;
> + bool skip_folio_copy:1;
> struct {
> unsigned int copied_sz; /* copied size into the user buffer */
> } ring;
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 51ecb66dd6eb..c2e53886cf06 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -246,6 +246,7 @@
> * - add fuse_uring_cmd_req init struct
> * - add FUSE_URING_PINNED_HEADERS flag
> * - add FUSE_URING_PINNED_BUFFERS flag
> + * - add FUSE_URING_ZERO_COPY flag
> */
>
> #ifndef _LINUX_FUSE_H
> @@ -1257,6 +1258,9 @@ struct fuse_supp_groups {
> #define FUSE_URING_IN_OUT_HEADER_SZ 128
> #define FUSE_URING_OP_IN_OUT_SZ 128
>
> +/* Set if the ent's payload is zero-copied */
> +#define FUSE_URING_ENT_ZERO_COPY (1 << 0)
> +
> /* Used as part of the fuse_uring_req_header */
> struct fuse_uring_ent_in_out {
> uint64_t flags;
> @@ -1310,6 +1314,7 @@ enum fuse_uring_cmd {
> #define FUSE_URING_BUFRING (1 << 0)
> #define FUSE_URING_PINNED_HEADERS (1 << 1)
> #define FUSE_URING_PINNED_BUFFERS (1 << 2)
> +#define FUSE_URING_ZERO_COPY (1 << 3)
>
> /**
> * In the 80B command area of the SQE.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 14/14] docs: fuse: add io-uring bufring and zero-copy documentation
2026-04-02 16:28 ` [PATCH v2 14/14] docs: fuse: add io-uring bufring and zero-copy documentation Joanne Koong
2026-04-14 21:05 ` Bernd Schubert
@ 2026-04-30 12:57 ` Jeff Layton
1 sibling, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2026-04-30 12:57 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> Add documentation for fuse over io-uring usage of buffer rings and
> zero-copy.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> .../filesystems/fuse/fuse-io-uring.rst | 189 ++++++++++++++++++
> 1 file changed, 189 insertions(+)
>
> diff --git a/Documentation/filesystems/fuse/fuse-io-uring.rst b/Documentation/filesystems/fuse/fuse-io-uring.rst
> index d73dd0dbd238..bc47686c023f 100644
> --- a/Documentation/filesystems/fuse/fuse-io-uring.rst
> +++ b/Documentation/filesystems/fuse/fuse-io-uring.rst
> @@ -95,5 +95,194 @@ Sending requests with CQEs
> | <fuse_unlink() |
> | <sys_unlink() |
>
> +Buffer rings
> +============
>
> +Buffer rings have two main advantages:
>
> +* Reduced memory usage: payload buffers are pooled and selected on demand
> + rather than dedicated per-entry, allowing fewer buffers than entries. This
> + infrastructure also allows for future optimizations like incremental buffer
> + consumption where non-overlapping parts of a buffer may be used across
> + concurrent requests.
> +* Foundation for pinned buffers: contiguous buffer allocations allow the
> + kernel to pin and vmap the entire region, avoiding per-request page
> + resolution overhead
> +
> +At a high-level, this is how fuse uses buffer rings:
> +
> +* The first REGISTER SQE for a queue creates the queue and sets up the
> + buffer ring. The server provides two iovecs: one for headers and one for
> + payload buffers. Each entry gets a fixed ID (sqe->buf_index) that maps
> + to a specific header slot.
> +* When a client request arrives, the kernel selects a payload buffer from
> + the ring (if the request has copyable data), copies headers and payload
> + data, and completes the sqe.
> +* The buf_id of the selected payload buffer is communicated to the server
> + via the fuse_uring_ent_in_out header. The server uses this to locate the
> + payload data in its buffer.
> +* The server processes the request and sends a COMMIT_AND_FETCH SQE with
> + the reply. The kernel processes the reply and recycles the buffer.
> +
> +Visually, this looks like::
> +
> + Headers buffer:
> + +-----------------------+-----------------------+-----+
> + | fuse_uring_req_header | fuse_uring_req_header | ... |
> + | [ent 0] | [ent 1] | |
> + +-----------------------+-----------------------+-----+
> + ^ ^
> + | |
> + ent 0 header slot ent 1 header slot
> + (sqe->buf_index=0) (sqe->buf_index=1)
> +
> + Payload buffer pool:
> + +-----------+-----------+-----------+-----+
> + | buf 0 | buf 1 | buf 2 | ... |
> + | (buf_size)| (buf_size)| (buf_size)| |
> + +-----------+-----------+-----------+-----+
> + selected on demand, recycled after each request
> +
> +Buffer ring request flow
> +------------------------::
> +
> +| Kernel | FUSE daemon
> +| |
> +| [client request arrives] |
> +| >fuse_uring_send() |
> +| [select payload buf from ring] |
> +| >fuse_uring_select_buffer() |
> +| [copy headers to ent's header slot] |
> +| >copy_header_to_ring() |
> +| [copy payload to selected buf] |
> +| >fuse_uring_copy_to_ring() |
> +| [set buf_id in ent_in_out header] |
> +| >io_uring_cmd_done() |
> +| | [CQE received]
> +| | [read headers from header slot]
> +| | [read payload from buf_id]
> +| | [process request]
> +| | [write reply to header slot]
> +| | [write reply payload to buf]
> +| | >io_uring_submit()
> +| | COMMIT_AND_FETCH
> +| >fuse_uring_commit_fetch() |
> +| >fuse_uring_commit() |
> +| [copy reply from ring] |
> +| >fuse_uring_recycle_buffer() |
> +| >fuse_uring_get_next_fuse_req() |
> +
> +Pinned buffers
> +==============
> +
> +Servers can optionally pin their header and/or payload buffers by setting
> +FUSE_URING_PINNED_HEADERS and/or FUSE_URING_PINNED_BUFFERS flags. When
> +set, the kernel pins the user pages and vmaps them during queue setup,
> +enabling memcpy to/from the kernel virtual address instead of
> +copy_to_user/copy_from_user.
> +
> +This avoids the per-request cost of pinning/unpinning user pages and
> +translating virtual addresses. Buffers must be page-aligned. The pinned pages
> +are accounted against RLIMIT_MEMLOCK (bypassable with CAP_IPC_LOCK).
> +
> +Zero-copy
> +=========
> +
> +Fuse io-uring zero-copy allows the server to directly read from / write to
> +the client's pages, bypassing any intermediary buffer copies. This requires
> +the FUSE_URING_ZERO_COPY flag, buffer rings with pinned headers and buffers,
> +and CAP_SYS_ADMIN.
> +
> +The kernel registers the client's underlying pages as a sparse buffer at
> +the entry's fixed id via io_buffer_register_bvec(). The fuse server can
> +then perform io_uring read/write operations directly on these pages.
> +Non-page-backed args (eg out headers) go through the payload buffer as
> +normal. Pages are unregistered when the request completes.
> +
> +The request flow for the zero-copy write path (client writes data, server
> +reads it) is as follows:
> +
> +Zero-copy write
> +---------------::
> +| Kernel | FUSE server
> +| |
> +| "write(fd, buf, 1MB)" |
> +| |
> +| >sys_write() |
> +| >fuse_file_write_iter() |
> +| >fuse_send_one() |
> +| [req->args->in_pages = true] |
> +| [folios hold client write data] |
> +| |
> +| >fuse_uring_copy_to_ring() |
> +| >copy_header_to_ring(IN_OUT) |
> +| [memcpy fuse_in_header to |
> +| pinned headers buf via kaddr] |
> +| >copy_header_to_ring(OP) |
> +| [memcpy write_in header] |
> +| |
> +| >fuse_uring_args_to_ring() |
> +| >setup_fuse_copy_state() |
> +| [is_kaddr = true] |
> +| [skip_folio_copy = true] |
> +| |
> +| >fuse_uring_set_up_zero_copy() |
> +| [folio_get for each client folio] |
> +| [build bio_vec array from folios] |
> +| >io_buffer_register_bvec() |
> +| [register pages at ent->id] |
> +| [ent->zero_copied = true] |
> +| |
> +| >fuse_copy_args() |
> +| [skip_folio_copy => return 0 |
> +| for page arg, skip data copy] |
> +| |
> +| >copy_header_to_ring(RING_ENT) |
> +| [memcpy ent_in_out] |
> +| >io_uring_cmd_done() |
> +| |
> +| | [CQE received]
> +| |
> +| | [issue io_uring READ at
> +| | ent->id]
> +| | [reads directly from
> +| | client's pages (ZERO_COPY)]
> +| |
> +| | [write data to backing
> +| | store]
> +| | [submit COMMIT AND FETCH]
> +| |
> +| >fuse_uring_commit_fetch() |
> +| >fuse_uring_commit() |
> +| >fuse_uring_copy_from_ring() |
> +| >fuse_uring_req_end() |
> +| >io_buffer_unregister(ent->id) |
> +| [unregister sparse buffer] |
> +| >fuse_zero_copy_release() |
> +| [folio_put for each folio] |
> +| [ent->zero_copied = false] |
> +| >fuse_request_end() |
> +| [wake up client] |
> +
> +The zero-copy read path is analogous.
> +
> +Some requests may have both page-backed args and non-page-backed args.
> +For these requests, the page-backed args are zero-copied while the
> +non-page-backed args are copied to the buffer selected from the buffer
> +ring:
> + zero-copy: pages registered via io_buffer_register_bvec()
> + non-page-backed: copied to payload buffer via fuse_copy_args()
> +
> +For a request whose payload is zero-copied, the registration/unregistration
> +path looks like:
> +
> +register: fuse_uring_set_up_zero_copy()
> + folio_get() for each folio
> + io_buffer_register_bvec(ent->id)
> +
> +[server accesses pages via io_uring fixed buf at ent->id]
> +
> +unregister: fuse_uring_req_end()
> + io_buffer_unregister(ent->id)
> + -> fuse_zero_copy_release() callback
> + folio_put() for each folio
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
` (13 preceding siblings ...)
2026-04-02 16:28 ` [PATCH v2 14/14] docs: fuse: add io-uring bufring and zero-copy documentation Joanne Koong
@ 2026-04-30 12:59 ` Jeff Layton
14 siblings, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2026-04-30 12:59 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: bernd, axboe, linux-fsdevel
On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
> This series adds buffer ring and zero-copy capabilities to fuse over io-uring.
>
> Using buffer rings has advantages over the non-buffer-ring (iovec) path:
> - Reduced memory usage: in the iovec path, each entry has its own
> dedicated payload buffer, requiring N buffers for N entries where each
> buffer must be large enough to accommodate the maximum possible
> payload size. With buffer rings, payload buffers are pooled and
> selected on demand. Entries only hold a buffer while actively
> processing a request with payload data. When incremental buffer
> consumption is added, this will allow non-overlapping regions of a
> single buffer to be used simultaneously across multiple requests,
> further reducing memory requirements.
> - Foundation for pinned buffers: the buffer ring headers and payloads
> are now each passed in as a contiguous memory allocation, which allows
> fuse to easily pin and vmap the entire region in one operation during
> queue setup. This will eliminate the per-request overhead of having to
> pin/unpin user pages and translate virtual addresses and is a
> prerequisite for future optimizations like performing data copies
> outside of the server's task context.
>
> This series adds the capability to pin the underlying header and payload
> buffers by setting init flags at registration time, depending on the user's
> mlock limit.
>
> Zero-copy (only for privileged servers) is also opt-in by setting an init flag
> at registration time. Zero-copy eliminates the memory copies between kernel and
> userspace for read/write/payload-heavy operations by allowing the server to
> directly operate on the client's underlying pages.
>
> This series has a dependency on io-uring registered bvec buffers changes
> in [1].
>
> The throughput improvements from pinned buffers and zero-copy depends on how
> much of the server's per-request latency is spent on data copying vs backing
> I/O. When backing I/O dominates, the saved memcpy is a negligible fraction of
> overall latency. Please also note that for the server to read/write
> into the zero-copied pages, the read/write must go through io-uring
> as an IORING_OP_READ_FIXED / IORING_OP_WRITE_FIXED operation. If the
> server's backing I/O is instantaneous (eg served from cache), the
> overhead of the additional io_uring operation may negate the savings
> from eliminating the memcpy.
>
> In benchmarks using passthrough_hp on a high-performance NVMe-backed
> system, pinned headers and pinned payload buffers showed around a 10%
> throughput improvement for direct randreads (~2150 MiB/s to ~2400
> MiB/s), a 4% improvement for direct sequential reads (~2510 MiB/s to
> ~2620 MiB/s), a 8% improvement for buffered randreads (~2100 MiB/s to
> ~2280 MiB/s), and a 6% improvement for buffered sequential reads (~2500
> MiB/s to ~2670 MiB/s).
>
> Zero-copy showed around a 35% throughput improvement for direct
> randreads (~2150 MiB/s to ~2900 MiB/s), a 15% improvement for direct
> sequential reads (~2510 MiB/s to ~2900 MiB/s), a 15% improvement for
> buffered randreads (~2100 MiB/s to ~2470 MiB/s), and a 10% improvement
> for buffered sequential reads (~2500 MiB/s to ~2750 MiB/s). I didn't see
> enough of a clear improvement for writes due to write latency being I/O
> dominated.
>
> The benchmarks were run using:
> fio --name=test_run --ioengine=sync --rw=rand{read,write} --bs=1M
> --size=1G --numjobs=2 --ramp_time=30 --group_reporting=1
>
> To run the benchmark, please also add this patch [2].
>
> The libfuse changes can be found in [3]. To test the server, run:
> sudo ~/libfuse/build/example/passthrough_hp ~/src ~/mounts/tmp
> --nopassthrough -o io_uring_zero_copy -o io_uring_q_depth=8
> Once this series is merged, the libfuse changes will be tidied up and
> submitted upstream.
>
> Further optimizations for incremental buffer consumption, request
> dispatching in current task context, and backing buffer integration with
> IORING_OP_READ/IORING_OP_WRITE operations will be submitted as part of a
> separate series.
>
> Thanks,
> Joanne
>
> [1] https://lore.kernel.org/io-uring/20260402160929.2749744-1-joannelkoong@gmail.com/T/#t
> [2] https://lore.kernel.org/linux-fsdevel/20260326215127.3857682-2-joannelkoong@gmail.com/
> [3] https://github.com/joannekoong/libfuse/commits/zero_copy_v2/
>
> Changelog
> ---------
> v1: https://lore.kernel.org/linux-fsdevel/20260324224532.3733468-1-joannelkoong@gmail.com/
> v1 -> v2:
> * Drop kernel managed buffers from io-uring infrastructure and instead move
> logic into fuse. To later use buffers with io-uring requests natively will
> require fuse to place the backing buffer as a fixed buffer in a sparse slot
> for the server, but that will be added as an optimization in a separate
> series. This makes the io-uring code cleaner and accomodates for more flexible
> fuse user configurations (eg mlock limits) and easier setup (me)
> * Run more benchmarks and get more numbers (me)
> * Add visual diagrams and more documentatoin to commit messages and
> documentation patch (Bernd)
>
> Joanne Koong (14):
> fuse: separate next request fetching from sending logic
> fuse: refactor io-uring header copying to ring
> fuse: refactor io-uring header copying from ring
> fuse: use enum types for header copying
> fuse: refactor setting up copy state for payload copying
> fuse: support buffer copying for kernel addresses
> fuse: use named constants for io-uring iovec indices
> fuse: move fuse_uring_abort() from header to dev_uring.c
> fuse: rearrange io-uring iovec and ent allocation logic
> fuse: add io-uring buffer rings
> fuse: add pinned headers capability for io-uring buffer rings
> fuse: add pinned payload buffers capability for io-uring buffer rings
> fuse: add zero-copy over io-uring
> docs: fuse: add io-uring bufring and zero-copy documentation
>
> .../filesystems/fuse/fuse-io-uring.rst | 189 +++
> fs/fuse/dev.c | 30 +-
> fs/fuse/dev_uring.c | 1042 ++++++++++++++---
> fs/fuse/dev_uring_i.h | 86 +-
> fs/fuse/fuse_dev_i.h | 8 +-
> include/uapi/linux/fuse.h | 36 +-
> 6 files changed, 1194 insertions(+), 197 deletions(-)
>
>
> base-commit: 619fa72e875483dabf7683001496cc0ca4480aa6
Nice work, Joanne! This seems to be in great shape overall.
I'll note that the first 9 patches or so (maybe modulo patch #6) could
be merged in advance of the bigger io_uring changes.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 10/14] fuse: add io-uring buffer rings
2026-04-02 16:28 ` [PATCH v2 10/14] fuse: add io-uring buffer rings Joanne Koong
2026-04-15 9:48 ` Bernd Schubert
2026-04-30 11:08 ` Jeff Layton
@ 2026-05-05 22:47 ` Bernd Schubert
2 siblings, 0 replies; 49+ messages in thread
From: Bernd Schubert @ 2026-05-05 22:47 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: axboe, linux-fsdevel
On 4/2/26 18:28, Joanne Koong wrote:
> Add fuse buffer rings for servers communicating through the io-uring
> interface. To use this, the server must set the FUSE_URING_BUFRING
> flag and provide header and payload buffers via an iovec array in the
> sqe during registration. The payload buffers are used to back the buffer
> ring. The kernel manages buffer selection and recycling through a simple
> internal ring.
>
> This has the following advantages over the non-bufring (iovec) path:
> - Reduced memory usage: in the iovec path, each entry has its own
> dedicated payload buffer, requiring N buffers for N entries where each
> buffer must be large enough to accommodate the maximum possible
> payload size. With buffer rings, payload buffers are pooled and
> selected on demand. Entries only hold a buffer while actively
> processing a request with payload data. When incremental buffer
> consumption is added, this will allow non-overlapping regions of a
> single buffer to be used simultaneously across multiple requests,
> further reducing memory requirements.
> - Foundation for pinned buffers: the buffer ring headers and payloads
> are now each passed in as a contiguous memory allocation, which allows
> fuse to easily pin and vmap the entire region in one operation during
> queue setup. This will eliminate the per-request overhead of having to
> pin/unpin user pages and translate virtual addresses and is a
> prerequisite for future optimizations like performing data copies
> outside of the server's task context.
>
> Each ring entry gets a fixed ID (sqe->buf_index) that maps to a specific
> header slot in the headers buffer. Payload buffers are selected from
> the ring on demand and recycled after each request. Buffer ring usage is
> set on a per-queue basis. All subsequent registration SQEs for the same
> queue must use consistent flags.
>
> The headers are laid out contiguously and provided via iov[0]. Each slot
> maps to ent->id:
>
> |<- headers_size (>= queue_depth * sizeof(fuse_uring_req_header)) ->|
> +------------------------------+------------------------------+-----+
> | struct fuse_uring_req_header | struct fuse_uring_req_header | ... |
> | [ent id=0] | [ent id=1] | |
> +------------------------------+------------------------------+-----+
>
> On the server side, the ent id is used to determine where in the headers
> buffer the headers data for the ent resides. This is done by
> calculating ent_id * sizeof(struct fuse_uring_req_header) as the offset
> into the headers buffer.
>
> The buffer ring is backed by the payload buffer, which is contiguous but
> partitioned into individual bufs according to the buf_size passed in at
> registration.
>
> PAYLOAD BUFFER POOL (contiguous, provided via iov[1]):
> |<-------------- payload_size ------------>|
> +--------- --+-----------+-----------+-----+
> | buf [0] | buf [1] | buf [2] | ... |
> | buf_size | buf_size | buf_size | ... |
> +--------- --+-----------+-----------+-----+
>
> buffer ring state (struct fuse_bufring, kernel-internal):
> bufs[]: [ used | used | FREE | FREE | FREE ]
> ^^^^^^^^^^^^^^^^^^^
> available for selection
>
> The buffer ring logic is as follows:
> select: buf = bufs[head % nbufs]; head++
> recycle: bufs[tail % nbufs] = buf; tail++
> empty: tail == head (no buffers available)
> full: tail - head >= nbufs
>
> Buffer ring request flow
> ------------------------
> | Kernel | FUSE daemon
> | |
> | [client request arrives] |
> | >fuse_uring_send() |
> | [select payload buf from ring] |
> | >fuse_uring_select_buffer() |
> | [copy headers to ent's header slot] |
> | >copy_header_to_ring() |
> | [copy payload to selected buf] |
> | >fuse_uring_copy_to_ring() |
> | [set buf_id in ent_in_out header] |
> | >io_uring_cmd_done() |
> | | [CQE received]
> | | [read headers from header
> | | slot]
> | | [read payload from buf_id]
> | | [process request]
> | | [write reply to header
> | | slot]
> | | [write reply payload to
> | | buf]
> | | >io_uring_submit()
> | | COMMIT_AND_FETCH
> | >fuse_uring_commit_fetch() |
> | >fuse_uring_commit() |
> | [copy reply from ring] |
> | >fuse_uring_recycle_buffer() |
> | >fuse_uring_get_next_fuse_req() |
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> fs/fuse/dev_uring.c | 363 +++++++++++++++++++++++++++++++++-----
> fs/fuse/dev_uring_i.h | 45 ++++-
> include/uapi/linux/fuse.h | 27 ++-
> 3 files changed, 381 insertions(+), 54 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index a061f175b3fd..9f14a2bcde3f 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -41,6 +41,11 @@ enum fuse_uring_header_type {
> FUSE_URING_HEADER_RING_ENT,
> };
>
> +static inline bool bufring_enabled(struct fuse_ring_queue *queue)
> +{
> + return queue->bufring != NULL;
> +}
> +
> static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
> struct fuse_ring_ent *ring_ent)
> {
> @@ -222,6 +227,7 @@ void fuse_uring_destruct(struct fuse_conn *fc)
> }
>
> kfree(queue->fpq.processing);
> + kfree(queue->bufring);
> kfree(queue);
> ring->queues[qid] = NULL;
> }
> @@ -303,20 +309,102 @@ static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
> return 0;
> }
>
> -static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
> - int qid)
> +static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> + struct fuse_ring_queue *queue)
> +{
> + const struct fuse_uring_cmd_req *cmd_req =
> + io_uring_sqe128_cmd(cmd->sqe, struct fuse_uring_cmd_req);
> + u16 queue_depth = READ_ONCE(cmd_req->init.queue_depth);
> + unsigned int buf_size = READ_ONCE(cmd_req->init.buf_size);
> + struct iovec iov[FUSE_URING_IOV_SEGS];
> + void __user *payload, *headers;
> + size_t headers_size, payload_size, ring_size;
> + struct fuse_bufring *br;
> + unsigned int nr_bufs, i;
> + uintptr_t payload_addr;
> + int err;
> +
> + if (!queue_depth || !buf_size)
> + return -EINVAL;
> +
> + err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
> + if (err)
> + return err;
> +
> + headers = iov[FUSE_URING_IOV_HEADERS].iov_base;
> + headers_size = iov[FUSE_URING_IOV_HEADERS].iov_len;
> + payload = iov[FUSE_URING_IOV_PAYLOAD].iov_base;
> + payload_size = iov[FUSE_URING_IOV_PAYLOAD].iov_len;
> +
> + /* check if there's enough space for all the headers */
> + if (headers_size < queue_depth * sizeof(struct fuse_uring_req_header))
> + return -EINVAL;
> +
> + if (buf_size < queue->ring->max_payload_sz)
> + return -EINVAL;
> +
> + nr_bufs = payload_size / buf_size;
> + if (!nr_bufs || nr_bufs > U16_MAX)
> + return -EINVAL;
> +
> + /* create the ring buffer */
> + ring_size = struct_size(br, bufs, nr_bufs);
> + br = kzalloc(ring_size, GFP_KERNEL_ACCOUNT);
> + if (!br)
> + return -ENOMEM;
> +
> + br->queue_depth = queue_depth;
> + br->headers = headers;
> +
> + payload_addr = (uintptr_t)payload;
> +
> + /* populate the ring buffer */
> + for (i = 0; i < nr_bufs; i++, payload_addr += buf_size) {
> + struct fuse_bufring_buf *buf = &br->bufs[i];
> +
> + buf->addr = payload_addr;
> + buf->len = buf_size;
> + buf->id = i;
> + }
> +
> + br->nbufs = nr_bufs;
> + br->tail = nr_bufs;
> +
> + queue->bufring = br;
> +
> + return 0;
> +}
> +
> +/*
> + * if the queue is already registered, check that the queue was initialized with
> + * the same init flags set for this FUSE_IO_URING_CMD_REGISTER cmd. all
> + * FUSE_IO_URING_CMD_REGISTER cmds should have the same init fields set on a
> + * per-queue basis.
> + */
> +static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
> + u64 init_flags)
> {
> + bool bufring = init_flags & FUSE_URING_BUFRING;
> +
> + return bufring_enabled(queue) == bufring;
> +}
> +
> +static struct fuse_ring_queue *
> +fuse_uring_create_queue(struct io_uring_cmd *cmd, struct fuse_ring *ring,
> + int qid, u64 init_flags)
> +{
> + bool use_bufring = init_flags & FUSE_URING_BUFRING;
> struct fuse_conn *fc = ring->fc;
> struct fuse_ring_queue *queue;
> struct list_head *pq;
>
> queue = kzalloc_obj(*queue, GFP_KERNEL_ACCOUNT);
> if (!queue)
> - return NULL;
> + return ERR_PTR(-ENOMEM);
> pq = kzalloc_objs(struct list_head, FUSE_PQ_HASH_SIZE);
> if (!pq) {
> kfree(queue);
> - return NULL;
> + return ERR_PTR(-ENOMEM);
> }
>
> queue->qid = qid;
> @@ -334,12 +422,29 @@ static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
> queue->fpq.processing = pq;
> fuse_pqueue_init(&queue->fpq);
>
> + if (use_bufring) {
> + int err = fuse_uring_bufring_setup(cmd, queue);
> +
> + if (err) {
> + kfree(pq);
> + kfree(queue);
> + return ERR_PTR(err);
> + }
> + }
> +
> spin_lock(&fc->lock);
> + /* check if the queue creation raced with another thread */
> if (ring->queues[qid]) {
> spin_unlock(&fc->lock);
> kfree(queue->fpq.processing);
> + if (use_bufring)
> + kfree(queue->bufring);
> kfree(queue);
> - return ring->queues[qid];
> +
> + queue = ring->queues[qid];
> + if (!queue_init_flags_consistent(queue, init_flags))
> + return ERR_PTR(-EINVAL);
> + return queue;
> }
>
> /*
> @@ -649,7 +754,14 @@ static int copy_header_to_ring(struct fuse_ring_ent *ent,
> if (offset < 0)
> return offset;
>
> - ring = (void __user *)ent->headers + offset;
> + if (bufring_enabled(ent->queue)) {
> + int buf_offset = offset +
> + sizeof(struct fuse_uring_req_header) * ent->id;
> +
> + ring = ent->queue->bufring->headers + buf_offset;
> + } else {
> + ring = (void __user *)ent->headers + offset;
> + }
>
> if (copy_to_user(ring, header, header_size)) {
> pr_info_ratelimited("Copying header to ring failed.\n");
> @@ -669,7 +781,14 @@ static int copy_header_from_ring(struct fuse_ring_ent *ent,
> if (offset < 0)
> return offset;
>
> - ring = (void __user *)ent->headers + offset;
> + if (bufring_enabled(ent->queue)) {
> + int buf_offset = offset +
> + sizeof(struct fuse_uring_req_header) * ent->id;
> +
> + ring = ent->queue->bufring->headers + buf_offset;
> + } else {
> + ring = (void __user *)ent->headers + offset;
> + }
>
> if (copy_from_user(header, ring, header_size)) {
> pr_info_ratelimited("Copying header from ring failed.\n");
> @@ -684,12 +803,20 @@ static int setup_fuse_copy_state(struct fuse_copy_state *cs,
> struct fuse_ring_ent *ent, int dir,
> struct iov_iter *iter)
> {
> + void __user *payload;
> int err;
>
> - err = import_ubuf(dir, ent->payload, ring->max_payload_sz, iter);
> - if (err) {
> - pr_info_ratelimited("fuse: Import of user buffer failed\n");
> - return err;
> + if (bufring_enabled(ent->queue))
> + payload = (void __user *)ent->payload_buf.addr;
> + else
> + payload = ent->payload;
> +
> + if (payload) {
> + err = import_ubuf(dir, payload, ring->max_payload_sz, iter);
> + if (err) {
> + pr_info_ratelimited("fuse: Import of user buffer failed\n");
> + return err;
> + }
> }
>
> fuse_copy_init(cs, dir == ITER_DEST, iter);
> @@ -741,6 +868,9 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
> .commit_id = req->in.h.unique,
> };
>
> + if (bufring_enabled(ent->queue))
> + ent_in_out.buf_id = ent->payload_buf.id;
> +
> err = setup_fuse_copy_state(&cs, ring, req, ent, ITER_DEST, &iter);
> if (err)
> return err;
> @@ -805,6 +935,96 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
> sizeof(req->in.h));
> }
>
> +static bool fuse_uring_req_has_payload(struct fuse_req *req)
> +{
> + struct fuse_args *args = req->args;
> +
> + return args->in_numargs > 1 || args->out_numargs;
> +}
> +
> +static int fuse_uring_select_buffer(struct fuse_ring_ent *ent)
> + __must_hold(&ent->queue->lock)
> +{
> + struct fuse_ring_queue *queue = ent->queue;
> + struct fuse_bufring *br = queue->bufring;
> + struct fuse_bufring_buf *buf;
> + unsigned int tail = br->tail, head = br->head;
> +
> + lockdep_assert_held(&queue->lock);
> +
> + /* Get a buffer to use for the payload */
> + if (tail == head)
> + return -ENOBUFS;
> +
> + buf = &br->bufs[head % br->nbufs];
> + br->head++;
Just a minor annotation and we can do this any time later. For cache
effects (mostly large L3) it might be worth to update buffer selection
and buffer recycling to LIFO.
Thanks,
Bernd
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 11/14] fuse: add pinned headers capability for io-uring buffer rings
2026-04-15 0:48 ` Joanne Koong
@ 2026-05-05 22:51 ` Bernd Schubert
0 siblings, 0 replies; 49+ messages in thread
From: Bernd Schubert @ 2026-05-05 22:51 UTC (permalink / raw)
To: Joanne Koong; +Cc: miklos, axboe, linux-fsdevel
On 4/15/26 02:48, Joanne Koong wrote:
> On Tue, Apr 14, 2026 at 5:47 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>>
>>
>>
>> On 4/2/26 18:28, Joanne Koong wrote:
>>> Allow fuse servers to pin their header buffers by setting the
>>> FUSE_URING_PINNED_HEADERS flag alongside FUSE_URING_BUFRING on REGISTER
>>> sqes. When set, the kernel pins the header pages, vmaps them for a
>>> kernel virtual address, and uses direct memcpy for copying. This avoids
>>> the per-request overhead of having to pin/unpin user pages and translate
>>> virtual addresses.
>>>
>>> Buffers must be page-aligned. The kernel accounts pinned pages against
>>> RLIMIT_MEMLOCK (bypassed with CAP_IPC_LOCK) and tracks mm->pinned_vm.
>>> Unpinning is done in process context during connection abort, since vmap
>>> cannot run in softirq (where final destruction occurs via RCU).
>>>
>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>> ---
>>> fs/fuse/dev_uring.c | 228 ++++++++++++++++++++++++++++++++++++--
>>> fs/fuse/dev_uring_i.h | 23 +++-
>>> include/uapi/linux/fuse.h | 2 +
>>> 3 files changed, 243 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>>> index 9f14a2bcde3f..79736b02cf9f 100644
>>> --- a/fs/fuse/dev_uring.c
>>> +++ b/fs/fuse/dev_uring.c
>>> @@ -11,6 +11,7 @@
>>>
>>> #include <linux/fs.h>
>>> #include <linux/io_uring/cmd.h>
>>> +#include <linux/vmalloc.h>
>>>
>>> static bool __read_mostly enable_uring;
>>> module_param(enable_uring, bool, 0644);
>>> @@ -46,6 +47,11 @@ static inline bool bufring_enabled(struct fuse_ring_queue *queue)
>>> return queue->bufring != NULL;
>>> }
>>>
>>> +static inline bool bufring_pinned_headers(struct fuse_ring_queue *queue)
>>> +{
>>> + return queue->bufring->use_pinned_headers;
>>> +}
>>> +
>>> static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
>>> struct fuse_ring_ent *ring_ent)
>>> {
>>> @@ -200,6 +206,37 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
>>> return false;
>>> }
>>>
>>> +static void fuse_bufring_unpin_mem(struct fuse_bufring_pinned *mem)
>>> +{
>>> + struct page **pages = mem->pages;
>>> + unsigned int nr_pages = mem->nr_pages;
>>> + struct user_struct *user = mem->user;
>>> + struct mm_struct *mm_account = mem->mm_account;
>>> +
>>> + vunmap(mem->addr);
>>> + unpin_user_pages(pages, nr_pages);
>>> +
>>> + if (user) {
>>> + atomic_long_sub(nr_pages, &user->locked_vm);
>>> + free_uid(user);
>>> + }
>>> +
>>> + atomic64_sub(nr_pages, &mm_account->pinned_vm);
>>> + mmdrop(mm_account);
>>> +
>>> + kvfree(mem->pages);
>>> +}
>>> +
>>> +static void fuse_uring_bufring_unpin(struct fuse_ring_queue *queue)
>>> +{
>>> + struct fuse_bufring *br = queue->bufring;
>>> +
>>> + if (bufring_pinned_headers(queue)) {
>>> + fuse_bufring_unpin_mem(&br->pinned_headers);
>>> + br->use_pinned_headers = false;
>>> + }
>>> +}
>>> +
>>> void fuse_uring_destruct(struct fuse_conn *fc)
>>> {
>>> struct fuse_ring *ring = fc->ring;
>>> @@ -227,7 +264,10 @@ void fuse_uring_destruct(struct fuse_conn *fc)
>>> }
>>>
>>> kfree(queue->fpq.processing);
>>> - kfree(queue->bufring);
>>> + if (bufring_enabled(queue)) {
>>> + fuse_uring_bufring_unpin(queue);
>>> + kfree(queue->bufring);
>>> + }
>>> kfree(queue);
>>> ring->queues[qid] = NULL;
>>> }
>>> @@ -309,14 +349,131 @@ static int fuse_uring_get_iovec_from_sqe(const struct io_uring_sqe *sqe,
>>> return 0;
>>> }
>>>
>>> +static struct page **fuse_uring_pin_user_pages(void __user *uaddr,
>>> + unsigned long len, int *npages)
>>
>> I think this is a duplicate of io_pin_pages(), can we just export that
>> and use here? I'm basically going to propose to use the same technique
>> in ublk - would be another duplicate.
>>
>
> Tbh I think this is generic logic that makes more sense to live in the
> mm layer instead of fuse calling this as an exported io-uring
> function. The memory it's passing in is not related to io-uring, so
> that was my hesitation. For your ublk use case, is the memory you're
> passing into this user-allocated memory that's not part of io-uring?
> If so, then maybe it's best to move io_pin_pages() out of io-uring and
> into generic mm.
Maybe we could quickly check with Jens tomorrow how to proceed? Maybe
export it in io-uring for now and then later move it to mm/?
I definitely have exactly the same function in the ublk multi buf size
patch (maybe I sent it out as alpha-RFC in the morning).
Thanks,
Bernd
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 13/14] fuse: add zero-copy over io-uring
2026-04-30 12:55 ` Jeff Layton
@ 2026-05-05 22:55 ` Bernd Schubert
0 siblings, 0 replies; 49+ messages in thread
From: Bernd Schubert @ 2026-05-05 22:55 UTC (permalink / raw)
To: Jeff Layton, Joanne Koong; +Cc: miklos, axboe, linux-fsdevel
On 4/30/26 14:55, Jeff Layton wrote:
> On Thu, 2026-04-30 at 13:35 +0100, Joanne Koong wrote:
>> On Thu, Apr 30, 2026 at 12:42 PM Jeff Layton <jlayton@kernel.org> wrote:
>>>
>>> On Thu, 2026-04-02 at 09:28 -0700, Joanne Koong wrote:
>>>> Implement zero-copy data transfer for fuse over io-uring, eliminating
>>>> memory copies between userspace, the kernel, and the fuse server for
>>>> page-backed read/write operations.
>>>>
>>>> When the FUSE_URING_ZERO_COPY flag is set alongside FUSE_URING_BUFRING,
>>>> the kernel registers the client's underlying pages as a sparse buffer at
>>>> the entry's fixed id via io_buffer_register_bvec(). The fuse server can
>>>> then perform io_uring read/write operations directly on these pages.
>>>> Non-page-backed args (eg out headers) go through the payload buffer as
>>>> normal.
>>>>
>>>> This requires CAP_SYS_ADMIN and buffer rings with pinned headers and
>>>> buffers. Gating on pinned headers and buffers keeps the configuration
>>>> space small and avoids partially-optimized modes that are unlikely to be
>>>> useful in practice. Pages are unregistered when the request completes.
>>>>
>>>
>>> Can you elaborate a bit more on why CAP_SYS_ADMIN is needed here? It's
>>> not immediately obvious to me.
>>
>> Thank you for reviewing this series, Jeff!
>>
>> This is gated behind CAP_SYS_ADMIN because zero-copy allows the server
>> direct access to the client's underlying pages, rather than operating
>> on an intermediary buffer that the contents of client's pages were
>> copied into. A malicious unprivileged server could keep direct access
>> to the client's pages (eg even if the client tries to cancel a
>> read/write, if the request was already sent to userspace, the server
>> will still have access to the underlying pages). In the non-zero-copy
>> path this isn't possible because the server only operates on the copy
>> of the pages and not on the actual pages.
>>
>
> Thanks for the explanation. I'd suggest adding that to the commit
> message (and maybe comments near the CAP_SYS_ADMIN checks) in case
> others aren't clear why this is gated on that.
Silly question, isn't it very splice like? Fuse-server doesn't get the
access to the buffer, actually, but only forwards via io-uring. Splice
is allowed without CAP_SYS_ADMIN, so why does this need to have it?
Thanks,
Bernd
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 13/14] fuse: add zero-copy over io-uring
2026-04-02 16:28 ` [PATCH v2 13/14] fuse: add zero-copy over io-uring Joanne Koong
2026-04-30 11:42 ` Jeff Layton
2026-04-30 12:56 ` Jeff Layton
@ 2026-05-05 23:45 ` Bernd Schubert
2 siblings, 0 replies; 49+ messages in thread
From: Bernd Schubert @ 2026-05-05 23:45 UTC (permalink / raw)
To: Joanne Koong, miklos; +Cc: axboe, linux-fsdevel
On 4/2/26 18:28, Joanne Koong wrote:
> Implement zero-copy data transfer for fuse over io-uring, eliminating
> memory copies between userspace, the kernel, and the fuse server for
> page-backed read/write operations.
>
> When the FUSE_URING_ZERO_COPY flag is set alongside FUSE_URING_BUFRING,
> the kernel registers the client's underlying pages as a sparse buffer at
> the entry's fixed id via io_buffer_register_bvec(). The fuse server can
> then perform io_uring read/write operations directly on these pages.
> Non-page-backed args (eg out headers) go through the payload buffer as
> normal.
>
> This requires CAP_SYS_ADMIN and buffer rings with pinned headers and
> buffers. Gating on pinned headers and buffers keeps the configuration
> space small and avoids partially-optimized modes that are unlikely to be
> useful in practice. Pages are unregistered when the request completes.
>
> The request flow for the zero-copy write path (client writes data,
> server reads it) is as follows:
> =======================================================================
> | Kernel | FUSE server
> | |
> | "write(fd, buf, 1MB)" |
> | |
> | >sys_write() |
> | >fuse_file_write_iter() |
> | >fuse_send_one() |
> | [req->args->in_pages = true] |
> | [folios hold client write data] |
> | |
> | >fuse_uring_copy_to_ring() |
> | >copy_header_to_ring(IN_OUT) |
> | [memcpy fuse_in_header to |
> | pinned headers buf via kaddr] |
> | >copy_header_to_ring(OP) |
> | [memcpy write_in header] |
> | |
> | >fuse_uring_args_to_ring() |
> | >setup_fuse_copy_state() |
> | [is_kaddr = true] |
> | [skip_folio_copy = true] |
> | |
> | >fuse_uring_set_up_zero_copy() |
> | [folio_get for each client folio] |
> | [build bio_vec array from folios] |
> | >io_buffer_register_bvec() |
> | [register pages at ent->id] |
Somehow I find ent->id really confusing here. ent->slot_idx? Or even
ent->tag?
> | [ent->zero_copied = true] |
> | |
> | >fuse_copy_args() |
> | [skip_folio_copy => return 0 |
> | for page arg, skip data copy] |
> | |
> | >copy_header_to_ring(RING_ENT) |
> | [memcpy ent_in_out] |
> | >io_uring_cmd_done() |
> | |
> | | [CQE received]
> | |
> | | [issue io_uring READ at
> | | ent->id]
> | | [reads directly from
> | |client's pages (ZERO_COPY)]
> | |
> | | [write data to backing
> | | store]
> | | [submit COMMIT AND FETCH]
> | |
> | >fuse_uring_commit_fetch() |
> | >fuse_uring_commit() |
> | >fuse_uring_copy_from_ring() |
> | >fuse_uring_req_end() |
> | >io_buffer_unregister(ent->id) |
> | [unregister sparse buffer] |
> | >fuse_zero_copy_release() |
> | [folio_put for each folio] |
> | [ent->zero_copied = false] |
> | >fuse_request_end() |
> | [wake up client] |
>
> The zero-copy read path is analogous.
>
> Some requests may have both page-backed args and non-page-backed args.
> For these requests, the page-backed args are zero-copied while the
> non-page-backed args are copied to the buffer selected from the buffer
> ring:
> zero-copy: pages registered via io_buffer_register_bvec()
> non-page-backed: copied to payload buffer via fuse_copy_args()
>
> For a request whose payload is zero-copied, the
> registration/unregistration path looks like:
>
> register: fuse_uring_set_up_zero_copy()
> folio_get() for each folio
> io_buffer_register_bvec(ent->id)
>
> [server accesses pages via io_uring fixed buf at ent->id]
>
> unregister: fuse_uring_req_end()
> io_buffer_unregister(ent->id)
> -> fuse_zero_copy_release() callback
> folio_put() for each folio
>
> The throughput improvement from zero-copy depends on how much of the
> per-request latency is spent on data copying vs backing I/O. When
> backing I/O dominates, the saved memcpy is a negligible fraction of
> overall latency. Please also note that for the server to read/write
> into the zero-copied pages, the read/write must go through io-uring
> as an IORING_OP_READ_FIXED / IORING_OP_WRITE_FIXED operation. If the
> server's backing I/O is instantaneous (eg served from cache), the
> overhead of the additional io_uring operation may negate the savings
> from eliminating the memcpy.
>
> In benchmarks using passthrough_hp on a high-performance NVMe-backed
> system, zero-copy showed around a 35% throughput improvement for direct
> randreads (~2150 MiB/s to ~2900 MiB/s), a 15% improvement for direct
> sequential reads (~2510 MiB/s to ~2900 MiB/s), a 15% improvement for
> buffered randreads (~2100 MiB/s to ~2470 MiB/s), and a 10% improvement
> for buffered sequential reads (~2500 MiB/s to ~2750 MiB/s).
>
> The benchmarks were run using:
> fio --name=test_run --ioengine=sync --rw=rand{read,write} --bs=1M
> --size=1G --numjobs=2 --ramp_time=30 --group_reporting=1
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> fs/fuse/dev.c | 7 +-
> fs/fuse/dev_uring.c | 167 +++++++++++++++++++++++++++++++++-----
> fs/fuse/dev_uring_i.h | 4 +
> fs/fuse/fuse_dev_i.h | 1 +
> include/uapi/linux/fuse.h | 5 ++
> 5 files changed, 160 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index a87939eaa103..cd326e61831b 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -1233,10 +1233,13 @@ int fuse_copy_args(struct fuse_copy_state *cs, unsigned numargs,
>
> for (i = 0; !err && i < numargs; i++) {
> struct fuse_arg *arg = &args[i];
> - if (i == numargs - 1 && argpages)
> + if (i == numargs - 1 && argpages) {
> + if (cs->skip_folio_copy)
> + return 0;
> err = fuse_copy_folios(cs, arg->size, zeroing);
> - else
> + } else {
> err = fuse_copy_one(cs, arg->value, arg->size);
> + }
> }
> return err;
> }
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 06d3d8dc1c82..d9f1ee4beaf3 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -31,6 +31,11 @@ struct fuse_uring_pdu {
> struct fuse_ring_ent *ent;
> };
>
> +struct fuse_zero_copy_bvs {
> + unsigned int nr_bvs;
> + struct bio_vec bvs[];
> +};
> +
> static const struct fuse_iqueue_ops fuse_io_uring_ops;
>
> enum fuse_uring_header_type {
> @@ -57,6 +62,11 @@ static inline bool bufring_pinned_buffers(struct fuse_ring_queue *queue)
> return queue->bufring->use_pinned_buffers;
> }
>
> +static inline bool bufring_zero_copy(struct fuse_ring_queue *queue)
> +{
> + return queue->bufring->use_zero_copy;
> +}
> +
> static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
> struct fuse_ring_ent *ring_ent)
> {
> @@ -102,8 +112,18 @@ static void fuse_uring_flush_bg(struct fuse_ring_queue *queue)
> }
> }
>
> +static bool can_zero_copy_req(struct fuse_ring_ent *ent, struct fuse_req *req)
> +{
> + struct fuse_args *args = req->args;
> +
> + if (!bufring_enabled(ent->queue) || !bufring_zero_copy(ent->queue))
> + return false;
> +
> + return args->in_pages || args->out_pages;
> +}
> +
> static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
> - int error)
> + int error, unsigned int issue_flags)
> {
> struct fuse_ring_queue *queue = ent->queue;
> struct fuse_ring *ring = queue->ring;
> @@ -122,6 +142,11 @@ static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
>
> spin_unlock(&queue->lock);
>
> + if (ent->zero_copied) {
> + io_buffer_unregister(ent->cmd, ent->id, issue_flags);
> + ent->zero_copied = false;
> + }
> +
> if (error)
> req->out.h.error = error;
>
> @@ -485,6 +510,7 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> struct iovec iov[FUSE_URING_IOV_SEGS];
> bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
> bool pinned_bufs = init_flags & FUSE_URING_PINNED_BUFFERS;
> + bool zero_copy = init_flags & FUSE_URING_ZERO_COPY;
> void __user *payload, *headers;
> size_t headers_size, payload_size, ring_size;
> struct fuse_bufring *br;
> @@ -508,7 +534,7 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> if (headers_size < queue_depth * sizeof(struct fuse_uring_req_header))
> return -EINVAL;
>
> - if (buf_size < queue->ring->max_payload_sz)
> + if (!zero_copy && buf_size < queue->ring->max_payload_sz)
> return -EINVAL;
>
> nr_bufs = payload_size / buf_size;
> @@ -521,6 +547,7 @@ static int fuse_uring_bufring_setup(struct io_uring_cmd *cmd,
> if (!br)
> return -ENOMEM;
>
> + br->use_zero_copy = zero_copy;
> br->queue_depth = queue_depth;
> if (pinned_headers) {
> err = fuse_bufring_pin_mem(&br->pinned_headers, headers,
> @@ -580,6 +607,7 @@ static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
> bool bufring = init_flags & FUSE_URING_BUFRING;
> bool pinned_headers = init_flags & FUSE_URING_PINNED_HEADERS;
> bool pinned_bufs = init_flags & FUSE_URING_PINNED_BUFFERS;
> + bool zero_copy = init_flags & FUSE_URING_ZERO_COPY;
>
> if (bufring_enabled(queue) != bufring)
> return false;
> @@ -588,7 +616,8 @@ static bool queue_init_flags_consistent(struct fuse_ring_queue *queue,
> return true;
>
> return bufring_pinned_headers(queue) == pinned_headers &&
> - bufring_pinned_buffers(queue) == pinned_bufs;
> + bufring_pinned_buffers(queue) == pinned_bufs &&
> + bufring_zero_copy(queue) == zero_copy;
> }
>
> static struct fuse_ring_queue *
> @@ -1063,6 +1092,7 @@ static int setup_fuse_copy_state(struct fuse_copy_state *cs,
> cs->is_kaddr = true;
> cs->kaddr = (void *)ent->payload_buf.addr;
> cs->len = ent->payload_buf.len;
> + cs->skip_folio_copy = ent->zero_copied;
> }
>
> cs->is_uring = true;
> @@ -1095,11 +1125,70 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
> return err;
> }
>
> +static void fuse_zero_copy_release(void *priv)
> +{
> + struct fuse_zero_copy_bvs *zc_bvs = priv;
> + unsigned int i;
> +
> + for (i = 0; i < zc_bvs->nr_bvs; i++)
> + folio_put(page_folio(zc_bvs->bvs[i].bv_page));
> +
> + kfree(zc_bvs);
> +}
> +
> +static int fuse_uring_set_up_zero_copy(struct fuse_ring_ent *ent,
> + struct fuse_req *req,
> + unsigned int issue_flags)
> +{
> + struct fuse_args_pages *ap;
> + int err, i, ddir = 0;
> + struct fuse_zero_copy_bvs *zc_bvs;
> + struct bio_vec *bvs;
> +
> + /* out_pages indicates a read, in_pages indicates a write */
> + if (req->args->out_pages)
> + ddir |= IO_BUF_DEST;
> + if (req->args->in_pages)
> + ddir |= IO_BUF_SOURCE;
> +
> + WARN_ON_ONCE(!ddir);
> +
> + ap = container_of(req->args, typeof(*ap), args);
> +
> + zc_bvs = kmalloc(struct_size(zc_bvs, bvs, ap->num_folios),
> + GFP_KERNEL_ACCOUNT);
> + if (!zc_bvs)
> + return -ENOMEM;
> +
> + zc_bvs->nr_bvs = ap->num_folios;
> + bvs = zc_bvs->bvs;
> + for (i = 0; i < ap->num_folios; i++) {
> + bvs[i].bv_page = folio_page(ap->folios[i], 0);
Hmm, I thought everything was prepared for huge folios? Shouldn't this
function here be updated to handle that? Iterate over all folios add up
the number of pages and then iterate over all folios and pages?
> + bvs[i].bv_offset = ap->descs[i].offset;
> + bvs[i].bv_len = ap->descs[i].length;
> + folio_get(ap->folios[i]);
> + }
> +
Maybe a commment here like
/* ent->id is used in fuse-server with io_uring_prep_{write,read}_fixed */ ?
Thanks,
Bernd
^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2026-05-05 23:45 UTC | newest]
Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-02 16:28 [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Joanne Koong
2026-04-02 16:28 ` [PATCH v2 01/14] fuse: separate next request fetching from sending logic Joanne Koong
2026-04-29 11:52 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 02/14] fuse: refactor io-uring header copying to ring Joanne Koong
2026-04-29 12:05 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 03/14] fuse: refactor io-uring header copying from ring Joanne Koong
2026-04-29 12:06 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 04/14] fuse: use enum types for header copying Joanne Koong
2026-04-30 8:04 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 05/14] fuse: refactor setting up copy state for payload copying Joanne Koong
2026-04-30 8:06 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 06/14] fuse: support buffer copying for kernel addresses Joanne Koong
2026-04-30 8:19 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 07/14] fuse: use named constants for io-uring iovec indices Joanne Koong
2026-04-15 9:36 ` Bernd Schubert
2026-04-30 8:20 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 08/14] fuse: move fuse_uring_abort() from header to dev_uring.c Joanne Koong
2026-04-15 9:40 ` Bernd Schubert
2026-04-30 8:21 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 09/14] fuse: rearrange io-uring iovec and ent allocation logic Joanne Koong
2026-04-15 9:45 ` Bernd Schubert
2026-04-30 8:24 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 10/14] fuse: add io-uring buffer rings Joanne Koong
2026-04-15 9:48 ` Bernd Schubert
2026-04-15 21:40 ` Joanne Koong
2026-04-30 11:08 ` Jeff Layton
2026-04-30 12:44 ` Joanne Koong
2026-05-05 22:47 ` Bernd Schubert
2026-04-02 16:28 ` [PATCH v2 11/14] fuse: add pinned headers capability for " Joanne Koong
2026-04-14 12:47 ` Bernd Schubert
2026-04-15 0:48 ` Joanne Koong
2026-05-05 22:51 ` Bernd Schubert
2026-04-30 11:22 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 12/14] fuse: add pinned payload buffers " Joanne Koong
2026-04-30 11:29 ` Jeff Layton
2026-04-02 16:28 ` [PATCH v2 13/14] fuse: add zero-copy over io-uring Joanne Koong
2026-04-30 11:42 ` Jeff Layton
2026-04-30 12:35 ` Joanne Koong
2026-04-30 12:55 ` Jeff Layton
2026-05-05 22:55 ` Bernd Schubert
2026-04-30 12:56 ` Jeff Layton
2026-05-05 23:45 ` Bernd Schubert
2026-04-02 16:28 ` [PATCH v2 14/14] docs: fuse: add io-uring bufring and zero-copy documentation Joanne Koong
2026-04-14 21:05 ` Bernd Schubert
2026-04-15 1:10 ` Joanne Koong
2026-04-15 10:55 ` Bernd Schubert
2026-04-15 22:40 ` Joanne Koong
2026-04-30 12:57 ` Jeff Layton
2026-04-30 12:59 ` [PATCH v2 00/14] fuse: add io-uring buffer rings and zero-copy Jeff Layton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox