* [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy
@ 2025-12-03 0:34 Joanne Koong
2025-12-03 0:34 ` [PATCH v1 01/30] io_uring/kbuf: refactor io_buf_pbuf_register() logic into generic helpers Joanne Koong
` (29 more replies)
0 siblings, 30 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:34 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
This series adds buffer ring and zero-copy capabilities to fuse over io-uring.
This requires adding a new kernel-managed buf (kmbuf) ring type to io-uring
where the buffers are provided and managed by the kernel instead of by
userspace.
On the io-uring side, the kmbuf interface is basically identical to pbufs.
They differ mostly in how the memory region is set up and whether it is
userspace or kernel that recycles back the buffer. Internally, the
IOBL_KERNEL_MANAGED flag is used to mark the buffer ring as kernel-managed.
Patches 6 and 7 add the capability to pin buffer rings and the fixed buffer
table. While originally desired as an optimization, this is a necessity for
fuse because the ent headers reside at a different index than the sqe's buf
index, which would require having to track the refcount for the imported
buffer in a gnarlier way. There are some cases where fuse needs to select
buffers from the buffer ring in atomic contexts where the uring mutex is not
held, and pinning the buffer ring allows the selection of buffers using the
underlying buffer list pointer with synchronization from the fuse queue
spinlock.
The zero-copy work builds on top of the infrastructure added for
kernel-managed buffer rings (the bulk of which is in patch 21: "fuse: add
io-uring kernel-managed buffer ring") and that informs some of the design
choices for how fuse uses the kernel-managed buffer ring without zero-copy.
There was a previous submission for supporting registered buffers in fuse [1]
but that was abandoned in favor of using kernel-managed buffer rings, which,
once incremental buffer consumption is added in a later patchset, gives
significant memory usage advantages in allowing the full buffer capacity to be
utilized across multiple requests, as well as offers more flexibility for
future additions. As well, it also makes the userspace side setup simpler.
The relevant refactoring fuse patches from the previous submission are carried
over into this one.
Benchmarks for zero-copy (patch 29) show approximately the following
differences in throughput for bs=1M:
direct randreads: ~20% increase (~2100 MB/s -> ~2600 MB/s)
buffered randreads: ~25% increase (~1900 MB/s -> 2400 MB/s)
direct randwrites: no difference (~750 MB/s)
buffered randwrites: ~10% increase (950 MB/s -> 1050 MB/s)
The benchmark was run using fio on the passthrough_hp server:
fio --name=test_run --ioengine=sync --rw=rand{read,write} --bs=1M
--size=1G --numjobs=2 --ramp_time=30 --group_reporting=1
This series is on top of commit 5d24321e4c15 ("io_uring: Introduce
sockname...") in the io-uring tree, and on top of two locally patched fixups
[2] and [3].
Thanks,
Joanne
[1] https://lore.kernel.org/linux-fsdevel/20251027222808.2332692-1-joannelkoong@gmail.com/
[2] https://lore.kernel.org/linux-fsdevel/20251125181347.667883-1-joannelkoong@gmail.com/
[3] https://lore.kernel.org/linux-fsdevel/20251021-io-uring-fixes-copy-finish-v1-0-913ecf8aa945@ddn.com/
Joanne Koong (30):
io_uring/kbuf: refactor io_buf_pbuf_register() logic into generic
helpers
io_uring/kbuf: rename io_unregister_pbuf_ring() to
io_unregister_buf_ring()
io_uring/kbuf: add support for kernel-managed buffer rings
io_uring/kbuf: add mmap support for kernel-managed buffer rings
io_uring/kbuf: support kernel-managed buffer rings in buffer selection
io_uring/kbuf: add buffer ring pinning/unpinning
io_uring/rsrc: add fixed buffer table pinning/unpinning
io_uring/kbuf: add recycling for pinned kernel managed buffer rings
io_uring: add io_uring_cmd_import_fixed_index()
io_uring/kbuf: add io_uring_is_kmbuf_ring()
io_uring/kbuf: return buffer id in buffer selection
io_uring/kbuf: export io_ring_buffer_select()
io_uring/cmd: set selected buffer index in __io_uring_cmd_done()
io_uring: add release callback for ring death
fuse: refactor io-uring logic for getting next fuse request
fuse: refactor io-uring header copying to ring
fuse: refactor io-uring header copying from ring
fuse: use enum types for header copying
fuse: refactor setting up copy state for payload copying
fuse: support buffer copying for kernel addresses
fuse: add io-uring kernel-managed buffer ring
io_uring/rsrc: refactor
io_buffer_register_bvec()/io_buffer_unregister_bvec()
io_uring/rsrc: split io_buffer_register_request() logic
io_uring/rsrc: Allow buffer release callback to be optional
io_uring/rsrc: add io_buffer_register_bvec()
io_uring/rsrc: export io_buffer_unregister
fuse: rename fuse_set_zero_arg0() to fuse_zero_in_arg0()
fuse: enforce op header for every payload reply
fuse: add zero-copy over io-uring
docs: fuse: add io-uring bufring and zero-copy documentation
Documentation/block/ublk.rst | 15 +-
.../filesystems/fuse/fuse-io-uring.rst | 55 +-
drivers/block/ublk_drv.c | 20 +-
fs/fuse/dax.c | 2 +-
fs/fuse/dev.c | 32 +-
fs/fuse/dev_uring.c | 775 +++++++++++++++---
fs/fuse/dev_uring_i.h | 47 +-
fs/fuse/dir.c | 13 +-
fs/fuse/file.c | 11 +-
fs/fuse/fuse_dev_i.h | 8 +-
fs/fuse/fuse_i.h | 8 +-
fs/fuse/readdir.c | 2 +-
fs/fuse/xattr.c | 18 +-
include/linux/io_uring.h | 9 +
include/linux/io_uring/buf.h | 98 +++
include/linux/io_uring/cmd.h | 25 +-
include/linux/io_uring_types.h | 21 +-
include/uapi/linux/fuse.h | 15 +-
include/uapi/linux/io_uring.h | 17 +-
io_uring/io_uring.c | 15 +
io_uring/kbuf.c | 337 ++++++--
io_uring/kbuf.h | 19 +-
io_uring/memmap.c | 117 ++-
io_uring/memmap.h | 4 +
io_uring/register.c | 9 +-
io_uring/rsrc.c | 188 ++++-
io_uring/rsrc.h | 6 +
io_uring/uring_cmd.c | 39 +-
28 files changed, 1632 insertions(+), 293 deletions(-)
create mode 100644 include/linux/io_uring/buf.h
--
2.47.3
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v1 01/30] io_uring/kbuf: refactor io_buf_pbuf_register() logic into generic helpers
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
@ 2025-12-03 0:34 ` Joanne Koong
2025-12-03 0:34 ` [PATCH v1 02/30] io_uring/kbuf: rename io_unregister_pbuf_ring() to io_unregister_buf_ring() Joanne Koong
` (28 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:34 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Refactor the logic in io_register_pbuf_ring() into generic helpers:
- io_validate_buf_reg(): Validate user input and buffer registration
parameters
- io_alloc_new_buffer_list(): Allocate and initialize a new buffer
list for the given buffer group ID
- io_setup_pbuf_ring(): Sets up the physical buffer ring region and
handles memory mapping for provided buffer rings
This is a preparatory change for upcoming kernel-managed buffer ring
support which will need to reuse some of these helpers.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
io_uring/kbuf.c | 123 ++++++++++++++++++++++++++++++++----------------
1 file changed, 82 insertions(+), 41 deletions(-)
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 8a329556f8df..c656cb433099 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -596,55 +596,71 @@ int io_manage_buffers_legacy(struct io_kiocb *req, unsigned int issue_flags)
return IOU_COMPLETE;
}
-int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
+static int io_validate_buf_reg(struct io_uring_buf_reg *reg,
+ unsigned int permitted_flags)
{
- struct io_uring_buf_reg reg;
- struct io_buffer_list *bl;
- struct io_uring_region_desc rd;
- struct io_uring_buf_ring *br;
- unsigned long mmap_offset;
- unsigned long ring_size;
- int ret;
-
- lockdep_assert_held(&ctx->uring_lock);
-
- if (copy_from_user(®, arg, sizeof(reg)))
- return -EFAULT;
- if (!mem_is_zero(reg.resv, sizeof(reg.resv)))
+ if (!mem_is_zero(reg->resv, sizeof(reg->resv)))
return -EINVAL;
- if (reg.flags & ~(IOU_PBUF_RING_MMAP | IOU_PBUF_RING_INC))
+ if (reg->flags & ~permitted_flags)
return -EINVAL;
- if (!is_power_of_2(reg.ring_entries))
+ if (!is_power_of_2(reg->ring_entries))
return -EINVAL;
/* cannot disambiguate full vs empty due to head/tail size */
- if (reg.ring_entries >= 65536)
+ if (reg->ring_entries >= 65536)
return -EINVAL;
+ return 0;
+}
- bl = io_buffer_get_list(ctx, reg.bgid);
- if (bl) {
+static int io_alloc_new_buffer_list(struct io_ring_ctx *ctx,
+ struct io_uring_buf_reg *reg,
+ struct io_buffer_list **bl)
+{
+ struct io_buffer_list *list;
+
+ list = io_buffer_get_list(ctx, reg->bgid);
+ if (list) {
/* if mapped buffer ring OR classic exists, don't allow */
- if (bl->flags & IOBL_BUF_RING || !list_empty(&bl->buf_list))
+ if (list->flags & IOBL_BUF_RING || !list_empty(&list->buf_list))
return -EEXIST;
- io_destroy_bl(ctx, bl);
+ io_destroy_bl(ctx, list);
}
- bl = kzalloc(sizeof(*bl), GFP_KERNEL_ACCOUNT);
- if (!bl)
+ list = kzalloc(sizeof(*list), GFP_KERNEL_ACCOUNT);
+ if (!list)
return -ENOMEM;
- mmap_offset = (unsigned long)reg.bgid << IORING_OFF_PBUF_SHIFT;
- ring_size = flex_array_size(br, bufs, reg.ring_entries);
+ list->nr_entries = reg->ring_entries;
+ list->mask = reg->ring_entries - 1;
+ list->flags = IOBL_BUF_RING;
+
+ *bl = list;
+
+ return 0;
+}
+
+static int io_setup_pbuf_ring(struct io_ring_ctx *ctx,
+ struct io_uring_buf_reg *reg,
+ struct io_buffer_list *bl)
+{
+ struct io_uring_region_desc rd;
+ unsigned long mmap_offset;
+ unsigned long ring_size;
+ int ret;
+
+ mmap_offset = (unsigned long)reg->bgid << IORING_OFF_PBUF_SHIFT;
+ ring_size = flex_array_size(bl->buf_ring, bufs, reg->ring_entries);
memset(&rd, 0, sizeof(rd));
rd.size = PAGE_ALIGN(ring_size);
- if (!(reg.flags & IOU_PBUF_RING_MMAP)) {
- rd.user_addr = reg.ring_addr;
+ if (!(reg->flags & IOU_PBUF_RING_MMAP)) {
+ rd.user_addr = reg->ring_addr;
rd.flags |= IORING_MEM_REGION_TYPE_USER;
}
+
ret = io_create_region(ctx, &bl->region, &rd, mmap_offset);
if (ret)
- goto fail;
- br = io_region_get_ptr(&bl->region);
+ return ret;
+ bl->buf_ring = io_region_get_ptr(&bl->region);
#ifdef SHM_COLOUR
/*
@@ -656,25 +672,50 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
* should use IOU_PBUF_RING_MMAP instead, and liburing will handle
* this transparently.
*/
- if (!(reg.flags & IOU_PBUF_RING_MMAP) &&
- ((reg.ring_addr | (unsigned long)br) & (SHM_COLOUR - 1))) {
- ret = -EINVAL;
- goto fail;
+ if (!(reg->flags & IOU_PBUF_RING_MMAP) &&
+ ((reg->ring_addr | (unsigned long)bl->buf_ring) &
+ (SHM_COLOUR - 1))) {
+ io_free_region(ctx->user, &bl->region);
+ return -EINVAL;
}
#endif
- bl->nr_entries = reg.ring_entries;
- bl->mask = reg.ring_entries - 1;
- bl->flags |= IOBL_BUF_RING;
- bl->buf_ring = br;
+ return 0;
+}
+
+int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
+{
+ unsigned int permitted_flags;
+ struct io_uring_buf_reg reg;
+ struct io_buffer_list *bl;
+ int ret;
+
+ lockdep_assert_held(&ctx->uring_lock);
+
+ if (copy_from_user(®, arg, sizeof(reg)))
+ return -EFAULT;
+
+ permitted_flags = IOU_PBUF_RING_MMAP | IOU_PBUF_RING_INC;
+ ret = io_validate_buf_reg(®, permitted_flags);
+ if (ret)
+ return ret;
+
+ ret = io_alloc_new_buffer_list(ctx, ®, &bl);
+ if (ret)
+ return ret;
+
+ ret = io_setup_pbuf_ring(ctx, ®, bl);
+ if (ret) {
+ kfree(bl);
+ return ret;
+ }
+
if (reg.flags & IOU_PBUF_RING_INC)
bl->flags |= IOBL_INC;
+
io_buffer_add_list(ctx, bl, reg.bgid);
+
return 0;
-fail:
- io_free_region(ctx->user, &bl->region);
- kfree(bl);
- return ret;
}
int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 02/30] io_uring/kbuf: rename io_unregister_pbuf_ring() to io_unregister_buf_ring()
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
2025-12-03 0:34 ` [PATCH v1 01/30] io_uring/kbuf: refactor io_buf_pbuf_register() logic into generic helpers Joanne Koong
@ 2025-12-03 0:34 ` Joanne Koong
2025-12-03 0:34 ` [PATCH v1 03/30] io_uring/kbuf: add support for kernel-managed buffer rings Joanne Koong
` (27 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:34 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Use the more generic name io_unregister_buf_ring() as this function will
be used for unregistering both provided buffer rings and kernel-managed
buffer rings.
This is a preparatory change for upcoming kernel-managed buffer ring
support.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
io_uring/kbuf.c | 2 +-
io_uring/kbuf.h | 2 +-
io_uring/register.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index c656cb433099..8f7ec4ebd990 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -718,7 +718,7 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
return 0;
}
-int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
+int io_unregister_buf_ring(struct io_ring_ctx *ctx, void __user *arg)
{
struct io_uring_buf_reg reg;
struct io_buffer_list *bl;
diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h
index bf15e26520d3..40b44f4fdb15 100644
--- a/io_uring/kbuf.h
+++ b/io_uring/kbuf.h
@@ -74,7 +74,7 @@ int io_provide_buffers_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
int io_manage_buffers_legacy(struct io_kiocb *req, unsigned int issue_flags);
int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg);
-int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg);
+int io_unregister_buf_ring(struct io_ring_ctx *ctx, void __user *arg);
int io_register_pbuf_status(struct io_ring_ctx *ctx, void __user *arg);
bool io_kbuf_recycle_legacy(struct io_kiocb *req, unsigned issue_flags);
diff --git a/io_uring/register.c b/io_uring/register.c
index 62d39b3ff317..4c6879698844 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -750,7 +750,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
ret = -EINVAL;
if (!arg || nr_args != 1)
break;
- ret = io_unregister_pbuf_ring(ctx, arg);
+ ret = io_unregister_buf_ring(ctx, arg);
break;
case IORING_REGISTER_SYNC_CANCEL:
ret = -EINVAL;
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 03/30] io_uring/kbuf: add support for kernel-managed buffer rings
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
2025-12-03 0:34 ` [PATCH v1 01/30] io_uring/kbuf: refactor io_buf_pbuf_register() logic into generic helpers Joanne Koong
2025-12-03 0:34 ` [PATCH v1 02/30] io_uring/kbuf: rename io_unregister_pbuf_ring() to io_unregister_buf_ring() Joanne Koong
@ 2025-12-03 0:34 ` Joanne Koong
2025-12-03 0:34 ` [PATCH v1 04/30] io_uring/kbuf: add mmap " Joanne Koong
` (26 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:34 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Add support for kernel-managed buffer rings (kmbuf rings), which allow
the kernel to allocate and manage the backing buffers for a buffer
ring, rather than requiring the application to provide and manage them.
This introduces two new registration opcodes:
- IORING_REGISTER_KMBUF_RING: Register a kernel-managed buffer ring
- IORING_UNREGISTER_KMBUF_RING: Unregister a kernel-managed buffer ring
The existing io_uring_buf_reg structure is extended with a union to
support both application-provided buffer rings (pbuf) and kernel-managed
buffer rings (kmbuf):
- For pbuf rings: ring_addr specifies the user-provided ring address
- For kmbuf rings: buf_size specifies the size of each buffer. buf_size
must be non-zero and page-aligned.
The implementation follows the same pattern as pbuf ring registration,
reusing the validation and buffer list allocation helpers introduced in
earlier refactoring. The IOBL_KERNEL_MANAGED flag marks buffer lists as
kernel-managed for appropriate handling in the I/O path.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
include/uapi/linux/io_uring.h | 15 ++++-
io_uring/kbuf.c | 76 +++++++++++++++++++++++
io_uring/kbuf.h | 7 ++-
io_uring/memmap.c | 112 ++++++++++++++++++++++++++++++++++
io_uring/memmap.h | 4 ++
io_uring/register.c | 7 +++
6 files changed, 217 insertions(+), 4 deletions(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index b5b23c0d5283..589755a4e2b4 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -700,6 +700,10 @@ enum io_uring_register_op {
/* auxiliary zcrx configuration, see enum zcrx_ctrl_op */
IORING_REGISTER_ZCRX_CTRL = 36,
+ /* register/unregister kernel-managed ring buffer group */
+ IORING_REGISTER_KMBUF_RING = 37,
+ IORING_UNREGISTER_KMBUF_RING = 38,
+
/* this goes last */
IORING_REGISTER_LAST,
@@ -869,9 +873,16 @@ enum io_uring_register_pbuf_ring_flags {
IOU_PBUF_RING_INC = 2,
};
-/* argument for IORING_(UN)REGISTER_PBUF_RING */
+/* argument for IORING_(UN)REGISTER_PBUF_RING and
+ * IORING_(UN)REGISTER_KMBUF_RING
+ */
struct io_uring_buf_reg {
- __u64 ring_addr;
+ union {
+ /* used for pbuf rings */
+ __u64 ring_addr;
+ /* used for kmbuf rings */
+ __u32 buf_size;
+ };
__u32 ring_entries;
__u16 bgid;
__u16 flags;
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 8f7ec4ebd990..1668718ac8fd 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -778,3 +778,79 @@ struct io_mapped_region *io_pbuf_get_region(struct io_ring_ctx *ctx,
return NULL;
return &bl->region;
}
+
+static int io_setup_kmbuf_ring(struct io_ring_ctx *ctx,
+ struct io_buffer_list *bl,
+ struct io_uring_buf_reg *reg)
+{
+ struct io_uring_buf_ring *ring;
+ unsigned long ring_size;
+ void *buf_region;
+ unsigned int i;
+ int ret;
+
+ /* allocate pages for the ring structure */
+ ring_size = flex_array_size(ring, bufs, bl->nr_entries);
+ ring = kzalloc(ring_size, GFP_KERNEL_ACCOUNT);
+ if (!ring)
+ return -ENOMEM;
+
+ ret = io_create_region_multi_buf(ctx, &bl->region, bl->nr_entries,
+ reg->buf_size);
+ if (ret) {
+ kfree(ring);
+ return ret;
+ }
+
+ /* initialize ring buf entries to point to the buffers */
+ buf_region = bl->region.ptr;
+ for (i = 0; i < bl->nr_entries; i++) {
+ struct io_uring_buf *buf = &ring->bufs[i];
+
+ buf->addr = (u64)buf_region;
+ buf->len = reg->buf_size;
+ buf->bid = i;
+
+ buf_region += reg->buf_size;
+ }
+ ring->tail = bl->nr_entries;
+
+ bl->buf_ring = ring;
+
+ return 0;
+}
+
+int io_register_kmbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
+{
+ struct io_uring_buf_reg reg;
+ struct io_buffer_list *bl;
+ int ret;
+
+ lockdep_assert_held(&ctx->uring_lock);
+
+ if (copy_from_user(®, arg, sizeof(reg)))
+ return -EFAULT;
+
+ ret = io_validate_buf_reg(®, 0);
+ if (ret)
+ return ret;
+
+ if (!reg.buf_size || !PAGE_ALIGNED(reg.buf_size))
+ return -EINVAL;
+
+ ret = io_alloc_new_buffer_list(ctx, ®, &bl);
+ if (ret)
+ return ret;
+
+ ret = io_setup_kmbuf_ring(ctx, bl, ®);
+ if (ret) {
+ kfree(bl);
+ return ret;
+ }
+
+ bl->flags |= IOBL_KERNEL_MANAGED;
+
+ io_buffer_add_list(ctx, bl, reg.bgid);
+
+ return 0;
+}
diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h
index 40b44f4fdb15..62c80a1ebf03 100644
--- a/io_uring/kbuf.h
+++ b/io_uring/kbuf.h
@@ -7,9 +7,11 @@
enum {
/* ring mapped provided buffers */
- IOBL_BUF_RING = 1,
+ IOBL_BUF_RING = 1,
/* buffers are consumed incrementally rather than always fully */
- IOBL_INC = 2,
+ IOBL_INC = 2,
+ /* buffers are kernel managed */
+ IOBL_KERNEL_MANAGED = 4,
};
struct io_buffer_list {
@@ -74,6 +76,7 @@ int io_provide_buffers_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
int io_manage_buffers_legacy(struct io_kiocb *req, unsigned int issue_flags);
int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg);
+int io_register_kmbuf_ring(struct io_ring_ctx *ctx, void __user *arg);
int io_unregister_buf_ring(struct io_ring_ctx *ctx, void __user *arg);
int io_register_pbuf_status(struct io_ring_ctx *ctx, void __user *arg);
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index dc4bfc5b6fb8..a46b027882f8 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -15,6 +15,28 @@
#include "rsrc.h"
#include "zcrx.h"
+static void release_multi_buf_pages(struct page **pages, unsigned long nr_pages)
+{
+ struct page *page;
+ unsigned int nr, i = 0;
+
+ while (nr_pages) {
+ page = pages[i];
+
+ if (!page || WARN_ON_ONCE(page != compound_head(page)))
+ return;
+
+ nr = compound_nr(page);
+ put_page(page);
+
+ if (WARN_ON_ONCE(nr > nr_pages))
+ return;
+
+ i += nr;
+ nr_pages -= nr;
+ }
+}
+
static bool io_mem_alloc_compound(struct page **pages, int nr_pages,
size_t size, gfp_t gfp)
{
@@ -86,6 +108,8 @@ enum {
IO_REGION_F_USER_PROVIDED = 2,
/* only the first page in the array is ref'ed */
IO_REGION_F_SINGLE_REF = 4,
+ /* pages in the array belong to multiple discrete allocations */
+ IO_REGION_F_MULTI_BUF = 8,
};
void io_free_region(struct user_struct *user, struct io_mapped_region *mr)
@@ -98,6 +122,8 @@ void io_free_region(struct user_struct *user, struct io_mapped_region *mr)
if (mr->flags & IO_REGION_F_USER_PROVIDED)
unpin_user_pages(mr->pages, nr_refs);
+ else if (mr->flags & IO_REGION_F_MULTI_BUF)
+ release_multi_buf_pages(mr->pages, nr_refs);
else
release_pages(mr->pages, nr_refs);
@@ -149,6 +175,54 @@ static int io_region_pin_pages(struct io_mapped_region *mr,
return 0;
}
+static int io_region_allocate_pages_multi_buf(struct io_mapped_region *mr,
+ unsigned int nr_bufs,
+ unsigned int buf_size)
+{
+ gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN;
+ struct page **pages, **cur_pages;
+ unsigned int nr_allocated;
+ unsigned int buf_pages;
+ unsigned int i;
+
+ if (!PAGE_ALIGNED(buf_size))
+ return -EINVAL;
+
+ buf_pages = buf_size >> PAGE_SHIFT;
+
+ pages = kvmalloc_array(mr->nr_pages, sizeof(*pages), gfp);
+ if (!pages)
+ return -ENOMEM;
+
+ cur_pages = pages;
+
+ for (i = 0; i < nr_bufs; i++) {
+ if (io_mem_alloc_compound(cur_pages, buf_pages, buf_size,
+ gfp)) {
+ cur_pages += buf_pages;
+ continue;
+ }
+
+ nr_allocated = alloc_pages_bulk_node(gfp, NUMA_NO_NODE,
+ buf_pages, cur_pages);
+ if (nr_allocated != buf_pages) {
+ unsigned int total =
+ (cur_pages - pages) + nr_allocated;
+
+ release_multi_buf_pages(pages, total);
+ kvfree(pages);
+ return -ENOMEM;
+ }
+
+ cur_pages += buf_pages;
+ }
+
+ mr->flags |= IO_REGION_F_MULTI_BUF;
+ mr->pages = pages;
+
+ return 0;
+}
+
static int io_region_allocate_pages(struct io_mapped_region *mr,
struct io_uring_region_desc *reg,
unsigned long mmap_offset)
@@ -181,6 +255,44 @@ static int io_region_allocate_pages(struct io_mapped_region *mr,
return 0;
}
+int io_create_region_multi_buf(struct io_ring_ctx *ctx,
+ struct io_mapped_region *mr,
+ unsigned int nr_bufs, unsigned int buf_size)
+{
+ unsigned long nr_pages;
+ int ret;
+
+ if (WARN_ON_ONCE(mr->pages || mr->ptr || mr->nr_pages))
+ return -EFAULT;
+
+ if (WARN_ON_ONCE(!nr_bufs || !buf_size))
+ return -EINVAL;
+
+ nr_pages = ((size_t)buf_size * nr_bufs) >> PAGE_SHIFT;
+ if (nr_pages > UINT_MAX)
+ return -E2BIG;
+
+ if (ctx->user) {
+ ret = __io_account_mem(ctx->user, nr_pages);
+ if (ret)
+ return ret;
+ }
+ mr->nr_pages = nr_pages;
+
+ ret = io_region_allocate_pages_multi_buf(mr, nr_bufs, buf_size);
+ if (ret)
+ goto out_free;
+
+ ret = io_region_init_ptr(mr);
+ if (ret)
+ goto out_free;
+
+ return 0;
+out_free:
+ io_free_region(ctx->user, mr);
+ return ret;
+}
+
int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
struct io_uring_region_desc *reg,
unsigned long mmap_offset)
diff --git a/io_uring/memmap.h b/io_uring/memmap.h
index a39d9e518905..b09fc34d5eb9 100644
--- a/io_uring/memmap.h
+++ b/io_uring/memmap.h
@@ -21,6 +21,10 @@ int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
struct io_uring_region_desc *reg,
unsigned long mmap_offset);
+int io_create_region_multi_buf(struct io_ring_ctx *ctx,
+ struct io_mapped_region *mr,
+ unsigned int nr_bufs, unsigned int buf_size);
+
static inline void *io_region_get_ptr(struct io_mapped_region *mr)
{
return mr->ptr;
diff --git a/io_uring/register.c b/io_uring/register.c
index 4c6879698844..4aabf6e44083 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -746,7 +746,14 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
break;
ret = io_register_pbuf_ring(ctx, arg);
break;
+ case IORING_REGISTER_KMBUF_RING:
+ ret = -EINVAL;
+ if (!arg || nr_args != 1)
+ break;
+ ret = io_register_kmbuf_ring(ctx, arg);
+ break;
case IORING_UNREGISTER_PBUF_RING:
+ case IORING_UNREGISTER_KMBUF_RING:
ret = -EINVAL;
if (!arg || nr_args != 1)
break;
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 04/30] io_uring/kbuf: add mmap support for kernel-managed buffer rings
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (2 preceding siblings ...)
2025-12-03 0:34 ` [PATCH v1 03/30] io_uring/kbuf: add support for kernel-managed buffer rings Joanne Koong
@ 2025-12-03 0:34 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 05/30] io_uring/kbuf: support kernel-managed buffer rings in buffer selection Joanne Koong
` (25 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:34 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Add support for mmapping kernel-managed buffer rings (kmbuf) to
userspace, allowing applications to access the kernel-allocated buffers.
Similar to application-provided buffer rings (pbuf), kmbuf rings use the
buffer group ID encoded in the mmap offset to identify which buffer ring
to map. The implementation follows the same pattern as pbuf rings.
New mmap offset constants are introduced:
- IORING_OFF_KMBUF_RING (0x88000000): Base offset for kmbuf mappings
- IORING_OFF_KMBUF_SHIFT (16): Shift value to encode buffer group ID
The mmap offset is calculated during registration, encoding the bgid
shifted by IORING_OFF_KMBUF_SHIFT. The io_buf_get_region() helper
retrieves the appropriate region.
This allows userspace to mmap the kernel-allocated buffer region and
access the buffers directly.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
include/uapi/linux/io_uring.h | 2 ++
io_uring/kbuf.c | 11 +++++++++--
io_uring/kbuf.h | 5 +++--
io_uring/memmap.c | 5 ++++-
4 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 589755a4e2b4..96e936503ef6 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -533,6 +533,8 @@ struct io_uring_cqe {
#define IORING_OFF_SQES 0x10000000ULL
#define IORING_OFF_PBUF_RING 0x80000000ULL
#define IORING_OFF_PBUF_SHIFT 16
+#define IORING_OFF_KMBUF_RING 0x88000000ULL
+#define IORING_OFF_KMBUF_SHIFT 16
#define IORING_OFF_MMAP_MASK 0xf8000000ULL
/*
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 1668718ac8fd..619bba43dda3 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -766,16 +766,23 @@ int io_register_pbuf_status(struct io_ring_ctx *ctx, void __user *arg)
return 0;
}
-struct io_mapped_region *io_pbuf_get_region(struct io_ring_ctx *ctx,
- unsigned int bgid)
+struct io_mapped_region *io_buf_get_region(struct io_ring_ctx *ctx,
+ unsigned int bgid,
+ bool kernel_managed)
{
struct io_buffer_list *bl;
+ bool is_kernel_managed;
lockdep_assert_held(&ctx->mmap_lock);
bl = xa_load(&ctx->io_bl_xa, bgid);
if (!bl || !(bl->flags & IOBL_BUF_RING))
return NULL;
+
+ is_kernel_managed = !!(bl->flags & IOBL_KERNEL_MANAGED);
+ if (is_kernel_managed != kernel_managed)
+ return NULL;
+
return &bl->region;
}
diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h
index 62c80a1ebf03..11d165888b8e 100644
--- a/io_uring/kbuf.h
+++ b/io_uring/kbuf.h
@@ -88,8 +88,9 @@ unsigned int __io_put_kbufs(struct io_kiocb *req, struct io_buffer_list *bl,
bool io_kbuf_commit(struct io_kiocb *req,
struct io_buffer_list *bl, int len, int nr);
-struct io_mapped_region *io_pbuf_get_region(struct io_ring_ctx *ctx,
- unsigned int bgid);
+struct io_mapped_region *io_buf_get_region(struct io_ring_ctx *ctx,
+ unsigned int bgid,
+ bool kernel_managed);
static inline bool io_kbuf_recycle_ring(struct io_kiocb *req,
struct io_buffer_list *bl)
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index a46b027882f8..1832ef923e99 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -357,7 +357,10 @@ static struct io_mapped_region *io_mmap_get_region(struct io_ring_ctx *ctx,
return &ctx->sq_region;
case IORING_OFF_PBUF_RING:
id = (offset & ~IORING_OFF_MMAP_MASK) >> IORING_OFF_PBUF_SHIFT;
- return io_pbuf_get_region(ctx, id);
+ return io_buf_get_region(ctx, id, false);
+ case IORING_OFF_KMBUF_RING:
+ id = (offset & ~IORING_OFF_MMAP_MASK) >> IORING_OFF_KMBUF_SHIFT;
+ return io_buf_get_region(ctx, id, true);
case IORING_MAP_OFF_PARAM_REGION:
return &ctx->param_region;
case IORING_MAP_OFF_ZCRX_REGION:
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 05/30] io_uring/kbuf: support kernel-managed buffer rings in buffer selection
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (3 preceding siblings ...)
2025-12-03 0:34 ` [PATCH v1 04/30] io_uring/kbuf: add mmap " Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 06/30] io_uring/kbuf: add buffer ring pinning/unpinning Joanne Koong
` (24 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Allow kernel-managed buffers to be selected. This requires modifying the
io_br_sel struct to separate the fields for address and val, since a
kernel address cannot be distinguished from a negative val when error
checking.
Auto-commit any selected kernel-managed buffer.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
include/linux/io_uring_types.h | 8 ++++----
io_uring/kbuf.c | 16 ++++++++++++----
2 files changed, 16 insertions(+), 8 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index e1adb0d20a0a..36fac08db636 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -93,13 +93,13 @@ struct io_mapped_region {
*/
struct io_br_sel {
struct io_buffer_list *buf_list;
- /*
- * Some selection parts return the user address, others return an error.
- */
union {
+ /* for classic/ring provided buffers */
void __user *addr;
- ssize_t val;
+ /* for kernel-managed buffers */
+ void *kaddr;
};
+ ssize_t val;
};
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 619bba43dda3..00ab17a034b5 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -155,7 +155,8 @@ static int io_provided_buffers_select(struct io_kiocb *req, size_t *len,
return 1;
}
-static bool io_should_commit(struct io_kiocb *req, unsigned int issue_flags)
+static bool io_should_commit(struct io_kiocb *req, struct io_buffer_list *bl,
+ unsigned int issue_flags)
{
/*
* If we came in unlocked, we have no choice but to consume the
@@ -170,7 +171,11 @@ static bool io_should_commit(struct io_kiocb *req, unsigned int issue_flags)
if (issue_flags & IO_URING_F_UNLOCKED)
return true;
- /* uring_cmd commits kbuf upfront, no need to auto-commit */
+ /* kernel-managed buffers are auto-committed */
+ if (bl->flags & IOBL_KERNEL_MANAGED)
+ return true;
+
+ /* multishot uring_cmd commits kbuf upfront, no need to auto-commit */
if (!io_file_can_poll(req) && req->opcode != IORING_OP_URING_CMD)
return true;
return false;
@@ -200,9 +205,12 @@ static struct io_br_sel io_ring_buffer_select(struct io_kiocb *req, size_t *len,
req->flags |= REQ_F_BUFFER_RING | REQ_F_BUFFERS_COMMIT;
req->buf_index = buf->bid;
sel.buf_list = bl;
- sel.addr = u64_to_user_ptr(buf->addr);
+ if (bl->flags & IOBL_KERNEL_MANAGED)
+ sel.kaddr = (void *)buf->addr;
+ else
+ sel.addr = u64_to_user_ptr(buf->addr);
- if (io_should_commit(req, issue_flags)) {
+ if (io_should_commit(req, bl, issue_flags)) {
io_kbuf_commit(req, sel.buf_list, *len, 1);
sel.buf_list = NULL;
}
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 06/30] io_uring/kbuf: add buffer ring pinning/unpinning
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (4 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 05/30] io_uring/kbuf: support kernel-managed buffer rings in buffer selection Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 4:13 ` Caleb Sander Mateos
2025-12-03 0:35 ` [PATCH v1 07/30] io_uring/rsrc: add fixed buffer table pinning/unpinning Joanne Koong
` (23 subsequent siblings)
29 siblings, 1 reply; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Add kernel APIs to pin and unpin buffer rings, preventing userspace from
unregistering a buffer ring while it is pinned by the kernel.
This provides a mechanism for kernel subsystems to safely access buffer
ring contents while ensuring the buffer ring remains valid. A pinned
buffer ring cannot be unregistered until explicitly unpinned. On the
userspace side, trying to unregister a pinned buffer will return -EBUSY.
Pinning an already-pinned bufring is acceptable and returns 0.
The API accepts a "struct io_ring_ctx *ctx" rather than a cmd pointer,
as the buffer ring may need to be unpinned in contexts where a cmd is
not readily available.
This is a preparatory change for upcoming fuse usage of kernel-managed
buffer rings. It is necessary for fuse to pin the buffer ring because
fuse may need to select a buffer in atomic contexts, which it can only
do so by using the underlying buffer list pointer.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
include/linux/io_uring/buf.h | 28 +++++++++++++++++++++++
io_uring/kbuf.c | 43 ++++++++++++++++++++++++++++++++++++
io_uring/kbuf.h | 5 +++++
3 files changed, 76 insertions(+)
create mode 100644 include/linux/io_uring/buf.h
diff --git a/include/linux/io_uring/buf.h b/include/linux/io_uring/buf.h
new file mode 100644
index 000000000000..7a1cf197434d
--- /dev/null
+++ b/include/linux/io_uring/buf.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _LINUX_IO_URING_BUF_H
+#define _LINUX_IO_URING_BUF_H
+
+#include <linux/io_uring_types.h>
+
+#if defined(CONFIG_IO_URING)
+int io_uring_buf_ring_pin(struct io_ring_ctx *ctx, unsigned buf_group,
+ unsigned issue_flags, struct io_buffer_list **bl);
+int io_uring_buf_ring_unpin(struct io_ring_ctx *ctx, unsigned buf_group,
+ unsigned issue_flags);
+#else
+static inline int io_uring_buf_ring_pin(struct io_ring_ctx *ctx,
+ unsigned buf_group,
+ unsigned issue_flags,
+ struct io_buffer_list **bl);
+{
+ return -EOPNOTSUPP;
+}
+static inline int io_uring_buf_ring_unpin(struct io_ring_ctx *ctx,
+ unsigned buf_group,
+ unsigned issue_flags)
+{
+ return -EOPNOTSUPP;
+}
+#endif /* CONFIG_IO_URING */
+
+#endif /* _LINUX_IO_URING_BUF_H */
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 00ab17a034b5..ddda1338e652 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -9,6 +9,7 @@
#include <linux/poll.h>
#include <linux/vmalloc.h>
#include <linux/io_uring.h>
+#include <linux/io_uring/buf.h>
#include <uapi/linux/io_uring.h>
@@ -237,6 +238,46 @@ struct io_br_sel io_buffer_select(struct io_kiocb *req, size_t *len,
return sel;
}
+int io_uring_buf_ring_pin(struct io_ring_ctx *ctx, unsigned buf_group,
+ unsigned issue_flags, struct io_buffer_list **bl)
+{
+ struct io_buffer_list *buffer_list;
+ int ret = -EINVAL;
+
+ io_ring_submit_lock(ctx, issue_flags);
+
+ buffer_list = io_buffer_get_list(ctx, buf_group);
+ if (likely(buffer_list) && (buffer_list->flags & IOBL_BUF_RING)) {
+ buffer_list->flags |= IOBL_PINNED;
+ ret = 0;
+ *bl = buffer_list;
+ }
+
+ io_ring_submit_unlock(ctx, issue_flags);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(io_uring_buf_ring_pin);
+
+int io_uring_buf_ring_unpin(struct io_ring_ctx *ctx, unsigned buf_group,
+ unsigned issue_flags)
+{
+ struct io_buffer_list *bl;
+ int ret = -EINVAL;
+
+ io_ring_submit_lock(ctx, issue_flags);
+
+ bl = io_buffer_get_list(ctx, buf_group);
+ if (likely(bl) && (bl->flags & IOBL_BUF_RING) &&
+ (bl->flags & IOBL_PINNED)) {
+ bl->flags &= ~IOBL_PINNED;
+ ret = 0;
+ }
+
+ io_ring_submit_unlock(ctx, issue_flags);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(io_uring_buf_ring_unpin);
+
/* cap it at a reasonable 256, will be one page even for 4K */
#define PEEK_MAX_IMPORT 256
@@ -743,6 +784,8 @@ int io_unregister_buf_ring(struct io_ring_ctx *ctx, void __user *arg)
return -ENOENT;
if (!(bl->flags & IOBL_BUF_RING))
return -EINVAL;
+ if (bl->flags & IOBL_PINNED)
+ return -EBUSY;
scoped_guard(mutex, &ctx->mmap_lock)
xa_erase(&ctx->io_bl_xa, bl->bgid);
diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h
index 11d165888b8e..781630c2cc10 100644
--- a/io_uring/kbuf.h
+++ b/io_uring/kbuf.h
@@ -12,6 +12,11 @@ enum {
IOBL_INC = 2,
/* buffers are kernel managed */
IOBL_KERNEL_MANAGED = 4,
+ /*
+ * buffer ring is pinned and cannot be unregistered by userspace until
+ * it has been unpinned
+ */
+ IOBL_PINNED = 8,
};
struct io_buffer_list {
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 07/30] io_uring/rsrc: add fixed buffer table pinning/unpinning
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (5 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 06/30] io_uring/kbuf: add buffer ring pinning/unpinning Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 4:49 ` Caleb Sander Mateos
2025-12-03 0:35 ` [PATCH v1 08/30] io_uring/kbuf: add recycling for pinned kernel managed buffer rings Joanne Koong
` (22 subsequent siblings)
29 siblings, 1 reply; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Add kernel APIs to pin and unpin the buffer table for fixed buffers,
preventing userspace from unregistering or updating the fixed buffers
table while it is pinned by the kernel.
This has two advantages:
a) Eliminating the overhead of having to fetch and construct an iter for
a fixed buffer per every cmd. Instead, the caller can pin the buffer
table, fetch/construct the iter once, and use that across cmds for
however long it needs to until it is ready to unpin the buffer table.
b) Allowing a fixed buffer lookup at any index. The buffer table must be
pinned in order to allow this, otherwise we would have to keep track of
all the nodes that have been looked up by the io_kiocb so that we can
properly adjust the refcounts for those nodes. Ensuring that the buffer
table must first be pinned before being able to fetch a buffer at any
index makes things logistically a lot neater.
This is a preparatory patch for fuse io-uring's usage of fixed buffers.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
include/linux/io_uring/buf.h | 13 +++++++++++
include/linux/io_uring_types.h | 9 ++++++++
io_uring/rsrc.c | 42 ++++++++++++++++++++++++++++++++++
3 files changed, 64 insertions(+)
diff --git a/include/linux/io_uring/buf.h b/include/linux/io_uring/buf.h
index 7a1cf197434d..c997c01c24c4 100644
--- a/include/linux/io_uring/buf.h
+++ b/include/linux/io_uring/buf.h
@@ -9,6 +9,9 @@ int io_uring_buf_ring_pin(struct io_ring_ctx *ctx, unsigned buf_group,
unsigned issue_flags, struct io_buffer_list **bl);
int io_uring_buf_ring_unpin(struct io_ring_ctx *ctx, unsigned buf_group,
unsigned issue_flags);
+
+int io_uring_buf_table_pin(struct io_ring_ctx *ctx, unsigned issue_flags);
+int io_uring_buf_table_unpin(struct io_ring_ctx *ctx, unsigned issue_flags);
#else
static inline int io_uring_buf_ring_pin(struct io_ring_ctx *ctx,
unsigned buf_group,
@@ -23,6 +26,16 @@ static inline int io_uring_buf_ring_unpin(struct io_ring_ctx *ctx,
{
return -EOPNOTSUPP;
}
+static inline int io_uring_buf_table_pin(struct io_ring_ctx *ctx,
+ unsigned issue_flags)
+{
+ return -EOPNOTSUPP;
+}
+static inline int io_uring_buf_table_unpin(struct io_ring_ctx *ctx,
+ unsigned issue_flags)
+{
+ return -EOPNOTSUPP;
+}
#endif /* CONFIG_IO_URING */
#endif /* _LINUX_IO_URING_BUF_H */
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 36fac08db636..e1a75cfe57d9 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -57,8 +57,17 @@ struct io_wq_work {
int cancel_seq;
};
+/*
+ * struct io_rsrc_data flag values:
+ *
+ * IO_RSRC_DATA_PINNED: data is pinned and cannot be unregistered by userspace
+ * until it has been unpinned. Currently this is only possible on buffer tables.
+ */
+#define IO_RSRC_DATA_PINNED BIT(0)
+
struct io_rsrc_data {
unsigned int nr;
+ u8 flags;
struct io_rsrc_node **nodes;
};
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 3765a50329a8..67331cae0a5a 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -9,6 +9,7 @@
#include <linux/hugetlb.h>
#include <linux/compat.h>
#include <linux/io_uring.h>
+#include <linux/io_uring/buf.h>
#include <linux/io_uring/cmd.h>
#include <uapi/linux/io_uring.h>
@@ -304,6 +305,8 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
return -ENXIO;
if (up->offset + nr_args > ctx->buf_table.nr)
return -EINVAL;
+ if (ctx->buf_table.flags & IO_RSRC_DATA_PINNED)
+ return -EBUSY;
for (done = 0; done < nr_args; done++) {
struct io_rsrc_node *node;
@@ -615,6 +618,8 @@ int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
{
if (!ctx->buf_table.nr)
return -ENXIO;
+ if (ctx->buf_table.flags & IO_RSRC_DATA_PINNED)
+ return -EBUSY;
io_rsrc_data_free(ctx, &ctx->buf_table);
return 0;
}
@@ -1580,3 +1585,40 @@ int io_prep_reg_iovec(struct io_kiocb *req, struct iou_vec *iv,
req->flags |= REQ_F_IMPORT_BUFFER;
return 0;
}
+
+int io_uring_buf_table_pin(struct io_ring_ctx *ctx, unsigned issue_flags)
+{
+ struct io_rsrc_data *data;
+ int err = 0;
+
+ io_ring_submit_lock(ctx, issue_flags);
+
+ data = &ctx->buf_table;
+ /* There was nothing registered. There is nothing to pin */
+ if (!data->nr)
+ err = -ENXIO;
+ else
+ data->flags |= IO_RSRC_DATA_PINNED;
+
+ io_ring_submit_unlock(ctx, issue_flags);
+ return err;
+}
+EXPORT_SYMBOL_GPL(io_uring_buf_table_pin);
+
+int io_uring_buf_table_unpin(struct io_ring_ctx *ctx, unsigned issue_flags)
+{
+ struct io_rsrc_data *data;
+ int err = 0;
+
+ io_ring_submit_lock(ctx, issue_flags);
+
+ data = &ctx->buf_table;
+ if (WARN_ON_ONCE(!(data->flags & IO_RSRC_DATA_PINNED)))
+ err = -EINVAL;
+ else
+ data->flags &= ~IO_RSRC_DATA_PINNED;
+
+ io_ring_submit_unlock(ctx, issue_flags);
+ return err;
+}
+EXPORT_SYMBOL_GPL(io_uring_buf_table_unpin);
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 08/30] io_uring/kbuf: add recycling for pinned kernel managed buffer rings
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (6 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 07/30] io_uring/rsrc: add fixed buffer table pinning/unpinning Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 09/30] io_uring: add io_uring_cmd_import_fixed_index() Joanne Koong
` (21 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Add an interface for buffers to be recycled back into a pinned
kernel-managed buffer ring. This requires the caller to synchronize
recycling and selecting.
This is a preparatory patch for fuse io-uring, which requires buffers to
be recycled without contending for the uring mutex, as the buffer may
need to be selected/recycled in an atomic context.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
include/linux/io_uring/buf.h | 10 ++++++++++
io_uring/kbuf.c | 33 +++++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+)
diff --git a/include/linux/io_uring/buf.h b/include/linux/io_uring/buf.h
index c997c01c24c4..839c5a0b3bf3 100644
--- a/include/linux/io_uring/buf.h
+++ b/include/linux/io_uring/buf.h
@@ -12,6 +12,10 @@ int io_uring_buf_ring_unpin(struct io_ring_ctx *ctx, unsigned buf_group,
int io_uring_buf_table_pin(struct io_ring_ctx *ctx, unsigned issue_flags);
int io_uring_buf_table_unpin(struct io_ring_ctx *ctx, unsigned issue_flags);
+
+int io_uring_kmbuf_recycle_pinned(struct io_kiocb *req,
+ struct io_buffer_list *bl, u64 addr,
+ unsigned int len, unsigned int bid);
#else
static inline int io_uring_buf_ring_pin(struct io_ring_ctx *ctx,
unsigned buf_group,
@@ -36,6 +40,12 @@ static inline int io_uring_buf_table_unpin(struct io_ring_ctx *ctx,
{
return -EOPNOTSUPP;
}
+int io_uring_kmbuf_recycle_pinned(struct io_kiocb *req,
+ struct io_buffer_list *bl, u64 addr,
+ unsigned int len, unsigned int bid)
+{
+ return -EOPNOTSUPP;
+}
#endif /* CONFIG_IO_URING */
#endif /* _LINUX_IO_URING_BUF_H */
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index ddda1338e652..82a4c550633d 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -102,6 +102,39 @@ void io_kbuf_drop_legacy(struct io_kiocb *req)
req->kbuf = NULL;
}
+/* The caller is responsible for synchronizing recycling and selection */
+int io_uring_kmbuf_recycle_pinned(struct io_kiocb *req,
+ struct io_buffer_list *bl, u64 addr,
+ unsigned int len, unsigned int bid)
+{
+ struct io_uring_buf *buf;
+ struct io_uring_buf_ring *br;
+
+ if (WARN_ON_ONCE(req->flags & REQ_F_BUFFERS_COMMIT) ||
+ WARN_ON_ONCE(!(bl->flags & IOBL_PINNED)) ||
+ WARN_ON_ONCE(!(bl->flags & IOBL_BUF_RING)) ||
+ WARN_ON_ONCE(!(bl->flags & IOBL_KERNEL_MANAGED)))
+ return -EINVAL;
+
+ br = bl->buf_ring;
+
+ if (WARN_ON_ONCE((br->tail - bl->head) >= bl->nr_entries))
+ return -EINVAL;
+
+ buf = &br->bufs[(br->tail) & bl->mask];
+
+ buf->addr = addr;
+ buf->len = len;
+ buf->bid = bid;
+
+ req->flags &= ~REQ_F_BUFFER_RING;
+
+ br->tail++;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(io_uring_kmbuf_recycle_pinned);
+
bool io_kbuf_recycle_legacy(struct io_kiocb *req, unsigned issue_flags)
{
struct io_ring_ctx *ctx = req->ctx;
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 09/30] io_uring: add io_uring_cmd_import_fixed_index()
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (7 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 08/30] io_uring/kbuf: add recycling for pinned kernel managed buffer rings Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 21:43 ` Caleb Sander Mateos
2025-12-03 0:35 ` [PATCH v1 10/30] io_uring/kbuf: add io_uring_is_kmbuf_ring() Joanne Koong
` (20 subsequent siblings)
29 siblings, 1 reply; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Add a new helper, io_uring_cmd_import_fixed_index(). This takes in a
buffer index. This requires the buffer table to have been pinned
beforehand. The caller is responsible for ensuring it does not use the
returned iter after the buffer table has been unpinned.
This is a preparatory patch needed for fuse-over-io-uring support, as
the metadata for fuse requests will be stored at the last index, which
will be different from the sqe's buffer index.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
include/linux/io_uring/cmd.h | 10 ++++++++++
io_uring/rsrc.c | 31 +++++++++++++++++++++++++++++++
io_uring/rsrc.h | 2 ++
io_uring/uring_cmd.c | 11 +++++++++++
4 files changed, 54 insertions(+)
diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
index 375fd048c4cb..a4b5eae2e5d1 100644
--- a/include/linux/io_uring/cmd.h
+++ b/include/linux/io_uring/cmd.h
@@ -44,6 +44,9 @@ int io_uring_cmd_import_fixed_vec(struct io_uring_cmd *ioucmd,
size_t uvec_segs,
int ddir, struct iov_iter *iter,
unsigned issue_flags);
+int io_uring_cmd_import_fixed_index(struct io_uring_cmd *ioucmd, u16 buf_index,
+ int ddir, struct iov_iter *iter,
+ unsigned int issue_flags);
/*
* Completes the request, i.e. posts an io_uring CQE and deallocates @ioucmd
@@ -100,6 +103,13 @@ static inline int io_uring_cmd_import_fixed_vec(struct io_uring_cmd *ioucmd,
{
return -EOPNOTSUPP;
}
+static inline int io_uring_cmd_import_fixed_index(struct io_uring_cmd *ioucmd,
+ u16 buf_index, int ddir,
+ struct iov_iter *iter,
+ unsigned int issue_flags)
+{
+ return -EOPNOTSUPP;
+}
static inline void __io_uring_cmd_done(struct io_uring_cmd *cmd, s32 ret,
u64 ret2, unsigned issue_flags, bool is_cqe32)
{
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 67331cae0a5a..b6dd62118311 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1156,6 +1156,37 @@ int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
return io_import_fixed(ddir, iter, node->buf, buf_addr, len);
}
+int io_import_reg_buf_index(struct io_kiocb *req, struct iov_iter *iter,
+ u16 buf_index, int ddir, unsigned issue_flags)
+{
+ struct io_ring_ctx *ctx = req->ctx;
+ struct io_rsrc_node *node;
+ struct io_mapped_ubuf *imu;
+
+ io_ring_submit_lock(ctx, issue_flags);
+
+ if (buf_index >= req->ctx->buf_table.nr ||
+ !(ctx->buf_table.flags & IO_RSRC_DATA_PINNED)) {
+ io_ring_submit_unlock(ctx, issue_flags);
+ return -EINVAL;
+ }
+
+ /*
+ * We don't have to grab the reference on the node because the buffer
+ * table is pinned. The caller is responsible for ensuring the iter
+ * isn't used after the buffer table has been unpinned.
+ */
+ node = io_rsrc_node_lookup(&ctx->buf_table, buf_index);
+ io_ring_submit_unlock(ctx, issue_flags);
+
+ if (!node || !node->buf)
+ return -EFAULT;
+
+ imu = node->buf;
+
+ return io_import_fixed(ddir, iter, imu, imu->ubuf, imu->len);
+}
+
/* Lock two rings at once. The rings must be different! */
static void lock_two_rings(struct io_ring_ctx *ctx1, struct io_ring_ctx *ctx2)
{
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index d603f6a47f5e..658934f4d3ff 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -64,6 +64,8 @@ struct io_rsrc_node *io_find_buf_node(struct io_kiocb *req,
int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
u64 buf_addr, size_t len, int ddir,
unsigned issue_flags);
+int io_import_reg_buf_index(struct io_kiocb *req, struct iov_iter *iter,
+ u16 buf_index, int ddir, unsigned issue_flags);
int io_import_reg_vec(int ddir, struct iov_iter *iter,
struct io_kiocb *req, struct iou_vec *vec,
unsigned nr_iovs, unsigned issue_flags);
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 197474911f04..e077eba00efe 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -314,6 +314,17 @@ int io_uring_cmd_import_fixed_vec(struct io_uring_cmd *ioucmd,
}
EXPORT_SYMBOL_GPL(io_uring_cmd_import_fixed_vec);
+int io_uring_cmd_import_fixed_index(struct io_uring_cmd *ioucmd, u16 buf_index,
+ int ddir, struct iov_iter *iter,
+ unsigned int issue_flags)
+{
+ struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
+
+ return io_import_reg_buf_index(req, iter, buf_index, ddir,
+ issue_flags);
+}
+EXPORT_SYMBOL_GPL(io_uring_cmd_import_fixed_index);
+
void io_uring_cmd_issue_blocking(struct io_uring_cmd *ioucmd)
{
struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 10/30] io_uring/kbuf: add io_uring_is_kmbuf_ring()
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (8 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 09/30] io_uring: add io_uring_cmd_import_fixed_index() Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 11/30] io_uring/kbuf: return buffer id in buffer selection Joanne Koong
` (19 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Add a function io_uring_is_kmbuf_ring() that returns true if there is a
kernel-managed buffer ring at the specified buffer group.
This is a preparatory patch for upcoming fuse kernel-managed buffer
support, which needs to ensure the buffer ring registered by the server
is a kernel-managed buffer ring.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
include/linux/io_uring/buf.h | 9 +++++++++
io_uring/kbuf.c | 19 +++++++++++++++++++
2 files changed, 28 insertions(+)
diff --git a/include/linux/io_uring/buf.h b/include/linux/io_uring/buf.h
index 839c5a0b3bf3..90ab5cde7d11 100644
--- a/include/linux/io_uring/buf.h
+++ b/include/linux/io_uring/buf.h
@@ -16,6 +16,9 @@ int io_uring_buf_table_unpin(struct io_ring_ctx *ctx, unsigned issue_flags);
int io_uring_kmbuf_recycle_pinned(struct io_kiocb *req,
struct io_buffer_list *bl, u64 addr,
unsigned int len, unsigned int bid);
+
+bool io_uring_is_kmbuf_ring(struct io_ring_ctx *ctx, unsigned int buf_group,
+ unsigned int issue_flags);
#else
static inline int io_uring_buf_ring_pin(struct io_ring_ctx *ctx,
unsigned buf_group,
@@ -46,6 +49,12 @@ int io_uring_kmbuf_recycle_pinned(struct io_kiocb *req,
{
return -EOPNOTSUPP;
}
+static inline bool io_uring_is_kmbuf_ring(struct io_ring_ctx *ctx,
+ unsigned int buf_group,
+ unsigned int issue_flags)
+{
+ return false;
+}
#endif /* CONFIG_IO_URING */
#endif /* _LINUX_IO_URING_BUF_H */
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 82a4c550633d..8a94de6e530f 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -945,3 +945,22 @@ int io_register_kmbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
return 0;
}
+
+bool io_uring_is_kmbuf_ring(struct io_ring_ctx *ctx, unsigned int buf_group,
+ unsigned int issue_flags)
+{
+ struct io_buffer_list *bl;
+ bool is_kmbuf_ring = false;
+
+ io_ring_submit_lock(ctx, issue_flags);
+
+ bl = io_buffer_get_list(ctx, buf_group);
+ if (likely(bl) && (bl->flags & IOBL_KERNEL_MANAGED)) {
+ WARN_ON_ONCE(!(bl->flags & IOBL_BUF_RING));
+ is_kmbuf_ring = true;
+ }
+
+ io_ring_submit_unlock(ctx, issue_flags);
+ return is_kmbuf_ring;
+}
+EXPORT_SYMBOL_GPL(io_uring_is_kmbuf_ring);
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 11/30] io_uring/kbuf: return buffer id in buffer selection
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (9 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 10/30] io_uring/kbuf: add io_uring_is_kmbuf_ring() Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 21:53 ` Caleb Sander Mateos
2025-12-03 0:35 ` [PATCH v1 12/30] io_uring/kbuf: export io_ring_buffer_select() Joanne Koong
` (18 subsequent siblings)
29 siblings, 1 reply; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Return the id of the selected buffer in io_buffer_select(). This is
needed for kernel-managed buffer rings to later recycle the selected
buffer.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
include/linux/io_uring/cmd.h | 2 +-
include/linux/io_uring_types.h | 2 ++
io_uring/kbuf.c | 7 +++++--
3 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
index a4b5eae2e5d1..795b846d1e11 100644
--- a/include/linux/io_uring/cmd.h
+++ b/include/linux/io_uring/cmd.h
@@ -74,7 +74,7 @@ void io_uring_cmd_issue_blocking(struct io_uring_cmd *ioucmd);
/*
* Select a buffer from the provided buffer group for multishot uring_cmd.
- * Returns the selected buffer address and size.
+ * Returns the selected buffer address, size, and id.
*/
struct io_br_sel io_uring_cmd_buffer_select(struct io_uring_cmd *ioucmd,
unsigned buf_group, size_t *len,
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index e1a75cfe57d9..dcc95e73f12f 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -109,6 +109,8 @@ struct io_br_sel {
void *kaddr;
};
ssize_t val;
+ /* id of the selected buffer */
+ unsigned buf_id;
};
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 8a94de6e530f..3ecb6494adea 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -239,6 +239,7 @@ static struct io_br_sel io_ring_buffer_select(struct io_kiocb *req, size_t *len,
req->flags |= REQ_F_BUFFER_RING | REQ_F_BUFFERS_COMMIT;
req->buf_index = buf->bid;
sel.buf_list = bl;
+ sel.buf_id = buf->bid;
if (bl->flags & IOBL_KERNEL_MANAGED)
sel.kaddr = (void *)buf->addr;
else
@@ -262,10 +263,12 @@ struct io_br_sel io_buffer_select(struct io_kiocb *req, size_t *len,
bl = io_buffer_get_list(ctx, buf_group);
if (likely(bl)) {
- if (bl->flags & IOBL_BUF_RING)
+ if (bl->flags & IOBL_BUF_RING) {
sel = io_ring_buffer_select(req, len, bl, issue_flags);
- else
+ } else {
sel.addr = io_provided_buffer_select(req, len, bl);
+ sel.buf_id = req->buf_index;
+ }
}
io_ring_submit_unlock(req->ctx, issue_flags);
return sel;
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 12/30] io_uring/kbuf: export io_ring_buffer_select()
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (10 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 11/30] io_uring/kbuf: return buffer id in buffer selection Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 13/30] io_uring/cmd: set selected buffer index in __io_uring_cmd_done() Joanne Koong
` (17 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Export io_ring_buffer_select() so that it may be used by callers who
pass in a pinned bufring without needing to grab the io_uring mutex.
This is a preparatory patch that will be needed by fuse io-uring, which
will need to select a buffer from a kernel-managed bufring while the
uring mutex may already be held by in-progress commits, and may need to
select a buffer in atomic contexts.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
include/linux/io_uring/buf.h | 15 +++++++++++++++
io_uring/kbuf.c | 7 ++++---
2 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/include/linux/io_uring/buf.h b/include/linux/io_uring/buf.h
index 90ab5cde7d11..2b49c01fe2f5 100644
--- a/include/linux/io_uring/buf.h
+++ b/include/linux/io_uring/buf.h
@@ -19,6 +19,10 @@ int io_uring_kmbuf_recycle_pinned(struct io_kiocb *req,
bool io_uring_is_kmbuf_ring(struct io_ring_ctx *ctx, unsigned int buf_group,
unsigned int issue_flags);
+
+struct io_br_sel io_ring_buffer_select(struct io_kiocb *req, size_t *len,
+ struct io_buffer_list *bl,
+ unsigned int issue_flags);
#else
static inline int io_uring_buf_ring_pin(struct io_ring_ctx *ctx,
unsigned buf_group,
@@ -55,6 +59,17 @@ static inline bool io_uring_is_kmbuf_ring(struct io_ring_ctx *ctx,
{
return false;
}
+static inline struct io_br_sel io_ring_buffer_select(struct io_kiocb *req,
+ size_t *len,
+ struct io_buffer_list *bl,
+ unsigned int issue_flags)
+{
+ struct io_br_sel sel = {
+ .val = -EOPNOTSUPP,
+ };
+
+ return sel;
+}
#endif /* CONFIG_IO_URING */
#endif /* _LINUX_IO_URING_BUF_H */
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 3ecb6494adea..74804bf631e9 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -215,9 +215,9 @@ static bool io_should_commit(struct io_kiocb *req, struct io_buffer_list *bl,
return false;
}
-static struct io_br_sel io_ring_buffer_select(struct io_kiocb *req, size_t *len,
- struct io_buffer_list *bl,
- unsigned int issue_flags)
+struct io_br_sel io_ring_buffer_select(struct io_kiocb *req, size_t *len,
+ struct io_buffer_list *bl,
+ unsigned int issue_flags)
{
struct io_uring_buf_ring *br = bl->buf_ring;
__u16 tail, head = bl->head;
@@ -251,6 +251,7 @@ static struct io_br_sel io_ring_buffer_select(struct io_kiocb *req, size_t *len,
}
return sel;
}
+EXPORT_SYMBOL_GPL(io_ring_buffer_select);
struct io_br_sel io_buffer_select(struct io_kiocb *req, size_t *len,
unsigned buf_group, unsigned int issue_flags)
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 13/30] io_uring/cmd: set selected buffer index in __io_uring_cmd_done()
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (11 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 12/30] io_uring/kbuf: export io_ring_buffer_select() Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 14/30] io_uring: add release callback for ring death Joanne Koong
` (16 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
When uring_cmd operations select a buffer, the completion queue entry
should indicate which buffer was selected.
Set IORING_CQE_F_BUFFER on the completed entry and encode the buffer
index if a buffer was selected.
This will be needed for fuse, which needs to relay to userspace which
selected buffer contains the data.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
io_uring/uring_cmd.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index e077eba00efe..3eb10bbba177 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -142,6 +142,7 @@ void __io_uring_cmd_done(struct io_uring_cmd *ioucmd, s32 ret, u64 res2,
unsigned issue_flags, bool is_cqe32)
{
struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
+ u32 cflags = 0;
if (WARN_ON_ONCE(req->flags & REQ_F_APOLL_MULTISHOT))
return;
@@ -151,7 +152,10 @@ void __io_uring_cmd_done(struct io_uring_cmd *ioucmd, s32 ret, u64 res2,
if (ret < 0)
req_set_fail(req);
- io_req_set_res(req, ret, 0);
+ if (req->flags & (REQ_F_BUFFER_SELECTED | REQ_F_BUFFER_RING))
+ cflags |= IORING_CQE_F_BUFFER |
+ (req->buf_index << IORING_CQE_BUFFER_SHIFT);
+ io_req_set_res(req, ret, cflags);
if (is_cqe32) {
if (req->ctx->flags & IORING_SETUP_CQE_MIXED)
req->cqe.flags |= IORING_CQE_F_32;
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 14/30] io_uring: add release callback for ring death
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (12 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 13/30] io_uring/cmd: set selected buffer index in __io_uring_cmd_done() Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 22:25 ` Caleb Sander Mateos
2025-12-03 0:35 ` [PATCH v1 15/30] fuse: refactor io-uring logic for getting next fuse request Joanne Koong
` (15 subsequent siblings)
29 siblings, 1 reply; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Allow registering a release callback on a ring context that will be
called when the ring is about to be destroyed.
This is a preparatory patch for fuse. Fuse will be pinning buffers and
registering bvecs, which requires cleanup whenever a server
disconnects. It needs to know if the ring is alive when the server has
disconnected, to avoid double-freeing or accessing invalid memory.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
include/linux/io_uring.h | 9 +++++++++
include/linux/io_uring_types.h | 2 ++
io_uring/io_uring.c | 15 +++++++++++++++
3 files changed, 26 insertions(+)
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 85fe4e6b275c..327fd8ac6e42 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -2,6 +2,7 @@
#ifndef _LINUX_IO_URING_H
#define _LINUX_IO_URING_H
+#include <linux/io_uring_types.h>
#include <linux/sched.h>
#include <linux/xarray.h>
#include <uapi/linux/io_uring.h>
@@ -28,6 +29,9 @@ static inline void io_uring_free(struct task_struct *tsk)
if (tsk->io_uring)
__io_uring_free(tsk);
}
+void io_uring_set_release_callback(struct io_ring_ctx *ctx,
+ void (*release)(void *), void *priv,
+ unsigned int issue_flags);
#else
static inline void io_uring_task_cancel(void)
{
@@ -46,6 +50,11 @@ static inline bool io_is_uring_fops(struct file *file)
{
return false;
}
+static inline void
+io_uring_set_release_callback(struct io_ring_ctx *ctx, void (*release)(void *),
+ void *priv, unsigned int issue_flags)
+{
+}
#endif
#endif
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index dcc95e73f12f..67c66658e3ec 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -441,6 +441,8 @@ struct io_ring_ctx {
struct work_struct exit_work;
struct list_head tctx_list;
struct completion ref_comp;
+ void (*release)(void *);
+ void *priv;
/* io-wq management, e.g. thread count */
u32 iowq_limits[2];
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 1e58fc1d5667..04ffcfa6f2d6 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2952,6 +2952,19 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
return mask;
}
+void io_uring_set_release_callback(struct io_ring_ctx *ctx,
+ void (*release)(void *), void *priv,
+ unsigned int issue_flags)
+{
+ io_ring_submit_lock(ctx, issue_flags);
+
+ ctx->release = release;
+ ctx->priv = priv;
+
+ io_ring_submit_unlock(ctx, issue_flags);
+}
+EXPORT_SYMBOL_GPL(io_uring_set_release_callback);
+
struct io_tctx_exit {
struct callback_head task_work;
struct completion completion;
@@ -3099,6 +3112,8 @@ static int io_uring_release(struct inode *inode, struct file *file)
struct io_ring_ctx *ctx = file->private_data;
file->private_data = NULL;
+ if (ctx->release)
+ ctx->release(ctx->priv);
io_ring_ctx_wait_and_kill(ctx);
return 0;
}
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 15/30] fuse: refactor io-uring logic for getting next fuse request
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (13 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 14/30] io_uring: add release callback for ring death Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 16/30] fuse: refactor io-uring header copying to ring Joanne Koong
` (14 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Simplify the logic for getting the next fuse request.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
---
fs/fuse/dev_uring.c | 78 ++++++++++++++++-----------------------------
1 file changed, 28 insertions(+), 50 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 5ceb217ced1b..1efee4391af5 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -714,34 +714,6 @@ static int fuse_uring_prepare_send(struct fuse_ring_ent *ent,
return err;
}
-/*
- * Write data to the ring buffer and send the request to userspace,
- * userspace will read it
- * This is comparable with classical read(/dev/fuse)
- */
-static int fuse_uring_send_next_to_ring(struct fuse_ring_ent *ent,
- struct fuse_req *req,
- unsigned int issue_flags)
-{
- struct fuse_ring_queue *queue = ent->queue;
- int err;
- struct io_uring_cmd *cmd;
-
- err = fuse_uring_prepare_send(ent, req);
- if (err)
- return err;
-
- spin_lock(&queue->lock);
- cmd = ent->cmd;
- ent->cmd = NULL;
- ent->state = FRRS_USERSPACE;
- list_move_tail(&ent->list, &queue->ent_in_userspace);
- spin_unlock(&queue->lock);
-
- io_uring_cmd_done(cmd, 0, issue_flags);
- return 0;
-}
-
/*
* Make a ring entry available for fuse_req assignment
*/
@@ -838,11 +810,13 @@ static void fuse_uring_commit(struct fuse_ring_ent *ent, struct fuse_req *req,
}
/*
- * Get the next fuse req and send it
+ * Get the next fuse req.
+ *
+ * Returns true if the next fuse request has been assigned to the ent.
+ * Else, there is no next fuse request and this returns false.
*/
-static void fuse_uring_next_fuse_req(struct fuse_ring_ent *ent,
- struct fuse_ring_queue *queue,
- unsigned int issue_flags)
+static bool fuse_uring_get_next_fuse_req(struct fuse_ring_ent *ent,
+ struct fuse_ring_queue *queue)
{
int err;
struct fuse_req *req;
@@ -854,10 +828,12 @@ static void fuse_uring_next_fuse_req(struct fuse_ring_ent *ent,
spin_unlock(&queue->lock);
if (req) {
- err = fuse_uring_send_next_to_ring(ent, req, issue_flags);
+ err = fuse_uring_prepare_send(ent, req);
if (err)
goto retry;
}
+
+ return req != NULL;
}
static int fuse_ring_ent_set_commit(struct fuse_ring_ent *ent)
@@ -875,6 +851,20 @@ static int fuse_ring_ent_set_commit(struct fuse_ring_ent *ent)
return 0;
}
+static void fuse_uring_send(struct fuse_ring_ent *ent, struct io_uring_cmd *cmd,
+ ssize_t ret, unsigned int issue_flags)
+{
+ struct fuse_ring_queue *queue = ent->queue;
+
+ spin_lock(&queue->lock);
+ ent->state = FRRS_USERSPACE;
+ list_move_tail(&ent->list, &queue->ent_in_userspace);
+ ent->cmd = NULL;
+ spin_unlock(&queue->lock);
+
+ io_uring_cmd_done(cmd, ret, issue_flags);
+}
+
/* FUSE_URING_CMD_COMMIT_AND_FETCH handler */
static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
struct fuse_conn *fc)
@@ -946,7 +936,8 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
* and fetching is done in one step vs legacy fuse, which has separated
* read (fetch request) and write (commit result).
*/
- fuse_uring_next_fuse_req(ent, queue, issue_flags);
+ if (fuse_uring_get_next_fuse_req(ent, queue))
+ fuse_uring_send(ent, cmd, 0, issue_flags);
return 0;
}
@@ -1194,20 +1185,6 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
return -EIOCBQUEUED;
}
-static void fuse_uring_send(struct fuse_ring_ent *ent, struct io_uring_cmd *cmd,
- ssize_t ret, unsigned int issue_flags)
-{
- struct fuse_ring_queue *queue = ent->queue;
-
- spin_lock(&queue->lock);
- ent->state = FRRS_USERSPACE;
- list_move_tail(&ent->list, &queue->ent_in_userspace);
- ent->cmd = NULL;
- spin_unlock(&queue->lock);
-
- io_uring_cmd_done(cmd, ret, issue_flags);
-}
-
/*
* This prepares and sends the ring request in fuse-uring task context.
* User buffers are not mapped yet - the application does not have permission
@@ -1224,8 +1201,9 @@ static void fuse_uring_send_in_task(struct io_tw_req tw_req, io_tw_token_t tw)
if (!tw.cancel) {
err = fuse_uring_prepare_send(ent, ent->fuse_req);
if (err) {
- fuse_uring_next_fuse_req(ent, queue, issue_flags);
- return;
+ if (!fuse_uring_get_next_fuse_req(ent, queue))
+ return;
+ err = 0;
}
} else {
err = -ECANCELED;
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 16/30] fuse: refactor io-uring header copying to ring
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (14 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 15/30] fuse: refactor io-uring logic for getting next fuse request Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 17/30] fuse: refactor io-uring header copying from ring Joanne Koong
` (13 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Move header copying to ring logic into a new copy_header_to_ring()
function. This consolidates error handling.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/dev_uring.c | 39 +++++++++++++++++++++------------------
1 file changed, 21 insertions(+), 18 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 1efee4391af5..7962a9876031 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -575,6 +575,18 @@ static int fuse_uring_out_header_has_err(struct fuse_out_header *oh,
return err;
}
+static __always_inline int copy_header_to_ring(void __user *ring,
+ const void *header,
+ size_t header_size)
+{
+ if (copy_to_user(ring, header, header_size)) {
+ pr_info_ratelimited("Copying header to ring failed.\n");
+ return -EFAULT;
+ }
+
+ return 0;
+}
+
static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
struct fuse_req *req,
struct fuse_ring_ent *ent)
@@ -637,13 +649,11 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
* Some op code have that as zero size.
*/
if (args->in_args[0].size > 0) {
- err = copy_to_user(&ent->headers->op_in, in_args->value,
- in_args->size);
- if (err) {
- pr_info_ratelimited(
- "Copying the header failed.\n");
- return -EFAULT;
- }
+ err = copy_header_to_ring(&ent->headers->op_in,
+ in_args->value,
+ in_args->size);
+ if (err)
+ return err;
}
in_args++;
num_args--;
@@ -659,9 +669,8 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
}
ent_in_out.payload_sz = cs.ring.copied_sz;
- err = copy_to_user(&ent->headers->ring_ent_in_out, &ent_in_out,
- sizeof(ent_in_out));
- return err ? -EFAULT : 0;
+ return copy_header_to_ring(&ent->headers->ring_ent_in_out, &ent_in_out,
+ sizeof(ent_in_out));
}
static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
@@ -690,14 +699,8 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
}
/* copy fuse_in_header */
- err = copy_to_user(&ent->headers->in_out, &req->in.h,
- sizeof(req->in.h));
- if (err) {
- err = -EFAULT;
- return err;
- }
-
- return 0;
+ return copy_header_to_ring(&ent->headers->in_out, &req->in.h,
+ sizeof(req->in.h));
}
static int fuse_uring_prepare_send(struct fuse_ring_ent *ent,
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 17/30] fuse: refactor io-uring header copying from ring
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (15 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 16/30] fuse: refactor io-uring header copying to ring Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 18/30] fuse: use enum types for header copying Joanne Koong
` (12 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Move header copying from ring logic into a new copy_header_from_ring()
function. This consolidates error handling.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/dev_uring.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 7962a9876031..e8ee51bfa5fc 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -587,6 +587,18 @@ static __always_inline int copy_header_to_ring(void __user *ring,
return 0;
}
+static __always_inline int copy_header_from_ring(void *header,
+ const void __user *ring,
+ size_t header_size)
+{
+ if (copy_from_user(header, ring, header_size)) {
+ pr_info_ratelimited("Copying header from ring failed.\n");
+ return -EFAULT;
+ }
+
+ return 0;
+}
+
static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
struct fuse_req *req,
struct fuse_ring_ent *ent)
@@ -597,10 +609,10 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
int err;
struct fuse_uring_ent_in_out ring_in_out;
- err = copy_from_user(&ring_in_out, &ent->headers->ring_ent_in_out,
- sizeof(ring_in_out));
+ err = copy_header_from_ring(&ring_in_out, &ent->headers->ring_ent_in_out,
+ sizeof(ring_in_out));
if (err)
- return -EFAULT;
+ return err;
err = import_ubuf(ITER_SOURCE, ent->payload, ring->max_payload_sz,
&iter);
@@ -794,10 +806,10 @@ static void fuse_uring_commit(struct fuse_ring_ent *ent, struct fuse_req *req,
struct fuse_conn *fc = ring->fc;
ssize_t err = 0;
- err = copy_from_user(&req->out.h, &ent->headers->in_out,
- sizeof(req->out.h));
+ err = copy_header_from_ring(&req->out.h, &ent->headers->in_out,
+ sizeof(req->out.h));
if (err) {
- req->out.h.error = -EFAULT;
+ req->out.h.error = err;
goto out;
}
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 18/30] fuse: use enum types for header copying
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (16 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 17/30] fuse: refactor io-uring header copying from ring Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 19/30] fuse: refactor setting up copy state for payload copying Joanne Koong
` (11 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Use enum types to identify which part of the header needs to be copied.
This improves the interface and will simplify both kernel-space and
user-space header addresses when kernel-managed buffer rings are added.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
---
fs/fuse/dev_uring.c | 57 +++++++++++++++++++++++++++++++++++++--------
1 file changed, 47 insertions(+), 10 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index e8ee51bfa5fc..d16f6b3489c1 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -31,6 +31,15 @@ struct fuse_uring_pdu {
static const struct fuse_iqueue_ops fuse_io_uring_ops;
+enum fuse_uring_header_type {
+ /* struct fuse_in_header / struct fuse_out_header */
+ FUSE_URING_HEADER_IN_OUT,
+ /* per op code header */
+ FUSE_URING_HEADER_OP,
+ /* struct fuse_uring_ent_in_out header */
+ FUSE_URING_HEADER_RING_ENT,
+};
+
static void uring_cmd_set_ring_ent(struct io_uring_cmd *cmd,
struct fuse_ring_ent *ring_ent)
{
@@ -575,10 +584,32 @@ static int fuse_uring_out_header_has_err(struct fuse_out_header *oh,
return err;
}
-static __always_inline int copy_header_to_ring(void __user *ring,
+static void __user *get_user_ring_header(struct fuse_ring_ent *ent,
+ enum fuse_uring_header_type type)
+{
+ switch (type) {
+ case FUSE_URING_HEADER_IN_OUT:
+ return &ent->headers->in_out;
+ case FUSE_URING_HEADER_OP:
+ return &ent->headers->op_in;
+ case FUSE_URING_HEADER_RING_ENT:
+ return &ent->headers->ring_ent_in_out;
+ }
+
+ WARN_ON_ONCE(1);
+ return NULL;
+}
+
+static __always_inline int copy_header_to_ring(struct fuse_ring_ent *ent,
+ enum fuse_uring_header_type type,
const void *header,
size_t header_size)
{
+ void __user *ring = get_user_ring_header(ent, type);
+
+ if (!ring)
+ return -EINVAL;
+
if (copy_to_user(ring, header, header_size)) {
pr_info_ratelimited("Copying header to ring failed.\n");
return -EFAULT;
@@ -587,10 +618,16 @@ static __always_inline int copy_header_to_ring(void __user *ring,
return 0;
}
-static __always_inline int copy_header_from_ring(void *header,
- const void __user *ring,
+static __always_inline int copy_header_from_ring(struct fuse_ring_ent *ent,
+ enum fuse_uring_header_type type,
+ void *header,
size_t header_size)
{
+ const void __user *ring = get_user_ring_header(ent, type);
+
+ if (!ring)
+ return -EINVAL;
+
if (copy_from_user(header, ring, header_size)) {
pr_info_ratelimited("Copying header from ring failed.\n");
return -EFAULT;
@@ -609,8 +646,8 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
int err;
struct fuse_uring_ent_in_out ring_in_out;
- err = copy_header_from_ring(&ring_in_out, &ent->headers->ring_ent_in_out,
- sizeof(ring_in_out));
+ err = copy_header_from_ring(ent, FUSE_URING_HEADER_RING_ENT,
+ &ring_in_out, sizeof(ring_in_out));
if (err)
return err;
@@ -661,7 +698,7 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
* Some op code have that as zero size.
*/
if (args->in_args[0].size > 0) {
- err = copy_header_to_ring(&ent->headers->op_in,
+ err = copy_header_to_ring(ent, FUSE_URING_HEADER_OP,
in_args->value,
in_args->size);
if (err)
@@ -681,8 +718,8 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
}
ent_in_out.payload_sz = cs.ring.copied_sz;
- return copy_header_to_ring(&ent->headers->ring_ent_in_out, &ent_in_out,
- sizeof(ent_in_out));
+ return copy_header_to_ring(ent, FUSE_URING_HEADER_RING_ENT,
+ &ent_in_out, sizeof(ent_in_out));
}
static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
@@ -711,7 +748,7 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
}
/* copy fuse_in_header */
- return copy_header_to_ring(&ent->headers->in_out, &req->in.h,
+ return copy_header_to_ring(ent, FUSE_URING_HEADER_IN_OUT, &req->in.h,
sizeof(req->in.h));
}
@@ -806,7 +843,7 @@ static void fuse_uring_commit(struct fuse_ring_ent *ent, struct fuse_req *req,
struct fuse_conn *fc = ring->fc;
ssize_t err = 0;
- err = copy_header_from_ring(&req->out.h, &ent->headers->in_out,
+ err = copy_header_from_ring(ent, FUSE_URING_HEADER_IN_OUT, &req->out.h,
sizeof(req->out.h));
if (err) {
req->out.h.error = err;
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 19/30] fuse: refactor setting up copy state for payload copying
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (17 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 18/30] fuse: use enum types for header copying Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 20/30] fuse: support buffer copying for kernel addresses Joanne Koong
` (10 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Add a new helper function setup_fuse_copy_state() to contain the logic
for setting up the copy state for payload copying.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
---
fs/fuse/dev_uring.c | 38 ++++++++++++++++++++++++--------------
1 file changed, 24 insertions(+), 14 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index d16f6b3489c1..b57871f92d08 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -636,6 +636,27 @@ static __always_inline int copy_header_from_ring(struct fuse_ring_ent *ent,
return 0;
}
+static int setup_fuse_copy_state(struct fuse_copy_state *cs,
+ struct fuse_ring *ring, struct fuse_req *req,
+ struct fuse_ring_ent *ent, int dir,
+ struct iov_iter *iter)
+{
+ int err;
+
+ err = import_ubuf(dir, ent->payload, ring->max_payload_sz, iter);
+ if (err) {
+ pr_info_ratelimited("fuse: Import of user buffer failed\n");
+ return err;
+ }
+
+ fuse_copy_init(cs, dir == ITER_DEST, iter);
+
+ cs->is_uring = true;
+ cs->req = req;
+
+ return 0;
+}
+
static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
struct fuse_req *req,
struct fuse_ring_ent *ent)
@@ -651,15 +672,10 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
if (err)
return err;
- err = import_ubuf(ITER_SOURCE, ent->payload, ring->max_payload_sz,
- &iter);
+ err = setup_fuse_copy_state(&cs, ring, req, ent, ITER_SOURCE, &iter);
if (err)
return err;
- fuse_copy_init(&cs, false, &iter);
- cs.is_uring = true;
- cs.req = req;
-
err = fuse_copy_out_args(&cs, args, ring_in_out.payload_sz);
fuse_copy_finish(&cs);
return err;
@@ -682,15 +698,9 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
.commit_id = req->in.h.unique,
};
- err = import_ubuf(ITER_DEST, ent->payload, ring->max_payload_sz, &iter);
- if (err) {
- pr_info_ratelimited("fuse: Import of user buffer failed\n");
+ err = setup_fuse_copy_state(&cs, ring, req, ent, ITER_DEST, &iter);
+ if (err)
return err;
- }
-
- fuse_copy_init(&cs, true, &iter);
- cs.is_uring = true;
- cs.req = req;
if (num_args > 0) {
/*
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 20/30] fuse: support buffer copying for kernel addresses
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (18 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 19/30] fuse: refactor setting up copy state for payload copying Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 21/30] fuse: add io-uring kernel-managed buffer ring Joanne Koong
` (9 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
This is a preparatory patch needed to support kernel-managed ring
buffers in fuse-over-io-uring. For kernel-managed ring buffers, we get
the vmapped address of the buffer which we can directly use.
Currently, buffer copying in fuse only supports extracting underlying
pages from an iov iter and kmapping them. This commit allows buffer
copying to work directly on a kaddr.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/dev.c | 23 +++++++++++++++++------
fs/fuse/fuse_dev_i.h | 7 ++++++-
2 files changed, 23 insertions(+), 7 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 49b18d7accb3..820d02f01b47 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -848,6 +848,9 @@ void fuse_copy_init(struct fuse_copy_state *cs, bool write,
/* Unmap and put previous page of userspace buffer */
void fuse_copy_finish(struct fuse_copy_state *cs)
{
+ if (cs->is_kaddr)
+ return;
+
if (cs->currbuf) {
struct pipe_buffer *buf = cs->currbuf;
@@ -873,6 +876,9 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
struct page *page;
int err;
+ if (cs->is_kaddr)
+ return 0;
+
err = unlock_request(cs->req);
if (err)
return err;
@@ -931,15 +937,20 @@ static int fuse_copy_do(struct fuse_copy_state *cs, void **val, unsigned *size)
{
unsigned ncpy = min(*size, cs->len);
if (val) {
- void *pgaddr = kmap_local_page(cs->pg);
- void *buf = pgaddr + cs->offset;
+ void *pgaddr, *buf;
+ if (!cs->is_kaddr) {
+ pgaddr = kmap_local_page(cs->pg);
+ buf = pgaddr + cs->offset;
+ } else {
+ buf = cs->kaddr + cs->offset;
+ }
if (cs->write)
memcpy(buf, *val, ncpy);
else
memcpy(*val, buf, ncpy);
-
- kunmap_local(pgaddr);
+ if (!cs->is_kaddr)
+ kunmap_local(pgaddr);
*val += ncpy;
}
*size -= ncpy;
@@ -1127,7 +1138,7 @@ static int fuse_copy_folio(struct fuse_copy_state *cs, struct folio **foliop,
}
while (count) {
- if (cs->write && cs->pipebufs && folio) {
+ if (cs->write && cs->pipebufs && folio && !cs->is_kaddr) {
/*
* Can't control lifetime of pipe buffers, so always
* copy user pages.
@@ -1139,7 +1150,7 @@ static int fuse_copy_folio(struct fuse_copy_state *cs, struct folio **foliop,
} else {
return fuse_ref_folio(cs, folio, offset, count);
}
- } else if (!cs->len) {
+ } else if (!cs->len && !cs->is_kaddr) {
if (cs->move_folios && folio &&
offset == 0 && count == size) {
err = fuse_try_move_folio(cs, foliop);
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index 134bf44aff0d..aa1d25421054 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -28,12 +28,17 @@ struct fuse_copy_state {
struct pipe_buffer *currbuf;
struct pipe_inode_info *pipe;
unsigned long nr_segs;
- struct page *pg;
+ union {
+ struct page *pg;
+ void *kaddr;
+ };
unsigned int len;
unsigned int offset;
bool write:1;
bool move_folios:1;
bool is_uring:1;
+ /* if set, use kaddr; otherwise use pg */
+ bool is_kaddr:1;
struct {
unsigned int copied_sz; /* copied size into the user buffer */
} ring;
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 21/30] fuse: add io-uring kernel-managed buffer ring
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (19 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 20/30] fuse: support buffer copying for kernel addresses Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 22/30] io_uring/rsrc: refactor io_buffer_register_bvec()/io_buffer_unregister_bvec() Joanne Koong
` (8 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Add io-uring kernel-managed buffer ring capability for fuse daemons
communicating through the io-uring interface.
This has two benefits:
a) eliminates the overhead of pinning/unpinning user pages and
translating virtual addresses for every server-kernel interaction
b) reduces the amount of memory needed for the buffers per queue and
allows buffers to be reused across entries. Incremental buffer
consumption, when added, will allow a buffer to be used across multiple
requests.
Buffer ring usage is set on a per-queue basis. In order to use this, the
daemon needs to have preregistered a kernel-managed buffer ring and a
fixed buffer at index 0 that will hold all the headers, and set the
"use_bufring" field during registration. The kernel-managed buffer ring
and fixed buffer will be pinned for the lifetime of the connection.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/dev_uring.c | 452 +++++++++++++++++++++++++++++++++-----
fs/fuse/dev_uring_i.h | 35 ++-
include/uapi/linux/fuse.h | 12 +-
3 files changed, 437 insertions(+), 62 deletions(-)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index b57871f92d08..3600892ba837 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -10,6 +10,8 @@
#include "fuse_trace.h"
#include <linux/fs.h>
+#include <linux/io_uring.h>
+#include <linux/io_uring/buf.h>
#include <linux/io_uring/cmd.h>
static bool __read_mostly enable_uring;
@@ -19,6 +21,8 @@ MODULE_PARM_DESC(enable_uring,
#define FUSE_URING_IOV_SEGS 2 /* header and payload */
+#define FUSE_URING_RINGBUF_GROUP 0
+#define FUSE_URING_FIXED_HEADERS_INDEX 0
bool fuse_uring_enabled(void)
{
@@ -194,6 +198,37 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
return false;
}
+static void fuse_uring_teardown_buffers(struct fuse_ring_queue *queue,
+ unsigned int issue_flags)
+{
+ if (!queue->use_bufring)
+ return;
+
+ spin_lock(&queue->lock);
+
+ if (queue->ring_killed) {
+ spin_unlock(&queue->lock);
+ return;
+ }
+
+ /*
+ * Try to get a reference on it so the ctx isn't killed while we're
+ * unpinning
+ */
+ if (!percpu_ref_tryget_live(&queue->ring_ctx->refs)) {
+ spin_unlock(&queue->lock);
+ return;
+ }
+
+ spin_unlock(&queue->lock);
+
+ WARN_ON_ONCE(io_uring_buf_table_unpin(queue->ring_ctx, issue_flags));
+ WARN_ON_ONCE(io_uring_buf_ring_unpin(queue->ring_ctx,
+ FUSE_URING_RINGBUF_GROUP,
+ issue_flags));
+ percpu_ref_put(&queue->ring_ctx->refs);
+}
+
void fuse_uring_destruct(struct fuse_conn *fc)
{
struct fuse_ring *ring = fc->ring;
@@ -276,20 +311,76 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
return res;
}
-static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
- int qid)
+static void io_ring_killed(void *priv)
+{
+ struct fuse_ring_queue *queue = (struct fuse_ring_queue *)priv;
+
+ spin_lock(&queue->lock);
+ queue->ring_killed = true;
+ spin_unlock(&queue->lock);
+}
+
+static int fuse_uring_buf_ring_setup(struct io_uring_cmd *cmd,
+ struct fuse_ring_queue *queue,
+ unsigned int issue_flags)
+{
+ struct io_ring_ctx *ring_ctx = cmd_to_io_kiocb(cmd)->ctx;
+ int err;
+
+ err = io_uring_buf_ring_pin(ring_ctx, FUSE_URING_RINGBUF_GROUP,
+ issue_flags, &queue->bufring);
+ if (err)
+ return err;
+
+ if (!io_uring_is_kmbuf_ring(ring_ctx, FUSE_URING_RINGBUF_GROUP,
+ issue_flags)) {
+ err = -EINVAL;
+ goto error;
+ }
+
+ err = io_uring_buf_table_pin(ring_ctx, issue_flags);
+ if (err)
+ goto error;
+
+ err = io_uring_cmd_import_fixed_index(cmd,
+ FUSE_URING_FIXED_HEADERS_INDEX,
+ ITER_DEST, &queue->headers_iter,
+ issue_flags);
+ if (err) {
+ io_uring_buf_table_unpin(ring_ctx, issue_flags);
+ goto error;
+ }
+
+ io_uring_set_release_callback(ring_ctx, io_ring_killed, queue,
+ issue_flags);
+ queue->ring_ctx = ring_ctx;
+
+ queue->use_bufring = true;
+
+ return 0;
+
+error:
+ io_uring_buf_ring_unpin(ring_ctx, FUSE_URING_RINGBUF_GROUP,
+ issue_flags);
+ return err;
+}
+
+static struct fuse_ring_queue *
+fuse_uring_create_queue(struct io_uring_cmd *cmd, struct fuse_ring *ring,
+ int qid, bool use_bufring, unsigned int issue_flags)
{
struct fuse_conn *fc = ring->fc;
struct fuse_ring_queue *queue;
struct list_head *pq;
+ int err = 0;
queue = kzalloc(sizeof(*queue), GFP_KERNEL_ACCOUNT);
if (!queue)
- return NULL;
+ return ERR_PTR(-ENOMEM);
pq = kcalloc(FUSE_PQ_HASH_SIZE, sizeof(struct list_head), GFP_KERNEL);
if (!pq) {
kfree(queue);
- return NULL;
+ return ERR_PTR(-ENOMEM);
}
queue->qid = qid;
@@ -307,6 +398,15 @@ static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
queue->fpq.processing = pq;
fuse_pqueue_init(&queue->fpq);
+ if (use_bufring) {
+ err = fuse_uring_buf_ring_setup(cmd, queue, issue_flags);
+ if (err) {
+ kfree(pq);
+ kfree(queue);
+ return ERR_PTR(err);
+ }
+ }
+
spin_lock(&fc->lock);
if (ring->queues[qid]) {
spin_unlock(&fc->lock);
@@ -452,6 +552,7 @@ static void fuse_uring_async_stop_queues(struct work_struct *work)
continue;
fuse_uring_teardown_entries(queue);
+ fuse_uring_teardown_buffers(queue, IO_URING_F_UNLOCKED);
}
/*
@@ -487,6 +588,7 @@ void fuse_uring_stop_queues(struct fuse_ring *ring)
continue;
fuse_uring_teardown_entries(queue);
+ fuse_uring_teardown_buffers(queue, IO_URING_F_UNLOCKED);
}
if (atomic_read(&ring->queue_refs) > 0) {
@@ -584,6 +686,35 @@ static int fuse_uring_out_header_has_err(struct fuse_out_header *oh,
return err;
}
+static int get_kernel_ring_header(struct fuse_ring_ent *ent,
+ enum fuse_uring_header_type type,
+ struct iov_iter *headers_iter)
+{
+ size_t offset;
+
+ switch (type) {
+ case FUSE_URING_HEADER_IN_OUT:
+ /* No offset - start of header */
+ offset = 0;
+ break;
+ case FUSE_URING_HEADER_OP:
+ offset = offsetof(struct fuse_uring_req_header, op_in);
+ break;
+ case FUSE_URING_HEADER_RING_ENT:
+ offset = offsetof(struct fuse_uring_req_header, ring_ent_in_out);
+ break;
+ default:
+ WARN_ONCE(1, "Invalid header type: %d\n", type);
+ return -EINVAL;
+ }
+
+ *headers_iter = ent->headers_iter;
+ if (offset)
+ iov_iter_advance(headers_iter, offset);
+
+ return 0;
+}
+
static void __user *get_user_ring_header(struct fuse_ring_ent *ent,
enum fuse_uring_header_type type)
{
@@ -605,17 +736,38 @@ static __always_inline int copy_header_to_ring(struct fuse_ring_ent *ent,
const void *header,
size_t header_size)
{
- void __user *ring = get_user_ring_header(ent, type);
+ bool use_bufring = ent->queue->use_bufring;
+ int err = 0;
- if (!ring)
- return -EINVAL;
+ if (use_bufring) {
+ struct iov_iter iter;
+
+ err = get_kernel_ring_header(ent, type, &iter);
+ if (err)
+ goto done;
- if (copy_to_user(ring, header, header_size)) {
- pr_info_ratelimited("Copying header to ring failed.\n");
- return -EFAULT;
+ if (copy_to_iter(header, header_size, &iter) != header_size)
+ err = -EFAULT;
+ } else {
+ void __user *ring = get_user_ring_header(ent, type);
+
+ if (!ring) {
+ err = -EINVAL;
+ goto done;
+ }
+
+ if (copy_to_user(ring, header, header_size))
+ err = -EFAULT;
}
- return 0;
+done:
+ if (err)
+ pr_info_ratelimited("Copying header to ring failed: "
+ "header_type=%u, header_size=%lu, "
+ "use_bufring=%d\n", type, header_size,
+ use_bufring);
+
+ return err;
}
static __always_inline int copy_header_from_ring(struct fuse_ring_ent *ent,
@@ -623,17 +775,38 @@ static __always_inline int copy_header_from_ring(struct fuse_ring_ent *ent,
void *header,
size_t header_size)
{
- const void __user *ring = get_user_ring_header(ent, type);
+ bool use_bufring = ent->queue->use_bufring;
+ int err = 0;
- if (!ring)
- return -EINVAL;
+ if (use_bufring) {
+ struct iov_iter iter;
- if (copy_from_user(header, ring, header_size)) {
- pr_info_ratelimited("Copying header from ring failed.\n");
- return -EFAULT;
+ err = get_kernel_ring_header(ent, type, &iter);
+ if (err)
+ goto done;
+
+ if (copy_from_iter(header, header_size, &iter) != header_size)
+ err = -EFAULT;
+ } else {
+ const void __user *ring = get_user_ring_header(ent, type);
+
+ if (!ring) {
+ err = -EINVAL;
+ goto done;
+ }
+
+ if (copy_from_user(header, ring, header_size))
+ err = -EFAULT;
}
- return 0;
+done:
+ if (err)
+ pr_info_ratelimited("Copying header from ring failed: "
+ "header_type=%u, header_size=%lu, "
+ "use_bufring=%d\n", type, header_size,
+ use_bufring);
+
+ return err;
}
static int setup_fuse_copy_state(struct fuse_copy_state *cs,
@@ -643,14 +816,23 @@ static int setup_fuse_copy_state(struct fuse_copy_state *cs,
{
int err;
- err = import_ubuf(dir, ent->payload, ring->max_payload_sz, iter);
- if (err) {
- pr_info_ratelimited("fuse: Import of user buffer failed\n");
- return err;
+ if (!ent->queue->use_bufring) {
+ err = import_ubuf(dir, ent->payload, ring->max_payload_sz, iter);
+ if (err) {
+ pr_info_ratelimited("fuse: Import of user buffer "
+ "failed\n");
+ return err;
+ }
}
fuse_copy_init(cs, dir == ITER_DEST, iter);
+ if (ent->queue->use_bufring) {
+ cs->is_kaddr = true;
+ cs->len = ent->payload_kvec.iov_len;
+ cs->kaddr = ent->payload_kvec.iov_base;
+ }
+
cs->is_uring = true;
cs->req = req;
@@ -762,6 +944,108 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
sizeof(req->in.h));
}
+static bool fuse_uring_req_has_payload(struct fuse_req *req)
+{
+ struct fuse_args *args = req->args;
+
+ return args->in_numargs > 1 || args->out_numargs;
+}
+
+static int fuse_uring_select_buffer(struct fuse_ring_ent *ent,
+ unsigned int issue_flags)
+ __must_hold(&queue->lock)
+{
+ struct io_br_sel sel;
+ size_t len = 0;
+
+ lockdep_assert_held(&ent->queue->lock);
+
+ /* Get a buffer to use for the payload */
+ sel = io_ring_buffer_select(cmd_to_io_kiocb(ent->cmd), &len,
+ ent->queue->bufring, issue_flags);
+ if (sel.val)
+ return sel.val;
+ if (!sel.kaddr)
+ return -ENOENT;
+
+ ent->payload_kvec.iov_base = sel.kaddr;
+ ent->payload_kvec.iov_len = len;
+ ent->ringbuf_buf_id = sel.buf_id;
+
+ return 0;
+}
+
+static int fuse_uring_clean_up_buffer(struct fuse_ring_ent *ent,
+ struct io_uring_cmd *cmd)
+ __must_hold(&queue->lock)
+{
+ struct kvec *kvec = &ent->payload_kvec;
+ int err;
+
+ lockdep_assert_held(&ent->queue->lock);
+
+ if (!ent->queue->use_bufring)
+ return 0;
+
+ if (kvec->iov_base) {
+ err = io_uring_kmbuf_recycle_pinned(cmd_to_io_kiocb(ent->cmd),
+ ent->queue->bufring,
+ (u64)kvec->iov_base,
+ kvec->iov_len,
+ ent->ringbuf_buf_id);
+ if (WARN_ON_ONCE(err))
+ return err;
+ memset(kvec, 0, sizeof(*kvec));
+ }
+
+ return 0;
+}
+
+static int fuse_uring_next_req_update_buffer(struct fuse_ring_ent *ent,
+ struct fuse_req *req,
+ unsigned int issue_flags)
+{
+ bool buffer_selected;
+ bool has_payload;
+
+ if (!ent->queue->use_bufring)
+ return 0;
+
+ ent->headers_iter.data_source = false;
+
+ buffer_selected = ent->payload_kvec.iov_base != 0;
+ has_payload = fuse_uring_req_has_payload(req);
+
+ if (has_payload && !buffer_selected)
+ return fuse_uring_select_buffer(ent, issue_flags);
+
+ if (!has_payload && buffer_selected)
+ fuse_uring_clean_up_buffer(ent, ent->cmd);
+
+ return 0;
+}
+
+static int fuse_uring_prep_buffer(struct fuse_ring_ent *ent,
+ struct fuse_req *req, unsigned int dir,
+ unsigned issue_flags)
+{
+ if (!ent->queue->use_bufring)
+ return 0;
+
+ if (dir == ITER_SOURCE) {
+ ent->headers_iter.data_source = true;
+ return 0;
+ }
+
+ ent->headers_iter.data_source = false;
+
+ /* no payload to copy, can skip selecting a buffer */
+ if (!fuse_uring_req_has_payload(req))
+ return 0;
+
+ return fuse_uring_select_buffer(ent, issue_flags);
+}
+
static int fuse_uring_prepare_send(struct fuse_ring_ent *ent,
struct fuse_req *req)
{
@@ -824,7 +1108,8 @@ static void fuse_uring_add_req_to_ring_ent(struct fuse_ring_ent *ent,
}
/* Fetch the next fuse request if available */
-static struct fuse_req *fuse_uring_ent_assign_req(struct fuse_ring_ent *ent)
+static struct fuse_req *fuse_uring_ent_assign_req(struct fuse_ring_ent *ent,
+ unsigned int issue_flags)
__must_hold(&queue->lock)
{
struct fuse_req *req;
@@ -835,8 +1120,13 @@ static struct fuse_req *fuse_uring_ent_assign_req(struct fuse_ring_ent *ent)
/* get and assign the next entry while it is still holding the lock */
req = list_first_entry_or_null(req_queue, struct fuse_req, list);
- if (req)
+ if (req) {
+ if (fuse_uring_next_req_update_buffer(ent, req, issue_flags))
+ return NULL;
fuse_uring_add_req_to_ring_ent(ent, req);
+ } else {
+ fuse_uring_clean_up_buffer(ent, ent->cmd);
+ }
return req;
}
@@ -878,7 +1168,8 @@ static void fuse_uring_commit(struct fuse_ring_ent *ent, struct fuse_req *req,
* Else, there is no next fuse request and this returns false.
*/
static bool fuse_uring_get_next_fuse_req(struct fuse_ring_ent *ent,
- struct fuse_ring_queue *queue)
+ struct fuse_ring_queue *queue,
+ unsigned int issue_flags)
{
int err;
struct fuse_req *req;
@@ -886,7 +1177,7 @@ static bool fuse_uring_get_next_fuse_req(struct fuse_ring_ent *ent,
retry:
spin_lock(&queue->lock);
fuse_uring_ent_avail(ent, queue);
- req = fuse_uring_ent_assign_req(ent);
+ req = fuse_uring_ent_assign_req(ent, issue_flags);
spin_unlock(&queue->lock);
if (req) {
@@ -990,7 +1281,12 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
/* without the queue lock, as other locks are taken */
fuse_uring_prepare_cancel(cmd, issue_flags, ent);
- fuse_uring_commit(ent, req, issue_flags);
+
+ err = fuse_uring_prep_buffer(ent, req, ITER_SOURCE, issue_flags);
+ if (WARN_ON_ONCE(err))
+ fuse_uring_req_end(ent, req, err);
+ else
+ fuse_uring_commit(ent, req, issue_flags);
/*
* Fetching the next request is absolutely required as queued
@@ -998,7 +1294,7 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
* and fetching is done in one step vs legacy fuse, which has separated
* read (fetch request) and write (commit result).
*/
- if (fuse_uring_get_next_fuse_req(ent, queue))
+ if (fuse_uring_get_next_fuse_req(ent, queue, issue_flags))
fuse_uring_send(ent, cmd, 0, issue_flags);
return 0;
}
@@ -1094,39 +1390,64 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
struct iovec iov[FUSE_URING_IOV_SEGS];
int err;
+ err = -ENOMEM;
+ ent = kzalloc(sizeof(*ent), GFP_KERNEL_ACCOUNT);
+ if (!ent)
+ return ERR_PTR(err);
+
+ INIT_LIST_HEAD(&ent->list);
+
+ ent->queue = queue;
+
+ err = -EINVAL;
+ if (queue->use_bufring) {
+ size_t header_size = sizeof(struct fuse_uring_req_header);
+ u16 buf_index;
+
+ if (!(cmd->flags & IORING_URING_CMD_FIXED))
+ goto error;
+
+ buf_index = READ_ONCE(cmd->sqe->buf_index);
+
+ /* set up the headers */
+ ent->headers_iter = queue->headers_iter;
+ iov_iter_advance(&ent->headers_iter, buf_index * header_size);
+ iov_iter_truncate(&ent->headers_iter, header_size);
+ if (iov_iter_count(&ent->headers_iter) != header_size)
+ goto error;
+
+ atomic_inc(&ring->queue_refs);
+ return ent;
+ }
+
err = fuse_uring_get_iovec_from_sqe(cmd->sqe, iov);
if (err) {
pr_info_ratelimited("Failed to get iovec from sqe, err=%d\n",
err);
- return ERR_PTR(err);
+ goto error;
}
err = -EINVAL;
if (iov[0].iov_len < sizeof(struct fuse_uring_req_header)) {
pr_info_ratelimited("Invalid header len %zu\n", iov[0].iov_len);
- return ERR_PTR(err);
+ goto error;
}
payload_size = iov[1].iov_len;
if (payload_size < ring->max_payload_sz) {
pr_info_ratelimited("Invalid req payload len %zu\n",
payload_size);
- return ERR_PTR(err);
+ goto error;
}
-
- err = -ENOMEM;
- ent = kzalloc(sizeof(*ent), GFP_KERNEL_ACCOUNT);
- if (!ent)
- return ERR_PTR(err);
-
- INIT_LIST_HEAD(&ent->list);
-
- ent->queue = queue;
ent->headers = iov[0].iov_base;
ent->payload = iov[1].iov_base;
atomic_inc(&ring->queue_refs);
return ent;
+
+error:
+ kfree(ent);
+ return ERR_PTR(err);
}
/*
@@ -1137,6 +1458,7 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
unsigned int issue_flags, struct fuse_conn *fc)
{
const struct fuse_uring_cmd_req *cmd_req = io_uring_sqe_cmd(cmd->sqe);
+ bool use_bufring = READ_ONCE(cmd_req->init.use_bufring);
struct fuse_ring *ring = smp_load_acquire(&fc->ring);
struct fuse_ring_queue *queue;
struct fuse_ring_ent *ent;
@@ -1157,9 +1479,13 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
queue = ring->queues[qid];
if (!queue) {
- queue = fuse_uring_create_queue(ring, qid);
- if (!queue)
- return err;
+ queue = fuse_uring_create_queue(cmd, ring, qid, use_bufring,
+ issue_flags);
+ if (IS_ERR(queue))
+ return PTR_ERR(queue);
+ } else {
+ if (queue->use_bufring != use_bufring)
+ return -EINVAL;
}
/*
@@ -1263,7 +1589,8 @@ static void fuse_uring_send_in_task(struct io_tw_req tw_req, io_tw_token_t tw)
if (!tw.cancel) {
err = fuse_uring_prepare_send(ent, ent->fuse_req);
if (err) {
- if (!fuse_uring_get_next_fuse_req(ent, queue))
+ if (!fuse_uring_get_next_fuse_req(ent, queue,
+ issue_flags))
return;
err = 0;
}
@@ -1325,14 +1652,20 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
req->ring_queue = queue;
ent = list_first_entry_or_null(&queue->ent_avail_queue,
struct fuse_ring_ent, list);
- if (ent)
- fuse_uring_add_req_to_ring_ent(ent, req);
- else
- list_add_tail(&req->list, &queue->fuse_req_queue);
- spin_unlock(&queue->lock);
+ if (ent) {
+ err = fuse_uring_prep_buffer(ent, req, ITER_DEST,
+ IO_URING_F_UNLOCKED);
+ if (!err) {
+ fuse_uring_add_req_to_ring_ent(ent, req);
+ spin_unlock(&queue->lock);
+ fuse_uring_dispatch_ent(ent);
+ return;
+ }
+ WARN_ON_ONCE(err != -ENOENT);
+ }
- if (ent)
- fuse_uring_dispatch_ent(ent);
+ list_add_tail(&req->list, &queue->fuse_req_queue);
+ spin_unlock(&queue->lock);
return;
@@ -1350,6 +1683,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
struct fuse_ring *ring = fc->ring;
struct fuse_ring_queue *queue;
struct fuse_ring_ent *ent = NULL;
+ int err;
queue = fuse_uring_task_to_queue(ring);
if (!queue)
@@ -1382,14 +1716,16 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
req = list_first_entry_or_null(&queue->fuse_req_queue, struct fuse_req,
list);
if (ent && req) {
- fuse_uring_add_req_to_ring_ent(ent, req);
- spin_unlock(&queue->lock);
-
- fuse_uring_dispatch_ent(ent);
- } else {
- spin_unlock(&queue->lock);
+ err = fuse_uring_prep_buffer(ent, req, ITER_DEST,
+ IO_URING_F_UNLOCKED);
+ if (!err) {
+ fuse_uring_add_req_to_ring_ent(ent, req);
+ spin_unlock(&queue->lock);
+ fuse_uring_dispatch_ent(ent);
+ return true;
+ }
}
-
+ spin_unlock(&queue->lock);
return true;
}
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 51a563922ce1..a8a849c3497e 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -7,6 +7,8 @@
#ifndef _FS_FUSE_DEV_URING_I_H
#define _FS_FUSE_DEV_URING_I_H
+#include <linux/uio.h>
+
#include "fuse_i.h"
#ifdef CONFIG_FUSE_IO_URING
@@ -38,9 +40,24 @@ enum fuse_ring_req_state {
/** A fuse ring entry, part of the ring queue */
struct fuse_ring_ent {
- /* userspace buffer */
- struct fuse_uring_req_header __user *headers;
- void __user *payload;
+ union {
+ /* queue->use_bufring == false */
+ struct {
+ /* userspace buffers */
+ struct fuse_uring_req_header __user *headers;
+ void __user *payload;
+ };
+ /* queue->use_bufring == true */
+ struct {
+ struct iov_iter headers_iter;
+ struct kvec payload_kvec;
+ /*
+ * This needs to be tracked in order to properly recycle
+ * the buffer when done with it
+ */
+ unsigned int ringbuf_buf_id;
+ };
+ };
/* the ring queue that owns the request */
struct fuse_ring_queue *queue;
@@ -99,6 +116,18 @@ struct fuse_ring_queue {
unsigned int active_background;
bool stopped;
+
+ bool ring_killed : 1;
+
+ /* true if kernel-managed buffer ring is used */
+ bool use_bufring: 1;
+
+ /* the below fields are only used if the bufring is used */
+ struct io_ring_ctx *ring_ctx;
+ /* iter for the headers buffer for all the ents */
+ struct iov_iter headers_iter;
+ /* synchronized by the queue lock */
+ struct io_buffer_list *bufring;
};
/**
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index c13e1f9a2f12..3041177e3dd8 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -240,6 +240,9 @@
* - add FUSE_COPY_FILE_RANGE_64
* - add struct fuse_copy_file_range_out
* - add FUSE_NOTIFY_PRUNE
+ *
+ * 7.46
+ * - add fuse_uring_cmd_req use_bufring
*/
#ifndef _LINUX_FUSE_H
@@ -1305,7 +1308,14 @@ struct fuse_uring_cmd_req {
/* queue the command is for (queue index) */
uint16_t qid;
- uint8_t padding[6];
+
+ union {
+ struct {
+ bool use_bufring;
+ } init;
+ };
+
+ uint8_t padding[5];
};
#endif /* _LINUX_FUSE_H */
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 22/30] io_uring/rsrc: refactor io_buffer_register_bvec()/io_buffer_unregister_bvec()
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (20 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 21/30] fuse: add io-uring kernel-managed buffer ring Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-07 8:33 ` Caleb Sander Mateos
2025-12-03 0:35 ` [PATCH v1 23/30] io_uring/rsrc: split io_buffer_register_request() logic Joanne Koong
` (7 subsequent siblings)
29 siblings, 1 reply; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Changes:
- Rename io_buffer_register_bvec() to io_buffer_register_request()
- Rename io_buffer_unregister_bvec() to io_buffer_unregister()
- Add cmd wrappers for io_buffer_register_request() and
io_buffer_unregister() for ublk to use
This is in preparation for supporting kernel-populated buffers in fuse
io-uring, which will need to register bvecs directly (not through a
block-based request) and will need to do unregistration through an
io_ring_ctx directly.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
Documentation/block/ublk.rst | 15 ++++++++-------
drivers/block/ublk_drv.c | 20 +++++++++++---------
include/linux/io_uring/cmd.h | 13 ++++++++-----
io_uring/rsrc.c | 14 +++++---------
io_uring/rsrc.h | 7 +++++++
io_uring/uring_cmd.c | 21 +++++++++++++++++++++
6 files changed, 60 insertions(+), 30 deletions(-)
diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
index 8c4030bcabb6..1546477e768b 100644
--- a/Documentation/block/ublk.rst
+++ b/Documentation/block/ublk.rst
@@ -326,16 +326,17 @@ Zero copy
---------
ublk zero copy relies on io_uring's fixed kernel buffer, which provides
-two APIs: `io_buffer_register_bvec()` and `io_buffer_unregister_bvec`.
+two APIs: `io_uring_cmd_buffer_register_request()` and
+`io_uring_cmd_buffer_unregister`.
ublk adds IO command of `UBLK_IO_REGISTER_IO_BUF` to call
-`io_buffer_register_bvec()` for ublk server to register client request
-buffer into io_uring buffer table, then ublk server can submit io_uring
+`io_uring_cmd_buffer_register_request()` for ublk server to register client
+request buffer into io_uring buffer table, then ublk server can submit io_uring
IOs with the registered buffer index. IO command of `UBLK_IO_UNREGISTER_IO_BUF`
-calls `io_buffer_unregister_bvec()` to unregister the buffer, which is
-guaranteed to be live between calling `io_buffer_register_bvec()` and
-`io_buffer_unregister_bvec()`. Any io_uring operation which supports this
-kind of kernel buffer will grab one reference of the buffer until the
+calls `io_uring_cmd_buffer_unregister()` to unregister the buffer, which is
+guaranteed to be live between calling `io_uring_cmd_buffer_register_request()`
+and `io_uring_cmd_buffer_unregister()`. Any io_uring operation which supports
+this kind of kernel buffer will grab one reference of the buffer until the
operation is completed.
ublk server implementing zero copy or user copy has to be CAP_SYS_ADMIN and
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index e0c601128efa..d671d08533c9 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1246,8 +1246,9 @@ static bool ublk_auto_buf_reg(const struct ublk_queue *ubq, struct request *req,
{
int ret;
- ret = io_buffer_register_bvec(io->cmd, req, ublk_io_release,
- io->buf.index, issue_flags);
+ ret = io_uring_cmd_buffer_register_request(io->cmd, req,
+ ublk_io_release,
+ io->buf.index, issue_flags);
if (ret) {
if (io->buf.flags & UBLK_AUTO_BUF_REG_FALLBACK) {
ublk_auto_buf_reg_fallback(ubq, io);
@@ -2204,8 +2205,8 @@ static int ublk_register_io_buf(struct io_uring_cmd *cmd,
if (!req)
return -EINVAL;
- ret = io_buffer_register_bvec(cmd, req, ublk_io_release, index,
- issue_flags);
+ ret = io_uring_cmd_buffer_register_request(cmd, req, ublk_io_release,
+ index, issue_flags);
if (ret) {
ublk_put_req_ref(io, req);
return ret;
@@ -2236,8 +2237,8 @@ ublk_daemon_register_io_buf(struct io_uring_cmd *cmd,
if (!ublk_dev_support_zero_copy(ub) || !ublk_rq_has_data(req))
return -EINVAL;
- ret = io_buffer_register_bvec(cmd, req, ublk_io_release, index,
- issue_flags);
+ ret = io_uring_cmd_buffer_register_request(cmd, req, ublk_io_release,
+ index, issue_flags);
if (ret)
return ret;
@@ -2252,7 +2253,7 @@ static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
if (!(ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY))
return -EINVAL;
- return io_buffer_unregister_bvec(cmd, index, issue_flags);
+ return io_uring_cmd_buffer_unregister(cmd, index, issue_flags);
}
static int ublk_check_fetch_buf(const struct ublk_device *ub, __u64 buf_addr)
@@ -2386,7 +2387,7 @@ static int ublk_ch_uring_cmd_local(struct io_uring_cmd *cmd,
goto out;
/*
- * io_buffer_unregister_bvec() doesn't access the ubq or io,
+ * io_uring_cmd_buffer_unregister() doesn't access the ubq or io,
* so no need to validate the q_id, tag, or task
*/
if (_IOC_NR(cmd_op) == UBLK_IO_UNREGISTER_IO_BUF)
@@ -2456,7 +2457,8 @@ static int ublk_ch_uring_cmd_local(struct io_uring_cmd *cmd,
/* can't touch 'ublk_io' any more */
if (buf_idx != UBLK_INVALID_BUF_IDX)
- io_buffer_unregister_bvec(cmd, buf_idx, issue_flags);
+ io_uring_cmd_buffer_unregister(cmd, buf_idx,
+ issue_flags);
if (req_op(req) == REQ_OP_ZONE_APPEND)
req->__sector = addr;
if (compl)
diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
index 795b846d1e11..fc956f8f7ed2 100644
--- a/include/linux/io_uring/cmd.h
+++ b/include/linux/io_uring/cmd.h
@@ -185,10 +185,13 @@ static inline void io_uring_cmd_done32(struct io_uring_cmd *ioucmd, s32 ret,
return __io_uring_cmd_done(ioucmd, ret, res2, issue_flags, true);
}
-int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
- void (*release)(void *), unsigned int index,
- unsigned int issue_flags);
-int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index,
- unsigned int issue_flags);
+int io_uring_cmd_buffer_register_request(struct io_uring_cmd *cmd,
+ struct request *rq,
+ void (*release)(void *),
+ unsigned int index,
+ unsigned int issue_flags);
+
+int io_uring_cmd_buffer_unregister(struct io_uring_cmd *cmd, unsigned int index,
+ unsigned int issue_flags);
#endif /* _LINUX_IO_URING_CMD_H */
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index b6dd62118311..59cafe63d187 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -941,11 +941,10 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
return ret;
}
-int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
- void (*release)(void *), unsigned int index,
- unsigned int issue_flags)
+int io_buffer_register_request(struct io_ring_ctx *ctx, struct request *rq,
+ void (*release)(void *), unsigned int index,
+ unsigned int issue_flags)
{
- struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
struct io_rsrc_data *data = &ctx->buf_table;
struct req_iterator rq_iter;
struct io_mapped_ubuf *imu;
@@ -1003,12 +1002,10 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
io_ring_submit_unlock(ctx, issue_flags);
return ret;
}
-EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
-int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index,
- unsigned int issue_flags)
+int io_buffer_unregister(struct io_ring_ctx *ctx, unsigned int index,
+ unsigned int issue_flags)
{
- struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
struct io_rsrc_data *data = &ctx->buf_table;
struct io_rsrc_node *node;
int ret = 0;
@@ -1036,7 +1033,6 @@ int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index,
io_ring_submit_unlock(ctx, issue_flags);
return ret;
}
-EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec);
static int validate_fixed_range(u64 buf_addr, size_t len,
const struct io_mapped_ubuf *imu)
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 658934f4d3ff..d1ca33f3319a 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -91,6 +91,13 @@ int io_validate_user_buf_range(u64 uaddr, u64 ulen);
bool io_check_coalesce_buffer(struct page **page_array, int nr_pages,
struct io_imu_folio_data *data);
+int io_buffer_register_request(struct io_ring_ctx *ctx, struct request *rq,
+ void (*release)(void *), unsigned int index,
+ unsigned int issue_flags);
+
+int io_buffer_unregister(struct io_ring_ctx *ctx, unsigned int index,
+ unsigned int issue_flags);
+
static inline struct io_rsrc_node *io_rsrc_node_lookup(struct io_rsrc_data *data,
int index)
{
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 3eb10bbba177..3922ac86b481 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -383,6 +383,27 @@ struct io_br_sel io_uring_cmd_buffer_select(struct io_uring_cmd *ioucmd,
}
EXPORT_SYMBOL_GPL(io_uring_cmd_buffer_select);
+int io_uring_cmd_buffer_register_request(struct io_uring_cmd *cmd,
+ struct request *rq,
+ void (*release)(void *),
+ unsigned int index,
+ unsigned int issue_flags)
+{
+ struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
+
+ return io_buffer_register_request(ctx, rq, release, index, issue_flags);
+}
+EXPORT_SYMBOL_GPL(io_uring_cmd_buffer_register_request);
+
+int io_uring_cmd_buffer_unregister(struct io_uring_cmd *cmd, unsigned int index,
+ unsigned int issue_flags)
+{
+ struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
+
+ return io_buffer_unregister(ctx, index, issue_flags);
+}
+EXPORT_SYMBOL_GPL(io_uring_cmd_buffer_unregister);
+
/*
* Return true if this multishot uring_cmd needs to be completed, otherwise
* the event CQE is posted successfully.
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 23/30] io_uring/rsrc: split io_buffer_register_request() logic
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (21 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 22/30] io_uring/rsrc: refactor io_buffer_register_bvec()/io_buffer_unregister_bvec() Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-07 8:41 ` Caleb Sander Mateos
2025-12-03 0:35 ` [PATCH v1 24/30] io_uring/rsrc: Allow buffer release callback to be optional Joanne Koong
` (6 subsequent siblings)
29 siblings, 1 reply; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Split the main initialization logic in io_buffer_register_request() into
a helper function.
This is a preparatory patch for supporting kernel-populated buffers in
fuse io-uring, which will be reusing this logic.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
io_uring/rsrc.c | 80 +++++++++++++++++++++++++++++--------------------
1 file changed, 48 insertions(+), 32 deletions(-)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 59cafe63d187..18abba6f6b86 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -941,63 +941,79 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
return ret;
}
-int io_buffer_register_request(struct io_ring_ctx *ctx, struct request *rq,
- void (*release)(void *), unsigned int index,
- unsigned int issue_flags)
+static int io_buffer_init(struct io_ring_ctx *ctx, unsigned int nr_bvecs,
+ unsigned int total_bytes, u8 dir,
+ void (*release)(void *), void *priv,
+ unsigned int index)
{
struct io_rsrc_data *data = &ctx->buf_table;
- struct req_iterator rq_iter;
struct io_mapped_ubuf *imu;
struct io_rsrc_node *node;
- struct bio_vec bv;
- unsigned int nr_bvecs = 0;
- int ret = 0;
- io_ring_submit_lock(ctx, issue_flags);
- if (index >= data->nr) {
- ret = -EINVAL;
- goto unlock;
- }
+ if (index >= data->nr)
+ return -EINVAL;
index = array_index_nospec(index, data->nr);
- if (data->nodes[index]) {
- ret = -EBUSY;
- goto unlock;
- }
+ if (data->nodes[index])
+ return -EBUSY;
node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
- if (!node) {
- ret = -ENOMEM;
- goto unlock;
- }
+ if (!node)
+ return -ENOMEM;
- /*
- * blk_rq_nr_phys_segments() may overestimate the number of bvecs
- * but avoids needing to iterate over the bvecs
- */
- imu = io_alloc_imu(ctx, blk_rq_nr_phys_segments(rq));
+ imu = io_alloc_imu(ctx, nr_bvecs);
if (!imu) {
kfree(node);
- ret = -ENOMEM;
- goto unlock;
+ return -ENOMEM;
}
imu->ubuf = 0;
- imu->len = blk_rq_bytes(rq);
+ imu->len = total_bytes;
imu->acct_pages = 0;
imu->folio_shift = PAGE_SHIFT;
+ imu->nr_bvecs = nr_bvecs;
refcount_set(&imu->refs, 1);
imu->release = release;
- imu->priv = rq;
+ imu->priv = priv;
imu->is_kbuf = true;
- imu->dir = 1 << rq_data_dir(rq);
+ imu->dir = 1 << dir;
+
+ node->buf = imu;
+ data->nodes[index] = node;
+
+ return 0;
+}
+
+int io_buffer_register_request(struct io_ring_ctx *ctx, struct request *rq,
+ void (*release)(void *), unsigned int index,
+ unsigned int issue_flags)
+{
+ struct req_iterator rq_iter;
+ struct io_mapped_ubuf *imu;
+ struct bio_vec bv;
+ unsigned int nr_bvecs;
+ unsigned int total_bytes;
+ int ret;
+
+ io_ring_submit_lock(ctx, issue_flags);
+
+ /*
+ * blk_rq_nr_phys_segments() may overestimate the number of bvecs
+ * but avoids needing to iterate over the bvecs
+ */
+ nr_bvecs = blk_rq_nr_phys_segments(rq);
+ total_bytes = blk_rq_bytes(rq);
+ ret = io_buffer_init(ctx, nr_bvecs, total_bytes, rq_data_dir(rq), release, rq,
+ index);
+ if (ret)
+ goto unlock;
+ imu = ctx->buf_table.nodes[index]->buf;
+ nr_bvecs = 0;
rq_for_each_bvec(bv, rq, rq_iter)
imu->bvec[nr_bvecs++] = bv;
imu->nr_bvecs = nr_bvecs;
- node->buf = imu;
- data->nodes[index] = node;
unlock:
io_ring_submit_unlock(ctx, issue_flags);
return ret;
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 24/30] io_uring/rsrc: Allow buffer release callback to be optional
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (22 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 23/30] io_uring/rsrc: split io_buffer_register_request() logic Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-07 8:42 ` Caleb Sander Mateos
2025-12-03 0:35 ` [PATCH v1 25/30] io_uring/rsrc: add io_buffer_register_bvec() Joanne Koong
` (5 subsequent siblings)
29 siblings, 1 reply; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
This is a preparatory patch for supporting kernel-populated buffers in
fuse io-uring, which does not need a release callback.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
io_uring/rsrc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 18abba6f6b86..a5605c35d857 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -149,7 +149,8 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
if (imu->acct_pages)
io_unaccount_mem(ctx->user, ctx->mm_account, imu->acct_pages);
- imu->release(imu->priv);
+ if (imu->release)
+ imu->release(imu->priv);
io_free_imu(ctx, imu);
}
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 25/30] io_uring/rsrc: add io_buffer_register_bvec()
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (23 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 24/30] io_uring/rsrc: Allow buffer release callback to be optional Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 26/30] io_uring/rsrc: export io_buffer_unregister Joanne Koong
` (4 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Add io_buffer_register_bvec() for registering a bvec array.
This is a preparatory patch for fuse-over-io-uring zero-copy.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
include/linux/io_uring/buf.h | 14 ++++++++++++++
io_uring/rsrc.c | 25 +++++++++++++++++++++++++
2 files changed, 39 insertions(+)
diff --git a/include/linux/io_uring/buf.h b/include/linux/io_uring/buf.h
index 2b49c01fe2f5..ff6b81bb95e5 100644
--- a/include/linux/io_uring/buf.h
+++ b/include/linux/io_uring/buf.h
@@ -23,6 +23,11 @@ bool io_uring_is_kmbuf_ring(struct io_ring_ctx *ctx, unsigned int buf_group,
struct io_br_sel io_ring_buffer_select(struct io_kiocb *req, size_t *len,
struct io_buffer_list *bl,
unsigned int issue_flags);
+
+int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct bio_vec *bvs,
+ unsigned int nr_bvecs, unsigned int total_bytes,
+ u8 dir, unsigned int index,
+ unsigned int issue_flags);
#else
static inline int io_uring_buf_ring_pin(struct io_ring_ctx *ctx,
unsigned buf_group,
@@ -70,6 +75,15 @@ static inline struct io_br_sel io_ring_buffer_select(struct io_kiocb *req,
return sel;
}
+static inline int io_buffer_register_bvec(struct io_ring_ctx *ctx,
+ struct bio_vec *bvs,
+ unsigned int nr_bvecs,
+ unsigned int total_bytes, u8 dir,
+ unsigned int index,
+ unsigned int issue_flags)
+{
+ return -EOPNOTSUPP;
+}
#endif /* CONFIG_IO_URING */
#endif /* _LINUX_IO_URING_BUF_H */
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index a5605c35d857..7358f153d136 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1020,6 +1020,31 @@ int io_buffer_register_request(struct io_ring_ctx *ctx, struct request *rq,
return ret;
}
+int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct bio_vec *bvs,
+ unsigned int nr_bvecs, unsigned int total_bytes,
+ u8 dir, unsigned int index,
+ unsigned int issue_flags)
+{
+ struct io_rsrc_data *data = &ctx->buf_table;
+ struct bio_vec *bvec;
+ int ret, i;
+
+ io_ring_submit_lock(ctx, issue_flags);
+ ret = io_buffer_init(ctx, nr_bvecs, total_bytes, dir, NULL, NULL,
+ index);
+ if (ret)
+ goto unlock;
+
+ bvec = data->nodes[index]->buf->bvec;
+ for (i = 0; i < nr_bvecs; i++)
+ bvec[i] = bvs[i];
+
+unlock:
+ io_ring_submit_unlock(ctx, issue_flags);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
+
int io_buffer_unregister(struct io_ring_ctx *ctx, unsigned int index,
unsigned int issue_flags)
{
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 26/30] io_uring/rsrc: export io_buffer_unregister
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (24 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 25/30] io_uring/rsrc: add io_buffer_register_bvec() Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 27/30] fuse: rename fuse_set_zero_arg0() to fuse_zero_in_arg0() Joanne Koong
` (3 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
This is a preparatory patch for fuse-over-io-uring zero-copy.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
include/linux/io_uring/buf.h | 9 +++++++++
io_uring/rsrc.c | 1 +
io_uring/rsrc.h | 3 ---
io_uring/uring_cmd.c | 1 +
4 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/include/linux/io_uring/buf.h b/include/linux/io_uring/buf.h
index ff6b81bb95e5..078e48b63bff 100644
--- a/include/linux/io_uring/buf.h
+++ b/include/linux/io_uring/buf.h
@@ -28,6 +28,8 @@ int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct bio_vec *bvs,
unsigned int nr_bvecs, unsigned int total_bytes,
u8 dir, unsigned int index,
unsigned int issue_flags);
+int io_buffer_unregister(struct io_ring_ctx *ctx, unsigned int index,
+ unsigned int issue_flags);
#else
static inline int io_uring_buf_ring_pin(struct io_ring_ctx *ctx,
unsigned buf_group,
@@ -84,6 +86,13 @@ static inline int io_buffer_register_bvec(struct io_ring_ctx *ctx,
{
return -EOPNOTSUPP;
}
+
+static inline int io_buffer_unregister(struct io_ring_ctx *ctx,
+ unsigned int index,
+ unsigned int issue_flags)
+{
+ return -EOPNOTSUPP;
+}
#endif /* CONFIG_IO_URING */
#endif /* _LINUX_IO_URING_BUF_H */
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 7358f153d136..08634254ab7c 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1075,6 +1075,7 @@ int io_buffer_unregister(struct io_ring_ctx *ctx, unsigned int index,
io_ring_submit_unlock(ctx, issue_flags);
return ret;
}
+EXPORT_SYMBOL_GPL(io_buffer_unregister);
static int validate_fixed_range(u64 buf_addr, size_t len,
const struct io_mapped_ubuf *imu)
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index d1ca33f3319a..c3b21aaaf984 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -95,9 +95,6 @@ int io_buffer_register_request(struct io_ring_ctx *ctx, struct request *rq,
void (*release)(void *), unsigned int index,
unsigned int issue_flags);
-int io_buffer_unregister(struct io_ring_ctx *ctx, unsigned int index,
- unsigned int issue_flags);
-
static inline struct io_rsrc_node *io_rsrc_node_lookup(struct io_rsrc_data *data,
int index)
{
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 3922ac86b481..dfceec36f101 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -2,6 +2,7 @@
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/file.h>
+#include <linux/io_uring/buf.h>
#include <linux/io_uring/cmd.h>
#include <linux/security.h>
#include <linux/nospec.h>
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 27/30] fuse: rename fuse_set_zero_arg0() to fuse_zero_in_arg0()
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (25 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 26/30] io_uring/rsrc: export io_buffer_unregister Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 28/30] fuse: enforce op header for every payload reply Joanne Koong
` (2 subsequent siblings)
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
The fuse_set_zero_arg0() function is used for setting a no-op header for
in args but to support fuse io-uring zero-copy, the first parameter of
outargs will also need to be set to a no-op header if the request
contains payload but no out heeader.
Rename fuse_set_zero_arg0() to fuse_zero_in_arg0() to indicate this is
for the in arg header. Later, fuse_zero_out_arg0() will be added.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/dax.c | 2 +-
fs/fuse/dev.c | 2 +-
fs/fuse/dir.c | 8 ++++----
fs/fuse/fuse_i.h | 2 +-
fs/fuse/xattr.c | 2 +-
5 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index ac6d4c1064cc..b4bf586d1fd1 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -240,7 +240,7 @@ static int fuse_send_removemapping(struct inode *inode,
args.opcode = FUSE_REMOVEMAPPING;
args.nodeid = fi->nodeid;
args.in_numargs = 3;
- fuse_set_zero_arg0(&args);
+ fuse_zero_in_arg0(&args);
args.in_args[1].size = sizeof(*inargp);
args.in_args[1].value = inargp;
args.in_args[2].size = inargp->count * sizeof(*remove_one);
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 820d02f01b47..7d39c80da554 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1943,7 +1943,7 @@ static int fuse_retrieve(struct fuse_mount *fm, struct inode *inode,
}
ra->inarg.offset = outarg->offset;
ra->inarg.size = total_len;
- fuse_set_zero_arg0(args);
+ fuse_zero_in_arg0(args);
args->in_args[1].size = sizeof(ra->inarg);
args->in_args[1].value = &ra->inarg;
args->in_args[2].size = total_len;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ecaec0fea3a1..b79be8bbbaf8 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -176,7 +176,7 @@ static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args,
args->opcode = FUSE_LOOKUP;
args->nodeid = nodeid;
args->in_numargs = 3;
- fuse_set_zero_arg0(args);
+ fuse_zero_in_arg0(args);
args->in_args[1].size = name->len;
args->in_args[1].value = name->name;
args->in_args[2].size = 1;
@@ -943,7 +943,7 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
args.opcode = FUSE_SYMLINK;
args.in_numargs = 3;
- fuse_set_zero_arg0(&args);
+ fuse_zero_in_arg0(&args);
args.in_args[1].size = entry->d_name.len + 1;
args.in_args[1].value = entry->d_name.name;
args.in_args[2].size = len;
@@ -1008,7 +1008,7 @@ static int fuse_unlink(struct inode *dir, struct dentry *entry)
args.opcode = FUSE_UNLINK;
args.nodeid = get_node_id(dir);
args.in_numargs = 2;
- fuse_set_zero_arg0(&args);
+ fuse_zero_in_arg0(&args);
args.in_args[1].size = entry->d_name.len + 1;
args.in_args[1].value = entry->d_name.name;
err = fuse_simple_request(fm, &args);
@@ -1032,7 +1032,7 @@ static int fuse_rmdir(struct inode *dir, struct dentry *entry)
args.opcode = FUSE_RMDIR;
args.nodeid = get_node_id(dir);
args.in_numargs = 2;
- fuse_set_zero_arg0(&args);
+ fuse_zero_in_arg0(&args);
args.in_args[1].size = entry->d_name.len + 1;
args.in_args[1].value = entry->d_name.name;
err = fuse_simple_request(fm, &args);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c2f2a48156d6..34541801d950 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1020,7 +1020,7 @@ struct fuse_mount {
*/
struct fuse_zero_header {};
-static inline void fuse_set_zero_arg0(struct fuse_args *args)
+static inline void fuse_zero_in_arg0(struct fuse_args *args)
{
args->in_args[0].size = sizeof(struct fuse_zero_header);
args->in_args[0].value = NULL;
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 93dfb06b6cea..aa0881162287 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -165,7 +165,7 @@ int fuse_removexattr(struct inode *inode, const char *name)
args.opcode = FUSE_REMOVEXATTR;
args.nodeid = get_node_id(inode);
args.in_numargs = 2;
- fuse_set_zero_arg0(&args);
+ fuse_zero_in_arg0(&args);
args.in_args[1].size = strlen(name) + 1;
args.in_args[1].value = name;
err = fuse_simple_request(fm, &args);
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 28/30] fuse: enforce op header for every payload reply
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (26 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 27/30] fuse: rename fuse_set_zero_arg0() to fuse_zero_in_arg0() Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 29/30] fuse: add zero-copy over io-uring Joanne Koong
2025-12-03 0:35 ` [PATCH v1 30/30] docs: fuse: add io-uring bufring and zero-copy documentation Joanne Koong
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
In order to support fuse io-uring zero-copy, the payload and the headers
for a request/reply must reside in separate buffers since any
zero-copied payload will be transparent to the daemon but the headers
need to be accessible.
Currently, a fuse reply can be either:
* arg[0] = op header, arg[1] = payload
* arg[0] = payload
* arg[0] = NULL
Fuse io-uring needs to differentiate between the first two for copying
to/from the ring.
Enforce that all fuse replies that have a payload also have an op
header. If there is is only a payload to send in the reply, then the
header will be a zero-size no-op header.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/dir.c | 5 +++--
fs/fuse/file.c | 11 ++++++-----
fs/fuse/fuse_i.h | 6 ++++++
fs/fuse/readdir.c | 2 +-
fs/fuse/xattr.c | 16 ++++++++++------
5 files changed, 26 insertions(+), 14 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index b79be8bbbaf8..238fa1bab3c9 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1630,8 +1630,9 @@ static int fuse_readlink_folio(struct inode *inode, struct folio *folio)
ap.args.out_pages = true;
ap.args.out_argvar = true;
ap.args.page_zeroing = true;
- ap.args.out_numargs = 1;
- ap.args.out_args[0].size = desc.length;
+ ap.args.out_numargs = 2;
+ fuse_zero_out_arg0(&ap.args);
+ ap.args.out_args[1].size = desc.length;
res = fuse_simple_request(fm, &ap.args);
fuse_invalidate_atime(inode);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f1ef77a0be05..ff6c287bc4ed 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -581,8 +581,9 @@ void fuse_read_args_fill(struct fuse_io_args *ia, struct file *file, loff_t pos,
args->in_args[0].size = sizeof(ia->read.in);
args->in_args[0].value = &ia->read.in;
args->out_argvar = true;
- args->out_numargs = 1;
- args->out_args[0].size = count;
+ args->out_numargs = 2;
+ fuse_zero_out_arg0(args);
+ args->out_args[1].size = count;
}
static void fuse_release_user_pages(struct fuse_args_pages *ap, ssize_t nres,
@@ -711,7 +712,7 @@ static void fuse_aio_complete_req(struct fuse_mount *fm, struct fuse_args *args,
ia->write.out.size;
}
} else {
- u32 outsize = args->out_args[0].size;
+ u32 outsize = args->out_args[1].size;
nres = outsize;
if (ia->read.in.size != outsize)
@@ -870,7 +871,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
struct fuse_io_args *ia = container_of(args, typeof(*ia), ap.args);
struct fuse_args_pages *ap = &ia->ap;
size_t count = ia->read.in.size;
- size_t num_read = args->out_args[0].size;
+ size_t num_read = args->out_args[1].size;
struct address_space *mapping;
struct inode *inode;
@@ -1506,7 +1507,7 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
if (write)
ap->args.in_args[1].value = user_addr;
else
- ap->args.out_args[0].value = user_addr;
+ ap->args.out_args[1].value = user_addr;
iov_iter_advance(ii, frag_size);
*nbytesp = frag_size;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 34541801d950..e45126d792a6 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1026,6 +1026,12 @@ static inline void fuse_zero_in_arg0(struct fuse_args *args)
args->in_args[0].value = NULL;
}
+static inline void fuse_zero_out_arg0(struct fuse_args *args)
+{
+ args->out_args[0].size = sizeof(struct fuse_zero_header);
+ args->out_args[0].value = NULL;
+}
+
static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
{
return sb->s_fs_info;
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index c2aae2eef086..d80cd2bedabe 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -349,7 +349,7 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
if (!buf)
return -ENOMEM;
- args->out_args[0].value = buf;
+ args->out_args[1].value = buf;
plus = fuse_use_readdirplus(inode, ctx);
if (plus) {
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index aa0881162287..4011a99abd52 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -70,12 +70,14 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
args.in_args[1].size = strlen(name) + 1;
args.in_args[1].value = name;
/* This is really two different operations rolled into one */
- args.out_numargs = 1;
if (size) {
args.out_argvar = true;
- args.out_args[0].size = size;
- args.out_args[0].value = value;
+ args.out_numargs = 2;
+ fuse_zero_out_arg0(&args);
+ args.out_args[1].size = size;
+ args.out_args[1].value = value;
} else {
+ args.out_numargs = 1;
args.out_args[0].size = sizeof(outarg);
args.out_args[0].value = &outarg;
}
@@ -132,12 +134,14 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size)
args.in_args[0].size = sizeof(inarg);
args.in_args[0].value = &inarg;
/* This is really two different operations rolled into one */
- args.out_numargs = 1;
if (size) {
args.out_argvar = true;
- args.out_args[0].size = size;
- args.out_args[0].value = list;
+ args.out_numargs = 2;
+ fuse_zero_out_arg0(&args);
+ args.out_args[1].size = size;
+ args.out_args[1].value = list;
} else {
+ args.out_numargs = 1;
args.out_args[0].size = sizeof(outarg);
args.out_args[0].value = &outarg;
}
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 29/30] fuse: add zero-copy over io-uring
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (27 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 28/30] fuse: enforce op header for every payload reply Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 30/30] docs: fuse: add io-uring bufring and zero-copy documentation Joanne Koong
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Implement zero-copy data transfer for fuse over io-uring, eliminating
memory copies between kernel and userspace for read/write operations.
This is only allowed on privileged servers and requires the server to
preregister the following:
a) a sparse buffer corresponding to the queue depth
b) a fixed buffer at index queue_depth (the tail of the buffers)
c) a kernel-managed buffer ring
The sparse buffer is where the client's pages reside. The fixed buffer
at the tail is where the headers (struct fuse_uring_req_header) are
placed. The kernel-managed buffer ring is where any non-zero-copied args
reside (eg out headers).
Benchmarks with bs=1M showed approximately the following differences in
throughput:
direct randreads: ~20% increase (~2100 MB/s -> ~2600 MB/s)
buffered randreads: ~25% increase (~1900 MB/s -> 2400 MB/s)
direct randwrites: no difference (~750 MB/s)
buffered randwrites: ~10% increase (950 MB/s -> 1050 MB/s)
The benchmark was run using fio on the passthrough_hp server:
fio --name=test_run --ioengine=sync --rw=rand{read,write} --bs=1M
--size=1G --numjobs=2 --ramp_time=30 --group_reporting=1
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
fs/fuse/dev.c | 7 +-
fs/fuse/dev_uring.c | 191 ++++++++++++++++++++++++++++++++------
fs/fuse/dev_uring_i.h | 12 +++
fs/fuse/fuse_dev_i.h | 1 +
include/uapi/linux/fuse.h | 5 +-
5 files changed, 187 insertions(+), 29 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 7d39c80da554..0e9c9d006118 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1229,8 +1229,11 @@ int fuse_copy_args(struct fuse_copy_state *cs, unsigned numargs,
for (i = 0; !err && i < numargs; i++) {
struct fuse_arg *arg = &args[i];
- if (i == numargs - 1 && argpages)
- err = fuse_copy_folios(cs, arg->size, zeroing);
+ if (i == numargs - 1 && argpages) {
+ if (cs->skip_folio_copy)
+ return 0;
+ return fuse_copy_folios(cs, arg->size, zeroing);
+ }
else
err = fuse_copy_one(cs, arg->value, arg->size);
}
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 3600892ba837..02846203960f 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -89,12 +89,19 @@ static void fuse_uring_flush_bg(struct fuse_ring_queue *queue)
}
}
+static bool can_zero_copy_req(struct fuse_ring_ent *ent, struct fuse_req *req)
+{
+ return ent->queue->use_zero_copy &&
+ (req->args->in_pages || req->args->out_pages);
+}
+
static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
- int error)
+ int error, unsigned issue_flags)
{
struct fuse_ring_queue *queue = ent->queue;
struct fuse_ring *ring = queue->ring;
struct fuse_conn *fc = ring->fc;
+ int err;
lockdep_assert_not_held(&queue->lock);
spin_lock(&queue->lock);
@@ -109,6 +116,13 @@ static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
spin_unlock(&queue->lock);
+ if (ent->zero_copied) {
+ err = io_buffer_unregister(ent->queue->ring_ctx,
+ ent->zero_copy_buf_id, issue_flags);
+ WARN_ON_ONCE(err);
+ ent->zero_copied = false;
+ }
+
if (error)
req->out.h.error = error;
@@ -198,6 +212,31 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
return false;
}
+static void fuse_uring_zero_copy_teardown(struct fuse_ring_ent *ent,
+ unsigned int issue_flags)
+{
+ struct fuse_ring_queue *queue = ent->queue;
+
+ spin_lock(&queue->lock);
+
+ if (queue->ring_killed) {
+ spin_unlock(&queue->lock);
+ return;
+ }
+
+ if (!percpu_ref_tryget_live(&queue->ring_ctx->refs)) {
+ spin_unlock(&queue->lock);
+ return;
+ }
+
+ spin_unlock(&queue->lock);
+
+ io_buffer_unregister(queue->ring_ctx, ent->zero_copy_buf_id,
+ issue_flags);
+
+ percpu_ref_put(&queue->ring_ctx->refs);
+}
+
static void fuse_uring_teardown_buffers(struct fuse_ring_queue *queue,
unsigned int issue_flags)
{
@@ -322,9 +361,12 @@ static void io_ring_killed(void *priv)
static int fuse_uring_buf_ring_setup(struct io_uring_cmd *cmd,
struct fuse_ring_queue *queue,
+ bool zero_copy,
unsigned int issue_flags)
{
struct io_ring_ctx *ring_ctx = cmd_to_io_kiocb(cmd)->ctx;
+ const struct fuse_uring_cmd_req *cmd_req;
+ u16 headers_index;
int err;
err = io_uring_buf_ring_pin(ring_ctx, FUSE_URING_RINGBUF_GROUP,
@@ -342,8 +384,24 @@ static int fuse_uring_buf_ring_setup(struct io_uring_cmd *cmd,
if (err)
goto error;
- err = io_uring_cmd_import_fixed_index(cmd,
- FUSE_URING_FIXED_HEADERS_INDEX,
+ if (zero_copy) {
+ err = -EINVAL;
+ if (!capable(CAP_SYS_ADMIN))
+ goto error;
+
+ queue->use_zero_copy = true;
+
+ cmd_req = io_uring_sqe_cmd(cmd->sqe);
+ queue->depth = READ_ONCE(cmd_req->init.queue_depth);
+ if (!queue->depth)
+ goto error;
+
+ headers_index = queue->depth;
+ } else {
+ headers_index = FUSE_URING_FIXED_HEADERS_INDEX;
+ }
+
+ err = io_uring_cmd_import_fixed_index(cmd, headers_index,
ITER_DEST, &queue->headers_iter,
issue_flags);
if (err) {
@@ -367,7 +425,8 @@ static int fuse_uring_buf_ring_setup(struct io_uring_cmd *cmd,
static struct fuse_ring_queue *
fuse_uring_create_queue(struct io_uring_cmd *cmd, struct fuse_ring *ring,
- int qid, bool use_bufring, unsigned int issue_flags)
+ int qid, bool use_bufring, bool zero_copy,
+ unsigned int issue_flags)
{
struct fuse_conn *fc = ring->fc;
struct fuse_ring_queue *queue;
@@ -399,12 +458,13 @@ fuse_uring_create_queue(struct io_uring_cmd *cmd, struct fuse_ring *ring,
fuse_pqueue_init(&queue->fpq);
if (use_bufring) {
- err = fuse_uring_buf_ring_setup(cmd, queue, issue_flags);
- if (err) {
- kfree(pq);
- kfree(queue);
- return ERR_PTR(err);
- }
+ err = fuse_uring_buf_ring_setup(cmd, queue, zero_copy,
+ issue_flags);
+ if (err)
+ goto cleanup;
+ } else if (zero_copy) {
+ err = -EINVAL;
+ goto cleanup;
}
spin_lock(&fc->lock);
@@ -422,6 +482,11 @@ fuse_uring_create_queue(struct io_uring_cmd *cmd, struct fuse_ring *ring,
spin_unlock(&fc->lock);
return queue;
+
+cleanup:
+ kfree(pq);
+ kfree(queue);
+ return ERR_PTR(err);
}
static void fuse_uring_stop_fuse_req_end(struct fuse_req *req)
@@ -466,6 +531,9 @@ static void fuse_uring_entry_teardown(struct fuse_ring_ent *ent)
if (req)
fuse_uring_stop_fuse_req_end(req);
+
+ if (ent->zero_copied)
+ fuse_uring_zero_copy_teardown(ent, IO_URING_F_UNLOCKED);
}
static void fuse_uring_stop_list_entries(struct list_head *head,
@@ -831,6 +899,7 @@ static int setup_fuse_copy_state(struct fuse_copy_state *cs,
cs->is_kaddr = true;
cs->len = ent->payload_kvec.iov_len;
cs->kaddr = ent->payload_kvec.iov_base;
+ cs->skip_folio_copy = can_zero_copy_req(ent, req);
}
cs->is_uring = true;
@@ -863,11 +932,56 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
return err;
}
+
+static int fuse_uring_set_up_zero_copy(struct fuse_ring_ent *ent,
+ struct fuse_req *req,
+ unsigned issue_flags)
+{
+ struct fuse_args_pages *ap;
+ size_t total_bytes = 0;
+ u16 buf_index;
+ struct bio_vec *bvs;
+ int err, ddir, i;
+
+ buf_index = ent->zero_copy_buf_id;
+
+ /* out_pages indicates a read, in_pages indicates a write */
+ ddir = req->args->out_pages ? ITER_DEST : ITER_SOURCE;
+
+ ap = container_of(req->args, typeof(*ap), args);
+
+ /*
+ * We can avoid having to allocate the bvs array when folios and
+ * descriptors are represented by bvecs in fuse
+ */
+ bvs = kcalloc(ap->num_folios, sizeof(*bvs), GFP_KERNEL_ACCOUNT);
+ if (!bvs)
+ return -ENOMEM;
+
+ for (i = 0; i < ap->num_folios; i++) {
+ total_bytes += ap->descs[i].length;
+ bvs[i].bv_page = folio_page(ap->folios[i], 0);
+ bvs[i].bv_offset = ap->descs[i].offset;
+ bvs[i].bv_len = ap->descs[i].length;
+ }
+
+ err = io_buffer_register_bvec(ent->queue->ring_ctx, bvs, ap->num_folios,
+ total_bytes, ddir, buf_index, issue_flags);
+ kfree(bvs);
+ if (err)
+ return err;
+
+ ent->zero_copied = true;
+
+ return 0;
+}
+
/*
* Copy data from the req to the ring buffer
*/
static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
- struct fuse_ring_ent *ent)
+ struct fuse_ring_ent *ent,
+ unsigned int issue_flags)
{
struct fuse_copy_state cs;
struct fuse_args *args = req->args;
@@ -900,6 +1014,11 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
num_args--;
}
+ if (can_zero_copy_req(ent, req)) {
+ err = fuse_uring_set_up_zero_copy(ent, req, issue_flags);
+ if (err)
+ return err;
+ }
/* copy the payload */
err = fuse_copy_args(&cs, num_args, args->in_pages,
(struct fuse_arg *)in_args, 0);
@@ -910,12 +1029,17 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
}
ent_in_out.payload_sz = cs.ring.copied_sz;
+ if (cs.skip_folio_copy && args->in_pages)
+ ent_in_out.payload_sz +=
+ args->in_args[args->in_numargs - 1].size;
+
return copy_header_to_ring(ent, FUSE_URING_HEADER_RING_ENT,
&ent_in_out, sizeof(ent_in_out));
}
static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
- struct fuse_req *req)
+ struct fuse_req *req,
+ unsigned int issue_flags)
{
struct fuse_ring_queue *queue = ent->queue;
struct fuse_ring *ring = queue->ring;
@@ -933,7 +1057,7 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
return err;
/* copy the request */
- err = fuse_uring_args_to_ring(ring, req, ent);
+ err = fuse_uring_args_to_ring(ring, req, ent, issue_flags);
if (unlikely(err)) {
pr_info_ratelimited("Copy to ring failed: %d\n", err);
return err;
@@ -944,11 +1068,20 @@ static int fuse_uring_copy_to_ring(struct fuse_ring_ent *ent,
sizeof(req->in.h));
}
-static bool fuse_uring_req_has_payload(struct fuse_req *req)
+static bool fuse_uring_req_has_copyable_payload(struct fuse_ring_ent *ent,
+ struct fuse_req *req)
{
struct fuse_args *args = req->args;
- return args->in_numargs > 1 || args->out_numargs;
+ if (!can_zero_copy_req(ent, req))
+ return args->in_numargs > 1 || args->out_numargs;
+
+ if ((args->in_numargs > 1) && (!args->in_pages || args->in_numargs > 2))
+ return true;
+ if (args->out_numargs && (!args->out_pages || args->out_numargs > 1))
+ return true;
+
+ return false;
}
static int fuse_uring_select_buffer(struct fuse_ring_ent *ent,
@@ -1014,7 +1147,7 @@ static int fuse_uring_next_req_update_buffer(struct fuse_ring_ent *ent,
ent->headers_iter.data_source = false;
buffer_selected = ent->payload_kvec.iov_base != 0;
- has_payload = fuse_uring_req_has_payload(req);
+ has_payload = fuse_uring_req_has_copyable_payload(ent, req);
if (has_payload && !buffer_selected)
return fuse_uring_select_buffer(ent, issue_flags);
@@ -1040,22 +1173,23 @@ static int fuse_uring_prep_buffer(struct fuse_ring_ent *ent,
ent->headers_iter.data_source = false;
/* no payload to copy, can skip selecting a buffer */
- if (!fuse_uring_req_has_payload(req))
+ if (!fuse_uring_req_has_copyable_payload(ent, req))
return 0;
return fuse_uring_select_buffer(ent, issue_flags);
}
static int fuse_uring_prepare_send(struct fuse_ring_ent *ent,
- struct fuse_req *req)
+ struct fuse_req *req,
+ unsigned int issue_flags)
{
int err;
- err = fuse_uring_copy_to_ring(ent, req);
+ err = fuse_uring_copy_to_ring(ent, req, issue_flags);
if (!err)
set_bit(FR_SENT, &req->flags);
else
- fuse_uring_req_end(ent, req, err);
+ fuse_uring_req_end(ent, req, err, issue_flags);
return err;
}
@@ -1158,7 +1292,7 @@ static void fuse_uring_commit(struct fuse_ring_ent *ent, struct fuse_req *req,
err = fuse_uring_copy_from_ring(ring, req, ent);
out:
- fuse_uring_req_end(ent, req, err);
+ fuse_uring_req_end(ent, req, err, issue_flags);
}
/*
@@ -1181,7 +1315,7 @@ static bool fuse_uring_get_next_fuse_req(struct fuse_ring_ent *ent,
spin_unlock(&queue->lock);
if (req) {
- err = fuse_uring_prepare_send(ent, req);
+ err = fuse_uring_prepare_send(ent, req, issue_flags);
if (err)
goto retry;
}
@@ -1284,7 +1418,7 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
err = fuse_uring_prep_buffer(ent, req, ITER_SOURCE, issue_flags);
if (WARN_ON_ONCE(err))
- fuse_uring_req_end(ent, req, err);
+ fuse_uring_req_end(ent, req, err, issue_flags);
else
fuse_uring_commit(ent, req, issue_flags);
@@ -1409,6 +1543,9 @@ fuse_uring_create_ring_ent(struct io_uring_cmd *cmd,
buf_index = READ_ONCE(cmd->sqe->buf_index);
+ if (queue->use_zero_copy)
+ ent->zero_copy_buf_id = buf_index;
+
/* set up the headers */
ent->headers_iter = queue->headers_iter;
iov_iter_advance(&ent->headers_iter, buf_index * header_size);
@@ -1459,6 +1596,7 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
{
const struct fuse_uring_cmd_req *cmd_req = io_uring_sqe_cmd(cmd->sqe);
bool use_bufring = READ_ONCE(cmd_req->init.use_bufring);
+ bool zero_copy = READ_ONCE(cmd_req->init.zero_copy);
struct fuse_ring *ring = smp_load_acquire(&fc->ring);
struct fuse_ring_queue *queue;
struct fuse_ring_ent *ent;
@@ -1480,11 +1618,12 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
queue = ring->queues[qid];
if (!queue) {
queue = fuse_uring_create_queue(cmd, ring, qid, use_bufring,
- issue_flags);
+ zero_copy, issue_flags);
if (IS_ERR(queue))
return PTR_ERR(queue);
} else {
- if (queue->use_bufring != use_bufring)
+ if ((queue->use_bufring != use_bufring) ||
+ (queue->use_zero_copy != zero_copy))
return -EINVAL;
}
@@ -1587,7 +1726,7 @@ static void fuse_uring_send_in_task(struct io_tw_req tw_req, io_tw_token_t tw)
int err;
if (!tw.cancel) {
- err = fuse_uring_prepare_send(ent, ent->fuse_req);
+ err = fuse_uring_prepare_send(ent, ent->fuse_req, issue_flags);
if (err) {
if (!fuse_uring_get_next_fuse_req(ent, queue,
issue_flags))
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index a8a849c3497e..3398b43fb1df 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -56,6 +56,11 @@ struct fuse_ring_ent {
* the buffer when done with it
*/
unsigned int ringbuf_buf_id;
+
+ /* True if the request's pages are being zero-copied */
+ bool zero_copied;
+ /* Buf id for this ent's zero-copied pages */
+ unsigned int zero_copy_buf_id;
};
};
@@ -128,6 +133,13 @@ struct fuse_ring_queue {
struct iov_iter headers_iter;
/* synchronized by the queue lock */
struct io_buffer_list *bufring;
+ /*
+ * True if zero copy should be used for payloads. This is only enabled
+ * on privileged servers. Kernel-managed ring buffers must be enabled
+ * in order to use zero copy.
+ */
+ bool use_zero_copy : 1;
+ unsigned int depth;
};
/**
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index aa1d25421054..67b5bed451fe 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -39,6 +39,7 @@ struct fuse_copy_state {
bool is_uring:1;
/* if set, use kaddr; otherwise use pg */
bool is_kaddr:1;
+ bool skip_folio_copy:1;
struct {
unsigned int copied_sz; /* copied size into the user buffer */
} ring;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 3041177e3dd8..c98ea7a4ddde 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -243,6 +243,7 @@
*
* 7.46
* - add fuse_uring_cmd_req use_bufring
+ * - add fuse_uring_cmd_req zero_copy and queue_depth
*/
#ifndef _LINUX_FUSE_H
@@ -1312,10 +1313,12 @@ struct fuse_uring_cmd_req {
union {
struct {
bool use_bufring;
+ bool zero_copy;
+ uint16_t queue_depth;
} init;
};
- uint8_t padding[5];
+ uint8_t padding[2];
};
#endif /* _LINUX_FUSE_H */
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v1 30/30] docs: fuse: add io-uring bufring and zero-copy documentation
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
` (28 preceding siblings ...)
2025-12-03 0:35 ` [PATCH v1 29/30] fuse: add zero-copy over io-uring Joanne Koong
@ 2025-12-03 0:35 ` Joanne Koong
29 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 0:35 UTC (permalink / raw)
To: miklos, axboe
Cc: bschubert, asml.silence, io-uring, csander, xiaobing.li,
linux-fsdevel
Add documentation for fuse over io-uring usage of kernel-managed
bufrings and zero-copy.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
.../filesystems/fuse/fuse-io-uring.rst | 55 ++++++++++++++++++-
1 file changed, 54 insertions(+), 1 deletion(-)
diff --git a/Documentation/filesystems/fuse/fuse-io-uring.rst b/Documentation/filesystems/fuse/fuse-io-uring.rst
index d73dd0dbd238..9f98289b0734 100644
--- a/Documentation/filesystems/fuse/fuse-io-uring.rst
+++ b/Documentation/filesystems/fuse/fuse-io-uring.rst
@@ -95,5 +95,58 @@ Sending requests with CQEs
| <fuse_unlink() |
| <sys_unlink() |
+Kernel-managed buffer rings
+===========================
-
+Kernel-managed buffer rings have two main advantages:
+* eliminates the overhead of pinning/unpinning user pages and translating
+ virtual addresses for evey server-kernel interaction
+* reduces buffer memory allocation requirements
+
+In order to use buffer rings, the server must preregister the following:
+* a fixed buffer at index 0. This is where the headers will reside
+* a kernel-managed buffer ring. This is where the payload will reside
+
+At a high-level, this is how fuse uses buffer rings:
+* The server registers a kernel-managed buffer ring. In the kernel this
+ allocates the pages needed for the buffers and vmaps them. The server
+ obtains the virtual address for the buffers through an mmap call on the ring
+ fd.
+* When there is a request from a client, fuse will select a buffer from the
+ ring if there is any payload that needs to be copied, copy over the payload
+ to the selected buffer, and copy over the headers to the fixed buffer at
+ index 0, at the buffer id that corresponds to the server (which the server
+ needs to specify through sqe->buf_index).
+* The server obtains a cqe representing the request. The cqe flag will have
+ IORING_CQE_F_BUFFER set if a selected buffer was used for the payload. The
+ buffer id is stashed in cqe->flags (through IORING_CQE_BUFFER_SHIFT). The
+ server can directly access the payload by using that buffer id to calculate
+ the offset into the virtual address obtained for the buffers.
+* The server processes the request and then sends a
+ FUSE_URING_CMD_COMMIT_AND_FETCH sqe with the reply.
+* When the kernel handles the sqe, it will process the reply and if there is a
+ next request, it will reuse the same selected buffer for the request. If
+ there is no next request, it will recycle the buffer back to the ring.
+
+Zero-copy
+=========
+
+Fuse io-uring zero-copy allows the server to directly read from / write to the
+client's pages and bypass any intermediary buffer copies. This is only allowed
+on privileged servers.
+
+In order to use zero-copy, the server must pregister the following:
+* a sparse buffer for every entry in the queue. This is where the client's
+ pages will reside
+* a fixed buffer at index queue_depth (tailing the sparse buffer).
+ This is where the headers will reside
+* a kernel-managed buffer ring. This is where any non-zero-copied payload (eg
+ out headers) will reside
+
+When the client issues a read/write, fuse stores the client's underlying pages
+in the sparse buffer entry corresponding to the ent in the queue. The server
+can then issue reads/writes on these pages through io_uring rw operations.
+Please note that the server is not able to directly access these pages, it
+must go through the io-uring interface to read/write to them. The pages are
+unregistered once the server replies to the request. Non-zero-copyable
+payload (if needed) is placed in a buffer from the kernel-managed buffer ring.
--
2.47.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v1 06/30] io_uring/kbuf: add buffer ring pinning/unpinning
2025-12-03 0:35 ` [PATCH v1 06/30] io_uring/kbuf: add buffer ring pinning/unpinning Joanne Koong
@ 2025-12-03 4:13 ` Caleb Sander Mateos
2025-12-04 18:41 ` Joanne Koong
0 siblings, 1 reply; 51+ messages in thread
From: Caleb Sander Mateos @ 2025-12-03 4:13 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Add kernel APIs to pin and unpin buffer rings, preventing userspace from
> unregistering a buffer ring while it is pinned by the kernel.
>
> This provides a mechanism for kernel subsystems to safely access buffer
> ring contents while ensuring the buffer ring remains valid. A pinned
> buffer ring cannot be unregistered until explicitly unpinned. On the
> userspace side, trying to unregister a pinned buffer will return -EBUSY.
> Pinning an already-pinned bufring is acceptable and returns 0.
>
> The API accepts a "struct io_ring_ctx *ctx" rather than a cmd pointer,
> as the buffer ring may need to be unpinned in contexts where a cmd is
> not readily available.
>
> This is a preparatory change for upcoming fuse usage of kernel-managed
> buffer rings. It is necessary for fuse to pin the buffer ring because
> fuse may need to select a buffer in atomic contexts, which it can only
> do so by using the underlying buffer list pointer.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> include/linux/io_uring/buf.h | 28 +++++++++++++++++++++++
> io_uring/kbuf.c | 43 ++++++++++++++++++++++++++++++++++++
> io_uring/kbuf.h | 5 +++++
> 3 files changed, 76 insertions(+)
> create mode 100644 include/linux/io_uring/buf.h
>
> diff --git a/include/linux/io_uring/buf.h b/include/linux/io_uring/buf.h
> new file mode 100644
> index 000000000000..7a1cf197434d
> --- /dev/null
> +++ b/include/linux/io_uring/buf.h
> @@ -0,0 +1,28 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +#ifndef _LINUX_IO_URING_BUF_H
> +#define _LINUX_IO_URING_BUF_H
> +
> +#include <linux/io_uring_types.h>
> +
> +#if defined(CONFIG_IO_URING)
> +int io_uring_buf_ring_pin(struct io_ring_ctx *ctx, unsigned buf_group,
> + unsigned issue_flags, struct io_buffer_list **bl);
> +int io_uring_buf_ring_unpin(struct io_ring_ctx *ctx, unsigned buf_group,
> + unsigned issue_flags);
> +#else
> +static inline int io_uring_buf_ring_pin(struct io_ring_ctx *ctx,
> + unsigned buf_group,
> + unsigned issue_flags,
> + struct io_buffer_list **bl);
> +{
> + return -EOPNOTSUPP;
> +}
> +static inline int io_uring_buf_ring_unpin(struct io_ring_ctx *ctx,
> + unsigned buf_group,
> + unsigned issue_flags)
> +{
> + return -EOPNOTSUPP;
> +}
> +#endif /* CONFIG_IO_URING */
> +
> +#endif /* _LINUX_IO_URING_BUF_H */
> diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
> index 00ab17a034b5..ddda1338e652 100644
> --- a/io_uring/kbuf.c
> +++ b/io_uring/kbuf.c
> @@ -9,6 +9,7 @@
> #include <linux/poll.h>
> #include <linux/vmalloc.h>
> #include <linux/io_uring.h>
> +#include <linux/io_uring/buf.h>
>
> #include <uapi/linux/io_uring.h>
>
> @@ -237,6 +238,46 @@ struct io_br_sel io_buffer_select(struct io_kiocb *req, size_t *len,
> return sel;
> }
>
> +int io_uring_buf_ring_pin(struct io_ring_ctx *ctx, unsigned buf_group,
> + unsigned issue_flags, struct io_buffer_list **bl)
> +{
> + struct io_buffer_list *buffer_list;
> + int ret = -EINVAL;
> +
> + io_ring_submit_lock(ctx, issue_flags);
> +
> + buffer_list = io_buffer_get_list(ctx, buf_group);
> + if (likely(buffer_list) && (buffer_list->flags & IOBL_BUF_RING)) {
Since there's no reference-counting of pins, I think it might make
more sense to fail io_uring_buf_ring_pin() if the buffer ring is
already pinned. Otherwise, the buffer ring will be unpinned in the
first call to io_uring_buf_ring_unpin(), when it might still be in use
by another caller of io_uring_buf_ring_pin().
Best,
Caleb
> + buffer_list->flags |= IOBL_PINNED;
> + ret = 0;
> + *bl = buffer_list;
> + }
> +
> + io_ring_submit_unlock(ctx, issue_flags);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(io_uring_buf_ring_pin);
> +
> +int io_uring_buf_ring_unpin(struct io_ring_ctx *ctx, unsigned buf_group,
> + unsigned issue_flags)
> +{
> + struct io_buffer_list *bl;
> + int ret = -EINVAL;
> +
> + io_ring_submit_lock(ctx, issue_flags);
> +
> + bl = io_buffer_get_list(ctx, buf_group);
> + if (likely(bl) && (bl->flags & IOBL_BUF_RING) &&
> + (bl->flags & IOBL_PINNED)) {
> + bl->flags &= ~IOBL_PINNED;
> + ret = 0;
> + }
> +
> + io_ring_submit_unlock(ctx, issue_flags);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(io_uring_buf_ring_unpin);
> +
> /* cap it at a reasonable 256, will be one page even for 4K */
> #define PEEK_MAX_IMPORT 256
>
> @@ -743,6 +784,8 @@ int io_unregister_buf_ring(struct io_ring_ctx *ctx, void __user *arg)
> return -ENOENT;
> if (!(bl->flags & IOBL_BUF_RING))
> return -EINVAL;
> + if (bl->flags & IOBL_PINNED)
> + return -EBUSY;
>
> scoped_guard(mutex, &ctx->mmap_lock)
> xa_erase(&ctx->io_bl_xa, bl->bgid);
> diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h
> index 11d165888b8e..781630c2cc10 100644
> --- a/io_uring/kbuf.h
> +++ b/io_uring/kbuf.h
> @@ -12,6 +12,11 @@ enum {
> IOBL_INC = 2,
> /* buffers are kernel managed */
> IOBL_KERNEL_MANAGED = 4,
> + /*
> + * buffer ring is pinned and cannot be unregistered by userspace until
> + * it has been unpinned
> + */
> + IOBL_PINNED = 8,
> };
>
> struct io_buffer_list {
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 07/30] io_uring/rsrc: add fixed buffer table pinning/unpinning
2025-12-03 0:35 ` [PATCH v1 07/30] io_uring/rsrc: add fixed buffer table pinning/unpinning Joanne Koong
@ 2025-12-03 4:49 ` Caleb Sander Mateos
2025-12-03 22:52 ` Joanne Koong
0 siblings, 1 reply; 51+ messages in thread
From: Caleb Sander Mateos @ 2025-12-03 4:49 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Add kernel APIs to pin and unpin the buffer table for fixed buffers,
> preventing userspace from unregistering or updating the fixed buffers
> table while it is pinned by the kernel.
>
> This has two advantages:
> a) Eliminating the overhead of having to fetch and construct an iter for
> a fixed buffer per every cmd. Instead, the caller can pin the buffer
> table, fetch/construct the iter once, and use that across cmds for
> however long it needs to until it is ready to unpin the buffer table.
>
> b) Allowing a fixed buffer lookup at any index. The buffer table must be
> pinned in order to allow this, otherwise we would have to keep track of
> all the nodes that have been looked up by the io_kiocb so that we can
> properly adjust the refcounts for those nodes. Ensuring that the buffer
> table must first be pinned before being able to fetch a buffer at any
> index makes things logistically a lot neater.
Why is it necessary to pin the entire buffer table rather than
specific entries? That's the purpose of the existing io_rsrc_node refs
field.
>
> This is a preparatory patch for fuse io-uring's usage of fixed buffers.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> include/linux/io_uring/buf.h | 13 +++++++++++
> include/linux/io_uring_types.h | 9 ++++++++
> io_uring/rsrc.c | 42 ++++++++++++++++++++++++++++++++++
> 3 files changed, 64 insertions(+)
>
> diff --git a/include/linux/io_uring/buf.h b/include/linux/io_uring/buf.h
> index 7a1cf197434d..c997c01c24c4 100644
> --- a/include/linux/io_uring/buf.h
> +++ b/include/linux/io_uring/buf.h
> @@ -9,6 +9,9 @@ int io_uring_buf_ring_pin(struct io_ring_ctx *ctx, unsigned buf_group,
> unsigned issue_flags, struct io_buffer_list **bl);
> int io_uring_buf_ring_unpin(struct io_ring_ctx *ctx, unsigned buf_group,
> unsigned issue_flags);
> +
> +int io_uring_buf_table_pin(struct io_ring_ctx *ctx, unsigned issue_flags);
> +int io_uring_buf_table_unpin(struct io_ring_ctx *ctx, unsigned issue_flags);
> #else
> static inline int io_uring_buf_ring_pin(struct io_ring_ctx *ctx,
> unsigned buf_group,
> @@ -23,6 +26,16 @@ static inline int io_uring_buf_ring_unpin(struct io_ring_ctx *ctx,
> {
> return -EOPNOTSUPP;
> }
> +static inline int io_uring_buf_table_pin(struct io_ring_ctx *ctx,
> + unsigned issue_flags)
> +{
> + return -EOPNOTSUPP;
> +}
> +static inline int io_uring_buf_table_unpin(struct io_ring_ctx *ctx,
> + unsigned issue_flags)
> +{
> + return -EOPNOTSUPP;
> +}
> #endif /* CONFIG_IO_URING */
>
> #endif /* _LINUX_IO_URING_BUF_H */
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 36fac08db636..e1a75cfe57d9 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -57,8 +57,17 @@ struct io_wq_work {
> int cancel_seq;
> };
>
> +/*
> + * struct io_rsrc_data flag values:
> + *
> + * IO_RSRC_DATA_PINNED: data is pinned and cannot be unregistered by userspace
> + * until it has been unpinned. Currently this is only possible on buffer tables.
> + */
> +#define IO_RSRC_DATA_PINNED BIT(0)
> +
> struct io_rsrc_data {
> unsigned int nr;
> + u8 flags;
> struct io_rsrc_node **nodes;
> };
>
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 3765a50329a8..67331cae0a5a 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -9,6 +9,7 @@
> #include <linux/hugetlb.h>
> #include <linux/compat.h>
> #include <linux/io_uring.h>
> +#include <linux/io_uring/buf.h>
> #include <linux/io_uring/cmd.h>
>
> #include <uapi/linux/io_uring.h>
> @@ -304,6 +305,8 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
> return -ENXIO;
> if (up->offset + nr_args > ctx->buf_table.nr)
> return -EINVAL;
> + if (ctx->buf_table.flags & IO_RSRC_DATA_PINNED)
> + return -EBUSY;
IORING_REGISTER_CLONE_BUFFERS can also be used to unregister existing
buffers, so it may need the check too?
>
> for (done = 0; done < nr_args; done++) {
> struct io_rsrc_node *node;
> @@ -615,6 +618,8 @@ int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
> {
> if (!ctx->buf_table.nr)
> return -ENXIO;
> + if (ctx->buf_table.flags & IO_RSRC_DATA_PINNED)
> + return -EBUSY;
io_buffer_unregister_bvec() can also be used to unregister ublk
zero-copy buffers (also under control of userspace), so it may need
the check too? But maybe fuse ensures that it never uses a ublk
zero-copy buffer?
Best,
Caleb
> io_rsrc_data_free(ctx, &ctx->buf_table);
> return 0;
> }
> @@ -1580,3 +1585,40 @@ int io_prep_reg_iovec(struct io_kiocb *req, struct iou_vec *iv,
> req->flags |= REQ_F_IMPORT_BUFFER;
> return 0;
> }
> +
> +int io_uring_buf_table_pin(struct io_ring_ctx *ctx, unsigned issue_flags)
> +{
> + struct io_rsrc_data *data;
> + int err = 0;
> +
> + io_ring_submit_lock(ctx, issue_flags);
> +
> + data = &ctx->buf_table;
> + /* There was nothing registered. There is nothing to pin */
> + if (!data->nr)
> + err = -ENXIO;
> + else
> + data->flags |= IO_RSRC_DATA_PINNED;
> +
> + io_ring_submit_unlock(ctx, issue_flags);
> + return err;
> +}
> +EXPORT_SYMBOL_GPL(io_uring_buf_table_pin);
> +
> +int io_uring_buf_table_unpin(struct io_ring_ctx *ctx, unsigned issue_flags)
> +{
> + struct io_rsrc_data *data;
> + int err = 0;
> +
> + io_ring_submit_lock(ctx, issue_flags);
> +
> + data = &ctx->buf_table;
> + if (WARN_ON_ONCE(!(data->flags & IO_RSRC_DATA_PINNED)))
> + err = -EINVAL;
> + else
> + data->flags &= ~IO_RSRC_DATA_PINNED;
> +
> + io_ring_submit_unlock(ctx, issue_flags);
> + return err;
> +}
> +EXPORT_SYMBOL_GPL(io_uring_buf_table_unpin);
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 09/30] io_uring: add io_uring_cmd_import_fixed_index()
2025-12-03 0:35 ` [PATCH v1 09/30] io_uring: add io_uring_cmd_import_fixed_index() Joanne Koong
@ 2025-12-03 21:43 ` Caleb Sander Mateos
2025-12-04 18:56 ` Joanne Koong
0 siblings, 1 reply; 51+ messages in thread
From: Caleb Sander Mateos @ 2025-12-03 21:43 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Add a new helper, io_uring_cmd_import_fixed_index(). This takes in a
> buffer index. This requires the buffer table to have been pinned
> beforehand. The caller is responsible for ensuring it does not use the
> returned iter after the buffer table has been unpinned.
>
> This is a preparatory patch needed for fuse-over-io-uring support, as
> the metadata for fuse requests will be stored at the last index, which
> will be different from the sqe's buffer index.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> include/linux/io_uring/cmd.h | 10 ++++++++++
> io_uring/rsrc.c | 31 +++++++++++++++++++++++++++++++
> io_uring/rsrc.h | 2 ++
> io_uring/uring_cmd.c | 11 +++++++++++
> 4 files changed, 54 insertions(+)
>
> diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> index 375fd048c4cb..a4b5eae2e5d1 100644
> --- a/include/linux/io_uring/cmd.h
> +++ b/include/linux/io_uring/cmd.h
> @@ -44,6 +44,9 @@ int io_uring_cmd_import_fixed_vec(struct io_uring_cmd *ioucmd,
> size_t uvec_segs,
> int ddir, struct iov_iter *iter,
> unsigned issue_flags);
> +int io_uring_cmd_import_fixed_index(struct io_uring_cmd *ioucmd, u16 buf_index,
> + int ddir, struct iov_iter *iter,
> + unsigned int issue_flags);
>
> /*
> * Completes the request, i.e. posts an io_uring CQE and deallocates @ioucmd
> @@ -100,6 +103,13 @@ static inline int io_uring_cmd_import_fixed_vec(struct io_uring_cmd *ioucmd,
> {
> return -EOPNOTSUPP;
> }
> +static inline int io_uring_cmd_import_fixed_index(struct io_uring_cmd *ioucmd,
> + u16 buf_index, int ddir,
> + struct iov_iter *iter,
> + unsigned int issue_flags)
> +{
> + return -EOPNOTSUPP;
> +}
> static inline void __io_uring_cmd_done(struct io_uring_cmd *cmd, s32 ret,
> u64 ret2, unsigned issue_flags, bool is_cqe32)
> {
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 67331cae0a5a..b6dd62118311 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -1156,6 +1156,37 @@ int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
> return io_import_fixed(ddir, iter, node->buf, buf_addr, len);
> }
>
> +int io_import_reg_buf_index(struct io_kiocb *req, struct iov_iter *iter,
> + u16 buf_index, int ddir, unsigned issue_flags)
> +{
> + struct io_ring_ctx *ctx = req->ctx;
> + struct io_rsrc_node *node;
> + struct io_mapped_ubuf *imu;
> +
> + io_ring_submit_lock(ctx, issue_flags);
> +
> + if (buf_index >= req->ctx->buf_table.nr ||
This condition is already checked in io_rsrc_node_lookup() below.
> + !(ctx->buf_table.flags & IO_RSRC_DATA_PINNED)) {
> + io_ring_submit_unlock(ctx, issue_flags);
> + return -EINVAL;
> + }
> +
> + /*
> + * We don't have to grab the reference on the node because the buffer
> + * table is pinned. The caller is responsible for ensuring the iter
> + * isn't used after the buffer table has been unpinned.
> + */
> + node = io_rsrc_node_lookup(&ctx->buf_table, buf_index);
> + io_ring_submit_unlock(ctx, issue_flags);
> +
> + if (!node || !node->buf)
> + return -EFAULT;
> +
> + imu = node->buf;
> +
> + return io_import_fixed(ddir, iter, imu, imu->ubuf, imu->len);
> +}
> +
> /* Lock two rings at once. The rings must be different! */
> static void lock_two_rings(struct io_ring_ctx *ctx1, struct io_ring_ctx *ctx2)
> {
> diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
> index d603f6a47f5e..658934f4d3ff 100644
> --- a/io_uring/rsrc.h
> +++ b/io_uring/rsrc.h
> @@ -64,6 +64,8 @@ struct io_rsrc_node *io_find_buf_node(struct io_kiocb *req,
> int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
> u64 buf_addr, size_t len, int ddir,
> unsigned issue_flags);
> +int io_import_reg_buf_index(struct io_kiocb *req, struct iov_iter *iter,
> + u16 buf_index, int ddir, unsigned issue_flags);
> int io_import_reg_vec(int ddir, struct iov_iter *iter,
> struct io_kiocb *req, struct iou_vec *vec,
> unsigned nr_iovs, unsigned issue_flags);
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 197474911f04..e077eba00efe 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -314,6 +314,17 @@ int io_uring_cmd_import_fixed_vec(struct io_uring_cmd *ioucmd,
> }
> EXPORT_SYMBOL_GPL(io_uring_cmd_import_fixed_vec);
>
> +int io_uring_cmd_import_fixed_index(struct io_uring_cmd *ioucmd, u16 buf_index,
> + int ddir, struct iov_iter *iter,
> + unsigned int issue_flags)
> +{
> + struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
> +
> + return io_import_reg_buf_index(req, iter, buf_index, ddir,
> + issue_flags);
> +}
Probably would make sense to make this an inline function, since it
immediately defers to io_import_reg_buf_index().
Best,
Caleb
> +EXPORT_SYMBOL_GPL(io_uring_cmd_import_fixed_index);
> +
> void io_uring_cmd_issue_blocking(struct io_uring_cmd *ioucmd)
> {
> struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 11/30] io_uring/kbuf: return buffer id in buffer selection
2025-12-03 0:35 ` [PATCH v1 11/30] io_uring/kbuf: return buffer id in buffer selection Joanne Koong
@ 2025-12-03 21:53 ` Caleb Sander Mateos
2025-12-04 19:22 ` Joanne Koong
0 siblings, 1 reply; 51+ messages in thread
From: Caleb Sander Mateos @ 2025-12-03 21:53 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Return the id of the selected buffer in io_buffer_select(). This is
> needed for kernel-managed buffer rings to later recycle the selected
> buffer.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> include/linux/io_uring/cmd.h | 2 +-
> include/linux/io_uring_types.h | 2 ++
> io_uring/kbuf.c | 7 +++++--
> 3 files changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> index a4b5eae2e5d1..795b846d1e11 100644
> --- a/include/linux/io_uring/cmd.h
> +++ b/include/linux/io_uring/cmd.h
> @@ -74,7 +74,7 @@ void io_uring_cmd_issue_blocking(struct io_uring_cmd *ioucmd);
>
> /*
> * Select a buffer from the provided buffer group for multishot uring_cmd.
> - * Returns the selected buffer address and size.
> + * Returns the selected buffer address, size, and id.
> */
> struct io_br_sel io_uring_cmd_buffer_select(struct io_uring_cmd *ioucmd,
> unsigned buf_group, size_t *len,
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index e1a75cfe57d9..dcc95e73f12f 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -109,6 +109,8 @@ struct io_br_sel {
> void *kaddr;
> };
> ssize_t val;
> + /* id of the selected buffer */
> + unsigned buf_id;
Looks like this could be unioned with val? I think val's size can be
reduced to an int since only int values are assigned to it.
> };
>
>
> diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
> index 8a94de6e530f..3ecb6494adea 100644
> --- a/io_uring/kbuf.c
> +++ b/io_uring/kbuf.c
> @@ -239,6 +239,7 @@ static struct io_br_sel io_ring_buffer_select(struct io_kiocb *req, size_t *len,
> req->flags |= REQ_F_BUFFER_RING | REQ_F_BUFFERS_COMMIT;
> req->buf_index = buf->bid;
> sel.buf_list = bl;
> + sel.buf_id = buf->bid;
This is userspace mapped, so probably should be using READ_ONCE() and
reusing the value between req->buf_index and buf->bid? Looks like an
existing bug that the reads of buf->bid and buf->addr aren't using
READ_ONCE().
> if (bl->flags & IOBL_KERNEL_MANAGED)
> sel.kaddr = (void *)buf->addr;
> else
> @@ -262,10 +263,12 @@ struct io_br_sel io_buffer_select(struct io_kiocb *req, size_t *len,
>
> bl = io_buffer_get_list(ctx, buf_group);
> if (likely(bl)) {
> - if (bl->flags & IOBL_BUF_RING)
> + if (bl->flags & IOBL_BUF_RING) {
> sel = io_ring_buffer_select(req, len, bl, issue_flags);
> - else
> + } else {
> sel.addr = io_provided_buffer_select(req, len, bl);
> + sel.buf_id = req->buf_index;
Could this cover both IOBL_BUF_RING and !IOBL_BUF_RING cases to avoid
the additional logic in io_ring_buffer_select()?
Best,
Caleb
> + }
> }
> io_ring_submit_unlock(req->ctx, issue_flags);
> return sel;
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 14/30] io_uring: add release callback for ring death
2025-12-03 0:35 ` [PATCH v1 14/30] io_uring: add release callback for ring death Joanne Koong
@ 2025-12-03 22:25 ` Caleb Sander Mateos
2025-12-03 22:54 ` Joanne Koong
0 siblings, 1 reply; 51+ messages in thread
From: Caleb Sander Mateos @ 2025-12-03 22:25 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Allow registering a release callback on a ring context that will be
> called when the ring is about to be destroyed.
>
> This is a preparatory patch for fuse. Fuse will be pinning buffers and
> registering bvecs, which requires cleanup whenever a server
> disconnects. It needs to know if the ring is alive when the server has
> disconnected, to avoid double-freeing or accessing invalid memory.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> include/linux/io_uring.h | 9 +++++++++
> include/linux/io_uring_types.h | 2 ++
> io_uring/io_uring.c | 15 +++++++++++++++
> 3 files changed, 26 insertions(+)
>
> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
> index 85fe4e6b275c..327fd8ac6e42 100644
> --- a/include/linux/io_uring.h
> +++ b/include/linux/io_uring.h
> @@ -2,6 +2,7 @@
> #ifndef _LINUX_IO_URING_H
> #define _LINUX_IO_URING_H
>
> +#include <linux/io_uring_types.h>
> #include <linux/sched.h>
> #include <linux/xarray.h>
> #include <uapi/linux/io_uring.h>
> @@ -28,6 +29,9 @@ static inline void io_uring_free(struct task_struct *tsk)
> if (tsk->io_uring)
> __io_uring_free(tsk);
> }
> +void io_uring_set_release_callback(struct io_ring_ctx *ctx,
> + void (*release)(void *), void *priv,
> + unsigned int issue_flags);
> #else
> static inline void io_uring_task_cancel(void)
> {
> @@ -46,6 +50,11 @@ static inline bool io_is_uring_fops(struct file *file)
> {
> return false;
> }
> +static inline void
> +io_uring_set_release_callback(struct io_ring_ctx *ctx, void (*release)(void *),
> + void *priv, unsigned int issue_flags)
> +{
> +}
> #endif
>
> #endif
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index dcc95e73f12f..67c66658e3ec 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -441,6 +441,8 @@ struct io_ring_ctx {
> struct work_struct exit_work;
> struct list_head tctx_list;
> struct completion ref_comp;
> + void (*release)(void *);
> + void *priv;
>
> /* io-wq management, e.g. thread count */
> u32 iowq_limits[2];
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 1e58fc1d5667..04ffcfa6f2d6 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -2952,6 +2952,19 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
> return mask;
> }
>
> +void io_uring_set_release_callback(struct io_ring_ctx *ctx,
> + void (*release)(void *), void *priv,
> + unsigned int issue_flags)
> +{
> + io_ring_submit_lock(ctx, issue_flags);
> +
> + ctx->release = release;
> + ctx->priv = priv;
Looks like this doesn't support the registration of multiple release
callbacks. Should there be a WARN_ON() to that effect?
Best,
Caleb
> +
> + io_ring_submit_unlock(ctx, issue_flags);
> +}
> +EXPORT_SYMBOL_GPL(io_uring_set_release_callback);
> +
> struct io_tctx_exit {
> struct callback_head task_work;
> struct completion completion;
> @@ -3099,6 +3112,8 @@ static int io_uring_release(struct inode *inode, struct file *file)
> struct io_ring_ctx *ctx = file->private_data;
>
> file->private_data = NULL;
> + if (ctx->release)
> + ctx->release(ctx->priv);
> io_ring_ctx_wait_and_kill(ctx);
> return 0;
> }
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 07/30] io_uring/rsrc: add fixed buffer table pinning/unpinning
2025-12-03 4:49 ` Caleb Sander Mateos
@ 2025-12-03 22:52 ` Joanne Koong
2025-12-04 1:24 ` Caleb Sander Mateos
0 siblings, 1 reply; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 22:52 UTC (permalink / raw)
To: Caleb Sander Mateos
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Tue, Dec 2, 2025 at 8:49 PM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > Add kernel APIs to pin and unpin the buffer table for fixed buffers,
> > preventing userspace from unregistering or updating the fixed buffers
> > table while it is pinned by the kernel.
> >
> > This has two advantages:
> > a) Eliminating the overhead of having to fetch and construct an iter for
> > a fixed buffer per every cmd. Instead, the caller can pin the buffer
> > table, fetch/construct the iter once, and use that across cmds for
> > however long it needs to until it is ready to unpin the buffer table.
> >
> > b) Allowing a fixed buffer lookup at any index. The buffer table must be
> > pinned in order to allow this, otherwise we would have to keep track of
> > all the nodes that have been looked up by the io_kiocb so that we can
> > properly adjust the refcounts for those nodes. Ensuring that the buffer
> > table must first be pinned before being able to fetch a buffer at any
> > index makes things logistically a lot neater.
>
> Why is it necessary to pin the entire buffer table rather than
> specific entries? That's the purpose of the existing io_rsrc_node refs
> field.
How would this work with userspace buffer unregistration (which works
at the table level)? If buffer unregistration should still succeed
then fuse would need a way to be notified that the buffer has been
unregistered since the buffer belongs to userspace (eg it would be
wrong if fuse continues using it even though fuse retains a refcount
on it). If buffer unregistration should fail, then we would need to
track this pinned state inside the node instead of relying just on the
refs field, as buffers can be unregistered even if there are in-flight
refs (eg we would need to differentiate the ref being from a pin vs
from not a pin), and I think this would make unregistration more
cumbersome as well (eg we would have to iterate through all the
entries looking to see if any are pinned before iterating through them
again to do the actual unregistration).
>
> >
> > This is a preparatory patch for fuse io-uring's usage of fixed buffers.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> > include/linux/io_uring/buf.h | 13 +++++++++++
> > include/linux/io_uring_types.h | 9 ++++++++
> > io_uring/rsrc.c | 42 ++++++++++++++++++++++++++++++++++
> > 3 files changed, 64 insertions(+)
> >
> > diff --git a/include/linux/io_uring/buf.h b/include/linux/io_uring/buf.h
> > index 7a1cf197434d..c997c01c24c4 100644
> > --- a/include/linux/io_uring/buf.h
> > +++ b/include/linux/io_uring/buf.h
> > @@ -9,6 +9,9 @@ int io_uring_buf_ring_pin(struct io_ring_ctx *ctx, unsigned buf_group,
> > unsigned issue_flags, struct io_buffer_list **bl);
> > int io_uring_buf_ring_unpin(struct io_ring_ctx *ctx, unsigned buf_group,
> > unsigned issue_flags);
> > +
> > +int io_uring_buf_table_pin(struct io_ring_ctx *ctx, unsigned issue_flags);
> > +int io_uring_buf_table_unpin(struct io_ring_ctx *ctx, unsigned issue_flags);
> > #else
> > static inline int io_uring_buf_ring_pin(struct io_ring_ctx *ctx,
> > unsigned buf_group,
> > @@ -23,6 +26,16 @@ static inline int io_uring_buf_ring_unpin(struct io_ring_ctx *ctx,
> > {
> > return -EOPNOTSUPP;
> > }
> > +static inline int io_uring_buf_table_pin(struct io_ring_ctx *ctx,
> > + unsigned issue_flags)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +static inline int io_uring_buf_table_unpin(struct io_ring_ctx *ctx,
> > + unsigned issue_flags)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > #endif /* CONFIG_IO_URING */
> >
> > #endif /* _LINUX_IO_URING_BUF_H */
> > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> > index 36fac08db636..e1a75cfe57d9 100644
> > --- a/include/linux/io_uring_types.h
> > +++ b/include/linux/io_uring_types.h
> > @@ -57,8 +57,17 @@ struct io_wq_work {
> > int cancel_seq;
> > };
> >
> > +/*
> > + * struct io_rsrc_data flag values:
> > + *
> > + * IO_RSRC_DATA_PINNED: data is pinned and cannot be unregistered by userspace
> > + * until it has been unpinned. Currently this is only possible on buffer tables.
> > + */
> > +#define IO_RSRC_DATA_PINNED BIT(0)
> > +
> > struct io_rsrc_data {
> > unsigned int nr;
> > + u8 flags;
> > struct io_rsrc_node **nodes;
> > };
> >
> > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > index 3765a50329a8..67331cae0a5a 100644
> > --- a/io_uring/rsrc.c
> > +++ b/io_uring/rsrc.c
> > @@ -9,6 +9,7 @@
> > #include <linux/hugetlb.h>
> > #include <linux/compat.h>
> > #include <linux/io_uring.h>
> > +#include <linux/io_uring/buf.h>
> > #include <linux/io_uring/cmd.h>
> >
> > #include <uapi/linux/io_uring.h>
> > @@ -304,6 +305,8 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
> > return -ENXIO;
> > if (up->offset + nr_args > ctx->buf_table.nr)
> > return -EINVAL;
> > + if (ctx->buf_table.flags & IO_RSRC_DATA_PINNED)
> > + return -EBUSY;
>
> IORING_REGISTER_CLONE_BUFFERS can also be used to unregister existing
> buffers, so it may need the check too?
Ah I didn't realize this existed, thanks. imo I think it's okay to
clone the buffers in a source ring's pinned buffer table to the
destination ring (where the destination ring's buffer table is
unpinned) since the clone acquires its own refcounts on the underlying
nodes and the clone is its own entity. Do you think this makes sense
or do you think it's better to just not allow this?
>
> >
> > for (done = 0; done < nr_args; done++) {
> > struct io_rsrc_node *node;
> > @@ -615,6 +618,8 @@ int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
> > {
> > if (!ctx->buf_table.nr)
> > return -ENXIO;
> > + if (ctx->buf_table.flags & IO_RSRC_DATA_PINNED)
> > + return -EBUSY;
>
> io_buffer_unregister_bvec() can also be used to unregister ublk
> zero-copy buffers (also under control of userspace), so it may need
> the check too? But maybe fuse ensures that it never uses a ublk
> zero-copy buffer?
fuse doesn't expose a way for userspace to unregister a zero-copy
buffer, but thanks for considering this possibility.
Thanks,
Joanne
>
> Best,
> Caleb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 14/30] io_uring: add release callback for ring death
2025-12-03 22:25 ` Caleb Sander Mateos
@ 2025-12-03 22:54 ` Joanne Koong
0 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-03 22:54 UTC (permalink / raw)
To: Caleb Sander Mateos
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Wed, Dec 3, 2025 at 2:25 PM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > Allow registering a release callback on a ring context that will be
> > called when the ring is about to be destroyed.
> >
> > This is a preparatory patch for fuse. Fuse will be pinning buffers and
> > registering bvecs, which requires cleanup whenever a server
> > disconnects. It needs to know if the ring is alive when the server has
> > disconnected, to avoid double-freeing or accessing invalid memory.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> > include/linux/io_uring.h | 9 +++++++++
> > include/linux/io_uring_types.h | 2 ++
> > io_uring/io_uring.c | 15 +++++++++++++++
> > 3 files changed, 26 insertions(+)
> >
> > diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
> > index 85fe4e6b275c..327fd8ac6e42 100644
> >
> > +void io_uring_set_release_callback(struct io_ring_ctx *ctx,
> > + void (*release)(void *), void *priv,
> > + unsigned int issue_flags)
> > +{
> > + io_ring_submit_lock(ctx, issue_flags);
> > +
> > + ctx->release = release;
> > + ctx->priv = priv;
>
> Looks like this doesn't support the registration of multiple release
> callbacks. Should there be a WARN_ON() to that effect?
Great idea, I'll add that in for v2.
Thanks,
Joanne
>
> Best,
> Caleb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 07/30] io_uring/rsrc: add fixed buffer table pinning/unpinning
2025-12-03 22:52 ` Joanne Koong
@ 2025-12-04 1:24 ` Caleb Sander Mateos
2025-12-04 20:07 ` Joanne Koong
0 siblings, 1 reply; 51+ messages in thread
From: Caleb Sander Mateos @ 2025-12-04 1:24 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Wed, Dec 3, 2025 at 2:52 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Tue, Dec 2, 2025 at 8:49 PM Caleb Sander Mateos
> <csander@purestorage.com> wrote:
> >
> > On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > Add kernel APIs to pin and unpin the buffer table for fixed buffers,
> > > preventing userspace from unregistering or updating the fixed buffers
> > > table while it is pinned by the kernel.
> > >
> > > This has two advantages:
> > > a) Eliminating the overhead of having to fetch and construct an iter for
> > > a fixed buffer per every cmd. Instead, the caller can pin the buffer
> > > table, fetch/construct the iter once, and use that across cmds for
> > > however long it needs to until it is ready to unpin the buffer table.
> > >
> > > b) Allowing a fixed buffer lookup at any index. The buffer table must be
> > > pinned in order to allow this, otherwise we would have to keep track of
> > > all the nodes that have been looked up by the io_kiocb so that we can
> > > properly adjust the refcounts for those nodes. Ensuring that the buffer
> > > table must first be pinned before being able to fetch a buffer at any
> > > index makes things logistically a lot neater.
> >
> > Why is it necessary to pin the entire buffer table rather than
> > specific entries? That's the purpose of the existing io_rsrc_node refs
> > field.
>
> How would this work with userspace buffer unregistration (which works
> at the table level)? If buffer unregistration should still succeed
> then fuse would need a way to be notified that the buffer has been
> unregistered since the buffer belongs to userspace (eg it would be
> wrong if fuse continues using it even though fuse retains a refcount
> on it). If buffer unregistration should fail, then we would need to
> track this pinned state inside the node instead of relying just on the
> refs field, as buffers can be unregistered even if there are in-flight
> refs (eg we would need to differentiate the ref being from a pin vs
> from not a pin), and I think this would make unregistration more
> cumbersome as well (eg we would have to iterate through all the
> entries looking to see if any are pinned before iterating through them
> again to do the actual unregistration).
Not sure I would say buffer unregistration operates on the table as a
whole. Each registered buffer node is unregistered individually and
stores its own reference count. io_put_rsrc_node() will be called on
each buffer node in the table. However, io_put_rsrc_node() just
removes the one reference from the buffer node. If there are other
references on the buffer node (such as an inflight io_uring request
using it), io_free_rsrc_node() won't be called to free the buffer node
until all those references are dropped too. So fuse holding a
reference on the buffer node would allow it to be unregistered, but
prevent it from being freed until fuse dropped its reference.
I'm not sure I understand the problem with fuse continuing to hold
onto a registered buffer node after userspace has unregistered it from
the buffer table. (It looks like the buffer node in question is the
one at FUSE_URING_FIXED_HEADERS_INDEX?) Wouldn't pinning the buffer
table present similar issues? How would userspace get fuse to drop its
pin if it wants to modify the buffer registrations? I would imagine
the code path that calls io_uring_buf_table_unpin() currently could
instead call into io_put_rsrc_node() (maybe by completing an io_uring
request that has imported the registered buffer) to release its
reference on the buffer node. For ublk, userspace can request to stop
a ublk device or the kernel will do so automatically if userspace
drops its file handle (e.g. if the process exits), which will release
any io_uring resources the ublk device is using.
>
> >
> > >
> > > This is a preparatory patch for fuse io-uring's usage of fixed buffers.
> > >
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > ---
> > > include/linux/io_uring/buf.h | 13 +++++++++++
> > > include/linux/io_uring_types.h | 9 ++++++++
> > > io_uring/rsrc.c | 42 ++++++++++++++++++++++++++++++++++
> > > 3 files changed, 64 insertions(+)
> > >
> > > diff --git a/include/linux/io_uring/buf.h b/include/linux/io_uring/buf.h
> > > index 7a1cf197434d..c997c01c24c4 100644
> > > --- a/include/linux/io_uring/buf.h
> > > +++ b/include/linux/io_uring/buf.h
> > > @@ -9,6 +9,9 @@ int io_uring_buf_ring_pin(struct io_ring_ctx *ctx, unsigned buf_group,
> > > unsigned issue_flags, struct io_buffer_list **bl);
> > > int io_uring_buf_ring_unpin(struct io_ring_ctx *ctx, unsigned buf_group,
> > > unsigned issue_flags);
> > > +
> > > +int io_uring_buf_table_pin(struct io_ring_ctx *ctx, unsigned issue_flags);
> > > +int io_uring_buf_table_unpin(struct io_ring_ctx *ctx, unsigned issue_flags);
> > > #else
> > > static inline int io_uring_buf_ring_pin(struct io_ring_ctx *ctx,
> > > unsigned buf_group,
> > > @@ -23,6 +26,16 @@ static inline int io_uring_buf_ring_unpin(struct io_ring_ctx *ctx,
> > > {
> > > return -EOPNOTSUPP;
> > > }
> > > +static inline int io_uring_buf_table_pin(struct io_ring_ctx *ctx,
> > > + unsigned issue_flags)
> > > +{
> > > + return -EOPNOTSUPP;
> > > +}
> > > +static inline int io_uring_buf_table_unpin(struct io_ring_ctx *ctx,
> > > + unsigned issue_flags)
> > > +{
> > > + return -EOPNOTSUPP;
> > > +}
> > > #endif /* CONFIG_IO_URING */
> > >
> > > #endif /* _LINUX_IO_URING_BUF_H */
> > > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> > > index 36fac08db636..e1a75cfe57d9 100644
> > > --- a/include/linux/io_uring_types.h
> > > +++ b/include/linux/io_uring_types.h
> > > @@ -57,8 +57,17 @@ struct io_wq_work {
> > > int cancel_seq;
> > > };
> > >
> > > +/*
> > > + * struct io_rsrc_data flag values:
> > > + *
> > > + * IO_RSRC_DATA_PINNED: data is pinned and cannot be unregistered by userspace
> > > + * until it has been unpinned. Currently this is only possible on buffer tables.
> > > + */
> > > +#define IO_RSRC_DATA_PINNED BIT(0)
> > > +
> > > struct io_rsrc_data {
> > > unsigned int nr;
> > > + u8 flags;
> > > struct io_rsrc_node **nodes;
> > > };
> > >
> > > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > > index 3765a50329a8..67331cae0a5a 100644
> > > --- a/io_uring/rsrc.c
> > > +++ b/io_uring/rsrc.c
> > > @@ -9,6 +9,7 @@
> > > #include <linux/hugetlb.h>
> > > #include <linux/compat.h>
> > > #include <linux/io_uring.h>
> > > +#include <linux/io_uring/buf.h>
> > > #include <linux/io_uring/cmd.h>
> > >
> > > #include <uapi/linux/io_uring.h>
> > > @@ -304,6 +305,8 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
> > > return -ENXIO;
> > > if (up->offset + nr_args > ctx->buf_table.nr)
> > > return -EINVAL;
> > > + if (ctx->buf_table.flags & IO_RSRC_DATA_PINNED)
> > > + return -EBUSY;
> >
> > IORING_REGISTER_CLONE_BUFFERS can also be used to unregister existing
> > buffers, so it may need the check too?
>
> Ah I didn't realize this existed, thanks. imo I think it's okay to
> clone the buffers in a source ring's pinned buffer table to the
> destination ring (where the destination ring's buffer table is
> unpinned) since the clone acquires its own refcounts on the underlying
> nodes and the clone is its own entity. Do you think this makes sense
> or do you think it's better to just not allow this?
I think cloning buffers to unused buffer table slots on another ring
is fine (analogous to registering new buffers in unused slots). But
with IORING_REGISTER_DST_REPLACE set, it can also be used to
unregister whatever existing buffers happen to be registered in those
slots.
Best,
Caleb
>
> >
> > >
> > > for (done = 0; done < nr_args; done++) {
> > > struct io_rsrc_node *node;
> > > @@ -615,6 +618,8 @@ int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
> > > {
> > > if (!ctx->buf_table.nr)
> > > return -ENXIO;
> > > + if (ctx->buf_table.flags & IO_RSRC_DATA_PINNED)
> > > + return -EBUSY;
> >
> > io_buffer_unregister_bvec() can also be used to unregister ublk
> > zero-copy buffers (also under control of userspace), so it may need
> > the check too? But maybe fuse ensures that it never uses a ublk
> > zero-copy buffer?
>
> fuse doesn't expose a way for userspace to unregister a zero-copy
> buffer, but thanks for considering this possibility.
>
> Thanks,
> Joanne
> >
> > Best,
> > Caleb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 06/30] io_uring/kbuf: add buffer ring pinning/unpinning
2025-12-03 4:13 ` Caleb Sander Mateos
@ 2025-12-04 18:41 ` Joanne Koong
0 siblings, 0 replies; 51+ messages in thread
From: Joanne Koong @ 2025-12-04 18:41 UTC (permalink / raw)
To: Caleb Sander Mateos
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Tue, Dec 2, 2025 at 8:13 PM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > Add kernel APIs to pin and unpin buffer rings, preventing userspace from
> > unregistering a buffer ring while it is pinned by the kernel.
> >
> > This provides a mechanism for kernel subsystems to safely access buffer
> > ring contents while ensuring the buffer ring remains valid. A pinned
> > buffer ring cannot be unregistered until explicitly unpinned. On the
> > userspace side, trying to unregister a pinned buffer will return -EBUSY.
> > Pinning an already-pinned bufring is acceptable and returns 0.
> >
> > The API accepts a "struct io_ring_ctx *ctx" rather than a cmd pointer,
> > as the buffer ring may need to be unpinned in contexts where a cmd is
> > not readily available.
> >
> > This is a preparatory change for upcoming fuse usage of kernel-managed
> > buffer rings. It is necessary for fuse to pin the buffer ring because
> > fuse may need to select a buffer in atomic contexts, which it can only
> > do so by using the underlying buffer list pointer.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> > include/linux/io_uring/buf.h | 28 +++++++++++++++++++++++
> > io_uring/kbuf.c | 43 ++++++++++++++++++++++++++++++++++++
> > io_uring/kbuf.h | 5 +++++
> > 3 files changed, 76 insertions(+)
> > create mode 100644 include/linux/io_uring/buf.h
> >
> > diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
> > index 00ab17a034b5..ddda1338e652 100644
> > --- a/io_uring/kbuf.c
> > +++ b/io_uring/kbuf.c
> > @@ -9,6 +9,7 @@
> > #include <linux/poll.h>
> > #include <linux/vmalloc.h>
> > #include <linux/io_uring.h>
> > +#include <linux/io_uring/buf.h>
> >
> > #include <uapi/linux/io_uring.h>
> >
> > @@ -237,6 +238,46 @@ struct io_br_sel io_buffer_select(struct io_kiocb *req, size_t *len,
> > return sel;
> > }
> >
> > +int io_uring_buf_ring_pin(struct io_ring_ctx *ctx, unsigned buf_group,
> > + unsigned issue_flags, struct io_buffer_list **bl)
> > +{
> > + struct io_buffer_list *buffer_list;
> > + int ret = -EINVAL;
> > +
> > + io_ring_submit_lock(ctx, issue_flags);
> > +
> > + buffer_list = io_buffer_get_list(ctx, buf_group);
> > + if (likely(buffer_list) && (buffer_list->flags & IOBL_BUF_RING)) {
>
> Since there's no reference-counting of pins, I think it might make
> more sense to fail io_uring_buf_ring_pin() if the buffer ring is
> already pinned. Otherwise, the buffer ring will be unpinned in the
> first call to io_uring_buf_ring_unpin(), when it might still be in use
> by another caller of io_uring_buf_ring_pin().
That makes sense, I'll change this to return -EALREADY then if it's
already pinned.
Thanks,
Joanne
>
> Best,
> Caleb
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 09/30] io_uring: add io_uring_cmd_import_fixed_index()
2025-12-03 21:43 ` Caleb Sander Mateos
@ 2025-12-04 18:56 ` Joanne Koong
2025-12-05 16:56 ` Caleb Sander Mateos
0 siblings, 1 reply; 51+ messages in thread
From: Joanne Koong @ 2025-12-04 18:56 UTC (permalink / raw)
To: Caleb Sander Mateos
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Wed, Dec 3, 2025 at 1:44 PM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > Add a new helper, io_uring_cmd_import_fixed_index(). This takes in a
> > buffer index. This requires the buffer table to have been pinned
> > beforehand. The caller is responsible for ensuring it does not use the
> > returned iter after the buffer table has been unpinned.
> >
> > This is a preparatory patch needed for fuse-over-io-uring support, as
> > the metadata for fuse requests will be stored at the last index, which
> > will be different from the sqe's buffer index.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> > include/linux/io_uring/cmd.h | 10 ++++++++++
> > io_uring/rsrc.c | 31 +++++++++++++++++++++++++++++++
> > io_uring/rsrc.h | 2 ++
> > io_uring/uring_cmd.c | 11 +++++++++++
> > 4 files changed, 54 insertions(+)
> >
> > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > index 67331cae0a5a..b6dd62118311 100644
> > --- a/io_uring/rsrc.c
> > +++ b/io_uring/rsrc.c
> > @@ -1156,6 +1156,37 @@ int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
> > return io_import_fixed(ddir, iter, node->buf, buf_addr, len);
> > }
> >
> > +int io_import_reg_buf_index(struct io_kiocb *req, struct iov_iter *iter,
> > + u16 buf_index, int ddir, unsigned issue_flags)
> > +{
> > + struct io_ring_ctx *ctx = req->ctx;
> > + struct io_rsrc_node *node;
> > + struct io_mapped_ubuf *imu;
> > +
> > + io_ring_submit_lock(ctx, issue_flags);
> > +
> > + if (buf_index >= req->ctx->buf_table.nr ||
>
> This condition is already checked in io_rsrc_node_lookup() below.
I think we still need this check here to differentiate between -EINVAL
if buf_index is out of bounds and -EFAULT if the buf index was not out
of bounds but the lookup returned NULL.
>
> > + !(ctx->buf_table.flags & IO_RSRC_DATA_PINNED)) {
> > + io_ring_submit_unlock(ctx, issue_flags);
> > + return -EINVAL;
> > + }
> > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > index 197474911f04..e077eba00efe 100644
> > --- a/io_uring/uring_cmd.c
> > +++ b/io_uring/uring_cmd.c
> > @@ -314,6 +314,17 @@ int io_uring_cmd_import_fixed_vec(struct io_uring_cmd *ioucmd,
> > }
> > EXPORT_SYMBOL_GPL(io_uring_cmd_import_fixed_vec);
> >
> > +int io_uring_cmd_import_fixed_index(struct io_uring_cmd *ioucmd, u16 buf_index,
> > + int ddir, struct iov_iter *iter,
> > + unsigned int issue_flags)
> > +{
> > + struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
> > +
> > + return io_import_reg_buf_index(req, iter, buf_index, ddir,
> > + issue_flags);
> > +}
>
> Probably would make sense to make this an inline function, since it
> immediately defers to io_import_reg_buf_index().
That makes sense to me, I'll make this change for v2.
Thanks,
Joanne
>
> Best,
> Caleb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 11/30] io_uring/kbuf: return buffer id in buffer selection
2025-12-03 21:53 ` Caleb Sander Mateos
@ 2025-12-04 19:22 ` Joanne Koong
2025-12-04 21:57 ` Caleb Sander Mateos
0 siblings, 1 reply; 51+ messages in thread
From: Joanne Koong @ 2025-12-04 19:22 UTC (permalink / raw)
To: Caleb Sander Mateos
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Wed, Dec 3, 2025 at 1:54 PM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > Return the id of the selected buffer in io_buffer_select(). This is
> > needed for kernel-managed buffer rings to later recycle the selected
> > buffer.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> > include/linux/io_uring/cmd.h | 2 +-
> > include/linux/io_uring_types.h | 2 ++
> > io_uring/kbuf.c | 7 +++++--
> > 3 files changed, 8 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> > index e1a75cfe57d9..dcc95e73f12f 100644
> > --- a/include/linux/io_uring_types.h
> > +++ b/include/linux/io_uring_types.h
> > @@ -109,6 +109,8 @@ struct io_br_sel {
> > void *kaddr;
> > };
> > ssize_t val;
> > + /* id of the selected buffer */
> > + unsigned buf_id;
>
> Looks like this could be unioned with val? I think val's size can be
> reduced to an int since only int values are assigned to it.
>
I'm not sure I see the advantage of this. imo I think it makes the
interface a bit more confusing, as val also represents the error code
and is logically its own separate entity from buf_id. I don't see the
struct io_br_sel being stored anywhere where it seems important to
save a few bytes, but maybe I'm missing something here.
> > };
> >
> >
> > diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
> > index 8a94de6e530f..3ecb6494adea 100644
> > --- a/io_uring/kbuf.c
> > +++ b/io_uring/kbuf.c
> > @@ -239,6 +239,7 @@ static struct io_br_sel io_ring_buffer_select(struct io_kiocb *req, size_t *len,
> > req->flags |= REQ_F_BUFFER_RING | REQ_F_BUFFERS_COMMIT;
> > req->buf_index = buf->bid;
> > sel.buf_list = bl;
> > + sel.buf_id = buf->bid;
>
> This is userspace mapped, so probably should be using READ_ONCE() and
> reusing the value between req->buf_index and buf->bid? Looks like an
> existing bug that the reads of buf->bid and buf->addr aren't using
> READ_ONCE().
Agreed. I think that existing bug should be its own patch. Are you
planning to submit that fix? since you found that bug, I think you
should get the authorship for it, but if it's annoying for whatever
reason and you prefer not to, I can roll that up into this patchset.
EIther way, I'll rebase v2 on top of that and change this line to
"sel.buf_id = req->buf_index".
>
> > if (bl->flags & IOBL_KERNEL_MANAGED)
> > sel.kaddr = (void *)buf->addr;
> > else
> > @@ -262,10 +263,12 @@ struct io_br_sel io_buffer_select(struct io_kiocb *req, size_t *len,
> >
> > bl = io_buffer_get_list(ctx, buf_group);
> > if (likely(bl)) {
> > - if (bl->flags & IOBL_BUF_RING)
> > + if (bl->flags & IOBL_BUF_RING) {
> > sel = io_ring_buffer_select(req, len, bl, issue_flags);
> > - else
> > + } else {
> > sel.addr = io_provided_buffer_select(req, len, bl);
> > + sel.buf_id = req->buf_index;
>
> Could this cover both IOBL_BUF_RING and !IOBL_BUF_RING cases to avoid
> the additional logic in io_ring_buffer_select()?
Ah the next patch (patch 12/30: io_uring/kbuf: export
io_ring_buffer_select() [1]) exports io_ring_buffer_select() to be
directly called, so that was my motivation for having that sel.buf_id
assignment logic in io_ring_buffer_select(). I should have swapped the
ordering of those 2 patches to make that more clear, I'll do that for
v2.
[1] https://lore.kernel.org/linux-fsdevel/20251203003526.2889477-13-joannelkoong@gmail.com/
Thanks,
Joanne
>
> Best,
> Caleb
>
> > + }
> > }
> > io_ring_submit_unlock(req->ctx, issue_flags);
> > return sel;
> > --
> > 2.47.3
> >
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 07/30] io_uring/rsrc: add fixed buffer table pinning/unpinning
2025-12-04 1:24 ` Caleb Sander Mateos
@ 2025-12-04 20:07 ` Joanne Koong
2025-12-10 3:35 ` Caleb Sander Mateos
0 siblings, 1 reply; 51+ messages in thread
From: Joanne Koong @ 2025-12-04 20:07 UTC (permalink / raw)
To: Caleb Sander Mateos
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Wed, Dec 3, 2025 at 5:24 PM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> On Wed, Dec 3, 2025 at 2:52 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Tue, Dec 2, 2025 at 8:49 PM Caleb Sander Mateos
> > <csander@purestorage.com> wrote:
> > >
> > > On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > Add kernel APIs to pin and unpin the buffer table for fixed buffers,
> > > > preventing userspace from unregistering or updating the fixed buffers
> > > > table while it is pinned by the kernel.
> > > >
> > > > This has two advantages:
> > > > a) Eliminating the overhead of having to fetch and construct an iter for
> > > > a fixed buffer per every cmd. Instead, the caller can pin the buffer
> > > > table, fetch/construct the iter once, and use that across cmds for
> > > > however long it needs to until it is ready to unpin the buffer table.
> > > >
> > > > b) Allowing a fixed buffer lookup at any index. The buffer table must be
> > > > pinned in order to allow this, otherwise we would have to keep track of
> > > > all the nodes that have been looked up by the io_kiocb so that we can
> > > > properly adjust the refcounts for those nodes. Ensuring that the buffer
> > > > table must first be pinned before being able to fetch a buffer at any
> > > > index makes things logistically a lot neater.
> > >
> > > Why is it necessary to pin the entire buffer table rather than
> > > specific entries? That's the purpose of the existing io_rsrc_node refs
> > > field.
> >
> > How would this work with userspace buffer unregistration (which works
> > at the table level)? If buffer unregistration should still succeed
> > then fuse would need a way to be notified that the buffer has been
> > unregistered since the buffer belongs to userspace (eg it would be
> > wrong if fuse continues using it even though fuse retains a refcount
> > on it). If buffer unregistration should fail, then we would need to
> > track this pinned state inside the node instead of relying just on the
> > refs field, as buffers can be unregistered even if there are in-flight
> > refs (eg we would need to differentiate the ref being from a pin vs
> > from not a pin), and I think this would make unregistration more
> > cumbersome as well (eg we would have to iterate through all the
> > entries looking to see if any are pinned before iterating through them
> > again to do the actual unregistration).
>
> Not sure I would say buffer unregistration operates on the table as a
> whole. Each registered buffer node is unregistered individually and
I'm looking at the liburing interface for it and I'm only seeing
io_uring_unregister_buffers() / IORING_UNREGISTER_BUFFERS which works
on the entire table, so I'm wondering how that interface would work if
pinning/unpinning was at the entry level?
> stores its own reference count. io_put_rsrc_node() will be called on
> each buffer node in the table. However, io_put_rsrc_node() just
> removes the one reference from the buffer node. If there are other
> references on the buffer node (such as an inflight io_uring request
> using it), io_free_rsrc_node() won't be called to free the buffer node
> until all those references are dropped too. So fuse holding a
> reference on the buffer node would allow it to be unregistered, but
> prevent it from being freed until fuse dropped its reference.
> I'm not sure I understand the problem with fuse continuing to hold
> onto a registered buffer node after userspace has unregistered it from
> the buffer table. (It looks like the buffer node in question is the
For fuse, it holds the reference to the buffer for the lifetime of the
connection, which could be a very long time. I'm not seeing how we
could let userspace succeed in unregistering with fuse continuing to
hold that reference, since as I understand it conceptually,
unregistering the buffer should give ownership of the buffer
completely back to userspace.
> one at FUSE_URING_FIXED_HEADERS_INDEX?) Wouldn't pinning the buffer
Yep you have that right, the buffer node in question is the one at
FUSE_URING_FIXED_HEADERS_INDEX which is where all the headers for
requests are placed.
> table present similar issues? How would userspace get fuse to drop its
I don't think pinning the buffer table has a similar issue because we
disallow unregistration if it's pinned.
> pin if it wants to modify the buffer registrations? I would imagine
For the fuse use case, the server never really modifies its buffer
registrations as it sets up everything before initiating the
connection. But if it wanted to in the future, the server could send a
fuse notification to the kernel to unpin the buf table.
> the code path that calls io_uring_buf_table_unpin() currently could
> instead call into io_put_rsrc_node() (maybe by completing an io_uring
> request that has imported the registered buffer) to release its
> reference on the buffer node. For ublk, userspace can request to stop
> a ublk device or the kernel will do so automatically if userspace
> drops its file handle (e.g. if the process exits), which will release
> any io_uring resources the ublk device is using.
Fuse has something similar where the server can abort the connection,
and that will release the pin / other io uring resources.
Thanks,
Joanne
>
> >
> > >
> > > >
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 11/30] io_uring/kbuf: return buffer id in buffer selection
2025-12-04 19:22 ` Joanne Koong
@ 2025-12-04 21:57 ` Caleb Sander Mateos
0 siblings, 0 replies; 51+ messages in thread
From: Caleb Sander Mateos @ 2025-12-04 21:57 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Thu, Dec 4, 2025 at 11:22 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, Dec 3, 2025 at 1:54 PM Caleb Sander Mateos
> <csander@purestorage.com> wrote:
> >
> > On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > Return the id of the selected buffer in io_buffer_select(). This is
> > > needed for kernel-managed buffer rings to later recycle the selected
> > > buffer.
> > >
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > ---
> > > include/linux/io_uring/cmd.h | 2 +-
> > > include/linux/io_uring_types.h | 2 ++
> > > io_uring/kbuf.c | 7 +++++--
> > > 3 files changed, 8 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> > > index e1a75cfe57d9..dcc95e73f12f 100644
> > > --- a/include/linux/io_uring_types.h
> > > +++ b/include/linux/io_uring_types.h
> > > @@ -109,6 +109,8 @@ struct io_br_sel {
> > > void *kaddr;
> > > };
> > > ssize_t val;
> > > + /* id of the selected buffer */
> > > + unsigned buf_id;
> >
> > Looks like this could be unioned with val? I think val's size can be
> > reduced to an int since only int values are assigned to it.
> >
>
> I'm not sure I see the advantage of this. imo I think it makes the
> interface a bit more confusing, as val also represents the error code
> and is logically its own separate entity from buf_id. I don't see the
> struct io_br_sel being stored anywhere where it seems important to
> save a few bytes, but maybe I'm missing something here.
Yeah, fair enough. Looks like it's only stored on the stack and
returned from functions. Splitting ssize_t val out of the union has
already increased its size beyond 16 bytes, so it was already too
large to fit in a register.
>
> > > };
> > >
> > >
> > > diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
> > > index 8a94de6e530f..3ecb6494adea 100644
> > > --- a/io_uring/kbuf.c
> > > +++ b/io_uring/kbuf.c
> > > @@ -239,6 +239,7 @@ static struct io_br_sel io_ring_buffer_select(struct io_kiocb *req, size_t *len,
> > > req->flags |= REQ_F_BUFFER_RING | REQ_F_BUFFERS_COMMIT;
> > > req->buf_index = buf->bid;
> > > sel.buf_list = bl;
> > > + sel.buf_id = buf->bid;
> >
> > This is userspace mapped, so probably should be using READ_ONCE() and
> > reusing the value between req->buf_index and buf->bid? Looks like an
> > existing bug that the reads of buf->bid and buf->addr aren't using
> > READ_ONCE().
>
> Agreed. I think that existing bug should be its own patch. Are you
> planning to submit that fix? since you found that bug, I think you
> should get the authorship for it, but if it's annoying for whatever
> reason and you prefer not to, I can roll that up into this patchset.
> EIther way, I'll rebase v2 on top of that and change this line to
> "sel.buf_id = req->buf_index".
Sure, happy to send a patch.
>
> >
> > > if (bl->flags & IOBL_KERNEL_MANAGED)
> > > sel.kaddr = (void *)buf->addr;
> > > else
> > > @@ -262,10 +263,12 @@ struct io_br_sel io_buffer_select(struct io_kiocb *req, size_t *len,
> > >
> > > bl = io_buffer_get_list(ctx, buf_group);
> > > if (likely(bl)) {
> > > - if (bl->flags & IOBL_BUF_RING)
> > > + if (bl->flags & IOBL_BUF_RING) {
> > > sel = io_ring_buffer_select(req, len, bl, issue_flags);
> > > - else
> > > + } else {
> > > sel.addr = io_provided_buffer_select(req, len, bl);
> > > + sel.buf_id = req->buf_index;
> >
> > Could this cover both IOBL_BUF_RING and !IOBL_BUF_RING cases to avoid
> > the additional logic in io_ring_buffer_select()?
>
> Ah the next patch (patch 12/30: io_uring/kbuf: export
> io_ring_buffer_select() [1]) exports io_ring_buffer_select() to be
> directly called, so that was my motivation for having that sel.buf_id
> assignment logic in io_ring_buffer_select(). I should have swapped the
> ordering of those 2 patches to make that more clear, I'll do that for
> v2.
Makes sense. No need to switch the order of the patches.
Best,
Caleb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 09/30] io_uring: add io_uring_cmd_import_fixed_index()
2025-12-04 18:56 ` Joanne Koong
@ 2025-12-05 16:56 ` Caleb Sander Mateos
2025-12-05 23:28 ` Joanne Koong
0 siblings, 1 reply; 51+ messages in thread
From: Caleb Sander Mateos @ 2025-12-05 16:56 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Thu, Dec 4, 2025 at 10:56 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, Dec 3, 2025 at 1:44 PM Caleb Sander Mateos
> <csander@purestorage.com> wrote:
> >
> > On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > Add a new helper, io_uring_cmd_import_fixed_index(). This takes in a
> > > buffer index. This requires the buffer table to have been pinned
> > > beforehand. The caller is responsible for ensuring it does not use the
> > > returned iter after the buffer table has been unpinned.
> > >
> > > This is a preparatory patch needed for fuse-over-io-uring support, as
> > > the metadata for fuse requests will be stored at the last index, which
> > > will be different from the sqe's buffer index.
> > >
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > ---
> > > include/linux/io_uring/cmd.h | 10 ++++++++++
> > > io_uring/rsrc.c | 31 +++++++++++++++++++++++++++++++
> > > io_uring/rsrc.h | 2 ++
> > > io_uring/uring_cmd.c | 11 +++++++++++
> > > 4 files changed, 54 insertions(+)
> > >
> > > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > > index 67331cae0a5a..b6dd62118311 100644
> > > --- a/io_uring/rsrc.c
> > > +++ b/io_uring/rsrc.c
> > > @@ -1156,6 +1156,37 @@ int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
> > > return io_import_fixed(ddir, iter, node->buf, buf_addr, len);
> > > }
> > >
> > > +int io_import_reg_buf_index(struct io_kiocb *req, struct iov_iter *iter,
> > > + u16 buf_index, int ddir, unsigned issue_flags)
> > > +{
> > > + struct io_ring_ctx *ctx = req->ctx;
> > > + struct io_rsrc_node *node;
> > > + struct io_mapped_ubuf *imu;
> > > +
> > > + io_ring_submit_lock(ctx, issue_flags);
> > > +
> > > + if (buf_index >= req->ctx->buf_table.nr ||
> >
> > This condition is already checked in io_rsrc_node_lookup() below.
>
> I think we still need this check here to differentiate between -EINVAL
> if buf_index is out of bounds and -EFAULT if the buf index was not out
> of bounds but the lookup returned NULL.
Is there a reason you prefer EINVAL over EFAULT? EFAULT seems
consistent with the errors returned from registered buffer lookups in
other cases.
Best,
Caleb
>
> >
> > > + !(ctx->buf_table.flags & IO_RSRC_DATA_PINNED)) {
> > > + io_ring_submit_unlock(ctx, issue_flags);
> > > + return -EINVAL;
> > > + }
> > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > index 197474911f04..e077eba00efe 100644
> > > --- a/io_uring/uring_cmd.c
> > > +++ b/io_uring/uring_cmd.c
> > > @@ -314,6 +314,17 @@ int io_uring_cmd_import_fixed_vec(struct io_uring_cmd *ioucmd,
> > > }
> > > EXPORT_SYMBOL_GPL(io_uring_cmd_import_fixed_vec);
> > >
> > > +int io_uring_cmd_import_fixed_index(struct io_uring_cmd *ioucmd, u16 buf_index,
> > > + int ddir, struct iov_iter *iter,
> > > + unsigned int issue_flags)
> > > +{
> > > + struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
> > > +
> > > + return io_import_reg_buf_index(req, iter, buf_index, ddir,
> > > + issue_flags);
> > > +}
> >
> > Probably would make sense to make this an inline function, since it
> > immediately defers to io_import_reg_buf_index().
>
> That makes sense to me, I'll make this change for v2.
>
> Thanks,
> Joanne
>
> >
> > Best,
> > Caleb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 09/30] io_uring: add io_uring_cmd_import_fixed_index()
2025-12-05 16:56 ` Caleb Sander Mateos
@ 2025-12-05 23:28 ` Joanne Koong
2025-12-11 2:57 ` Caleb Sander Mateos
0 siblings, 1 reply; 51+ messages in thread
From: Joanne Koong @ 2025-12-05 23:28 UTC (permalink / raw)
To: Caleb Sander Mateos
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Fri, Dec 5, 2025 at 8:56 AM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> On Thu, Dec 4, 2025 at 10:56 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Wed, Dec 3, 2025 at 1:44 PM Caleb Sander Mateos
> > <csander@purestorage.com> wrote:
> > >
> > > On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > Add a new helper, io_uring_cmd_import_fixed_index(). This takes in a
> > > > buffer index. This requires the buffer table to have been pinned
> > > > beforehand. The caller is responsible for ensuring it does not use the
> > > > returned iter after the buffer table has been unpinned.
> > > >
> > > > This is a preparatory patch needed for fuse-over-io-uring support, as
> > > > the metadata for fuse requests will be stored at the last index, which
> > > > will be different from the sqe's buffer index.
> > > >
> > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > ---
> > > > include/linux/io_uring/cmd.h | 10 ++++++++++
> > > > io_uring/rsrc.c | 31 +++++++++++++++++++++++++++++++
> > > > io_uring/rsrc.h | 2 ++
> > > > io_uring/uring_cmd.c | 11 +++++++++++
> > > > 4 files changed, 54 insertions(+)
> > > >
> > > > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > > > index 67331cae0a5a..b6dd62118311 100644
> > > > --- a/io_uring/rsrc.c
> > > > +++ b/io_uring/rsrc.c
> > > > @@ -1156,6 +1156,37 @@ int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
> > > > return io_import_fixed(ddir, iter, node->buf, buf_addr, len);
> > > > }
> > > >
> > > > +int io_import_reg_buf_index(struct io_kiocb *req, struct iov_iter *iter,
> > > > + u16 buf_index, int ddir, unsigned issue_flags)
> > > > +{
> > > > + struct io_ring_ctx *ctx = req->ctx;
> > > > + struct io_rsrc_node *node;
> > > > + struct io_mapped_ubuf *imu;
> > > > +
> > > > + io_ring_submit_lock(ctx, issue_flags);
> > > > +
> > > > + if (buf_index >= req->ctx->buf_table.nr ||
> > >
> > > This condition is already checked in io_rsrc_node_lookup() below.
> >
> > I think we still need this check here to differentiate between -EINVAL
> > if buf_index is out of bounds and -EFAULT if the buf index was not out
> > of bounds but the lookup returned NULL.
>
> Is there a reason you prefer EINVAL over EFAULT? EFAULT seems
> consistent with the errors returned from registered buffer lookups in
> other cases.
To me -EINVAL makes sense because the error stems from the user
passing in an invalid argument (eg a buffer index that exceeds the
number of buffers registered to the table). The comment in
errno-base.h for EINVAL is "Invalid argument". The EFAULT use for the
other cases (eg io_import_reg_buf) makes sense because it might be the
case that for whatever reason the req->buf_index isn't found in the
table but isn't attributable to having passed in an invalid index.
Thanks,
Joanne
>
> Best,
> Caleb
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 22/30] io_uring/rsrc: refactor io_buffer_register_bvec()/io_buffer_unregister_bvec()
2025-12-03 0:35 ` [PATCH v1 22/30] io_uring/rsrc: refactor io_buffer_register_bvec()/io_buffer_unregister_bvec() Joanne Koong
@ 2025-12-07 8:33 ` Caleb Sander Mateos
0 siblings, 0 replies; 51+ messages in thread
From: Caleb Sander Mateos @ 2025-12-07 8:33 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Tue, Dec 2, 2025 at 4:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Changes:
> - Rename io_buffer_register_bvec() to io_buffer_register_request()
> - Rename io_buffer_unregister_bvec() to io_buffer_unregister()
> - Add cmd wrappers for io_buffer_register_request() and
> io_buffer_unregister() for ublk to use
I agree these names seem clearer.
>
> This is in preparation for supporting kernel-populated buffers in fuse
> io-uring, which will need to register bvecs directly (not through a
> block-based request) and will need to do unregistration through an
> io_ring_ctx directly.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> Documentation/block/ublk.rst | 15 ++++++++-------
> drivers/block/ublk_drv.c | 20 +++++++++++---------
> include/linux/io_uring/cmd.h | 13 ++++++++-----
> io_uring/rsrc.c | 14 +++++---------
> io_uring/rsrc.h | 7 +++++++
> io_uring/uring_cmd.c | 21 +++++++++++++++++++++
> 6 files changed, 60 insertions(+), 30 deletions(-)
>
> diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> index 8c4030bcabb6..1546477e768b 100644
> --- a/Documentation/block/ublk.rst
> +++ b/Documentation/block/ublk.rst
> @@ -326,16 +326,17 @@ Zero copy
> ---------
>
> ublk zero copy relies on io_uring's fixed kernel buffer, which provides
> -two APIs: `io_buffer_register_bvec()` and `io_buffer_unregister_bvec`.
> +two APIs: `io_uring_cmd_buffer_register_request()` and
> +`io_uring_cmd_buffer_unregister`.
>
> ublk adds IO command of `UBLK_IO_REGISTER_IO_BUF` to call
> -`io_buffer_register_bvec()` for ublk server to register client request
> -buffer into io_uring buffer table, then ublk server can submit io_uring
> +`io_uring_cmd_buffer_register_request()` for ublk server to register client
> +request buffer into io_uring buffer table, then ublk server can submit io_uring
> IOs with the registered buffer index. IO command of `UBLK_IO_UNREGISTER_IO_BUF`
> -calls `io_buffer_unregister_bvec()` to unregister the buffer, which is
> -guaranteed to be live between calling `io_buffer_register_bvec()` and
> -`io_buffer_unregister_bvec()`. Any io_uring operation which supports this
> -kind of kernel buffer will grab one reference of the buffer until the
> +calls `io_uring_cmd_buffer_unregister()` to unregister the buffer, which is
> +guaranteed to be live between calling `io_uring_cmd_buffer_register_request()`
> +and `io_uring_cmd_buffer_unregister()`. Any io_uring operation which supports
> +this kind of kernel buffer will grab one reference of the buffer until the
> operation is completed.
>
> ublk server implementing zero copy or user copy has to be CAP_SYS_ADMIN and
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index e0c601128efa..d671d08533c9 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -1246,8 +1246,9 @@ static bool ublk_auto_buf_reg(const struct ublk_queue *ubq, struct request *req,
> {
> int ret;
>
> - ret = io_buffer_register_bvec(io->cmd, req, ublk_io_release,
> - io->buf.index, issue_flags);
> + ret = io_uring_cmd_buffer_register_request(io->cmd, req,
> + ublk_io_release,
> + io->buf.index, issue_flags);
> if (ret) {
> if (io->buf.flags & UBLK_AUTO_BUF_REG_FALLBACK) {
> ublk_auto_buf_reg_fallback(ubq, io);
> @@ -2204,8 +2205,8 @@ static int ublk_register_io_buf(struct io_uring_cmd *cmd,
> if (!req)
> return -EINVAL;
>
> - ret = io_buffer_register_bvec(cmd, req, ublk_io_release, index,
> - issue_flags);
> + ret = io_uring_cmd_buffer_register_request(cmd, req, ublk_io_release,
> + index, issue_flags);
> if (ret) {
> ublk_put_req_ref(io, req);
> return ret;
> @@ -2236,8 +2237,8 @@ ublk_daemon_register_io_buf(struct io_uring_cmd *cmd,
> if (!ublk_dev_support_zero_copy(ub) || !ublk_rq_has_data(req))
> return -EINVAL;
>
> - ret = io_buffer_register_bvec(cmd, req, ublk_io_release, index,
> - issue_flags);
> + ret = io_uring_cmd_buffer_register_request(cmd, req, ublk_io_release,
> + index, issue_flags);
> if (ret)
> return ret;
>
> @@ -2252,7 +2253,7 @@ static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
> if (!(ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY))
> return -EINVAL;
>
> - return io_buffer_unregister_bvec(cmd, index, issue_flags);
> + return io_uring_cmd_buffer_unregister(cmd, index, issue_flags);
> }
>
> static int ublk_check_fetch_buf(const struct ublk_device *ub, __u64 buf_addr)
> @@ -2386,7 +2387,7 @@ static int ublk_ch_uring_cmd_local(struct io_uring_cmd *cmd,
> goto out;
>
> /*
> - * io_buffer_unregister_bvec() doesn't access the ubq or io,
> + * io_uring_cmd_buffer_unregister() doesn't access the ubq or io,
> * so no need to validate the q_id, tag, or task
> */
> if (_IOC_NR(cmd_op) == UBLK_IO_UNREGISTER_IO_BUF)
> @@ -2456,7 +2457,8 @@ static int ublk_ch_uring_cmd_local(struct io_uring_cmd *cmd,
>
> /* can't touch 'ublk_io' any more */
> if (buf_idx != UBLK_INVALID_BUF_IDX)
> - io_buffer_unregister_bvec(cmd, buf_idx, issue_flags);
> + io_uring_cmd_buffer_unregister(cmd, buf_idx,
> + issue_flags);
> if (req_op(req) == REQ_OP_ZONE_APPEND)
> req->__sector = addr;
> if (compl)
> diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> index 795b846d1e11..fc956f8f7ed2 100644
> --- a/include/linux/io_uring/cmd.h
> +++ b/include/linux/io_uring/cmd.h
> @@ -185,10 +185,13 @@ static inline void io_uring_cmd_done32(struct io_uring_cmd *ioucmd, s32 ret,
> return __io_uring_cmd_done(ioucmd, ret, res2, issue_flags, true);
> }
>
> -int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
> - void (*release)(void *), unsigned int index,
> - unsigned int issue_flags);
> -int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index,
> - unsigned int issue_flags);
> +int io_uring_cmd_buffer_register_request(struct io_uring_cmd *cmd,
> + struct request *rq,
> + void (*release)(void *),
> + unsigned int index,
> + unsigned int issue_flags);
> +
> +int io_uring_cmd_buffer_unregister(struct io_uring_cmd *cmd, unsigned int index,
> + unsigned int issue_flags);
>
> #endif /* _LINUX_IO_URING_CMD_H */
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index b6dd62118311..59cafe63d187 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -941,11 +941,10 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
> return ret;
> }
>
> -int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
> - void (*release)(void *), unsigned int index,
> - unsigned int issue_flags)
> +int io_buffer_register_request(struct io_ring_ctx *ctx, struct request *rq,
> + void (*release)(void *), unsigned int index,
> + unsigned int issue_flags)
> {
> - struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> struct io_rsrc_data *data = &ctx->buf_table;
> struct req_iterator rq_iter;
> struct io_mapped_ubuf *imu;
> @@ -1003,12 +1002,10 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
> io_ring_submit_unlock(ctx, issue_flags);
> return ret;
> }
> -EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
>
> -int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index,
> - unsigned int issue_flags)
> +int io_buffer_unregister(struct io_ring_ctx *ctx, unsigned int index,
> + unsigned int issue_flags)
> {
> - struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> struct io_rsrc_data *data = &ctx->buf_table;
> struct io_rsrc_node *node;
> int ret = 0;
> @@ -1036,7 +1033,6 @@ int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index,
> io_ring_submit_unlock(ctx, issue_flags);
> return ret;
> }
> -EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec);
>
> static int validate_fixed_range(u64 buf_addr, size_t len,
> const struct io_mapped_ubuf *imu)
> diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
> index 658934f4d3ff..d1ca33f3319a 100644
> --- a/io_uring/rsrc.h
> +++ b/io_uring/rsrc.h
> @@ -91,6 +91,13 @@ int io_validate_user_buf_range(u64 uaddr, u64 ulen);
> bool io_check_coalesce_buffer(struct page **page_array, int nr_pages,
> struct io_imu_folio_data *data);
>
> +int io_buffer_register_request(struct io_ring_ctx *ctx, struct request *rq,
> + void (*release)(void *), unsigned int index,
> + unsigned int issue_flags);
> +
> +int io_buffer_unregister(struct io_ring_ctx *ctx, unsigned int index,
> + unsigned int issue_flags);
> +
> static inline struct io_rsrc_node *io_rsrc_node_lookup(struct io_rsrc_data *data,
> int index)
> {
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 3eb10bbba177..3922ac86b481 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -383,6 +383,27 @@ struct io_br_sel io_uring_cmd_buffer_select(struct io_uring_cmd *ioucmd,
> }
> EXPORT_SYMBOL_GPL(io_uring_cmd_buffer_select);
>
> +int io_uring_cmd_buffer_register_request(struct io_uring_cmd *cmd,
> + struct request *rq,
> + void (*release)(void *),
> + unsigned int index,
> + unsigned int issue_flags)
> +{
> + struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> +
> + return io_buffer_register_request(ctx, rq, release, index, issue_flags);
> +}
> +EXPORT_SYMBOL_GPL(io_uring_cmd_buffer_register_request);
> +
> +int io_uring_cmd_buffer_unregister(struct io_uring_cmd *cmd, unsigned int index,
> + unsigned int issue_flags)
> +{
> + struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> +
> + return io_buffer_unregister(ctx, index, issue_flags);
> +}
> +EXPORT_SYMBOL_GPL(io_uring_cmd_buffer_unregister);
It would be nice to avoid these additional function calls that can't
be inlined. I guess we probably don't want to include the
io_uring-internal header io_uring/rsrc.h in the external header
linux/io_uring/cmd.h, which is probably why the functions were
declared in linux/io_uring/cmd.h but defined in io_uring/rsrc.c
previously. Maybe it would make sense to move the definitions of
io_uring_cmd_buffer_register_request() and
io_uring_cmd_buffer_unregister() to io_uring/rsrc.c so
io_buffer_register_request()/io_buffer_unregister() can be inlined
into them?
Best,
Caleb
> +
> /*
> * Return true if this multishot uring_cmd needs to be completed, otherwise
> * the event CQE is posted successfully.
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 23/30] io_uring/rsrc: split io_buffer_register_request() logic
2025-12-03 0:35 ` [PATCH v1 23/30] io_uring/rsrc: split io_buffer_register_request() logic Joanne Koong
@ 2025-12-07 8:41 ` Caleb Sander Mateos
0 siblings, 0 replies; 51+ messages in thread
From: Caleb Sander Mateos @ 2025-12-07 8:41 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Tue, Dec 2, 2025 at 4:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Split the main initialization logic in io_buffer_register_request() into
> a helper function.
>
> This is a preparatory patch for supporting kernel-populated buffers in
> fuse io-uring, which will be reusing this logic.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> io_uring/rsrc.c | 80 +++++++++++++++++++++++++++++--------------------
> 1 file changed, 48 insertions(+), 32 deletions(-)
>
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 59cafe63d187..18abba6f6b86 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -941,63 +941,79 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
> return ret;
> }
>
> -int io_buffer_register_request(struct io_ring_ctx *ctx, struct request *rq,
> - void (*release)(void *), unsigned int index,
> - unsigned int issue_flags)
> +static int io_buffer_init(struct io_ring_ctx *ctx, unsigned int nr_bvecs,
Consider adding "kernel" somewhere in the name to distinguish this
from the userspace registered buffer initialization
> + unsigned int total_bytes, u8 dir,
> + void (*release)(void *), void *priv,
> + unsigned int index)
> {
> struct io_rsrc_data *data = &ctx->buf_table;
> - struct req_iterator rq_iter;
> struct io_mapped_ubuf *imu;
> struct io_rsrc_node *node;
> - struct bio_vec bv;
> - unsigned int nr_bvecs = 0;
> - int ret = 0;
>
> - io_ring_submit_lock(ctx, issue_flags);
> - if (index >= data->nr) {
> - ret = -EINVAL;
> - goto unlock;
> - }
> + if (index >= data->nr)
> + return -EINVAL;
> index = array_index_nospec(index, data->nr);
>
> - if (data->nodes[index]) {
> - ret = -EBUSY;
> - goto unlock;
> - }
> + if (data->nodes[index])
> + return -EBUSY;
>
> node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
> - if (!node) {
> - ret = -ENOMEM;
> - goto unlock;
> - }
> + if (!node)
> + return -ENOMEM;
>
> - /*
> - * blk_rq_nr_phys_segments() may overestimate the number of bvecs
> - * but avoids needing to iterate over the bvecs
> - */
> - imu = io_alloc_imu(ctx, blk_rq_nr_phys_segments(rq));
> + imu = io_alloc_imu(ctx, nr_bvecs);
> if (!imu) {
> kfree(node);
> - ret = -ENOMEM;
> - goto unlock;
> + return -ENOMEM;
> }
>
> imu->ubuf = 0;
> - imu->len = blk_rq_bytes(rq);
> + imu->len = total_bytes;
> imu->acct_pages = 0;
> imu->folio_shift = PAGE_SHIFT;
> + imu->nr_bvecs = nr_bvecs;
> refcount_set(&imu->refs, 1);
> imu->release = release;
> - imu->priv = rq;
> + imu->priv = priv;
> imu->is_kbuf = true;
> - imu->dir = 1 << rq_data_dir(rq);
> + imu->dir = 1 << dir;
> +
> + node->buf = imu;
> + data->nodes[index] = node;
> +
> + return 0;
> +}
> +
> +int io_buffer_register_request(struct io_ring_ctx *ctx, struct request *rq,
> + void (*release)(void *), unsigned int index,
> + unsigned int issue_flags)
> +{
> + struct req_iterator rq_iter;
> + struct io_mapped_ubuf *imu;
> + struct bio_vec bv;
> + unsigned int nr_bvecs;
> + unsigned int total_bytes;
> + int ret;
> +
> + io_ring_submit_lock(ctx, issue_flags);
> +
> + /*
> + * blk_rq_nr_phys_segments() may overestimate the number of bvecs
> + * but avoids needing to iterate over the bvecs
> + */
> + nr_bvecs = blk_rq_nr_phys_segments(rq);
> + total_bytes = blk_rq_bytes(rq);
These could be initialized before io_ring_submit_lock()
> + ret = io_buffer_init(ctx, nr_bvecs, total_bytes, rq_data_dir(rq), release, rq,
> + index);
> + if (ret)
> + goto unlock;
>
> + imu = ctx->buf_table.nodes[index]->buf;
It would be nice to avoid all these additional dereferences. Could
io_buffer_init() return the struct io_mapped_ubuf *, using ERR_PTR()
to return any error code?
Best,
Caleb
> + nr_bvecs = 0;
> rq_for_each_bvec(bv, rq, rq_iter)
> imu->bvec[nr_bvecs++] = bv;
> imu->nr_bvecs = nr_bvecs;
>
> - node->buf = imu;
> - data->nodes[index] = node;
> unlock:
> io_ring_submit_unlock(ctx, issue_flags);
> return ret;
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 24/30] io_uring/rsrc: Allow buffer release callback to be optional
2025-12-03 0:35 ` [PATCH v1 24/30] io_uring/rsrc: Allow buffer release callback to be optional Joanne Koong
@ 2025-12-07 8:42 ` Caleb Sander Mateos
0 siblings, 0 replies; 51+ messages in thread
From: Caleb Sander Mateos @ 2025-12-07 8:42 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Tue, Dec 2, 2025 at 4:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> This is a preparatory patch for supporting kernel-populated buffers in
> fuse io-uring, which does not need a release callback.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
> ---
> io_uring/rsrc.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 18abba6f6b86..a5605c35d857 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -149,7 +149,8 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
>
> if (imu->acct_pages)
> io_unaccount_mem(ctx->user, ctx->mm_account, imu->acct_pages);
> - imu->release(imu->priv);
> + if (imu->release)
> + imu->release(imu->priv);
> io_free_imu(ctx, imu);
> }
>
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 07/30] io_uring/rsrc: add fixed buffer table pinning/unpinning
2025-12-04 20:07 ` Joanne Koong
@ 2025-12-10 3:35 ` Caleb Sander Mateos
0 siblings, 0 replies; 51+ messages in thread
From: Caleb Sander Mateos @ 2025-12-10 3:35 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Thu, Dec 4, 2025 at 12:07 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, Dec 3, 2025 at 5:24 PM Caleb Sander Mateos
> <csander@purestorage.com> wrote:
> >
> > On Wed, Dec 3, 2025 at 2:52 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Tue, Dec 2, 2025 at 8:49 PM Caleb Sander Mateos
> > > <csander@purestorage.com> wrote:
> > > >
> > > > On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > >
> > > > > Add kernel APIs to pin and unpin the buffer table for fixed buffers,
> > > > > preventing userspace from unregistering or updating the fixed buffers
> > > > > table while it is pinned by the kernel.
> > > > >
> > > > > This has two advantages:
> > > > > a) Eliminating the overhead of having to fetch and construct an iter for
> > > > > a fixed buffer per every cmd. Instead, the caller can pin the buffer
> > > > > table, fetch/construct the iter once, and use that across cmds for
> > > > > however long it needs to until it is ready to unpin the buffer table.
> > > > >
> > > > > b) Allowing a fixed buffer lookup at any index. The buffer table must be
> > > > > pinned in order to allow this, otherwise we would have to keep track of
> > > > > all the nodes that have been looked up by the io_kiocb so that we can
> > > > > properly adjust the refcounts for those nodes. Ensuring that the buffer
> > > > > table must first be pinned before being able to fetch a buffer at any
> > > > > index makes things logistically a lot neater.
> > > >
> > > > Why is it necessary to pin the entire buffer table rather than
> > > > specific entries? That's the purpose of the existing io_rsrc_node refs
> > > > field.
> > >
> > > How would this work with userspace buffer unregistration (which works
> > > at the table level)? If buffer unregistration should still succeed
> > > then fuse would need a way to be notified that the buffer has been
> > > unregistered since the buffer belongs to userspace (eg it would be
> > > wrong if fuse continues using it even though fuse retains a refcount
> > > on it). If buffer unregistration should fail, then we would need to
> > > track this pinned state inside the node instead of relying just on the
> > > refs field, as buffers can be unregistered even if there are in-flight
> > > refs (eg we would need to differentiate the ref being from a pin vs
> > > from not a pin), and I think this would make unregistration more
> > > cumbersome as well (eg we would have to iterate through all the
> > > entries looking to see if any are pinned before iterating through them
> > > again to do the actual unregistration).
> >
> > Not sure I would say buffer unregistration operates on the table as a
> > whole. Each registered buffer node is unregistered individually and
>
> I'm looking at the liburing interface for it and I'm only seeing
> io_uring_unregister_buffers() / IORING_UNREGISTER_BUFFERS which works
> on the entire table, so I'm wondering how that interface would work if
> pinning/unpinning was at the entry level?
IORING_REGISTER_BUFFERS_UPDATE can be used to update individual
registered buffers. For each updated slot, if there is an existing
buffer, it will be unregistered (decrementing the buffer node's
reference count). A new buffer may or may not be registered in its
place; it can be skipped by specifying a struct iovec with a NULL
iov_base.
>
> > stores its own reference count. io_put_rsrc_node() will be called on
> > each buffer node in the table. However, io_put_rsrc_node() just
> > removes the one reference from the buffer node. If there are other
> > references on the buffer node (such as an inflight io_uring request
> > using it), io_free_rsrc_node() won't be called to free the buffer node
> > until all those references are dropped too. So fuse holding a
> > reference on the buffer node would allow it to be unregistered, but
> > prevent it from being freed until fuse dropped its reference.
> > I'm not sure I understand the problem with fuse continuing to hold
> > onto a registered buffer node after userspace has unregistered it from
> > the buffer table. (It looks like the buffer node in question is the
>
> For fuse, it holds the reference to the buffer for the lifetime of the
> connection, which could be a very long time. I'm not seeing how we
> could let userspace succeed in unregistering with fuse continuing to
> hold that reference, since as I understand it conceptually,
> unregistering the buffer should give ownership of the buffer
> completely back to userspace.
I'm not quite sure what you mean by "give ownership of the buffer
completely back to userspace". My understanding is that registering a
buffer with io_uring just pins the physical pages as a perf
optimization. I'm not aware of a way for userspace to observe directly
whether or not certain physical pages are pinned. There's already no
guarantee that the physical pages are unpinned as soon as a buffer is
unregistered; if there are any inflight io_uring requests using the
registered buffer, they will continue to hold a reference count on the
buffer, preventing it from being released. The only guarantee is that
the unregistered slot in the buffer table is now empty.
Presumably fuse must release its reference count (or pin) eventually,
or otherwise there would be a resource leak? I don't see an issue with
holding references to registered buffers for the lifetime of a fuse
connection. As long as there's a way for the fuse server to tell the
kernel to release those resources if they are no longer needed (which
it sounds like already exists from your description of aborting a fuse
connection).
>
> > one at FUSE_URING_FIXED_HEADERS_INDEX?) Wouldn't pinning the buffer
>
> Yep you have that right, the buffer node in question is the one at
> FUSE_URING_FIXED_HEADERS_INDEX which is where all the headers for
> requests are placed.
>
> > table present similar issues? How would userspace get fuse to drop its
>
> I don't think pinning the buffer table has a similar issue because we
> disallow unregistration if it's pinned.
It sounds like you're saying that buffer unregistration just isn't
expected to work together with fuse's use of registered buffers, is
that accurate? Does it matter then whether the buffer unregistration
returns -EBUSY because the buffer table is pinned vs. succeeding but
not actually releasing the buffer resource because fuse still holds a
reference on it? My preference would be to just use the existing
reference counting mechanism rather than introducing another mechanism
if both provide safety equally well.
>
> > pin if it wants to modify the buffer registrations? I would imagine
>
> For the fuse use case, the server never really modifies its buffer
> registrations as it sets up everything before initiating the
> connection. But if it wanted to in the future, the server could send a
> fuse notification to the kernel to unpin the buf table.
This seems to assume that all the registered buffers are used
exclusively for fuse purposes. But it seems plausible that a fuse
server might want to use io_uring for other non-fuse purposes and want
to use registered buffers there too as a perf optimization? Then it
would be nice for the process to be able to update the other
registered buffers even while the kernel pins (or holds a reference
count on) the ones used for fuse.
Best,
Caleb
>
> > the code path that calls io_uring_buf_table_unpin() currently could
> > instead call into io_put_rsrc_node() (maybe by completing an io_uring
> > request that has imported the registered buffer) to release its
> > reference on the buffer node. For ublk, userspace can request to stop
> > a ublk device or the kernel will do so automatically if userspace
> > drops its file handle (e.g. if the process exits), which will release
> > any io_uring resources the ublk device is using.
>
> Fuse has something similar where the server can abort the connection,
> and that will release the pin / other io uring resources.
>
> Thanks,
> Joanne
>
> >
> > >
> > > >
> > > > >
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v1 09/30] io_uring: add io_uring_cmd_import_fixed_index()
2025-12-05 23:28 ` Joanne Koong
@ 2025-12-11 2:57 ` Caleb Sander Mateos
0 siblings, 0 replies; 51+ messages in thread
From: Caleb Sander Mateos @ 2025-12-11 2:57 UTC (permalink / raw)
To: Joanne Koong
Cc: miklos, axboe, bschubert, asml.silence, io-uring, xiaobing.li,
linux-fsdevel
On Fri, Dec 5, 2025 at 3:28 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Fri, Dec 5, 2025 at 8:56 AM Caleb Sander Mateos
> <csander@purestorage.com> wrote:
> >
> > On Thu, Dec 4, 2025 at 10:56 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Wed, Dec 3, 2025 at 1:44 PM Caleb Sander Mateos
> > > <csander@purestorage.com> wrote:
> > > >
> > > > On Tue, Dec 2, 2025 at 4:36 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > >
> > > > > Add a new helper, io_uring_cmd_import_fixed_index(). This takes in a
> > > > > buffer index. This requires the buffer table to have been pinned
> > > > > beforehand. The caller is responsible for ensuring it does not use the
> > > > > returned iter after the buffer table has been unpinned.
> > > > >
> > > > > This is a preparatory patch needed for fuse-over-io-uring support, as
> > > > > the metadata for fuse requests will be stored at the last index, which
> > > > > will be different from the sqe's buffer index.
> > > > >
> > > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > > ---
> > > > > include/linux/io_uring/cmd.h | 10 ++++++++++
> > > > > io_uring/rsrc.c | 31 +++++++++++++++++++++++++++++++
> > > > > io_uring/rsrc.h | 2 ++
> > > > > io_uring/uring_cmd.c | 11 +++++++++++
> > > > > 4 files changed, 54 insertions(+)
> > > > >
> > > > > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > > > > index 67331cae0a5a..b6dd62118311 100644
> > > > > --- a/io_uring/rsrc.c
> > > > > +++ b/io_uring/rsrc.c
> > > > > @@ -1156,6 +1156,37 @@ int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
> > > > > return io_import_fixed(ddir, iter, node->buf, buf_addr, len);
> > > > > }
> > > > >
> > > > > +int io_import_reg_buf_index(struct io_kiocb *req, struct iov_iter *iter,
> > > > > + u16 buf_index, int ddir, unsigned issue_flags)
> > > > > +{
> > > > > + struct io_ring_ctx *ctx = req->ctx;
> > > > > + struct io_rsrc_node *node;
> > > > > + struct io_mapped_ubuf *imu;
> > > > > +
> > > > > + io_ring_submit_lock(ctx, issue_flags);
> > > > > +
> > > > > + if (buf_index >= req->ctx->buf_table.nr ||
> > > >
> > > > This condition is already checked in io_rsrc_node_lookup() below.
> > >
> > > I think we still need this check here to differentiate between -EINVAL
> > > if buf_index is out of bounds and -EFAULT if the buf index was not out
> > > of bounds but the lookup returned NULL.
> >
> > Is there a reason you prefer EINVAL over EFAULT? EFAULT seems
> > consistent with the errors returned from registered buffer lookups in
> > other cases.
>
> To me -EINVAL makes sense because the error stems from the user
> passing in an invalid argument (eg a buffer index that exceeds the
> number of buffers registered to the table). The comment in
> errno-base.h for EINVAL is "Invalid argument". The EFAULT use for the
> other cases (eg io_import_reg_buf) makes sense because it might be the
> case that for whatever reason the req->buf_index isn't found in the
> table but isn't attributable to having passed in an invalid index.
req->buf_index generally comes from the buf_index field of the
io_uring SQE, so you could make the same argument about EINVAL making
sense for other failed buffer lookups. I don't feel strongly either
way, but it seems a bit more consistent (and less code) to just
propagate EFAULT from io_rsrc_node_lookup().
Best,
Caleb
^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2025-12-11 2:58 UTC | newest]
Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-03 0:34 [PATCH v1 00/30] fuse/io-uring: add kernel-managed buffer rings and zero-copy Joanne Koong
2025-12-03 0:34 ` [PATCH v1 01/30] io_uring/kbuf: refactor io_buf_pbuf_register() logic into generic helpers Joanne Koong
2025-12-03 0:34 ` [PATCH v1 02/30] io_uring/kbuf: rename io_unregister_pbuf_ring() to io_unregister_buf_ring() Joanne Koong
2025-12-03 0:34 ` [PATCH v1 03/30] io_uring/kbuf: add support for kernel-managed buffer rings Joanne Koong
2025-12-03 0:34 ` [PATCH v1 04/30] io_uring/kbuf: add mmap " Joanne Koong
2025-12-03 0:35 ` [PATCH v1 05/30] io_uring/kbuf: support kernel-managed buffer rings in buffer selection Joanne Koong
2025-12-03 0:35 ` [PATCH v1 06/30] io_uring/kbuf: add buffer ring pinning/unpinning Joanne Koong
2025-12-03 4:13 ` Caleb Sander Mateos
2025-12-04 18:41 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 07/30] io_uring/rsrc: add fixed buffer table pinning/unpinning Joanne Koong
2025-12-03 4:49 ` Caleb Sander Mateos
2025-12-03 22:52 ` Joanne Koong
2025-12-04 1:24 ` Caleb Sander Mateos
2025-12-04 20:07 ` Joanne Koong
2025-12-10 3:35 ` Caleb Sander Mateos
2025-12-03 0:35 ` [PATCH v1 08/30] io_uring/kbuf: add recycling for pinned kernel managed buffer rings Joanne Koong
2025-12-03 0:35 ` [PATCH v1 09/30] io_uring: add io_uring_cmd_import_fixed_index() Joanne Koong
2025-12-03 21:43 ` Caleb Sander Mateos
2025-12-04 18:56 ` Joanne Koong
2025-12-05 16:56 ` Caleb Sander Mateos
2025-12-05 23:28 ` Joanne Koong
2025-12-11 2:57 ` Caleb Sander Mateos
2025-12-03 0:35 ` [PATCH v1 10/30] io_uring/kbuf: add io_uring_is_kmbuf_ring() Joanne Koong
2025-12-03 0:35 ` [PATCH v1 11/30] io_uring/kbuf: return buffer id in buffer selection Joanne Koong
2025-12-03 21:53 ` Caleb Sander Mateos
2025-12-04 19:22 ` Joanne Koong
2025-12-04 21:57 ` Caleb Sander Mateos
2025-12-03 0:35 ` [PATCH v1 12/30] io_uring/kbuf: export io_ring_buffer_select() Joanne Koong
2025-12-03 0:35 ` [PATCH v1 13/30] io_uring/cmd: set selected buffer index in __io_uring_cmd_done() Joanne Koong
2025-12-03 0:35 ` [PATCH v1 14/30] io_uring: add release callback for ring death Joanne Koong
2025-12-03 22:25 ` Caleb Sander Mateos
2025-12-03 22:54 ` Joanne Koong
2025-12-03 0:35 ` [PATCH v1 15/30] fuse: refactor io-uring logic for getting next fuse request Joanne Koong
2025-12-03 0:35 ` [PATCH v1 16/30] fuse: refactor io-uring header copying to ring Joanne Koong
2025-12-03 0:35 ` [PATCH v1 17/30] fuse: refactor io-uring header copying from ring Joanne Koong
2025-12-03 0:35 ` [PATCH v1 18/30] fuse: use enum types for header copying Joanne Koong
2025-12-03 0:35 ` [PATCH v1 19/30] fuse: refactor setting up copy state for payload copying Joanne Koong
2025-12-03 0:35 ` [PATCH v1 20/30] fuse: support buffer copying for kernel addresses Joanne Koong
2025-12-03 0:35 ` [PATCH v1 21/30] fuse: add io-uring kernel-managed buffer ring Joanne Koong
2025-12-03 0:35 ` [PATCH v1 22/30] io_uring/rsrc: refactor io_buffer_register_bvec()/io_buffer_unregister_bvec() Joanne Koong
2025-12-07 8:33 ` Caleb Sander Mateos
2025-12-03 0:35 ` [PATCH v1 23/30] io_uring/rsrc: split io_buffer_register_request() logic Joanne Koong
2025-12-07 8:41 ` Caleb Sander Mateos
2025-12-03 0:35 ` [PATCH v1 24/30] io_uring/rsrc: Allow buffer release callback to be optional Joanne Koong
2025-12-07 8:42 ` Caleb Sander Mateos
2025-12-03 0:35 ` [PATCH v1 25/30] io_uring/rsrc: add io_buffer_register_bvec() Joanne Koong
2025-12-03 0:35 ` [PATCH v1 26/30] io_uring/rsrc: export io_buffer_unregister Joanne Koong
2025-12-03 0:35 ` [PATCH v1 27/30] fuse: rename fuse_set_zero_arg0() to fuse_zero_in_arg0() Joanne Koong
2025-12-03 0:35 ` [PATCH v1 28/30] fuse: enforce op header for every payload reply Joanne Koong
2025-12-03 0:35 ` [PATCH v1 29/30] fuse: add zero-copy over io-uring Joanne Koong
2025-12-03 0:35 ` [PATCH v1 30/30] docs: fuse: add io-uring bufring and zero-copy documentation Joanne Koong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).