[RFC v2 0/5] BPF controlled io

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC v2 0/5] BPF controlled io_uring
@ 2025-06-06 13:57 Pavel Begunkov
  2025-06-06 13:57 ` [RFC v2 1/5] io_uring: add struct for state controlling cqwait Pavel Begunkov
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Pavel Begunkov @ 2025-06-06 13:57 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Martin KaFai Lau, bpf, linux-kernel

This series adds io_uring BPF struct_ops, which allows processing
events and submitting requests from BPF without returning to user.
There is only one callback for now, it's called from the io_uring
CQ waiting loop when there is an event to be processed. It also
has access to waiting parameters like batching and timeouts.

It's tested with a program that queues a nop request, waits for
its completion and then queues another request, repeating it N
times. The baseline to compare with is traditional io_uring
application doing same without BPF and using 2 requests links,
with the same total number of requests.

# ./link 0 100000000
type 2-LINK, requests to run 100000000
sec 20, total (ms) 20374
# ./link 1 100000000
type BPF, requests to run 100000000
sec 13, total (ms) 13700

The BPF version works ~50% faster on a mitigated kernel, while it's
not even a completely fair comparison as links are restrictive and
can't always be used. Without links the speedup reaches ~80%.

This allows arbitrary relations between requests including using
a result from one request to configure the following one. There are
other use cases in mind that need access to in-kernel resources and
can't be implemented from userspace. On top, it can be extended with
more callbacks to get finer control over task work batching.

It's a prototype, I intend to remake the kfunc helpers, enchance
program verification, and fix some mild io_uring waiting edge
cases.

Kernel branch:
https://github.com/isilence/linux/tree/io-uring-bpf/v2
git https://github.com/isilence/linux.git io-uring-bpf/v2

Liburing + bpf bootsrap examples:
https://github.com/isilence/liburing/tree/bpf-struct-ops-examples
git git@github.com:isilence/liburing.git bpf-struct-ops-examples

Pavel Begunkov (5):
  io_uring: add struct for state controlling cqwait
  io_uring/bpf: add stubs for bpf struct_ops
  io_uring/bpf: implement struct_ops registration
  io_uring/bpf: add handle events callback
  io_uring/bpf: add basic kfunc helpers

 include/linux/io_uring_types.h |   4 +
 io_uring/Kconfig               |   5 +
 io_uring/Makefile              |   1 +
 io_uring/bpf.c                 | 277 +++++++++++++++++++++++++++++++++
 io_uring/bpf.h                 |  45 ++++++
 io_uring/io_uring.c            |  45 ++++--
 io_uring/io_uring.h            |  11 +-
 io_uring/napi.c                |   4 +-
 8 files changed, 376 insertions(+), 16 deletions(-)
 create mode 100644 io_uring/bpf.c
 create mode 100644 io_uring/bpf.h

-- 
2.49.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC v2 1/5] io_uring: add struct for state controlling cqwait
  2025-06-06 13:57 [RFC v2 0/5] BPF controlled io_uring Pavel Begunkov
@ 2025-06-06 13:57 ` Pavel Begunkov
  2025-06-06 13:57 ` [RFC v2 2/5] io_uring/bpf: add stubs for bpf struct_ops Pavel Begunkov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2025-06-06 13:57 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Martin KaFai Lau, bpf, linux-kernel

Add struct iou_loop_state and place there parameter controlling the flow
of normal CQ waiting. It will be exposed to BPF for api of the helpers,
and while I could've used struct io_wait_queue, the name is not ideal,
and keeping only necessary bits makes further development a bit cleaner.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/io_uring.c | 20 ++++++++++----------
 io_uring/io_uring.h | 11 ++++++++---
 io_uring/napi.c     |  4 ++--
 3 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 5cdccf65c652..9cc4d8f335a1 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2404,8 +2404,8 @@ static enum hrtimer_restart io_cqring_min_timer_wakeup(struct hrtimer *timer)
 	struct io_ring_ctx *ctx = iowq->ctx;
 
 	/* no general timeout, or shorter (or equal), we are done */
-	if (iowq->timeout == KTIME_MAX ||
-	    ktime_compare(iowq->min_timeout, iowq->timeout) >= 0)
+	if (iowq->state.timeout == KTIME_MAX ||
+	    ktime_compare(iowq->min_timeout, iowq->state.timeout) >= 0)
 		goto out_wake;
 	/* work we may need to run, wake function will see if we need to wake */
 	if (io_has_work(ctx))
@@ -2431,7 +2431,7 @@ static enum hrtimer_restart io_cqring_min_timer_wakeup(struct hrtimer *timer)
 	}
 
 	hrtimer_update_function(&iowq->t, io_cqring_timer_wakeup);
-	hrtimer_set_expires(timer, iowq->timeout);
+	hrtimer_set_expires(timer, iowq->state.timeout);
 	return HRTIMER_RESTART;
 out_wake:
 	return io_cqring_timer_wakeup(timer);
@@ -2447,7 +2447,7 @@ static int io_cqring_schedule_timeout(struct io_wait_queue *iowq,
 		hrtimer_setup_on_stack(&iowq->t, io_cqring_min_timer_wakeup, clock_id,
 				       HRTIMER_MODE_ABS);
 	} else {
-		timeout = iowq->timeout;
+		timeout = iowq->state.timeout;
 		hrtimer_setup_on_stack(&iowq->t, io_cqring_timer_wakeup, clock_id,
 				       HRTIMER_MODE_ABS);
 	}
@@ -2488,7 +2488,7 @@ static int __io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 	 */
 	if (ext_arg->iowait && current_pending_io())
 		current->in_iowait = 1;
-	if (iowq->timeout != KTIME_MAX || iowq->min_timeout)
+	if (iowq->state.timeout != KTIME_MAX || iowq->min_timeout)
 		ret = io_cqring_schedule_timeout(iowq, ctx->clockid, start_time);
 	else
 		schedule();
@@ -2546,18 +2546,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
 	iowq.wq.private = current;
 	INIT_LIST_HEAD(&iowq.wq.entry);
 	iowq.ctx = ctx;
-	iowq.cq_tail = READ_ONCE(ctx->rings->cq.head) + min_events;
+	iowq.state.target_cq_tail = READ_ONCE(ctx->rings->cq.head) + min_events;
 	iowq.cq_min_tail = READ_ONCE(ctx->rings->cq.tail);
 	iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
 	iowq.hit_timeout = 0;
 	iowq.min_timeout = ext_arg->min_time;
-	iowq.timeout = KTIME_MAX;
+	iowq.state.timeout = KTIME_MAX;
 	start_time = io_get_time(ctx);
 
 	if (ext_arg->ts_set) {
-		iowq.timeout = timespec64_to_ktime(ext_arg->ts);
+		iowq.state.timeout = timespec64_to_ktime(ext_arg->ts);
 		if (!(flags & IORING_ENTER_ABS_TIMER))
-			iowq.timeout = ktime_add(iowq.timeout, start_time);
+			iowq.state.timeout = ktime_add(iowq.state.timeout, start_time);
 	}
 
 	if (ext_arg->sig) {
@@ -2582,7 +2582,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
 
 		/* if min timeout has been hit, don't reset wait count */
 		if (!iowq.hit_timeout)
-			nr_wait = (int) iowq.cq_tail -
+			nr_wait = (int) iowq.state.target_cq_tail -
 					READ_ONCE(ctx->rings->cq.tail);
 		else
 			nr_wait = 1;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 0ea7a435d1de..edf698b81a95 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -39,15 +39,19 @@ enum {
 	IOU_REQUEUE		= -3072,
 };
 
+struct iou_loop_state {
+	__u32			target_cq_tail;
+	ktime_t			timeout;
+};
+
 struct io_wait_queue {
+	struct iou_loop_state state;
 	struct wait_queue_entry wq;
 	struct io_ring_ctx *ctx;
-	unsigned cq_tail;
 	unsigned cq_min_tail;
 	unsigned nr_timeouts;
 	int hit_timeout;
 	ktime_t min_timeout;
-	ktime_t timeout;
 	struct hrtimer t;
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
@@ -59,7 +63,8 @@ struct io_wait_queue {
 static inline bool io_should_wake(struct io_wait_queue *iowq)
 {
 	struct io_ring_ctx *ctx = iowq->ctx;
-	int dist = READ_ONCE(ctx->rings->cq.tail) - (int) iowq->cq_tail;
+	u32 target = iowq->state.target_cq_tail;
+	int dist = READ_ONCE(ctx->rings->cq.tail) - target;
 
 	/*
 	 * Wake up if we have enough events, or if a timeout occurred since we
diff --git a/io_uring/napi.c b/io_uring/napi.c
index 4a10de03e426..e08bddc1dbd2 100644
--- a/io_uring/napi.c
+++ b/io_uring/napi.c
@@ -360,8 +360,8 @@ void __io_napi_busy_loop(struct io_ring_ctx *ctx, struct io_wait_queue *iowq)
 		return;
 
 	iowq->napi_busy_poll_dt = READ_ONCE(ctx->napi_busy_poll_dt);
-	if (iowq->timeout != KTIME_MAX) {
-		ktime_t dt = ktime_sub(iowq->timeout, io_get_time(ctx));
+	if (iowq->state.timeout != KTIME_MAX) {
+		ktime_t dt = ktime_sub(iowq->state.timeout, io_get_time(ctx));
 
 		iowq->napi_busy_poll_dt = min_t(u64, iowq->napi_busy_poll_dt, dt);
 	}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 2/5] io_uring/bpf: add stubs for bpf struct_ops
  2025-06-06 13:57 [RFC v2 0/5] BPF controlled io_uring Pavel Begunkov
  2025-06-06 13:57 ` [RFC v2 1/5] io_uring: add struct for state controlling cqwait Pavel Begunkov
@ 2025-06-06 13:57 ` Pavel Begunkov
  2025-06-06 14:25   ` Jens Axboe
  2025-06-06 13:58 ` [RFC v2 3/5] io_uring/bpf: implement struct_ops registration Pavel Begunkov
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2025-06-06 13:57 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Martin KaFai Lau, bpf, linux-kernel

Add some basic helpers and definitions for implementing bpf struct_ops.
There are no callbaack yet, and registration will always fail.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/io_uring_types.h |  4 ++
 io_uring/Kconfig               |  5 ++
 io_uring/Makefile              |  1 +
 io_uring/bpf.c                 | 93 ++++++++++++++++++++++++++++++++++
 io_uring/bpf.h                 | 26 ++++++++++
 io_uring/io_uring.c            |  3 ++
 6 files changed, 132 insertions(+)
 create mode 100644 io_uring/bpf.c
 create mode 100644 io_uring/bpf.h

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 2922635986f5..26ee1a6f52e7 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -8,6 +8,8 @@
 #include <linux/llist.h>
 #include <uapi/linux/io_uring.h>
 
+struct io_uring_ops;
+
 enum {
 	/*
 	 * A hint to not wake right away but delay until there are enough of
@@ -344,6 +346,8 @@ struct io_ring_ctx {
 
 		void			*cq_wait_arg;
 		size_t			cq_wait_size;
+
+		struct io_uring_ops	*bpf_ops;
 	} ____cacheline_aligned_in_smp;
 
 	/*
diff --git a/io_uring/Kconfig b/io_uring/Kconfig
index 4b949c42c0bf..b4dad9b74544 100644
--- a/io_uring/Kconfig
+++ b/io_uring/Kconfig
@@ -9,3 +9,8 @@ config IO_URING_ZCRX
 	depends on PAGE_POOL
 	depends on INET
 	depends on NET_RX_BUSY_POLL
+
+config IO_URING_BPF
+	def_bool y
+	depends on IO_URING
+	depends on BPF_SYSCALL && BPF_JIT && DEBUG_INFO_BTF
diff --git a/io_uring/Makefile b/io_uring/Makefile
index d97c6b51d584..58f46c0f9895 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -21,3 +21,4 @@ obj-$(CONFIG_EPOLL)		+= epoll.o
 obj-$(CONFIG_NET_RX_BUSY_POLL)	+= napi.o
 obj-$(CONFIG_NET) += net.o cmd_net.o
 obj-$(CONFIG_PROC_FS) += fdinfo.o
+obj-$(CONFIG_IO_URING_BPF)	+= bpf.o
diff --git a/io_uring/bpf.c b/io_uring/bpf.c
new file mode 100644
index 000000000000..3096c54e4fb3
--- /dev/null
+++ b/io_uring/bpf.c
@@ -0,0 +1,93 @@
+#include <linux/mutex.h>
+
+#include "bpf.h"
+#include "register.h"
+
+static struct io_uring_ops io_bpf_ops_stubs = {
+};
+
+static bool bpf_io_is_valid_access(int off, int size,
+				    enum bpf_access_type type,
+				    const struct bpf_prog *prog,
+				    struct bpf_insn_access_aux *info)
+{
+	if (type != BPF_READ)
+		return false;
+	if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS)
+		return false;
+	if (off % size != 0)
+		return false;
+
+	return btf_ctx_access(off, size, type, prog, info);
+}
+
+static int bpf_io_btf_struct_access(struct bpf_verifier_log *log,
+				    const struct bpf_reg_state *reg, int off,
+				    int size)
+{
+	return -EACCES;
+}
+
+static const struct bpf_verifier_ops bpf_io_verifier_ops = {
+	.get_func_proto = bpf_base_func_proto,
+	.is_valid_access = bpf_io_is_valid_access,
+	.btf_struct_access = bpf_io_btf_struct_access,
+};
+
+static int bpf_io_init(struct btf *btf)
+{
+	return 0;
+}
+
+static int bpf_io_check_member(const struct btf_type *t,
+				const struct btf_member *member,
+				const struct bpf_prog *prog)
+{
+	return 0;
+}
+
+static int bpf_io_init_member(const struct btf_type *t,
+			       const struct btf_member *member,
+			       void *kdata, const void *udata)
+{
+	return 0;
+}
+
+static int bpf_io_reg(void *kdata, struct bpf_link *link)
+{
+	return -EOPNOTSUPP;
+}
+
+static void bpf_io_unreg(void *kdata, struct bpf_link *link)
+{
+}
+
+void io_unregister_bpf_ops(struct io_ring_ctx *ctx)
+{
+}
+
+static struct bpf_struct_ops bpf_io_uring_ops = {
+	.verifier_ops = &bpf_io_verifier_ops,
+	.reg = bpf_io_reg,
+	.unreg = bpf_io_unreg,
+	.check_member = bpf_io_check_member,
+	.init_member = bpf_io_init_member,
+	.init = bpf_io_init,
+	.cfi_stubs = &io_bpf_ops_stubs,
+	.name = "io_uring_ops",
+	.owner = THIS_MODULE,
+};
+
+static int __init io_uring_bpf_init(void)
+{
+	int ret;
+
+	ret = register_bpf_struct_ops(&bpf_io_uring_ops, io_uring_ops);
+	if (ret) {
+		pr_err("io_uring: Failed to register struct_ops (%d)\n", ret);
+		return ret;
+	}
+
+	return 0;
+}
+__initcall(io_uring_bpf_init);
diff --git a/io_uring/bpf.h b/io_uring/bpf.h
new file mode 100644
index 000000000000..a61c489d306b
--- /dev/null
+++ b/io_uring/bpf.h
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef IOU_BPF_H
+#define IOU_BPF_H
+
+#include <linux/io_uring_types.h>
+#include <linux/bpf.h>
+
+#include "io_uring.h"
+
+struct io_uring_ops {
+};
+
+static inline bool io_bpf_attached(struct io_ring_ctx *ctx)
+{
+	return IS_ENABLED(CONFIG_BPF) && ctx->bpf_ops != NULL;
+}
+
+#ifdef CONFIG_BPF
+void io_unregister_bpf_ops(struct io_ring_ctx *ctx);
+#else
+static inline void io_unregister_bpf_ops(struct io_ring_ctx *ctx)
+{
+}
+#endif
+
+#endif
\ No newline at end of file
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 9cc4d8f335a1..8f68e898d60c 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -98,6 +98,7 @@
 #include "msg_ring.h"
 #include "memmap.h"
 #include "zcrx.h"
+#include "bpf.h"
 
 #include "timeout.h"
 #include "poll.h"
@@ -2870,6 +2871,8 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 	struct io_tctx_node *node;
 	int ret;
 
+	io_unregister_bpf_ops(ctx);
+
 	/*
 	 * If we're doing polled IO and end up having requests being
 	 * submitted async (out-of-line), then completions can come in while
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC v2 2/5] io_uring/bpf: add stubs for bpf struct_ops
  2025-06-06 13:57 ` [RFC v2 2/5] io_uring/bpf: add stubs for bpf struct_ops Pavel Begunkov
@ 2025-06-06 14:25   ` Jens Axboe
  2025-06-06 14:28     ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2025-06-06 14:25 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Martin KaFai Lau, bpf, linux-kernel

On 6/6/25 7:57 AM, Pavel Begunkov wrote:
> diff --git a/io_uring/bpf.h b/io_uring/bpf.h
> new file mode 100644
> index 000000000000..a61c489d306b
> --- /dev/null
> +++ b/io_uring/bpf.h
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#ifndef IOU_BPF_H
> +#define IOU_BPF_H
> +
> +#include <linux/io_uring_types.h>
> +#include <linux/bpf.h>
> +
> +#include "io_uring.h"
> +
> +struct io_uring_ops {
> +};
> +
> +static inline bool io_bpf_attached(struct io_ring_ctx *ctx)
> +{
> +	return IS_ENABLED(CONFIG_BPF) && ctx->bpf_ops != NULL;
> +}
> +
> +#ifdef CONFIG_BPF
> +void io_unregister_bpf_ops(struct io_ring_ctx *ctx);
> +#else
> +static inline void io_unregister_bpf_ops(struct io_ring_ctx *ctx)
> +{
> +}
> +#endif

Should be

#ifdef IO_URING_BPF

here.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 2/5] io_uring/bpf: add stubs for bpf struct_ops
  2025-06-06 14:25   ` Jens Axboe
@ 2025-06-06 14:28     ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2025-06-06 14:28 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Martin KaFai Lau, bpf, linux-kernel

On 6/6/25 8:25 AM, Jens Axboe wrote:
> On 6/6/25 7:57 AM, Pavel Begunkov wrote:
>> diff --git a/io_uring/bpf.h b/io_uring/bpf.h
>> new file mode 100644
>> index 000000000000..a61c489d306b
>> --- /dev/null
>> +++ b/io_uring/bpf.h
>> @@ -0,0 +1,26 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#ifndef IOU_BPF_H
>> +#define IOU_BPF_H
>> +
>> +#include <linux/io_uring_types.h>
>> +#include <linux/bpf.h>
>> +
>> +#include "io_uring.h"
>> +
>> +struct io_uring_ops {
>> +};
>> +
>> +static inline bool io_bpf_attached(struct io_ring_ctx *ctx)
>> +{
>> +	return IS_ENABLED(CONFIG_BPF) && ctx->bpf_ops != NULL;
>> +}
>> +
>> +#ifdef CONFIG_BPF
>> +void io_unregister_bpf_ops(struct io_ring_ctx *ctx);
>> +#else
>> +static inline void io_unregister_bpf_ops(struct io_ring_ctx *ctx)
>> +{
>> +}
>> +#endif
> 
> Should be
> 
> #ifdef IO_URING_BPF
> 
> here.

CONFIG_IO_URING_BPF of course...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC v2 3/5] io_uring/bpf: implement struct_ops registration
  2025-06-06 13:57 [RFC v2 0/5] BPF controlled io_uring Pavel Begunkov
  2025-06-06 13:57 ` [RFC v2 1/5] io_uring: add struct for state controlling cqwait Pavel Begunkov
  2025-06-06 13:57 ` [RFC v2 2/5] io_uring/bpf: add stubs for bpf struct_ops Pavel Begunkov
@ 2025-06-06 13:58 ` Pavel Begunkov
  2025-06-06 14:57   ` Jens Axboe
  2025-06-06 13:58 ` [RFC v2 4/5] io_uring/bpf: add handle events callback Pavel Begunkov
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2025-06-06 13:58 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Martin KaFai Lau, bpf, linux-kernel

Add ring_fd to the struct_ops and implement [un]registration.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/bpf.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++-
 io_uring/bpf.h |  3 +++
 2 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/io_uring/bpf.c b/io_uring/bpf.c
index 3096c54e4fb3..0f82acf09959 100644
--- a/io_uring/bpf.c
+++ b/io_uring/bpf.c
@@ -3,6 +3,8 @@
 #include "bpf.h"
 #include "register.h"
 
+DEFINE_MUTEX(io_bpf_ctrl_mutex);
+
 static struct io_uring_ops io_bpf_ops_stubs = {
 };
 
@@ -50,20 +52,83 @@ static int bpf_io_init_member(const struct btf_type *t,
 			       const struct btf_member *member,
 			       void *kdata, const void *udata)
 {
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+	const struct io_uring_ops *uops = udata;
+	struct io_uring_ops *ops = kdata;
+
+	switch (moff) {
+	case offsetof(struct io_uring_ops, ring_fd):
+		ops->ring_fd = uops->ring_fd;
+		return 1;
+	}
+	return 0;
+}
+
+static int io_register_bpf_ops(struct io_ring_ctx *ctx, struct io_uring_ops *ops)
+{
+	if (ctx->bpf_ops)
+		return -EBUSY;
+	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
+		return -EOPNOTSUPP;
+
+	percpu_ref_get(&ctx->refs);
+	ops->ctx = ctx;
+	ctx->bpf_ops = ops;
 	return 0;
 }
 
 static int bpf_io_reg(void *kdata, struct bpf_link *link)
 {
-	return -EOPNOTSUPP;
+	struct io_uring_ops *ops = kdata;
+	struct io_ring_ctx *ctx;
+	struct file *file;
+	int ret;
+
+	file = io_uring_register_get_file(ops->ring_fd, false);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	ctx = file->private_data;
+	scoped_guard(mutex, &ctx->uring_lock)
+		ret = io_register_bpf_ops(ctx, ops);
+
+	fput(file);
+	return ret;
 }
 
 static void bpf_io_unreg(void *kdata, struct bpf_link *link)
 {
+	struct io_uring_ops *ops = kdata;
+	struct io_ring_ctx *ctx;
+
+	guard(mutex)(&io_bpf_ctrl_mutex);
+
+	ctx = ops->ctx;
+	ops->ctx = NULL;
+
+	if (ctx) {
+		scoped_guard(mutex, &ctx->uring_lock) {
+			if (ctx->bpf_ops == ops)
+				ctx->bpf_ops = NULL;
+		}
+		percpu_ref_put(&ctx->refs);
+	}
 }
 
 void io_unregister_bpf_ops(struct io_ring_ctx *ctx)
 {
+	struct io_uring_ops *ops;
+
+	guard(mutex)(&io_bpf_ctrl_mutex);
+	guard(mutex)(&ctx->uring_lock);
+
+	ops = ctx->bpf_ops;
+	ctx->bpf_ops = NULL;
+
+	if (ops && ops->ctx) {
+		percpu_ref_put(&ctx->refs);
+		ops->ctx = NULL;
+	}
 }
 
 static struct bpf_struct_ops bpf_io_uring_ops = {
diff --git a/io_uring/bpf.h b/io_uring/bpf.h
index a61c489d306b..4b147540d006 100644
--- a/io_uring/bpf.h
+++ b/io_uring/bpf.h
@@ -8,6 +8,9 @@
 #include "io_uring.h"
 
 struct io_uring_ops {
+	__u32 ring_fd;
+
+	struct io_ring_ctx *ctx;
 };
 
 static inline bool io_bpf_attached(struct io_ring_ctx *ctx)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC v2 3/5] io_uring/bpf: implement struct_ops registration
  2025-06-06 13:58 ` [RFC v2 3/5] io_uring/bpf: implement struct_ops registration Pavel Begunkov
@ 2025-06-06 14:57   ` Jens Axboe
  2025-06-06 20:00     ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2025-06-06 14:57 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Martin KaFai Lau, bpf, linux-kernel

On 6/6/25 7:58 AM, Pavel Begunkov wrote:
> Add ring_fd to the struct_ops and implement [un]registration.
> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  io_uring/bpf.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  io_uring/bpf.h |  3 +++
>  2 files changed, 69 insertions(+), 1 deletion(-)
> 
> diff --git a/io_uring/bpf.c b/io_uring/bpf.c
> index 3096c54e4fb3..0f82acf09959 100644
> --- a/io_uring/bpf.c
> +++ b/io_uring/bpf.c
> @@ -3,6 +3,8 @@
>  #include "bpf.h"
>  #include "register.h"
>  
> +DEFINE_MUTEX(io_bpf_ctrl_mutex);
> +
>  static struct io_uring_ops io_bpf_ops_stubs = {
>  };
>  
> @@ -50,20 +52,83 @@ static int bpf_io_init_member(const struct btf_type *t,
>  			       const struct btf_member *member,
>  			       void *kdata, const void *udata)
>  {
> +	u32 moff = __btf_member_bit_offset(t, member) / 8;
> +	const struct io_uring_ops *uops = udata;
> +	struct io_uring_ops *ops = kdata;
> +
> +	switch (moff) {
> +	case offsetof(struct io_uring_ops, ring_fd):
> +		ops->ring_fd = uops->ring_fd;
> +		return 1;
> +	}
> +	return 0;

Possible to pass in here whether the ring fd is registered or not? Such
that it can be used in bpf_io_reg() as well.

> +static int io_register_bpf_ops(struct io_ring_ctx *ctx, struct io_uring_ops *ops)
> +{
> +	if (ctx->bpf_ops)
> +		return -EBUSY;
> +	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
> +		return -EOPNOTSUPP;
> +
> +	percpu_ref_get(&ctx->refs);
> +	ops->ctx = ctx;
> +	ctx->bpf_ops = ops;
>  	return 0;
>  }

Haven't looked too deeply yet, but what's the dependency with
DEFER_TASKRUN?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 3/5] io_uring/bpf: implement struct_ops registration
  2025-06-06 14:57   ` Jens Axboe
@ 2025-06-06 20:00     ` Pavel Begunkov
  2025-06-06 21:07       ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2025-06-06 20:00 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: Martin KaFai Lau, bpf, linux-kernel

On 6/6/25 15:57, Jens Axboe wrote:
...>> @@ -50,20 +52,83 @@ static int bpf_io_init_member(const struct btf_type *t,
>>   			       const struct btf_member *member,
>>   			       void *kdata, const void *udata)
>>   {
>> +	u32 moff = __btf_member_bit_offset(t, member) / 8;
>> +	const struct io_uring_ops *uops = udata;
>> +	struct io_uring_ops *ops = kdata;
>> +
>> +	switch (moff) {
>> +	case offsetof(struct io_uring_ops, ring_fd):
>> +		ops->ring_fd = uops->ring_fd;
>> +		return 1;
>> +	}
>> +	return 0;
> 
> Possible to pass in here whether the ring fd is registered or not? Such
> that it can be used in bpf_io_reg() as well.

That requires registration to be done off the syscall path (e.g. no
workers), which is low risk and I'm pretty sure that's how it's done,
but in either case that's not up to io_uring and should be vetted by
bpf. It's not important to performance, and leaking that to other
syscalls is a bad idea as well, so in the meantime it's just left
unsupported.

>> +static int io_register_bpf_ops(struct io_ring_ctx *ctx, struct io_uring_ops *ops)
>> +{
>> +	if (ctx->bpf_ops)
>> +		return -EBUSY;
>> +	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
>> +		return -EOPNOTSUPP;
>> +
>> +	percpu_ref_get(&ctx->refs);
>> +	ops->ctx = ctx;
>> +	ctx->bpf_ops = ops;
>>   	return 0;
>>   }
> 
> Haven't looked too deeply yet, but what's the dependency with
> DEFER_TASKRUN?
Unregistration needs to be sync'ed with waiters, and that can easily
become a problem. Taking the lock like in this set in not necessarily
the right solution. I plan to wait and see where it goes rather
than shooting myself in the leg right away.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 3/5] io_uring/bpf: implement struct_ops registration
  2025-06-06 20:00     ` Pavel Begunkov
@ 2025-06-06 21:07       ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2025-06-06 21:07 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Martin KaFai Lau, bpf, linux-kernel

On 6/6/25 2:00 PM, Pavel Begunkov wrote:
> On 6/6/25 15:57, Jens Axboe wrote:
> ...>> @@ -50,20 +52,83 @@ static int bpf_io_init_member(const struct btf_type *t,
>>>                      const struct btf_member *member,
>>>                      void *kdata, const void *udata)
>>>   {
>>> +    u32 moff = __btf_member_bit_offset(t, member) / 8;
>>> +    const struct io_uring_ops *uops = udata;
>>> +    struct io_uring_ops *ops = kdata;
>>> +
>>> +    switch (moff) {
>>> +    case offsetof(struct io_uring_ops, ring_fd):
>>> +        ops->ring_fd = uops->ring_fd;
>>> +        return 1;
>>> +    }
>>> +    return 0;
>>
>> Possible to pass in here whether the ring fd is registered or not? Such
>> that it can be used in bpf_io_reg() as well.
> 
> That requires registration to be done off the syscall path (e.g. no
> workers), which is low risk and I'm pretty sure that's how it's done,
> but in either case that's not up to io_uring and should be vetted by
> bpf. It's not important to performance, and leaking that to other
> syscalls is a bad idea as well, so in the meantime it's just left
> unsupported.

Don't care about the performance as much as it being a weird crinkle.
Obviously not a huge deal, and can always get sorted out down the line.

>>> +static int io_register_bpf_ops(struct io_ring_ctx *ctx, struct io_uring_ops *ops)
>>> +{
>>> +    if (ctx->bpf_ops)
>>> +        return -EBUSY;
>>> +    if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
>>> +        return -EOPNOTSUPP;
>>> +
>>> +    percpu_ref_get(&ctx->refs);
>>> +    ops->ctx = ctx;
>>> +    ctx->bpf_ops = ops;
>>>       return 0;
>>>   }
>>
>> Haven't looked too deeply yet, but what's the dependency with
>> DEFER_TASKRUN?
> Unregistration needs to be sync'ed with waiters, and that can easily
> become a problem. Taking the lock like in this set in not necessarily
> the right solution. I plan to wait and see where it goes rather
> than shooting myself in the leg right away.

That's fine, would be nice with a comment or something in the commit
message to that effect at least for the time being.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC v2 4/5] io_uring/bpf: add handle events callback
  2025-06-06 13:57 [RFC v2 0/5] BPF controlled io_uring Pavel Begunkov
                   ` (2 preceding siblings ...)
  2025-06-06 13:58 ` [RFC v2 3/5] io_uring/bpf: implement struct_ops registration Pavel Begunkov
@ 2025-06-06 13:58 ` Pavel Begunkov
  2025-06-12  2:28   ` Alexei Starovoitov
  2025-06-06 13:58 ` [RFC v2 5/5] io_uring/bpf: add basic kfunc helpers Pavel Begunkov
  2025-06-06 14:38 ` [RFC v2 0/5] BPF controlled io_uring Jens Axboe
  5 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2025-06-06 13:58 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Martin KaFai Lau, bpf, linux-kernel

Add a struct_ops callback called handle_events, which will be called
off the CQ waiting loop every time there is an event that might be
interesting to the program. The program takes the io_uring ctx and also
a loop state, which it can use to set the number of events it wants to
wait for as well as the timeout value.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/bpf.c      | 33 +++++++++++++++++++++++++++++++++
 io_uring/bpf.h      | 16 ++++++++++++++++
 io_uring/io_uring.c | 22 +++++++++++++++++++++-
 3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/io_uring/bpf.c b/io_uring/bpf.c
index 0f82acf09959..f86b12f280e8 100644
--- a/io_uring/bpf.c
+++ b/io_uring/bpf.c
@@ -1,11 +1,20 @@
 #include <linux/mutex.h>
+#include <linux/bpf_verifier.h>
 
 #include "bpf.h"
 #include "register.h"
 
+static const struct btf_type *loop_state_type;
 DEFINE_MUTEX(io_bpf_ctrl_mutex);
 
+static int io_bpf_ops__handle_events(struct io_ring_ctx *ctx,
+				     struct iou_loop_state *state)
+{
+	return IOU_EVENTS_STOP;
+}
+
 static struct io_uring_ops io_bpf_ops_stubs = {
+	.handle_events = io_bpf_ops__handle_events,
 };
 
 static bool bpf_io_is_valid_access(int off, int size,
@@ -27,6 +36,16 @@ static int bpf_io_btf_struct_access(struct bpf_verifier_log *log,
 				    const struct bpf_reg_state *reg, int off,
 				    int size)
 {
+	const struct btf_type *t = btf_type_by_id(reg->btf, reg->btf_id);
+
+	if (t == loop_state_type) {
+		if (off >= offsetof(struct iou_loop_state, target_cq_tail) &&
+		    off + size <= offsetofend(struct iou_loop_state, target_cq_tail))
+			return SCALAR_VALUE;
+		if (off >= offsetof(struct iou_loop_state, timeout) &&
+		    off + size <= offsetofend(struct iou_loop_state, timeout))
+			return SCALAR_VALUE;
+	}
 	return -EACCES;
 }
 
@@ -36,8 +55,22 @@ static const struct bpf_verifier_ops bpf_io_verifier_ops = {
 	.btf_struct_access = bpf_io_btf_struct_access,
 };
 
+static const struct btf_type *
+io_lookup_struct_type(struct btf *btf, const char *name)
+{
+	s32 type_id;
+
+	type_id = btf_find_by_name_kind(btf, name, BTF_KIND_STRUCT);
+	if (type_id < 0)
+		return NULL;
+	return btf_type_by_id(btf, type_id);
+}
+
 static int bpf_io_init(struct btf *btf)
 {
+	loop_state_type = io_lookup_struct_type(btf, "iou_loop_state");
+	if (!loop_state_type)
+		return -EINVAL;
 	return 0;
 }
 
diff --git a/io_uring/bpf.h b/io_uring/bpf.h
index 4b147540d006..ac4a9361f9c7 100644
--- a/io_uring/bpf.h
+++ b/io_uring/bpf.h
@@ -7,12 +7,28 @@
 
 #include "io_uring.h"
 
+enum {
+	IOU_EVENTS_WAIT,
+	IOU_EVENTS_STOP,
+};
+
 struct io_uring_ops {
 	__u32 ring_fd;
 
+	int (*handle_events)(struct io_ring_ctx *ctx, struct iou_loop_state *state);
+
 	struct io_ring_ctx *ctx;
 };
 
+static inline int io_run_bpf(struct io_ring_ctx *ctx, struct iou_loop_state *state)
+{
+	scoped_guard(mutex, &ctx->uring_lock) {
+		if (!ctx->bpf_ops)
+			return IOU_EVENTS_STOP;
+		return ctx->bpf_ops->handle_events(ctx, state);
+	}
+}
+
 static inline bool io_bpf_attached(struct io_ring_ctx *ctx)
 {
 	return IS_ENABLED(CONFIG_BPF) && ctx->bpf_ops != NULL;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 8f68e898d60c..bf245be0844b 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2540,8 +2540,13 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
 
 	if (unlikely(test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq)))
 		io_cqring_do_overflow_flush(ctx);
-	if (__io_cqring_events_user(ctx) >= min_events)
+
+	if (io_bpf_attached(ctx)) {
+		if (ext_arg->min_time)
+			return -EINVAL;
+	} else if (__io_cqring_events_user(ctx) >= min_events) {
 		return 0;
+	}
 
 	init_waitqueue_func_entry(&iowq.wq, io_wake_function);
 	iowq.wq.private = current;
@@ -2621,6 +2626,21 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
 		if (ret < 0)
 			break;
 
+		if (io_bpf_attached(ctx)) {
+			ret = io_run_bpf(ctx, &iowq.state);
+			if (ret != IOU_EVENTS_WAIT)
+				break;
+
+			if (unlikely(read_thread_flags())) {
+				if (task_sigpending(current)) {
+					ret = -EINTR;
+					break;
+				}
+				cond_resched();
+			}
+			continue;
+		}
+
 		check_cq = READ_ONCE(ctx->check_cq);
 		if (unlikely(check_cq)) {
 			/* let the caller flush overflows, retry */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC v2 4/5] io_uring/bpf: add handle events callback
  2025-06-06 13:58 ` [RFC v2 4/5] io_uring/bpf: add handle events callback Pavel Begunkov
@ 2025-06-12  2:28   ` Alexei Starovoitov
  2025-06-12  9:33     ` Pavel Begunkov
  2025-06-12 14:07     ` Jens Axboe
  0 siblings, 2 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2025-06-12  2:28 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, Martin KaFai Lau, bpf, LKML

On Fri, Jun 6, 2025 at 6:58 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> +static inline int io_run_bpf(struct io_ring_ctx *ctx, struct iou_loop_state *state)
> +{
> +       scoped_guard(mutex, &ctx->uring_lock) {
> +               if (!ctx->bpf_ops)
> +                       return IOU_EVENTS_STOP;
> +               return ctx->bpf_ops->handle_events(ctx, state);
> +       }
> +}

you're grabbing the mutex before calling bpf prog and doing
it in a loop million times a second?
Looks like massive overhead for program invocation.
I'm surprised it's fast.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 4/5] io_uring/bpf: add handle events callback
  2025-06-12  2:28   ` Alexei Starovoitov
@ 2025-06-12  9:33     ` Pavel Begunkov
  2025-06-12 14:07     ` Jens Axboe
  1 sibling, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2025-06-12  9:33 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: io-uring, Martin KaFai Lau, bpf, LKML

On 6/12/25 03:28, Alexei Starovoitov wrote:
> On Fri, Jun 6, 2025 at 6:58 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> +static inline int io_run_bpf(struct io_ring_ctx *ctx, struct iou_loop_state *state)
>> +{
>> +       scoped_guard(mutex, &ctx->uring_lock) {
>> +               if (!ctx->bpf_ops)
>> +                       return IOU_EVENTS_STOP;
>> +               return ctx->bpf_ops->handle_events(ctx, state);
>> +       }
>> +}
> 
> you're grabbing the mutex before calling bpf prog and doing
> it in a loop million times a second?
> Looks like massive overhead for program invocation.
> I'm surprised it's fast.

You need the lock to submit anything with io_uring, so there is
a parity with how it already is. And the program is just a test
and pretty silly in nature, normally you'd either get higher
batching, and the user (incl bpf) can specifically specify to
wait for more, or it'll be intermingled with sleeping at which
point the mutex is not a problem. I'll write a storage IO
example for the next time.

If there will be a good use case, I can try to relax it for
programs that don't issue requests, but that might make
sync more complicated, especially on the reg/unreg side.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 4/5] io_uring/bpf: add handle events callback
  2025-06-12  2:28   ` Alexei Starovoitov
  2025-06-12  9:33     ` Pavel Begunkov
@ 2025-06-12 14:07     ` Jens Axboe
  1 sibling, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2025-06-12 14:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Pavel Begunkov; +Cc: io-uring, Martin KaFai Lau, bpf, LKML

On 6/11/25 8:28 PM, Alexei Starovoitov wrote:
> On Fri, Jun 6, 2025 at 6:58?AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> +static inline int io_run_bpf(struct io_ring_ctx *ctx, struct iou_loop_state *state)
>> +{
>> +       scoped_guard(mutex, &ctx->uring_lock) {
>> +               if (!ctx->bpf_ops)
>> +                       return IOU_EVENTS_STOP;
>> +               return ctx->bpf_ops->handle_events(ctx, state);
>> +       }
>> +}
> 
> you're grabbing the mutex before calling bpf prog and doing
> it in a loop million times a second?
> Looks like massive overhead for program invocation.
> I'm surprised it's fast.

Grabbing a mutex is only expensive if it's contended, or obviously
if it's already held. Repeatedly grabbing it on submission where
submission is the only one expected to grab it (or off that path, at
least) means it should be very cheap.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC v2 5/5] io_uring/bpf: add basic kfunc helpers
  2025-06-06 13:57 [RFC v2 0/5] BPF controlled io_uring Pavel Begunkov
                   ` (3 preceding siblings ...)
  2025-06-06 13:58 ` [RFC v2 4/5] io_uring/bpf: add handle events callback Pavel Begunkov
@ 2025-06-06 13:58 ` Pavel Begunkov
  2025-06-12  2:47   ` Alexei Starovoitov
  2025-06-06 14:38 ` [RFC v2 0/5] BPF controlled io_uring Jens Axboe
  5 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2025-06-06 13:58 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Martin KaFai Lau, bpf, linux-kernel

A handle_events program should be able to parse the CQ and submit new
requests, add kfuncs to cover that. The only essential kfunc here is
bpf_io_uring_submit_sqes, and the rest are likely be removed in a
non-RFC version in favour of a more general approach.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/bpf.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 86 insertions(+)

diff --git a/io_uring/bpf.c b/io_uring/bpf.c
index f86b12f280e8..9494e4289605 100644
--- a/io_uring/bpf.c
+++ b/io_uring/bpf.c
@@ -1,12 +1,92 @@
 #include <linux/mutex.h>
 #include <linux/bpf_verifier.h>
 
+#include "io_uring.h"
 #include "bpf.h"
 #include "register.h"
 
 static const struct btf_type *loop_state_type;
 DEFINE_MUTEX(io_bpf_ctrl_mutex);
 
+__bpf_kfunc_start_defs();
+
+__bpf_kfunc int bpf_io_uring_submit_sqes(struct io_ring_ctx *ctx,
+					 unsigned nr)
+{
+	return io_submit_sqes(ctx, nr);
+}
+
+__bpf_kfunc int bpf_io_uring_post_cqe(struct io_ring_ctx *ctx,
+				      u64 data, u32 res, u32 cflags)
+{
+	bool posted;
+
+	posted = io_post_aux_cqe(ctx, data, res, cflags);
+	return posted ? 0 : -ENOMEM;
+}
+
+__bpf_kfunc int bpf_io_uring_queue_sqe(struct io_ring_ctx *ctx,
+					void *bpf_sqe, int mem__sz)
+{
+	unsigned tail = ctx->rings->sq.tail;
+	struct io_uring_sqe *sqe;
+
+	if (mem__sz != sizeof(*sqe))
+		return -EINVAL;
+
+	ctx->rings->sq.tail++;
+	tail &= (ctx->sq_entries - 1);
+	/* double index for 128-byte SQEs, twice as long */
+	if (ctx->flags & IORING_SETUP_SQE128)
+		tail <<= 1;
+	sqe = &ctx->sq_sqes[tail];
+	memcpy(sqe, bpf_sqe, sizeof(*sqe));
+	return 0;
+}
+
+__bpf_kfunc
+struct io_uring_cqe *bpf_io_uring_get_cqe(struct io_ring_ctx *ctx, u32 idx)
+{
+	unsigned max_entries = ctx->cq_entries;
+	struct io_uring_cqe *cqe_array = ctx->rings->cqes;
+
+	if (ctx->flags & IORING_SETUP_CQE32)
+		max_entries *= 2;
+	return &cqe_array[idx & (max_entries - 1)];
+}
+
+__bpf_kfunc
+struct io_uring_cqe *bpf_io_uring_extract_next_cqe(struct io_ring_ctx *ctx)
+{
+	struct io_rings *rings = ctx->rings;
+	unsigned int mask = ctx->cq_entries - 1;
+	unsigned head = rings->cq.head;
+	struct io_uring_cqe *cqe;
+
+	/* TODO CQE32 */
+	if (head == rings->cq.tail)
+		return NULL;
+
+	cqe = &rings->cqes[head & mask];
+	rings->cq.head++;
+	return cqe;
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(io_uring_kfunc_set)
+BTF_ID_FLAGS(func, bpf_io_uring_submit_sqes, KF_SLEEPABLE);
+BTF_ID_FLAGS(func, bpf_io_uring_post_cqe, KF_SLEEPABLE);
+BTF_ID_FLAGS(func, bpf_io_uring_queue_sqe, KF_SLEEPABLE);
+BTF_ID_FLAGS(func, bpf_io_uring_get_cqe, 0);
+BTF_ID_FLAGS(func, bpf_io_uring_extract_next_cqe, KF_RET_NULL);
+BTF_KFUNCS_END(io_uring_kfunc_set)
+
+static const struct btf_kfunc_id_set bpf_io_uring_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set = &io_uring_kfunc_set,
+};
+
 static int io_bpf_ops__handle_events(struct io_ring_ctx *ctx,
 				     struct iou_loop_state *state)
 {
@@ -186,6 +266,12 @@ static int __init io_uring_bpf_init(void)
 		return ret;
 	}
 
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+					&bpf_io_uring_kfunc_set);
+	if (ret) {
+		pr_err("io_uring: Failed to register kfuncs (%d)\n", ret);
+		return ret;
+	}
 	return 0;
 }
 __initcall(io_uring_bpf_init);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC v2 5/5] io_uring/bpf: add basic kfunc helpers
  2025-06-06 13:58 ` [RFC v2 5/5] io_uring/bpf: add basic kfunc helpers Pavel Begunkov
@ 2025-06-12  2:47   ` Alexei Starovoitov
  2025-06-12 13:26     ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2025-06-12  2:47 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, Martin KaFai Lau, bpf, LKML

On Fri, Jun 6, 2025 at 6:58 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> A handle_events program should be able to parse the CQ and submit new
> requests, add kfuncs to cover that. The only essential kfunc here is
> bpf_io_uring_submit_sqes, and the rest are likely be removed in a
> non-RFC version in favour of a more general approach.
>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  io_uring/bpf.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 86 insertions(+)
>
> diff --git a/io_uring/bpf.c b/io_uring/bpf.c
> index f86b12f280e8..9494e4289605 100644
> --- a/io_uring/bpf.c
> +++ b/io_uring/bpf.c
> @@ -1,12 +1,92 @@
>  #include <linux/mutex.h>
>  #include <linux/bpf_verifier.h>
>
> +#include "io_uring.h"
>  #include "bpf.h"
>  #include "register.h"
>
>  static const struct btf_type *loop_state_type;
>  DEFINE_MUTEX(io_bpf_ctrl_mutex);
>
> +__bpf_kfunc_start_defs();
> +
> +__bpf_kfunc int bpf_io_uring_submit_sqes(struct io_ring_ctx *ctx,
> +                                        unsigned nr)
> +{
> +       return io_submit_sqes(ctx, nr);
> +}
> +
> +__bpf_kfunc int bpf_io_uring_post_cqe(struct io_ring_ctx *ctx,
> +                                     u64 data, u32 res, u32 cflags)
> +{
> +       bool posted;
> +
> +       posted = io_post_aux_cqe(ctx, data, res, cflags);
> +       return posted ? 0 : -ENOMEM;
> +}
> +
> +__bpf_kfunc int bpf_io_uring_queue_sqe(struct io_ring_ctx *ctx,
> +                                       void *bpf_sqe, int mem__sz)
> +{
> +       unsigned tail = ctx->rings->sq.tail;
> +       struct io_uring_sqe *sqe;
> +
> +       if (mem__sz != sizeof(*sqe))
> +               return -EINVAL;
> +
> +       ctx->rings->sq.tail++;
> +       tail &= (ctx->sq_entries - 1);
> +       /* double index for 128-byte SQEs, twice as long */
> +       if (ctx->flags & IORING_SETUP_SQE128)
> +               tail <<= 1;
> +       sqe = &ctx->sq_sqes[tail];
> +       memcpy(sqe, bpf_sqe, sizeof(*sqe));
> +       return 0;
> +}
> +
> +__bpf_kfunc
> +struct io_uring_cqe *bpf_io_uring_get_cqe(struct io_ring_ctx *ctx, u32 idx)
> +{
> +       unsigned max_entries = ctx->cq_entries;
> +       struct io_uring_cqe *cqe_array = ctx->rings->cqes;
> +
> +       if (ctx->flags & IORING_SETUP_CQE32)
> +               max_entries *= 2;
> +       return &cqe_array[idx & (max_entries - 1)];
> +}
> +
> +__bpf_kfunc
> +struct io_uring_cqe *bpf_io_uring_extract_next_cqe(struct io_ring_ctx *ctx)
> +{
> +       struct io_rings *rings = ctx->rings;
> +       unsigned int mask = ctx->cq_entries - 1;
> +       unsigned head = rings->cq.head;
> +       struct io_uring_cqe *cqe;
> +
> +       /* TODO CQE32 */
> +       if (head == rings->cq.tail)
> +               return NULL;
> +
> +       cqe = &rings->cqes[head & mask];
> +       rings->cq.head++;
> +       return cqe;
> +}
> +
> +__bpf_kfunc_end_defs();
> +
> +BTF_KFUNCS_START(io_uring_kfunc_set)
> +BTF_ID_FLAGS(func, bpf_io_uring_submit_sqes, KF_SLEEPABLE);
> +BTF_ID_FLAGS(func, bpf_io_uring_post_cqe, KF_SLEEPABLE);
> +BTF_ID_FLAGS(func, bpf_io_uring_queue_sqe, KF_SLEEPABLE);
> +BTF_ID_FLAGS(func, bpf_io_uring_get_cqe, 0);
> +BTF_ID_FLAGS(func, bpf_io_uring_extract_next_cqe, KF_RET_NULL);
> +BTF_KFUNCS_END(io_uring_kfunc_set)

This is not safe in general.
The verifier doesn't enforce argument safety here.
As a minimum you need to add KF_TRUSTED_ARGS flag to all kfunc.
And once you do that you'll see that the verifier
doesn't recognize the cqe returned from bpf_io_uring_get_cqe*()
as trusted.
Looking at your example:
https://github.com/axboe/liburing/commit/706237127f03e15b4cc9c7c31c16d34dbff37cdc
it doesn't care about contents of cqe and doesn't pass it further.
So sort-of ok-ish right now,
but if you need to pass cqe to another kfunc
you would need to add an open coded iterator for cqe-s
with appropriate KF_ITER* flags
or maybe add acquire/release semantics for cqe.
Like, get_cqe will be KF_ACQUIRE, and you'd need
matching KF_RELEASE kfunc,
so that 'cqe' is not lost.
Then 'cqe' will be trusted and you can pass it as actual 'cqe'
into another kfunc.
Without KF_ACQUIRE the verifier sees that get_cqe*() kfuncs
return 'struct io_uring_cqe *' and it's ok for tracing
or passing into kfuncs like bpf_io_uring_queue_sqe()
that don't care about a particular type,
but not ok for full tracking of objects.

For next revision please post all selftest, examples,
and bpf progs on the list,
so people don't need to search github.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 5/5] io_uring/bpf: add basic kfunc helpers
  2025-06-12  2:47   ` Alexei Starovoitov
@ 2025-06-12 13:26     ` Pavel Begunkov
  2025-06-12 14:06       ` Jens Axboe
  2025-06-13  0:25       ` Alexei Starovoitov
  0 siblings, 2 replies; 22+ messages in thread
From: Pavel Begunkov @ 2025-06-12 13:26 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: io-uring, Martin KaFai Lau, bpf, LKML

On 6/12/25 03:47, Alexei Starovoitov wrote:
> On Fri, Jun 6, 2025 at 6:58 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
...>> +__bpf_kfunc
>> +struct io_uring_cqe *bpf_io_uring_extract_next_cqe(struct io_ring_ctx *ctx)
>> +{
>> +       struct io_rings *rings = ctx->rings;
>> +       unsigned int mask = ctx->cq_entries - 1;
>> +       unsigned head = rings->cq.head;
>> +       struct io_uring_cqe *cqe;
>> +
>> +       /* TODO CQE32 */
>> +       if (head == rings->cq.tail)
>> +               return NULL;
>> +
>> +       cqe = &rings->cqes[head & mask];
>> +       rings->cq.head++;
>> +       return cqe;
>> +}
>> +
>> +__bpf_kfunc_end_defs();
>> +
>> +BTF_KFUNCS_START(io_uring_kfunc_set)
>> +BTF_ID_FLAGS(func, bpf_io_uring_submit_sqes, KF_SLEEPABLE);
>> +BTF_ID_FLAGS(func, bpf_io_uring_post_cqe, KF_SLEEPABLE);
>> +BTF_ID_FLAGS(func, bpf_io_uring_queue_sqe, KF_SLEEPABLE);
>> +BTF_ID_FLAGS(func, bpf_io_uring_get_cqe, 0);
>> +BTF_ID_FLAGS(func, bpf_io_uring_extract_next_cqe, KF_RET_NULL);
>> +BTF_KFUNCS_END(io_uring_kfunc_set)
> 
> This is not safe in general.
> The verifier doesn't enforce argument safety here.
> As a minimum you need to add KF_TRUSTED_ARGS flag to all kfunc.
> And once you do that you'll see that the verifier
> doesn't recognize the cqe returned from bpf_io_uring_get_cqe*()
> as trusted.

Thanks, will add it. If I read it right, without the flag the
program can, for example, create a struct io_ring_ctx on stack,
fill it with nonsense and pass to kfuncs. Is that right?

> Looking at your example:
> https://github.com/axboe/liburing/commit/706237127f03e15b4cc9c7c31c16d34dbff37cdc
> it doesn't care about contents of cqe and doesn't pass it further.
> So sort-of ok-ish right now,
> but if you need to pass cqe to another kfunc
> you would need to add an open coded iterator for cqe-s
> with appropriate KF_ITER* flags
> or maybe add acquire/release semantics for cqe.
> Like, get_cqe will be KF_ACQUIRE, and you'd need
> matching KF_RELEASE kfunc,
> so that 'cqe' is not lost.
> Then 'cqe' will be trusted and you can pass it as actual 'cqe'
> into another kfunc.
> Without KF_ACQUIRE the verifier sees that get_cqe*() kfuncs
> return 'struct io_uring_cqe *' and it's ok for tracing
> or passing into kfuncs like bpf_io_uring_queue_sqe()
> that don't care about a particular type,
> but not ok for full tracking of objects.

I don't need type safety for SQEs / CQEs, they're supposed to be simple
memory blobs containing userspace data only. SQ / CQ are shared with
userspace, and the kfuncs can leak the content of passed CQE / SQE to
userspace. But I'd like to find a way to reject programs stashing
kernel pointers / data into them.

BPF_PROG(name, struct io_ring_ctx *io_ring)
{
     struct io_uring_sqe *cqe = ...;
     cqe->user_data = io_ring;
     cqe->res = io_ring->private_field;
}

And I mentioned in the message, I rather want to get rid of half of the
kfuncs, and give BPF direct access to the SQ/CQ instead. Schematically
it should look like this:

BPF_PROG(name, struct io_ring_ctx *ring)
{
     struct io_uring_sqe *sqes = get_SQ(ring);

     sqes[ring->sq_tail]->opcode = OP_NOP;
     bpf_kfunc_submit_sqes(ring, 1);

     struct io_uring_cqe *cqes = get_CQ(ring);
     print_cqe(&cqes[ring->cq_head]);
}

I hacked up RET_PTR_TO_MEM for kfuncs, the diff is below, but it'd be
great to get rid of the constness of the size argument. I need to
digest arenas first as conceptually they look very close.

> For next revision please post all selftest, examples,
> and bpf progs on the list,
> so people don't need to search github.

Did the link in the cover letter not work for you? I'm confused
since it's all in a branch in my tree, but you linked to the same
patches but in Jens' tree, and I have zero clue what they're
doing there or how you found them.


diff --git a/io_uring/bpf.c b/io_uring/bpf.c
index 9494e4289605..400a06a74b5d 100644
--- a/io_uring/bpf.c
+++ b/io_uring/bpf.c
@@ -2,6 +2,7 @@
  #include <linux/bpf_verifier.h>
  
  #include "io_uring.h"
+#include "memmap.h"
  #include "bpf.h"
  #include "register.h"
  
@@ -72,6 +73,14 @@ struct io_uring_cqe *bpf_io_uring_extract_next_cqe(struct io_ring_ctx *ctx)
  	return cqe;
  }
  
+__bpf_kfunc
+void *bpf_io_uring_get_region(struct io_ring_ctx *ctx, u64 size__retsz)
+{
+	if (size__retsz > ((u64)ctx->ring_region.nr_pages << PAGE_SHIFT))
+		return NULL;
+	return io_region_get_ptr(&ctx->ring_region);
+}
+
  __bpf_kfunc_end_defs();
  
  BTF_KFUNCS_START(io_uring_kfunc_set)
@@ -80,6 +89,7 @@ BTF_ID_FLAGS(func, bpf_io_uring_post_cqe, KF_SLEEPABLE);
  BTF_ID_FLAGS(func, bpf_io_uring_queue_sqe, KF_SLEEPABLE);
  BTF_ID_FLAGS(func, bpf_io_uring_get_cqe, 0);
  BTF_ID_FLAGS(func, bpf_io_uring_extract_next_cqe, KF_RET_NULL);
+BTF_ID_FLAGS(func, bpf_io_uring_get_region, KF_RET_NULL);
  BTF_KFUNCS_END(io_uring_kfunc_set)
  
  static const struct btf_kfunc_id_set bpf_io_uring_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 54c6953a8b84..ac4803b5933c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -343,6 +343,7 @@ struct bpf_kfunc_call_arg_meta {
  		int uid;
  	} map;
  	u64 mem_size;
+	bool mem_size_found;
  };
  
  struct btf *btf_vmlinux;
@@ -11862,6 +11863,11 @@ static bool is_kfunc_arg_ignore(const struct btf *btf, const struct btf_param *a
  	return btf_param_match_suffix(btf, arg, "__ign");
  }
  
+static bool is_kfunc_arg_ret_size(const struct btf *btf, const struct btf_param *arg)
+{
+	return btf_param_match_suffix(btf, arg, "__retsz");
+}
+
  static bool is_kfunc_arg_map(const struct btf *btf, const struct btf_param *arg)
  {
  	return btf_param_match_suffix(btf, arg, "__map");
@@ -12912,7 +12918,21 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
  				return -EINVAL;
  			}
  
-			if (is_kfunc_arg_constant(meta->btf, &args[i])) {
+			if (is_kfunc_arg_ret_size(btf, &args[i])) {
+				if (!tnum_is_const(reg->var_off)) {
+					verbose(env, "R%d must be a known constant\n", regno);
+					return -EINVAL;
+				}
+				if (meta->mem_size_found) {
+					verbose(env, "Only one return size argument is permitted\n");
+					return -EINVAL;
+				}
+				meta->mem_size = reg->var_off.value;
+				meta->mem_size_found = true;
+				ret = mark_chain_precision(env, regno);
+				if (ret)
+					return ret;
+			} else if (is_kfunc_arg_constant(meta->btf, &args[i])) {
  				if (meta->arg_constant.found) {
  					verbose(env, "verifier internal error: only one constant argument permitted\n");
  					return -EFAULT;
@@ -13816,6 +13836,12 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
  		} else if (btf_type_is_void(ptr_type)) {
  			/* kfunc returning 'void *' is equivalent to returning scalar */
  			mark_reg_unknown(env, regs, BPF_REG_0);
+
+			if (meta.mem_size_found) {
+				mark_reg_known_zero(env, regs, BPF_REG_0);
+				regs[BPF_REG_0].type = PTR_TO_MEM;
+				regs[BPF_REG_0].mem_size = meta.mem_size;
+			}
  		} else if (!__btf_type_is_struct(ptr_type)) {
  			if (!meta.r0_size) {
  				__u32 sz;

-- 
Pavel Begunkov


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC v2 5/5] io_uring/bpf: add basic kfunc helpers
  2025-06-12 13:26     ` Pavel Begunkov
@ 2025-06-12 14:06       ` Jens Axboe
  2025-06-13  0:25       ` Alexei Starovoitov
  1 sibling, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2025-06-12 14:06 UTC (permalink / raw)
  To: Pavel Begunkov, Alexei Starovoitov; +Cc: io-uring, Martin KaFai Lau, bpf, LKML

On 6/12/25 7:26 AM, Pavel Begunkov wrote:
>> For next revision please post all selftest, examples,
>> and bpf progs on the list,
>> so people don't need to search github.
> 
> Did the link in the cover letter not work for you? I'm confused
> since it's all in a branch in my tree, but you linked to the same
> patches but in Jens' tree, and I have zero clue what they're
> doing there or how you found them.

Puzzled me too, but if you go there, github will say:

"This commit does not belong to any branch on this repository, and may
 belong to a fork outside of the repository."

which is exactly because it's not in my tree, but in your fork of
my tree. Pretty wonky GH behavior if you ask me, but there it is.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 5/5] io_uring/bpf: add basic kfunc helpers
  2025-06-12 13:26     ` Pavel Begunkov
  2025-06-12 14:06       ` Jens Axboe
@ 2025-06-13  0:25       ` Alexei Starovoitov
  2025-06-13 16:12         ` Pavel Begunkov
  1 sibling, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2025-06-13  0:25 UTC (permalink / raw)
  To: Pavel Begunkov, Andrii Nakryiko; +Cc: io-uring, Martin KaFai Lau, bpf, LKML

On Thu, Jun 12, 2025 at 6:25 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 6/12/25 03:47, Alexei Starovoitov wrote:
> > On Fri, Jun 6, 2025 at 6:58 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> ...>> +__bpf_kfunc
> >> +struct io_uring_cqe *bpf_io_uring_extract_next_cqe(struct io_ring_ctx *ctx)
> >> +{
> >> +       struct io_rings *rings = ctx->rings;
> >> +       unsigned int mask = ctx->cq_entries - 1;
> >> +       unsigned head = rings->cq.head;
> >> +       struct io_uring_cqe *cqe;
> >> +
> >> +       /* TODO CQE32 */
> >> +       if (head == rings->cq.tail)
> >> +               return NULL;
> >> +
> >> +       cqe = &rings->cqes[head & mask];
> >> +       rings->cq.head++;
> >> +       return cqe;
> >> +}
> >> +
> >> +__bpf_kfunc_end_defs();
> >> +
> >> +BTF_KFUNCS_START(io_uring_kfunc_set)
> >> +BTF_ID_FLAGS(func, bpf_io_uring_submit_sqes, KF_SLEEPABLE);
> >> +BTF_ID_FLAGS(func, bpf_io_uring_post_cqe, KF_SLEEPABLE);
> >> +BTF_ID_FLAGS(func, bpf_io_uring_queue_sqe, KF_SLEEPABLE);
> >> +BTF_ID_FLAGS(func, bpf_io_uring_get_cqe, 0);
> >> +BTF_ID_FLAGS(func, bpf_io_uring_extract_next_cqe, KF_RET_NULL);
> >> +BTF_KFUNCS_END(io_uring_kfunc_set)
> >
> > This is not safe in general.
> > The verifier doesn't enforce argument safety here.
> > As a minimum you need to add KF_TRUSTED_ARGS flag to all kfunc.
> > And once you do that you'll see that the verifier
> > doesn't recognize the cqe returned from bpf_io_uring_get_cqe*()
> > as trusted.
>
> Thanks, will add it. If I read it right, without the flag the
> program can, for example, create a struct io_ring_ctx on stack,
> fill it with nonsense and pass to kfuncs. Is that right?

No. The verifier will only allow a pointer to struct io_ring_ctx
to be passed, but it may not be fully trusted.

The verifier has 3 types of pointers to kernel structures:
1. ptr_to_btf_id
2. ptr_to_btf_id | trusted
3. ptr_to_btf_id | untrusted

1st was added long ago for tracing and gradually got adopted
for non-tracing needs, but it has a foot gun, since
all pointer walks keep ptr_to_btf_id type.
It's fine in some cases to follow pointers, but not in all.
Hence 2nd variant was added and there
foo->bar dereference needs to be explicitly allowed
instead of allowed by default like for 1st kind.

All loads through 1 and 3 are implemented as probe_read_kernel.
while loads from 2 are direct loads.

So kfuncs without KF_TRUSTED_ARGS with struct io_ring_ctx *ctx
argument are likely fine and safe, since it's impossible
to get this io_ring_ctx pointer by dereferencing some other pointer.
But better to tighten safety from the start.
We recommend KF_TRUSTED_ARGS for all kfuncs and
eventually it will be the default.

> > Looking at your example:
> > https://github.com/axboe/liburing/commit/706237127f03e15b4cc9c7c31c16d34dbff37cdc
> > it doesn't care about contents of cqe and doesn't pass it further.
> > So sort-of ok-ish right now,
> > but if you need to pass cqe to another kfunc
> > you would need to add an open coded iterator for cqe-s
> > with appropriate KF_ITER* flags
> > or maybe add acquire/release semantics for cqe.
> > Like, get_cqe will be KF_ACQUIRE, and you'd need
> > matching KF_RELEASE kfunc,
> > so that 'cqe' is not lost.
> > Then 'cqe' will be trusted and you can pass it as actual 'cqe'
> > into another kfunc.
> > Without KF_ACQUIRE the verifier sees that get_cqe*() kfuncs
> > return 'struct io_uring_cqe *' and it's ok for tracing
> > or passing into kfuncs like bpf_io_uring_queue_sqe()
> > that don't care about a particular type,
> > but not ok for full tracking of objects.
>
> I don't need type safety for SQEs / CQEs, they're supposed to be simple
> memory blobs containing userspace data only. SQ / CQ are shared with
> userspace, and the kfuncs can leak the content of passed CQE / SQE to
> userspace. But I'd like to find a way to reject programs stashing
> kernel pointers / data into them.

That's impossible.
If you're worried about bpf prog exposing kernel addresses
to user space then abort the whole thing.
CAP_PERFMON is required for the majority of bpf progs.

>
> BPF_PROG(name, struct io_ring_ctx *io_ring)
> {
>      struct io_uring_sqe *cqe = ...;
>      cqe->user_data = io_ring;
>      cqe->res = io_ring->private_field;
> }
>
> And I mentioned in the message, I rather want to get rid of half of the
> kfuncs, and give BPF direct access to the SQ/CQ instead. Schematically
> it should look like this:
>
> BPF_PROG(name, struct io_ring_ctx *ring)
> {
>      struct io_uring_sqe *sqes = get_SQ(ring);
>
>      sqes[ring->sq_tail]->opcode = OP_NOP;
>      bpf_kfunc_submit_sqes(ring, 1);
>
>      struct io_uring_cqe *cqes = get_CQ(ring);
>      print_cqe(&cqes[ring->cq_head]);
> }
>
> I hacked up RET_PTR_TO_MEM for kfuncs, the diff is below, but it'd be
> great to get rid of the constness of the size argument. I need to
> digest arenas first as conceptually they look very close.

arena is a special memory region where every byte is writeable
by user space.

>
> > For next revision please post all selftest, examples,
> > and bpf progs on the list,
> > so people don't need to search github.
>
> Did the link in the cover letter not work for you? I'm confused
> since it's all in a branch in my tree, but you linked to the same
> patches but in Jens' tree, and I have zero clue what they're
> doing there or how you found them.

External links can disappear. It's not good for reviewers and
for keeping the history of conversation.

>
> diff --git a/io_uring/bpf.c b/io_uring/bpf.c
> index 9494e4289605..400a06a74b5d 100644
> --- a/io_uring/bpf.c
> +++ b/io_uring/bpf.c
> @@ -2,6 +2,7 @@
>   #include <linux/bpf_verifier.h>
>
>   #include "io_uring.h"
> +#include "memmap.h"
>   #include "bpf.h"
>   #include "register.h"
>
> @@ -72,6 +73,14 @@ struct io_uring_cqe *bpf_io_uring_extract_next_cqe(struct io_ring_ctx *ctx)
>         return cqe;
>   }
>
> +__bpf_kfunc
> +void *bpf_io_uring_get_region(struct io_ring_ctx *ctx, u64 size__retsz)
> +{
> +       if (size__retsz > ((u64)ctx->ring_region.nr_pages << PAGE_SHIFT))
> +               return NULL;
> +       return io_region_get_ptr(&ctx->ring_region);
> +}

and bpf prog should be able to read/write anything in
[ctx->ring_region->ptr, ..ptr + size] region ?

Populating (creating) dynptr is probably better.
See bpf_dynptr_from*()

but what is the lifetime of that memory ?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 5/5] io_uring/bpf: add basic kfunc helpers
  2025-06-13  0:25       ` Alexei Starovoitov
@ 2025-06-13 16:12         ` Pavel Begunkov
  2025-06-13 19:51           ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2025-06-13 16:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko; +Cc: io-uring, Martin KaFai Lau, bpf, LKML

On 6/13/25 01:25, Alexei Starovoitov wrote:
> On Thu, Jun 12, 2025 at 6:25 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
...>>>> +BTF_ID_FLAGS(func, bpf_io_uring_extract_next_cqe, KF_RET_NULL);
>>>> +BTF_KFUNCS_END(io_uring_kfunc_set)
>>>
>>> This is not safe in general.
>>> The verifier doesn't enforce argument safety here.
>>> As a minimum you need to add KF_TRUSTED_ARGS flag to all kfunc.
>>> And once you do that you'll see that the verifier
>>> doesn't recognize the cqe returned from bpf_io_uring_get_cqe*()
>>> as trusted.
>>
>> Thanks, will add it. If I read it right, without the flag the
>> program can, for example, create a struct io_ring_ctx on stack,
>> fill it with nonsense and pass to kfuncs. Is that right?
> 
> No. The verifier will only allow a pointer to struct io_ring_ctx
> to be passed, but it may not be fully trusted.
> 
> The verifier has 3 types of pointers to kernel structures:
> 1. ptr_to_btf_id
> 2. ptr_to_btf_id | trusted
> 3. ptr_to_btf_id | untrusted
> 
> 1st was added long ago for tracing and gradually got adopted
> for non-tracing needs, but it has a foot gun, since
> all pointer walks keep ptr_to_btf_id type.
> It's fine in some cases to follow pointers, but not in all.
> Hence 2nd variant was added and there
> foo->bar dereference needs to be explicitly allowed
> instead of allowed by default like for 1st kind.
> 
> All loads through 1 and 3 are implemented as probe_read_kernel.
> while loads from 2 are direct loads.
> 
> So kfuncs without KF_TRUSTED_ARGS with struct io_ring_ctx *ctx
> argument are likely fine and safe, since it's impossible
> to get this io_ring_ctx pointer by dereferencing some other pointer.
> But better to tighten safety from the start.
> We recommend KF_TRUSTED_ARGS for all kfuncs and
> eventually it will be the default.

Sure, I'll add it, thanks for the explanation

...>> diff --git a/io_uring/bpf.c b/io_uring/bpf.c
>> index 9494e4289605..400a06a74b5d 100644
>> --- a/io_uring/bpf.c
>> +++ b/io_uring/bpf.c
>> @@ -2,6 +2,7 @@
>>    #include <linux/bpf_verifier.h>
>>
>>    #include "io_uring.h"
>> +#include "memmap.h"
>>    #include "bpf.h"
>>    #include "register.h"
>>
>> @@ -72,6 +73,14 @@ struct io_uring_cqe *bpf_io_uring_extract_next_cqe(struct io_ring_ctx *ctx)
>>          return cqe;
>>    }
>>
>> +__bpf_kfunc
>> +void *bpf_io_uring_get_region(struct io_ring_ctx *ctx, u64 size__retsz)
>> +{
>> +       if (size__retsz > ((u64)ctx->ring_region.nr_pages << PAGE_SHIFT))
>> +               return NULL;
>> +       return io_region_get_ptr(&ctx->ring_region);
>> +}
> 
> and bpf prog should be able to read/write anything in
> [ctx->ring_region->ptr, ..ptr + size] region ?

Right, and it's already rw mmap'ed into the user space.

> Populating (creating) dynptr is probably better.
> See bpf_dynptr_from*()
> 
> but what is the lifetime of that memory ?

It's valid within a single run of the callback but shouldn't cross
into another invocation. Specifically, it's protected by the lock,
but that can be tuned. Does that match with what PTR_TO_MEM expects?

I can add refcounting for longer term pinning, maybe to store it
as a bpf map or whatever is the right way, but I'd rather avoid
anything expensive in the kfunc as that'll likely be called on
every program run.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 5/5] io_uring/bpf: add basic kfunc helpers
  2025-06-13 16:12         ` Pavel Begunkov
@ 2025-06-13 19:51           ` Alexei Starovoitov
  2025-06-16 20:34             ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2025-06-13 19:51 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Andrii Nakryiko, io-uring, Martin KaFai Lau, bpf, LKML

On Fri, Jun 13, 2025 at 9:11 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 6/13/25 01:25, Alexei Starovoitov wrote:
> > On Thu, Jun 12, 2025 at 6:25 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> ...>>>> +BTF_ID_FLAGS(func, bpf_io_uring_extract_next_cqe, KF_RET_NULL);
> >>>> +BTF_KFUNCS_END(io_uring_kfunc_set)
> >>>
> >>> This is not safe in general.
> >>> The verifier doesn't enforce argument safety here.
> >>> As a minimum you need to add KF_TRUSTED_ARGS flag to all kfunc.
> >>> And once you do that you'll see that the verifier
> >>> doesn't recognize the cqe returned from bpf_io_uring_get_cqe*()
> >>> as trusted.
> >>
> >> Thanks, will add it. If I read it right, without the flag the
> >> program can, for example, create a struct io_ring_ctx on stack,
> >> fill it with nonsense and pass to kfuncs. Is that right?
> >
> > No. The verifier will only allow a pointer to struct io_ring_ctx
> > to be passed, but it may not be fully trusted.
> >
> > The verifier has 3 types of pointers to kernel structures:
> > 1. ptr_to_btf_id
> > 2. ptr_to_btf_id | trusted
> > 3. ptr_to_btf_id | untrusted
> >
> > 1st was added long ago for tracing and gradually got adopted
> > for non-tracing needs, but it has a foot gun, since
> > all pointer walks keep ptr_to_btf_id type.
> > It's fine in some cases to follow pointers, but not in all.
> > Hence 2nd variant was added and there
> > foo->bar dereference needs to be explicitly allowed
> > instead of allowed by default like for 1st kind.
> >
> > All loads through 1 and 3 are implemented as probe_read_kernel.
> > while loads from 2 are direct loads.
> >
> > So kfuncs without KF_TRUSTED_ARGS with struct io_ring_ctx *ctx
> > argument are likely fine and safe, since it's impossible
> > to get this io_ring_ctx pointer by dereferencing some other pointer.
> > But better to tighten safety from the start.
> > We recommend KF_TRUSTED_ARGS for all kfuncs and
> > eventually it will be the default.
>
> Sure, I'll add it, thanks for the explanation
>
> ...>> diff --git a/io_uring/bpf.c b/io_uring/bpf.c
> >> index 9494e4289605..400a06a74b5d 100644
> >> --- a/io_uring/bpf.c
> >> +++ b/io_uring/bpf.c
> >> @@ -2,6 +2,7 @@
> >>    #include <linux/bpf_verifier.h>
> >>
> >>    #include "io_uring.h"
> >> +#include "memmap.h"
> >>    #include "bpf.h"
> >>    #include "register.h"
> >>
> >> @@ -72,6 +73,14 @@ struct io_uring_cqe *bpf_io_uring_extract_next_cqe(struct io_ring_ctx *ctx)
> >>          return cqe;
> >>    }
> >>
> >> +__bpf_kfunc
> >> +void *bpf_io_uring_get_region(struct io_ring_ctx *ctx, u64 size__retsz)
> >> +{
> >> +       if (size__retsz > ((u64)ctx->ring_region.nr_pages << PAGE_SHIFT))
> >> +               return NULL;
> >> +       return io_region_get_ptr(&ctx->ring_region);
> >> +}
> >
> > and bpf prog should be able to read/write anything in
> > [ctx->ring_region->ptr, ..ptr + size] region ?
>
> Right, and it's already rw mmap'ed into the user space.
>
> > Populating (creating) dynptr is probably better.
> > See bpf_dynptr_from*()
> >
> > but what is the lifetime of that memory ?
>
> It's valid within a single run of the callback but shouldn't cross
> into another invocation. Specifically, it's protected by the lock,
> but that can be tuned. Does that match with what PTR_TO_MEM expects?

yes. PTR_TO_MEM lasts for duration of the prog.

> I can add refcounting for longer term pinning, maybe to store it
> as a bpf map or whatever is the right way, but I'd rather avoid
> anything expensive in the kfunc as that'll likely be called on
> every program run.

yeah. let's not add any refcounting.

It sounds like you want something similar to
__bpf_kfunc __u8 *
hid_bpf_get_data(struct hid_bpf_ctx *ctx, unsigned int offset, const
size_t rdwr_buf_size)

we have a special hack for it already in the verifier.
The argument need to be called rdwr_buf_size,
then it will be used to establish the range of PTR_TO_MEM.
It has to be run-time constant.

What you're proposing with "__retsz" is a cleaner version of the same.
But consider bpf_dynptr_from_io_uring(struct io_ring_ctx *ctx)
it can create a dynamically sized region,
and later use bpf_dynptr_slice_rdwr() to get writeable chunk of it.

I feel that __retsz approach may actually be a better fit at the end,
if you're ok with constant arg.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 5/5] io_uring/bpf: add basic kfunc helpers
  2025-06-13 19:51           ` Alexei Starovoitov
@ 2025-06-16 20:34             ` Pavel Begunkov
  0 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2025-06-16 20:34 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Andrii Nakryiko, io-uring, Martin KaFai Lau, bpf, LKML

On 6/13/25 20:51, Alexei Starovoitov wrote:
> On Fri, Jun 13, 2025 at 9:11 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
...>>
>> It's valid within a single run of the callback but shouldn't cross
>> into another invocation. Specifically, it's protected by the lock,
>> but that can be tuned. Does that match with what PTR_TO_MEM expects?
> 
> yes. PTR_TO_MEM lasts for duration of the prog.
> 
>> I can add refcounting for longer term pinning, maybe to store it
>> as a bpf map or whatever is the right way, but I'd rather avoid
>> anything expensive in the kfunc as that'll likely be called on
>> every program run.
> 
> yeah. let's not add any refcounting.
> 
> It sounds like you want something similar to
> __bpf_kfunc __u8 *
> hid_bpf_get_data(struct hid_bpf_ctx *ctx, unsigned int offset, const
> size_t rdwr_buf_size)
> 
> we have a special hack for it already in the verifier.
> The argument need to be called rdwr_buf_size,
> then it will be used to establish the range of PTR_TO_MEM.
> It has to be run-time constant.

Great, I can just use that

> What you're proposing with "__retsz" is a cleaner version of the same.
> But consider bpf_dynptr_from_io_uring(struct io_ring_ctx *ctx)
> it can create a dynamically sized region,
> and later use bpf_dynptr_slice_rdwr() to get writeable chunk of it.
> 
> I feel that __retsz approach may actually be a better fit at the end,
> if you're ok with constant arg.
I took a quick look, 16MB sounds a bit restrictive long term. I'll
just go for rdwr_buf_size while experimenting and hopefully will be
able to make a more educated choice later

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 0/5] BPF controlled io_uring
  2025-06-06 13:57 [RFC v2 0/5] BPF controlled io_uring Pavel Begunkov
                   ` (4 preceding siblings ...)
  2025-06-06 13:58 ` [RFC v2 5/5] io_uring/bpf: add basic kfunc helpers Pavel Begunkov
@ 2025-06-06 14:38 ` Jens Axboe
  5 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2025-06-06 14:38 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Martin KaFai Lau, bpf, linux-kernel

On 6/6/25 7:57 AM, Pavel Begunkov wrote:
> This series adds io_uring BPF struct_ops, which allows processing
> events and submitting requests from BPF without returning to user.
> There is only one callback for now, it's called from the io_uring
> CQ waiting loop when there is an event to be processed. It also
> has access to waiting parameters like batching and timeouts.
> 
> It's tested with a program that queues a nop request, waits for
> its completion and then queues another request, repeating it N
> times. The baseline to compare with is traditional io_uring
> application doing same without BPF and using 2 requests links,
> with the same total number of requests.
> 
> # ./link 0 100000000
> type 2-LINK, requests to run 100000000
> sec 20, total (ms) 20374
> # ./link 1 100000000
> type BPF, requests to run 100000000
> sec 13, total (ms) 13700
> 
> The BPF version works ~50% faster on a mitigated kernel, while it's
> not even a completely fair comparison as links are restrictive and
> can't always be used. Without links the speedup reaches ~80%.

Nifty! Great to see the BPF side taking shape, I can think of many cool
things we could do with that. Out of curiosity, tested this on my usual
arm64 vm on the laptop:

axboe@m2max-kvm ~/g/l/examples-bpf (bpf) [1]> ./link 0 100000000
type 2-LINK, requests to run 100000000
sec 13, total (ms) 13868

axboe@m2max-kvm ~/g/l/examples-bpf (bpf)> sudo ./link 1 100000000
type BPF, requests to run 100000000
sec 4, total (ms) 4929

No mitigations or anything configured in this kernel.

I'll take a closer look at the patches.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-06-16 20:33 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-06 13:57 [RFC v2 0/5] BPF controlled io_uring Pavel Begunkov
2025-06-06 13:57 ` [RFC v2 1/5] io_uring: add struct for state controlling cqwait Pavel Begunkov
2025-06-06 13:57 ` [RFC v2 2/5] io_uring/bpf: add stubs for bpf struct_ops Pavel Begunkov
2025-06-06 14:25   ` Jens Axboe
2025-06-06 14:28     ` Jens Axboe
2025-06-06 13:58 ` [RFC v2 3/5] io_uring/bpf: implement struct_ops registration Pavel Begunkov
2025-06-06 14:57   ` Jens Axboe
2025-06-06 20:00     ` Pavel Begunkov
2025-06-06 21:07       ` Jens Axboe
2025-06-06 13:58 ` [RFC v2 4/5] io_uring/bpf: add handle events callback Pavel Begunkov
2025-06-12  2:28   ` Alexei Starovoitov
2025-06-12  9:33     ` Pavel Begunkov
2025-06-12 14:07     ` Jens Axboe
2025-06-06 13:58 ` [RFC v2 5/5] io_uring/bpf: add basic kfunc helpers Pavel Begunkov
2025-06-12  2:47   ` Alexei Starovoitov
2025-06-12 13:26     ` Pavel Begunkov
2025-06-12 14:06       ` Jens Axboe
2025-06-13  0:25       ` Alexei Starovoitov
2025-06-13 16:12         ` Pavel Begunkov
2025-06-13 19:51           ` Alexei Starovoitov
2025-06-16 20:34             ` Pavel Begunkov
2025-06-06 14:38 ` [RFC v2 0/5] BPF controlled io_uring Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).