[RFC v2 0/3] block: Introduce a BPF-based I/O scheduler

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC v2 0/3] block: Introduce a BPF-based I/O scheduler
@ 2026-05-03  3:56 Kaitao cheng
  2026-05-03  3:56 ` [RFC v2 1/3] bpf: Add KF_SPIN_LOCK flag for kfuncs under bpf_spin_lock Kaitao cheng
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Kaitao cheng @ 2026-05-03  3:56 UTC (permalink / raw)
  To: axboe, ast, daniel, andrii, martin.lau, eddyz87, memxor, song,
	yonghong.song, jolsa, john.fastabend
  Cc: bpf, linux-block, linux-kernel, Kaitao cheng

I have been working on adding a new BPF-based I/O scheduler. It has both
kernel and user-space parts. In kernel space, using per-ctx, I implemented
a simple elevator that exposes a set of BPF hooks. The goal is to move the
policy side of I/O scheduling out of the kernel and into user space, which
should greatly increase flexibility and applicability. To verify that the
whole stack works end to end, I wrote a simple BPF example program. I am
calling this feature the UFQ (User-programmable Flexible Queueing) I/O
scheduler.

This patch depends on new BPF functionality that I have already posted to
the BPF community but that is not yet in mainline. Details are in the
thread:

https://lore.kernel.org/bpf/20260427165906.84420-1-kaitao.cheng@linux.dev/

To try it, you need to apply the patches from that series first.

I am looking for community feedback on whether this direction and the
implementation approach make sense, and what else we should
consider.

todo:
1. More thorough testing
2. Split the code into multiple sub-patches
3. Identify concrete use cases

Changes in v2:
- Remove bpf_request_put (Alexei Starovoitov)
- Update the UFQ README (Miguel Ojeda)
- Add bio merge support
- Fix synchronization issues during UFQ scheduler transitions

Link to v1:
https://lore.kernel.org/bpf/20260327114741.91500-1-pilgrimtao@gmail.com/

Kaitao Cheng (3):
  bpf: Add KF_SPIN_LOCK flag for kfuncs under bpf_spin_lock
  block: Introduce the UFQ I/O scheduler
  tools/ufq_iosched: add BPF example scheduler and build scaffolding

 block/Kconfig.iosched                         |   8 +
 block/Makefile                                |   1 +
 block/blk-merge.c                             |  28 +-
 block/blk-mq.c                                |   8 +-
 block/blk-mq.h                                |   2 +-
 block/blk.h                                   |   5 +
 block/ufq-bpfops.c                            | 241 +++++++
 block/ufq-iosched.c                           | 640 ++++++++++++++++++
 block/ufq-iosched.h                           |  64 ++
 block/ufq-kfunc.c                             | 131 ++++
 include/linux/btf.h                           |   1 +
 kernel/bpf/verifier.c                         |  20 +-
 tools/ufq_iosched/.gitignore                  |   2 +
 tools/ufq_iosched/Makefile                    | 262 +++++++
 tools/ufq_iosched/README.md                   | 145 ++++
 .../include/bpf-compat/gnu/stubs.h            |  12 +
 tools/ufq_iosched/include/ufq/common.bpf.h    |  75 ++
 tools/ufq_iosched/include/ufq/common.h        |  90 +++
 tools/ufq_iosched/include/ufq/simple_stat.h   |  23 +
 tools/ufq_iosched/ufq_simple.bpf.c            | 604 +++++++++++++++++
 tools/ufq_iosched/ufq_simple.c                | 120 ++++
 21 files changed, 2464 insertions(+), 18 deletions(-)
 create mode 100644 block/ufq-bpfops.c
 create mode 100644 block/ufq-iosched.c
 create mode 100644 block/ufq-iosched.h
 create mode 100644 block/ufq-kfunc.c
 create mode 100644 tools/ufq_iosched/.gitignore
 create mode 100644 tools/ufq_iosched/Makefile
 create mode 100644 tools/ufq_iosched/README.md
 create mode 100644 tools/ufq_iosched/include/bpf-compat/gnu/stubs.h
 create mode 100644 tools/ufq_iosched/include/ufq/common.bpf.h
 create mode 100644 tools/ufq_iosched/include/ufq/common.h
 create mode 100644 tools/ufq_iosched/include/ufq/simple_stat.h
 create mode 100644 tools/ufq_iosched/ufq_simple.bpf.c
 create mode 100644 tools/ufq_iosched/ufq_simple.c

-- 
2.50.1 (Apple Git-155)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC v2 1/3] bpf: Add KF_SPIN_LOCK flag for kfuncs under bpf_spin_lock
  2026-05-03  3:56 [RFC v2 0/3] block: Introduce a BPF-based I/O scheduler Kaitao cheng
@ 2026-05-03  3:56 ` Kaitao cheng
  2026-05-03  3:56 ` [RFC v2 2/3] block: Introduce the UFQ I/O scheduler Kaitao cheng
  2026-05-03  3:56 ` [RFC v2 3/3] tools/ufq_iosched: add BPF example scheduler and build scaffolding Kaitao cheng
  2 siblings, 0 replies; 6+ messages in thread
From: Kaitao cheng @ 2026-05-03  3:56 UTC (permalink / raw)
  To: axboe, ast, daniel, andrii, martin.lau, eddyz87, memxor, song,
	yonghong.song, jolsa, john.fastabend
  Cc: bpf, linux-block, linux-kernel, Kaitao Cheng

From: Kaitao Cheng <chengkaitao@kylinos.cn>

Introduce the KF_SPIN_LOCK kfunc metadata flag in BTF so kfuncs may be
explicitly marked as safe to call while holding bpf_spin_lock.

Allow kfuncs defined in kernel modules to be marked with KF_SPIN_LOCK.

Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
 include/linux/btf.h   |  1 +
 kernel/bpf/verifier.c | 20 +++++++++++++++-----
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index c82d0d689059..08d8a023ffe6 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -79,6 +79,7 @@
 #define KF_ARENA_ARG1   (1 << 14) /* kfunc takes an arena pointer as its first argument */
 #define KF_ARENA_ARG2   (1 << 15) /* kfunc takes an arena pointer as its second argument */
 #define KF_IMPLICIT_ARGS (1 << 16) /* kfunc has implicit arguments supplied by the verifier */
+#define KF_SPIN_LOCK    (1 << 17) /* kfunc is allowed inside bpf_spin_lock-ed region */
 
 /*
  * Tag marking a kernel function as a kfunc. This is meant to minimize the
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4d78d834c609..cffba6a714f3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -11403,11 +11403,21 @@ static bool is_bpf_stream_kfunc(u32 btf_id)
 	       btf_id == special_kfunc_list[KF_bpf_stream_print_stack];
 }
 
-static bool kfunc_spin_allowed(u32 btf_id)
+static bool kfunc_spin_allowed(struct bpf_verifier_env *env, s32 func_id, s16 offset)
 {
-	return is_bpf_graph_api_kfunc(btf_id) || is_bpf_iter_num_api_kfunc(btf_id) ||
-	       is_bpf_res_spin_lock_kfunc(btf_id) || is_bpf_arena_kfunc(btf_id) ||
-	       is_bpf_stream_kfunc(btf_id);
+	struct bpf_kfunc_meta kfunc;
+	int err;
+
+	if (is_bpf_graph_api_kfunc(func_id) || is_bpf_iter_num_api_kfunc(func_id) ||
+	    is_bpf_res_spin_lock_kfunc(func_id) || is_bpf_arena_kfunc(func_id) ||
+	    is_bpf_stream_kfunc(func_id))
+		return true;
+
+	err = fetch_kfunc_meta(env, func_id, offset, &kfunc);
+	if (err || !kfunc.flags)
+		return false;
+
+	return *kfunc.flags & KF_SPIN_LOCK;
 }
 
 static bool is_sync_callback_calling_kfunc(u32 btf_id)
@@ -17025,7 +17035,7 @@ static int do_check_insn(struct bpf_verifier_env *env, bool *do_print_state)
 				     insn->imm != BPF_FUNC_spin_unlock &&
 				     insn->imm != BPF_FUNC_kptr_xchg) ||
 				    (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
-				     (insn->off != 0 || !kfunc_spin_allowed(insn->imm)))) {
+				     !kfunc_spin_allowed(env, insn->imm, insn->off))) {
 					verbose(env,
 						"function calls are not allowed while holding a lock\n");
 					return -EINVAL;
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC v2 2/3] block: Introduce the UFQ I/O scheduler
  2026-05-03  3:56 [RFC v2 0/3] block: Introduce a BPF-based I/O scheduler Kaitao cheng
  2026-05-03  3:56 ` [RFC v2 1/3] bpf: Add KF_SPIN_LOCK flag for kfuncs under bpf_spin_lock Kaitao cheng
@ 2026-05-03  3:56 ` Kaitao cheng
  2026-05-03  4:45   ` bot+bpf-ci
  2026-05-03  3:56 ` [RFC v2 3/3] tools/ufq_iosched: add BPF example scheduler and build scaffolding Kaitao cheng
  2 siblings, 1 reply; 6+ messages in thread
From: Kaitao cheng @ 2026-05-03  3:56 UTC (permalink / raw)
  To: axboe, ast, daniel, andrii, martin.lau, eddyz87, memxor, song,
	yonghong.song, jolsa, john.fastabend
  Cc: bpf, linux-block, linux-kernel, Kaitao Cheng

From: Kaitao Cheng <chengkaitao@kylinos.cn>

Introduce IOSCHED_UFQ, a blk-mq elevator ("ufq: User-programmable
Flexible Queueing") whose policy is supplied by an eBPF program via
struct_ops (insert, dispatch, merge, finish, etc.).

When no eBPF program is attached, the UFQ I/O scheduler uses a simple,
per-ctx queueing policy (similar to none). After an eBPF program is
attached, the user-defined scheduling policy replaces UFQ’s built-in
queueing policy, while per-ctx queues remain available as a fallback
mechanism.

Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
 block/Kconfig.iosched |   8 +
 block/Makefile        |   1 +
 block/blk-merge.c     |  28 +-
 block/blk-mq.c        |   8 +-
 block/blk-mq.h        |   2 +-
 block/blk.h           |   5 +
 block/ufq-bpfops.c    | 241 ++++++++++++++++
 block/ufq-iosched.c   | 640 ++++++++++++++++++++++++++++++++++++++++++
 block/ufq-iosched.h   |  64 +++++
 block/ufq-kfunc.c     | 131 +++++++++
 10 files changed, 1115 insertions(+), 13 deletions(-)
 create mode 100644 block/ufq-bpfops.c
 create mode 100644 block/ufq-iosched.c
 create mode 100644 block/ufq-iosched.h
 create mode 100644 block/ufq-kfunc.c

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 27f11320b8d1..56afc425cc52 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -44,4 +44,12 @@ config BFQ_CGROUP_DEBUG
 	Enable some debugging help. Currently it exports additional stat
 	files in a cgroup which can be useful for debugging.
 
+config IOSCHED_UFQ
+	tristate "UFQ I/O scheduler"
+	default y
+	help
+	The UFQ I/O scheduler is a programmable I/O scheduler. When
+	enabled, an out-of-kernel I/O scheduler based on eBPF can be
+	designed to interact with it, leveraging its customizable
+	hooks to redefine I/O scheduling policies.
 endmenu
diff --git a/block/Makefile b/block/Makefile
index 7dce2e44276c..a58ea7384b7a 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
 obj-$(CONFIG_MQ_IOSCHED_KYBER)	+= kyber-iosched.o
 bfq-y				:= bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
 obj-$(CONFIG_IOSCHED_BFQ)	+= bfq.o
+obj-$(CONFIG_IOSCHED_UFQ)	+= ufq-iosched.o ufq-bpfops.o ufq-kfunc.o
 
 obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o \
 				   bio-integrity-auto.o bio-integrity-fs.o
diff --git a/block/blk-merge.c b/block/blk-merge.c
index fcf09325b22e..7a98dc75a06f 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -774,8 +774,8 @@ u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next,
  * For non-mq, this has to be called with the request spinlock acquired.
  * For mq with scheduling, the appropriate queue wide lock should be held.
  */
-static struct request *attempt_merge(struct request_queue *q,
-				     struct request *req, struct request *next)
+static struct request *attempt_merge(struct request_queue *q, struct request *req,
+				     struct request *next, bool nohash)
 {
 	if (!rq_mergeable(req) || !rq_mergeable(next))
 		return NULL;
@@ -842,7 +842,7 @@ static struct request *attempt_merge(struct request_queue *q,
 
 	req->__data_len += blk_rq_bytes(next);
 
-	if (!blk_discard_mergable(req))
+	if (!nohash && !blk_discard_mergable(req))
 		elv_merge_requests(q, req, next);
 
 	blk_crypto_rq_put_keyslot(next);
@@ -868,7 +868,7 @@ static struct request *attempt_back_merge(struct request_queue *q,
 	struct request *next = elv_latter_request(q, rq);
 
 	if (next)
-		return attempt_merge(q, rq, next);
+		return attempt_merge(q, rq, next, false);
 
 	return NULL;
 }
@@ -879,11 +879,17 @@ static struct request *attempt_front_merge(struct request_queue *q,
 	struct request *prev = elv_former_request(q, rq);
 
 	if (prev)
-		return attempt_merge(q, prev, rq);
+		return attempt_merge(q, prev, rq, false);
 
 	return NULL;
 }
 
+struct request *bpf_attempt_merge(struct request_queue *q, struct request *rq,
+				  struct request *next)
+{
+	return attempt_merge(q, rq, next, true);
+}
+
 /*
  * Try to merge 'next' into 'rq'. Return true if the merge happened, false
  * otherwise. The caller is responsible for freeing 'next' if the merge
@@ -892,7 +898,7 @@ static struct request *attempt_front_merge(struct request_queue *q,
 bool blk_attempt_req_merge(struct request_queue *q, struct request *rq,
 			   struct request *next)
 {
-	return attempt_merge(q, rq, next);
+	return attempt_merge(q, rq, next, false);
 }
 
 bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
@@ -1035,11 +1041,11 @@ static enum bio_merge_status bio_attempt_discard_merge(struct request_queue *q,
 	return BIO_MERGE_FAILED;
 }
 
-static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
-						   struct request *rq,
-						   struct bio *bio,
-						   unsigned int nr_segs,
-						   bool sched_allow_merge)
+enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
+					    struct request *rq,
+					    struct bio *bio,
+					    unsigned int nr_segs,
+					    bool sched_allow_merge)
 {
 	if (!blk_rq_merge_ok(rq, bio))
 		return BIO_MERGE_NONE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c5c16cce4f8..bebc1306d8fd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -796,7 +796,7 @@ static void blk_mq_finish_request(struct request *rq)
 	}
 }
 
-static void __blk_mq_free_request(struct request *rq)
+void __blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
@@ -1844,6 +1844,12 @@ static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr,
 		if (list_empty(&ctx->rq_lists[type]))
 			sbitmap_clear_bit(sb, bitnr);
 	}
+
+	if (dispatch_data->rq) {
+		dispatch_data->rq->rq_flags |= RQF_STARTED;
+		if (hctx->queue->last_merge == dispatch_data->rq)
+			hctx->queue->last_merge = NULL;
+	}
 	spin_unlock(&ctx->lock);
 
 	return !dispatch_data->rq;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index aa15d31aaae9..3f85cae7bf57 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -56,7 +56,7 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
 					struct blk_mq_ctx *start);
 void blk_mq_put_rq_ref(struct request *rq);
-
+void __blk_mq_free_request(struct request *rq);
 /*
  * Internal helpers for allocating/freeing the request map
  */
diff --git a/block/blk.h b/block/blk.h
index ec4674cdf2ea..51adc4cdcee4 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -313,6 +313,9 @@ enum bio_merge_status {
 
 enum bio_merge_status bio_attempt_back_merge(struct request *req,
 		struct bio *bio, unsigned int nr_segs);
+enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
+		struct request *rq, struct bio *bio, unsigned int nr_segs,
+		bool sched_allow_merge);
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 		unsigned int nr_segs);
 bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
@@ -444,6 +447,8 @@ static inline unsigned get_max_segment_size(const struct queue_limits *lim,
 
 int ll_back_merge_fn(struct request *req, struct bio *bio,
 		unsigned int nr_segs);
+struct request *bpf_attempt_merge(struct request_queue *q, struct request *rq,
+				  struct request *next);
 bool blk_attempt_req_merge(struct request_queue *q, struct request *rq,
 				struct request *next);
 unsigned int blk_recalc_rq_segments(struct request *rq);
diff --git a/block/ufq-bpfops.c b/block/ufq-bpfops.c
new file mode 100644
index 000000000000..1c3c62e6c47e
--- /dev/null
+++ b/block/ufq-bpfops.c
@@ -0,0 +1,241 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 KylinSoft Corporation.
+ * Copyright (c) 2026 Kaitao Cheng <chengkaitao@kylinos.cn>
+ */
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/btf_ids.h>
+#include <linux/string.h>
+#include <linux/wait.h>
+#include <linux/rcupdate.h>
+#include "ufq-iosched.h"
+
+struct ufq_iosched_ops ufq_ops;
+static atomic_t ufq_bpfops_enabled;
+static atomic_t ufq_bpfops_users;
+static DECLARE_WAIT_QUEUE_HEAD(ufq_bpfops_wq);
+
+const struct ufq_iosched_ops *ufq_bpfops_tryget(void)
+{
+	if (!atomic_read(&ufq_bpfops_enabled))
+		return NULL;
+
+	atomic_inc(&ufq_bpfops_users);
+	/*
+	 * Pairs with disable path flipping ufq_bpfops_enabled to make sure no
+	 * callback runs after teardown starts.
+	 */
+	smp_mb__after_atomic();
+
+	if (unlikely(!atomic_read(&ufq_bpfops_enabled))) {
+		if (atomic_dec_and_test(&ufq_bpfops_users))
+			wake_up_all(&ufq_bpfops_wq);
+		return NULL;
+	}
+
+	return &ufq_ops;
+}
+
+void ufq_bpfops_put(void)
+{
+	if (atomic_dec_and_test(&ufq_bpfops_users))
+		wake_up_all(&ufq_bpfops_wq);
+}
+
+static const struct bpf_func_proto *
+bpf_ufq_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id, prog);
+}
+
+static bool bpf_ufq_is_valid_access(int off, int size,
+				    enum bpf_access_type type,
+				    const struct bpf_prog *prog,
+				    struct bpf_insn_access_aux *info)
+{
+	if (type != BPF_READ)
+		return false;
+	if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS)
+		return false;
+	if (off % size != 0)
+		return false;
+
+	/*
+	 * btf_ctx_access() treats pointers that are not "pointer to struct"
+	 * as scalars (no reg_type), so loading pointers like merge_req()'s
+	 * int *type or merge_bio()'s bool *merged from ctx leaves a SCALAR
+	 * and stores through them fail verification. Model both as writable
+	 * buffers.
+	 */
+	if (size == sizeof(__u64) && prog->aux->attach_func_name &&
+	    ((!strcmp(prog->aux->attach_func_name, "merge_req") && off == 16) ||
+	     (!strcmp(prog->aux->attach_func_name, "merge_bio") && off == 24))) {
+		if (!btf_ctx_access(off, size, type, prog, info))
+			return false;
+		info->reg_type = PTR_TO_BUF;
+		return true;
+	}
+
+	return btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_verifier_ops bpf_ufq_verifier_ops = {
+	.get_func_proto = bpf_ufq_get_func_proto,
+	.is_valid_access = bpf_ufq_is_valid_access,
+};
+
+static int bpf_ufq_init_member(const struct btf_type *t,
+			       const struct btf_member *member,
+			       void *kdata, const void *udata)
+{
+	const struct ufq_iosched_ops *uops = udata;
+	struct ufq_iosched_ops *ops = kdata;
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+	int ret;
+
+	switch (moff) {
+	case offsetof(struct ufq_iosched_ops, name):
+		ret = bpf_obj_name_cpy(ops->name, uops->name,
+				       sizeof(ops->name));
+		if (ret < 0)
+			return ret;
+		if (ret == 0)
+			return -EINVAL;
+		return 1;
+	/* other var adding .... */
+	}
+
+	return 0;
+}
+
+static int bpf_ufq_check_member(const struct btf_type *t,
+				const struct btf_member *member,
+				const struct bpf_prog *prog)
+{
+	return 0;
+}
+
+static int bpf_ufq_enable(void *ops)
+{
+	ufq_ops = *(struct ufq_iosched_ops *)ops;
+	atomic_set(&ufq_bpfops_enabled, 1);
+	return 0;
+}
+
+static void bpf_ufq_disable(struct ufq_iosched_ops *ops)
+{
+	atomic_set(&ufq_bpfops_enabled, 0);
+	wait_event(ufq_bpfops_wq, !atomic_read(&ufq_bpfops_users));
+	memset(&ufq_ops, 0, sizeof(ufq_ops));
+}
+
+static int bpf_ufq_reg(void *kdata, struct bpf_link *link)
+{
+	return ufq_prepare_bpf_attach(bpf_ufq_enable, kdata);
+}
+
+static void bpf_ufq_unreg(void *kdata, struct bpf_link *link)
+{
+	bpf_ufq_disable(kdata);
+	ufq_kick_all_hw_queues();
+}
+
+static int bpf_ufq_init(struct btf *btf)
+{
+	return 0;
+}
+
+static int bpf_ufq_update(void *kdata, void *old_kdata, struct bpf_link *link)
+{
+	/*
+	 * UFQ does not support live-updating an already-attached BPF scheduler:
+	 * partial failure during callback setup (e.g. init_sched) would be hard
+	 * to reason about, and update can race with unregister/teardown.
+	 */
+	return -EOPNOTSUPP;
+}
+
+static int bpf_ufq_validate(void *kdata)
+{
+	return 0;
+}
+
+static int init_sched_stub(struct request_queue *q)
+{
+	return -EPERM;
+}
+
+static int exit_sched_stub(struct request_queue *q)
+{
+	return -EPERM;
+}
+
+static int insert_req_stub(struct request_queue *q, struct request *rq,
+			   blk_insert_t flags)
+{
+	return 0;
+}
+
+static struct request *dispatch_req_stub(struct request_queue *q)
+{
+	return NULL;
+}
+
+static bool has_req_stub(struct request_queue *q, int rqs_count)
+{
+	return rqs_count > 0;
+}
+
+static void finish_req_stub(struct request *rq)
+{
+}
+
+static struct request *merge_req_stub(struct request_queue *q, struct request *rq,
+				      int *type)
+{
+	*type = ELEVATOR_NO_MERGE;
+	return NULL;
+}
+
+static struct request *merge_bio_stub(struct request_queue *q, struct bio *bio,
+				      unsigned int nr_segs, bool *merged)
+{
+	if (merged)
+		*merged = false;
+
+	return NULL;
+}
+
+static struct ufq_iosched_ops __bpf_ops_ufq_ops = {
+	.init_sched		= init_sched_stub,
+	.exit_sched		= exit_sched_stub,
+	.insert_req		= insert_req_stub,
+	.dispatch_req		= dispatch_req_stub,
+	.has_req		= has_req_stub,
+	.merge_req		= merge_req_stub,
+	.finish_req		= finish_req_stub,
+	.merge_bio		= merge_bio_stub,
+};
+
+static struct bpf_struct_ops bpf_iosched_ufq_ops = {
+	.verifier_ops = &bpf_ufq_verifier_ops,
+	.reg = bpf_ufq_reg,
+	.unreg = bpf_ufq_unreg,
+	.check_member = bpf_ufq_check_member,
+	.init_member = bpf_ufq_init_member,
+	.init = bpf_ufq_init,
+	.update = bpf_ufq_update,
+	.validate = bpf_ufq_validate,
+	.name = "ufq_iosched_ops",
+	.owner = THIS_MODULE,
+	.cfi_stubs = &__bpf_ops_ufq_ops
+};
+
+int bpf_ufq_ops_init(void)
+{
+	return register_bpf_struct_ops(&bpf_iosched_ufq_ops, ufq_iosched_ops);
+}
diff --git a/block/ufq-iosched.c b/block/ufq-iosched.c
new file mode 100644
index 000000000000..ebbb63e0ef51
--- /dev/null
+++ b/block/ufq-iosched.c
@@ -0,0 +1,640 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 KylinSoft Corporation.
+ * Copyright (c) 2026 Kaitao Cheng <chengkaitao@kylinos.cn>
+ */
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/compiler.h>
+#include <linux/sbitmap.h>
+#include <linux/workqueue.h>
+
+#include <trace/events/block.h>
+
+#include "elevator.h"
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-debugfs.h"
+#include "ufq-iosched.h"
+
+static DEFINE_MUTEX(ufq_active_queues_lock);
+static LIST_HEAD(ufq_active_queues);
+
+enum ufq_priv_state {
+	UFQ_PRIV_NOT_IN_SCHED = 0,
+	UFQ_PRIV_IN_BPF = 1,
+	UFQ_PRIV_IN_UFQ = 2,
+	UFQ_PRIV_IN_SCHED = 3,
+};
+
+static struct request *ufq_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+	struct ufq_data *ufq = hctx->queue->elevator->elevator_data;
+	const struct ufq_iosched_ops *ops;
+	struct blk_mq_ctx *ctx;
+	struct request *rq = NULL;
+	unsigned short idx;
+
+	ops = ufq_bpfops_tryget();
+	if (ops && ops->dispatch_req) {
+		rq = ops->dispatch_req(hctx->queue);
+		if (!rq) {
+			atomic_inc(&ufq->ops_stats.dispatch_null_count);
+			ufq_bpfops_put();
+			return NULL;
+		}
+		ufq_bpfops_put();
+
+		/* The BPF insert_req callback bumps the request's reference
+		 * count; dispatch_req returns that same request with an extra
+		 * reference held. The kernel must put that reference here,
+		 * and the request's refcount is always greater than zero at
+		 * this point.
+		 */
+		if (WARN_ON_ONCE(req_ref_put_and_test(rq))) {
+			__blk_mq_free_request(rq);
+			return NULL;
+		}
+
+		ctx = rq->mq_ctx;
+		spin_lock(&ctx->lock);
+		if (unlikely(blk_mq_rq_state(rq) != MQ_RQ_IDLE ||
+			     (rq->rq_flags & RQF_STARTED) ||
+			     list_empty(&rq->queuelist))) {
+			spin_unlock(&ctx->lock);
+			return NULL;
+		}
+		list_del_init(&rq->queuelist);
+		rq->rq_flags |= RQF_STARTED;
+		if (hctx->queue->last_merge == rq)
+			hctx->queue->last_merge = NULL;
+		if (list_empty(&ctx->rq_lists[rq->mq_hctx->type]))
+			sbitmap_clear_bit(&rq->mq_hctx->ctx_map,
+					  ctx->index_hw[rq->mq_hctx->type]);
+		spin_unlock(&ctx->lock);
+		atomic_inc(&ufq->ops_stats.dispatch_ok_count);
+		atomic64_add(blk_rq_sectors(rq), &ufq->ops_stats.dispatch_ok_sectors);
+		rq->elv.priv[0] = (void *)((uintptr_t)rq->elv.priv[0]
+				  & ~UFQ_PRIV_IN_UFQ);
+	} else {
+		if (ops)
+			ufq_bpfops_put();
+		ctx = READ_ONCE(hctx->dispatch_from);
+		rq = blk_mq_dequeue_from_ctx(hctx, ctx);
+		if (rq) {
+			idx = rq->mq_ctx->index_hw[hctx->type];
+			if (++idx == hctx->nr_ctx)
+				idx = 0;
+			WRITE_ONCE(hctx->dispatch_from, hctx->ctxs[idx]);
+		}
+	}
+
+	if (rq)
+		atomic_dec(&ufq->rqs_count);
+	return rq;
+}
+
+/*
+ * Called by __blk_mq_alloc_request(). The shallow_depth value set by this
+ * function is used by __blk_mq_get_tag().
+ */
+static void ufq_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
+{
+	struct ufq_data *ufq = data->q->elevator->elevator_data;
+
+	/* Do not throttle synchronous reads. */
+	if (op_is_sync(opf) && !op_is_write(opf))
+		return;
+
+	/*
+	 * Throttle asynchronous requests and writes such that these requests
+	 * do not block the allocation of synchronous requests.
+	 */
+	data->shallow_depth = ufq->async_depth;
+}
+
+static void ufq_depth_updated(struct request_queue *q)
+{
+	struct ufq_data *ufq = q->elevator->elevator_data;
+
+	ufq->async_depth = q->nr_requests;
+	q->async_depth = q->nr_requests;
+	blk_mq_set_min_shallow_depth(q, 1);
+}
+
+static int ufq_init_sched(struct request_queue *q, struct elevator_queue *eq)
+{
+	const struct ufq_iosched_ops *ops;
+	struct ufq_data *ufq;
+
+	ufq = kzalloc_node(sizeof(*ufq), GFP_KERNEL, q->node);
+	if (!ufq)
+		return -ENOMEM;
+
+	eq->elevator_data = ufq;
+	ufq->q = q;
+	INIT_LIST_HEAD(&ufq->active_node);
+
+	blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
+	q->elevator = eq;
+
+	q->async_depth = q->nr_requests;
+	ufq->async_depth = q->nr_requests;
+
+	ops = ufq_bpfops_tryget();
+	if (ops) {
+		if (ops->init_sched)
+			ops->init_sched(q);
+		ufq_bpfops_put();
+	}
+
+	mutex_lock(&ufq_active_queues_lock);
+	list_add_tail(&ufq->active_node, &ufq_active_queues);
+	mutex_unlock(&ufq_active_queues_lock);
+
+	ufq_depth_updated(q);
+	return 0;
+}
+
+static void ufq_exit_sched(struct elevator_queue *e)
+{
+	const struct ufq_iosched_ops *ops;
+	struct ufq_data *ufq = e->elevator_data;
+
+	ops = ufq_bpfops_tryget();
+	if (ops) {
+		if (ops->exit_sched)
+			ops->exit_sched(ufq->q);
+		ufq_bpfops_put();
+	}
+
+	mutex_lock(&ufq_active_queues_lock);
+	if (!list_empty(&ufq->active_node))
+		list_del_init(&ufq->active_node);
+	mutex_unlock(&ufq_active_queues_lock);
+
+	WARN_ON_ONCE(atomic_read(&ufq->rqs_count));
+
+	kfree(ufq);
+	e->elevator_data = NULL;
+}
+
+void ufq_kick_all_hw_queues(void)
+{
+	struct ufq_data *ufq;
+
+	mutex_lock(&ufq_active_queues_lock);
+	list_for_each_entry(ufq, &ufq_active_queues, active_node)
+		blk_mq_run_hw_queues(ufq->q, true);
+	mutex_unlock(&ufq_active_queues_lock);
+}
+
+static int ufq_drain_ctx_rqs(struct ufq_data *ufq)
+{
+	struct request_queue *q = ufq->q;
+	unsigned long deadline = jiffies + 8 * HZ;
+
+	while (atomic_read(&ufq->rqs_count) > 0 && time_before(jiffies, deadline)) {
+		blk_mq_run_hw_queues(q, false);
+		if (atomic_read(&ufq->rqs_count) > 0) {
+			struct blk_mq_hw_ctx *hctx;
+			unsigned long i;
+
+			queue_for_each_hw_ctx(q, hctx, i)
+				flush_delayed_work(&hctx->run_work);
+			flush_delayed_work(&q->requeue_work);
+		}
+		cond_resched();
+	}
+
+	if (atomic_read(&ufq->rqs_count) > 0) {
+		pr_warn_ratelimited("ufq: drain timeout (%d rqs) before BPF attach\n",
+				    atomic_read(&ufq->rqs_count));
+		return -EBUSY;
+	}
+	return 0;
+}
+
+/*
+ * Mirror elevator_change(): freeze each queue, cancel mq dispatch work,
+ * then drain software-ctx requests while BPF callbacks are still off.
+ * @enable runs with all those queues still frozen so new ctx backlog cannot
+ * race ahead of turning BPF dispatch on.
+ */
+int ufq_prepare_bpf_attach(int (*enable)(void *kdata), void *kdata)
+{
+	struct ufq_data *ufq;
+	unsigned int memflags;
+	int frozen = 0, ret = 0;
+
+	mutex_lock(&ufq_active_queues_lock);
+	if (list_empty(&ufq_active_queues)) {
+		mutex_unlock(&ufq_active_queues_lock);
+		return enable(kdata);
+	}
+
+	memflags = memalloc_noio_save();
+	list_for_each_entry(ufq, &ufq_active_queues, active_node) {
+		blk_mq_freeze_queue_nomemsave(ufq->q);
+		blk_mq_cancel_work_sync(ufq->q);
+		frozen++;
+		ret = ufq_drain_ctx_rqs(ufq);
+		if (ret)
+			goto unfreeze;
+	}
+
+	ret = enable(kdata);
+unfreeze:
+	list_for_each_entry(ufq, &ufq_active_queues, active_node) {
+		if (!frozen--)
+			break;
+		blk_mq_unfreeze_queue_nomemrestore(ufq->q);
+	}
+	memalloc_noio_restore(memflags);
+	mutex_unlock(&ufq_active_queues_lock);
+	return ret;
+}
+
+static bool ufq_bio_merge(struct request_queue *q, struct bio *bio,
+			  unsigned int nr_segs)
+{
+	struct ufq_data *ufq = q->elevator->elevator_data;
+	const struct ufq_iosched_ops *ops;
+	struct request *rq = NULL, *last;
+	enum bio_merge_status mstat;
+	struct blk_mq_ctx *ctx;
+	bool ret = false;
+
+	/*
+	 * Levels of merges:
+	 *	nomerges:  No merges at all attempted
+	 *	noxmerges: Only simple one-hit cache try
+	 *	merges:    All merge tries attempted
+	 */
+	if (blk_queue_nomerges(q) || !bio_mergeable(bio))
+		return false;
+
+	last = q->last_merge;
+	if (last) {
+		ctx = last->mq_ctx;
+		spin_lock(&ctx->lock);
+		if (last == q->last_merge && !list_empty(&last->queuelist)
+		    && elv_bio_merge_ok(last, bio)) {
+			mstat = blk_attempt_bio_merge(q, last, bio, nr_segs, true);
+			if (mstat == BIO_MERGE_OK) {
+				spin_unlock(&ctx->lock);
+				atomic_inc(&ufq->ops_stats.merge_bio_ok_count);
+				atomic64_add(bio->bi_iter.bi_size >> SECTOR_SHIFT,
+					     &ufq->ops_stats.merge_bio_ok_sectors);
+				return true;
+			}
+			if (mstat == BIO_MERGE_FAILED) {
+				spin_unlock(&ctx->lock);
+				return false;
+			}
+		}
+		spin_unlock(&ctx->lock);
+	}
+
+	if (blk_queue_noxmerges(q))
+		return false;
+
+	ops = ufq_bpfops_tryget();
+	if (ops) {
+		if (ops->merge_bio) {
+			rq = ops->merge_bio(q, bio, nr_segs, &ret);
+			if (ret) {
+				atomic_inc(&ufq->ops_stats.merge_bio_ok_count);
+				atomic64_add(bio->bi_iter.bi_size >> SECTOR_SHIFT,
+					     &ufq->ops_stats.merge_bio_ok_sectors);
+			} else {
+				ufq_bpfops_put();
+				return false;
+			}
+
+			if (rq) {
+				ufq_bpfops_put();
+				spin_lock(&rq->mq_ctx->lock);
+				if (!list_empty(&rq->queuelist)) {
+					list_del_init(&rq->queuelist);
+					atomic_dec(&ufq->rqs_count);
+				}
+				spin_unlock(&rq->mq_ctx->lock);
+				blk_mq_free_request(rq);
+				atomic_inc(&ufq->ops_stats.merge_request_ok_count);
+				atomic64_add(bio->bi_iter.bi_size >> SECTOR_SHIFT,
+					     &ufq->ops_stats.merge_request_ok_sectors);
+				return ret;
+			}
+		}
+		ufq_bpfops_put();
+	}
+
+	return ret;
+}
+
+static enum elv_merge ufq_try_insert_merge(struct request_queue *q,
+					   struct request **new)
+{
+	const struct ufq_iosched_ops *ops;
+	struct request *target = NULL, *free = NULL, *last, *rq = *new;
+	struct ufq_data *ufq = q->elevator->elevator_data;
+	enum elv_merge type = ELEVATOR_NO_MERGE;
+	int merge_type = ELEVATOR_NO_MERGE;
+
+	if (!rq_mergeable(rq))
+		return ELEVATOR_NO_MERGE;
+
+	if (blk_queue_nomerges(q))
+		return ELEVATOR_NO_MERGE;
+
+	last = q->last_merge;
+	if (last) {
+		spin_lock(&last->mq_ctx->lock);
+		if (last == q->last_merge && !list_empty(&last->queuelist)
+		    && bpf_attempt_merge(q, last, rq)) {
+			spin_unlock(&last->mq_ctx->lock);
+			type = ELEVATOR_BACK_MERGE;
+			free = rq;
+			*new = NULL;
+			goto end;
+		}
+		spin_unlock(&last->mq_ctx->lock);
+	}
+
+	if (blk_queue_noxmerges(q))
+		return ELEVATOR_NO_MERGE;
+
+	ops = ufq_bpfops_tryget();
+	if (ops && ops->merge_req) {
+		target = ops->merge_req(q, rq, &merge_type);
+		type = (enum elv_merge)merge_type;
+	}
+
+	if (target && WARN_ON_ONCE(req_ref_put_and_test(target))) {
+		__blk_mq_free_request(target);
+		ufq_bpfops_put();
+		return ELEVATOR_NO_MERGE;
+	}
+
+	if (type == ELEVATOR_NO_MERGE || !target) {
+		if (ops)
+			ufq_bpfops_put();
+		return ELEVATOR_NO_MERGE;
+	} else if (type == ELEVATOR_FRONT_MERGE) {
+		if (rq->mq_ctx != target->mq_ctx || rq->mq_hctx != target->mq_hctx)
+			goto rollback;
+		spin_lock(&target->mq_ctx->lock);
+		free = bpf_attempt_merge(q, rq, target);
+		if (!free) {
+			spin_unlock(&target->mq_ctx->lock);
+			pr_err("ufq-iosched: front merge failed\n");
+			goto rollback;
+		}
+		rq->elv.priv[0] = (void *)((uintptr_t)rq->elv.priv[0]
+				  | UFQ_PRIV_IN_UFQ);
+		list_replace_init(&target->queuelist, &rq->queuelist);
+		rq->fifo_time = target->fifo_time;
+		q->last_merge = rq;
+	} else if (type == ELEVATOR_BACK_MERGE) {
+		spin_lock(&target->mq_ctx->lock);
+		free = bpf_attempt_merge(q, target, rq);
+		if (!free) {
+			spin_unlock(&target->mq_ctx->lock);
+			pr_err("ufq-iosched: back merge failed\n");
+			goto rollback;
+		}
+		*new = target;
+		q->last_merge = target;
+	}
+
+	spin_unlock(&target->mq_ctx->lock);
+	if (ops)
+		ufq_bpfops_put();
+end:
+	atomic_inc(&ufq->ops_stats.merge_request_ok_count);
+	atomic64_add(blk_rq_sectors(free), &ufq->ops_stats.merge_request_ok_sectors);
+	blk_mq_free_request(free);
+	return type;
+
+rollback:
+	if (ops) {
+		if (ops->insert_req && ops->insert_req(q, target, 0)) {
+			atomic_inc(&ufq->ops_stats.insert_err_count);
+			pr_err("ufq-iosched: rollback insert_req error\n");
+		}
+		ufq_bpfops_put();
+	}
+
+	return ELEVATOR_NO_MERGE;
+}
+
+static void ufq_insert_requests(struct blk_mq_hw_ctx *hctx,
+			       struct list_head *list,
+			       blk_insert_t flags)
+{
+	struct request_queue *q = hctx->queue;
+	struct ufq_data *ufq = q->elevator->elevator_data;
+	const struct ufq_iosched_ops *ops;
+	struct blk_mq_ctx *ctx;
+	enum elv_merge type;
+	int bit, ret = 0;
+
+	ops = ufq_bpfops_tryget();
+
+	while (!list_empty(list)) {
+		struct request *rq;
+
+		rq = list_first_entry(list, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+
+		type = ufq_try_insert_merge(q, &rq);
+		if (type == ELEVATOR_NO_MERGE) {
+			rq->fifo_time = jiffies;
+			ctx = rq->mq_ctx;
+			rq->elv.priv[0] = (void *)((uintptr_t)rq->elv.priv[0]
+					  | UFQ_PRIV_IN_UFQ);
+			spin_lock(&ctx->lock);
+			if (flags & BLK_MQ_INSERT_AT_HEAD)
+				list_add(&rq->queuelist, &ctx->rq_lists[hctx->type]);
+			else
+				list_add_tail(&rq->queuelist,
+					&ctx->rq_lists[hctx->type]);
+
+			bit = ctx->index_hw[hctx->type];
+			if (!sbitmap_test_bit(&hctx->ctx_map, bit))
+				sbitmap_set_bit(&hctx->ctx_map, bit);
+			q->last_merge = rq;
+			spin_unlock(&ctx->lock);
+			atomic_inc(&ufq->rqs_count);
+		}
+
+		if (ops && rq && ops->insert_req) {
+			rq->elv.priv[0] = (void *)((uintptr_t)rq->elv.priv[0]
+				  | UFQ_PRIV_IN_BPF);
+			ret = ops->insert_req(q, rq, flags);
+			if (ret) {
+				atomic_inc(&ufq->ops_stats.insert_err_count);
+				pr_err("ufq-iosched: bpf insert_req error (%d)\n", ret);
+			} else {
+				atomic_inc(&ufq->ops_stats.insert_ok_count);
+				atomic64_add(blk_rq_sectors(rq), &ufq->ops_stats.insert_ok_sectors);
+			}
+		}
+	}
+
+	if (ops)
+		ufq_bpfops_put();
+}
+
+static void ufq_prepare_request(struct request *rq)
+{
+	rq->elv.priv[0] = (void *)(uintptr_t)UFQ_PRIV_NOT_IN_SCHED;
+}
+
+static void ufq_finish_request(struct request *rq)
+{
+	const struct ufq_iosched_ops *ops;
+	struct ufq_data *ufq = rq->q->elevator->elevator_data;
+
+	/*
+	 * The block layer core may call ufq_finish_request() without having
+	 * called ufq_insert_requests(). Skip requests that bypassed I/O
+	 * scheduling.
+	 */
+	if (!((uintptr_t)rq->elv.priv[0] & UFQ_PRIV_IN_BPF))
+		return;
+
+	ops = ufq_bpfops_tryget();
+	if (ops) {
+		if (ops->finish_req)
+			ops->finish_req(rq);
+		ufq_bpfops_put();
+	}
+
+	atomic_inc(&ufq->ops_stats.finish_ok_count);
+	atomic64_add(blk_rq_stats_sectors(rq), &ufq->ops_stats.finish_ok_sectors);
+}
+
+static bool ufq_has_work(struct blk_mq_hw_ctx *hctx)
+{
+	const struct ufq_iosched_ops *ops;
+	struct ufq_data *ufq = hctx->queue->elevator->elevator_data;
+	int rqs_count = atomic_read(&ufq->rqs_count);
+
+	ops = ufq_bpfops_tryget();
+	if (!ops)
+		return rqs_count > 0;
+
+	if (ops->has_req)
+		rqs_count = ops->has_req(hctx->queue, rqs_count);
+	ufq_bpfops_put();
+	return rqs_count > 0;
+}
+
+#ifdef CONFIG_BLK_DEBUG_FS
+static int ufq_ops_stats_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+	struct ufq_data *ufq = q->elevator->elevator_data;
+	struct ufq_ops_stats *s = &ufq->ops_stats;
+
+	/* for debug */
+	seq_printf(m, "dispatch_ok_count %d\n",
+		   atomic_read(&s->dispatch_ok_count));
+	seq_printf(m, "dispatch_ok_sectors %lld\n",
+		   (long long)atomic64_read(&s->dispatch_ok_sectors));
+	seq_printf(m, "dispatch_null_count %d\n",
+		   atomic_read(&s->dispatch_null_count));
+	seq_printf(m, "insert_ok_count %d\n",
+		   atomic_read(&s->insert_ok_count));
+	seq_printf(m, "insert_ok_sectors %lld\n",
+		   (long long)atomic64_read(&s->insert_ok_sectors));
+	seq_printf(m, "insert_err_count %d\n",
+		   atomic_read(&s->insert_err_count));
+	seq_printf(m, "merge_req_ok_count %d\n",
+		   atomic_read(&s->merge_request_ok_count));
+	seq_printf(m, "merge_req_ok_sectors %lld\n",
+		   (long long)atomic64_read(&s->merge_request_ok_sectors));
+	seq_printf(m, "merge_bio_ok_count %d\n",
+		   atomic_read(&s->merge_bio_ok_count));
+	seq_printf(m, "merge_bio_ok_sectors %lld\n",
+		   (long long)atomic64_read(&s->merge_bio_ok_sectors));
+	seq_printf(m, "finish_ok_count %d\n",
+		   atomic_read(&s->finish_ok_count));
+	seq_printf(m, "finish_ok_sectors %lld\n",
+		   (long long)atomic64_read(&s->finish_ok_sectors));
+	return 0;
+}
+
+static const struct blk_mq_debugfs_attr ufq_iosched_debugfs_attrs[] = {
+	{"ops_stats", 0400, ufq_ops_stats_show},
+	{},
+};
+#endif
+
+static struct elevator_type ufq_iosched_mq = {
+	.ops = {
+		.depth_updated		= ufq_depth_updated,
+		.limit_depth		= ufq_limit_depth,
+		.insert_requests	= ufq_insert_requests,
+		.dispatch_request	= ufq_dispatch_request,
+		.prepare_request	= ufq_prepare_request,
+		.finish_request		= ufq_finish_request,
+		.bio_merge		= ufq_bio_merge,
+		.has_work		= ufq_has_work,
+		.init_sched		= ufq_init_sched,
+		.exit_sched		= ufq_exit_sched,
+	},
+
+#ifdef CONFIG_BLK_DEBUG_FS
+	.queue_debugfs_attrs = ufq_iosched_debugfs_attrs,
+#endif
+	.elevator_name = "ufq",
+	.elevator_alias = "ufq_iosched",
+	.elevator_owner = THIS_MODULE,
+};
+MODULE_ALIAS("ufq-iosched");
+
+static int __init ufq_init(void)
+{
+	int ret;
+
+	ret = elv_register(&ufq_iosched_mq);
+	if (ret)
+		return ret;
+
+	ret = bpf_ufq_kfunc_init();
+	if (ret) {
+		pr_err("ufq-iosched: Failed to register kfunc sets (%d)\n", ret);
+		elv_unregister(&ufq_iosched_mq);
+		return ret;
+	}
+
+	ret = bpf_ufq_ops_init();
+	if (ret) {
+		pr_err("ufq-iosched: Failed to register struct_ops (%d)\n", ret);
+		elv_unregister(&ufq_iosched_mq);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void __exit ufq_exit(void)
+{
+	elv_unregister(&ufq_iosched_mq);
+}
+
+module_init(ufq_init);
+module_exit(ufq_exit);
+
+MODULE_AUTHOR("Kaitao Cheng <chengkaitao@kylinos.cn>");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("User-programmable Flexible Queueing");
diff --git a/block/ufq-iosched.h b/block/ufq-iosched.h
new file mode 100644
index 000000000000..26a9c1708e8b
--- /dev/null
+++ b/block/ufq-iosched.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2026 KylinSoft Corporation.
+ * Copyright (c) 2026 Kaitao Cheng <chengkaitao@kylinos.cn>
+ */
+#ifndef _BLOCK_UFQ_IOSCHED_H
+#define _BLOCK_UFQ_IOSCHED_H
+
+#include <linux/types.h>
+#include "elevator.h"
+#include "blk-mq.h"
+
+#ifndef BPF_IOSCHED_NAME_MAX
+#define BPF_IOSCHED_NAME_MAX	16
+#endif
+
+/* For testing and debugging */
+struct ufq_ops_stats {
+	atomic_t dispatch_ok_count;
+	atomic64_t dispatch_ok_sectors;
+	atomic_t dispatch_null_count;
+	atomic_t insert_ok_count;
+	atomic64_t insert_ok_sectors;
+	atomic_t insert_err_count;
+	atomic_t merge_request_ok_count;
+	atomic64_t merge_request_ok_sectors;
+	atomic_t merge_bio_ok_count;
+	atomic64_t merge_bio_ok_sectors;
+	atomic_t finish_ok_count;
+	atomic64_t finish_ok_sectors;
+};
+
+struct ufq_iosched_ops {
+	int (*init_sched)(struct request_queue *q);
+	int (*exit_sched)(struct request_queue *q);
+	bool (*has_req)(struct request_queue *q, int rqs_count);
+	int (*insert_req)(struct request_queue *q, struct request *rq,
+			blk_insert_t flags);
+	void (*finish_req)(struct request *rq);
+	struct request *(*merge_req)(struct request_queue *q, struct request *rq,
+			int *type);
+	struct request *(*merge_bio)(struct request_queue *q, struct bio *bio,
+			unsigned int nr_segs, bool *merged);
+	struct request *(*dispatch_req)(struct request_queue *q);
+	char name[BPF_IOSCHED_NAME_MAX];
+};
+
+struct ufq_data {
+	struct request_queue *q;
+	u32 async_depth;
+	atomic_t rqs_count;
+	struct list_head active_node;
+	struct ufq_ops_stats ops_stats;
+};
+
+const struct ufq_iosched_ops *ufq_bpfops_tryget(void);
+void ufq_bpfops_put(void);
+void ufq_kick_all_hw_queues(void);
+int ufq_prepare_bpf_attach(int (*enable)(void *kdata), void *kdata);
+
+int bpf_ufq_ops_init(void);
+int bpf_ufq_kfunc_init(void);
+
+#endif /* _BLOCK_UFQ_IOSCHED_H */
diff --git a/block/ufq-kfunc.c b/block/ufq-kfunc.c
new file mode 100644
index 000000000000..c799eeffcba0
--- /dev/null
+++ b/block/ufq-kfunc.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 KylinSoft Corporation.
+ * Copyright (c) 2026 Kaitao Cheng <chengkaitao@kylinos.cn>
+ */
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/btf_ids.h>
+#include <trace/events/block.h>
+#include "blk.h"
+#include "ufq-iosched.h"
+
+__bpf_kfunc_start_defs();
+
+__bpf_kfunc struct request *bpf_request_acquire(struct request *rq)
+{
+	if (req_ref_inc_not_zero(rq))
+		return rq;
+	return NULL;
+}
+
+__bpf_kfunc void bpf_request_release(struct request *rq)
+{
+	if (req_ref_put_and_test(rq))
+		__blk_mq_free_request(rq);
+}
+
+__bpf_kfunc bool bpf_request_bio_try_merge(struct request *rq, struct bio *bio,
+					   unsigned int nr_segs)
+{
+	struct blk_mq_ctx *ctx;
+	bool merged;
+
+	if (!rq || !bio)
+		return false;
+
+	ctx = rq->mq_ctx;
+	if (!ctx || !rq->q || !bio->bi_bdev || !bio->bi_bdev->bd_disk ||
+	    bio->bi_bdev->bd_disk->queue != rq->q)
+		return false;
+
+	spin_lock(&ctx->lock);
+	merged = blk_attempt_bio_merge(rq->q, rq, bio, nr_segs, true) == BIO_MERGE_OK;
+	spin_unlock(&ctx->lock);
+
+	return merged;
+}
+
+__bpf_kfunc struct request *bpf_request_try_merge(struct request *rq, struct request *next)
+{
+	struct blk_mq_ctx *ctx;
+	struct ufq_data *ufq;
+	struct request *free;
+
+	if (!rq || !next || !rq->q || rq->q != next->q)
+		return NULL;
+
+	ufq = rq->q->elevator->elevator_data;
+	if (!ufq)
+		return NULL;
+
+	if (rq->mq_ctx != next->mq_ctx || rq->mq_hctx != next->mq_hctx)
+		return NULL;
+
+	ctx = rq->mq_ctx;
+	if (!ctx)
+		return NULL;
+
+	spin_lock(&ctx->lock);
+	free = bpf_attempt_merge(rq->q, rq, next);
+	if (free) {
+		if (rq->q->last_merge == free)
+			rq->q->last_merge = NULL;
+		list_del_init(&free->queuelist);
+		atomic_dec(&ufq->rqs_count);
+	}
+	spin_unlock(&ctx->lock);
+
+	return free;
+}
+
+__bpf_kfunc_end_defs();
+
+#if defined(CONFIG_X86_KERNEL_IBT)
+static const void * const __used __section(".discard.ibt_endbr_noseal")
+__ibt_noseal_bpf_request_release = (void *)bpf_request_release;
+#endif
+
+BTF_KFUNCS_START(ufq_kfunc_set_ops)
+BTF_ID_FLAGS(func, bpf_request_acquire, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_request_release, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_request_bio_try_merge, KF_SPIN_LOCK)
+BTF_ID_FLAGS(func, bpf_request_try_merge, KF_SPIN_LOCK)
+BTF_KFUNCS_END(ufq_kfunc_set_ops)
+
+static const struct btf_kfunc_id_set bpf_ufq_kfunc_set = {
+	.owner			= THIS_MODULE,
+	.set			= &ufq_kfunc_set_ops,
+};
+
+BTF_ID_LIST(bpf_ufq_dtor_kfunc_ids)
+BTF_ID(struct, request)
+BTF_ID(func, bpf_request_release)
+
+int bpf_ufq_kfunc_init(void)
+{
+	int ret;
+	const struct btf_id_dtor_kfunc bpf_ufq_dtor_kfunc[] = {
+		{
+		  .btf_id       = bpf_ufq_dtor_kfunc_ids[0],
+		  .kfunc_btf_id = bpf_ufq_dtor_kfunc_ids[1]
+		},
+	};
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_ufq_kfunc_set);
+	if (ret)
+		return ret;
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &bpf_ufq_kfunc_set);
+	if (ret)
+		return ret;
+	ret = register_btf_id_dtor_kfuncs(bpf_ufq_dtor_kfunc,
+					  ARRAY_SIZE(bpf_ufq_dtor_kfunc),
+					  THIS_MODULE);
+	if (ret)
+		return ret;
+
+	return 0;
+}
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC v2 2/3] block: Introduce the UFQ I/O scheduler
  2026-05-03  3:56 ` [RFC v2 2/3] block: Introduce the UFQ I/O scheduler Kaitao cheng
@ 2026-05-03  4:45   ` bot+bpf-ci
  0 siblings, 0 replies; 6+ messages in thread
From: bot+bpf-ci @ 2026-05-03  4:45 UTC (permalink / raw)
  To: kaitao.cheng, axboe, ast, daniel, andrii, martin.lau, eddyz87,
	memxor, song, yonghong.song, jolsa, john.fastabend
  Cc: bpf, linux-block, linux-kernel, chengkaitao, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 12280 bytes --]

> diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
> --- a/block/Kconfig.iosched
> +++ b/block/Kconfig.iosched
> @@ -44,4 +44,12 @@
>  	Enable some debugging help. Currently it exports additional stat
>  	files in a cgroup which can be useful for debugging.
>
> +config IOSCHED_UFQ
> +	tristate "UFQ I/O scheduler"
> +	default y
> +	help
> +	The UFQ I/O scheduler is a programmable I/O scheduler. When
> +	enabled, an out-of-kernel I/O scheduler based on eBPF can be
> +	designed to interact with it, leveraging its customizable
> +	hooks to redefine I/O scheduling policies.
>  endmenu
> diff --git a/block/Makefile b/block/Makefile
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -24,6 +24,7 @@
>  obj-$(CONFIG_MQ_IOSCHED_KYBER)	+= kyber-iosched.o
>  bfq-y				:= bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
>  obj-$(CONFIG_IOSCHED_BFQ)	+= bfq.o
> +obj-$(CONFIG_IOSCHED_UFQ)	+= ufq-iosched.o ufq-bpfops.o ufq-kfunc.o
>
>  obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o \
>  				   bio-integrity-fs.o
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -774,8 +774,8 @@
>   * For non-mq, this has to be called with the request spinlock acquired.
>   * For mq with scheduling, the appropriate queue wide lock should be held.
>   */
> -static struct request *attempt_merge(struct request_queue *q,
> -				     struct request *req, struct request *next)
> +static struct request *attempt_merge(struct request_queue *q, struct request *req,
> +				     struct request *next, bool nohash)
>  {
>  	if (!rq_mergeable(req) || !rq_mergeable(next))
>  		return NULL;
> @@ -842,7 +842,7 @@
>
>  	req->__data_len += blk_rq_bytes(next);
>
> -	if (!blk_discard_mergable(req))
> +	if (!nohash && !blk_discard_mergable(req))
>  		elv_merge_requests(q, req, next);
>
>  	blk_crypto_rq_put_keyslot(next);
> @@ -879,11 +879,17 @@
>  	struct request *prev = elv_former_request(q, rq);
>
>  	if (prev)
> -		return attempt_merge(q, prev, rq);
> +		return attempt_merge(q, prev, rq, false);
>
>  	return NULL;
>  }
>
> +struct request *bpf_attempt_merge(struct request_queue *q, struct request *rq,
> +				  struct request *next)
> +{
> +	return attempt_merge(q, rq, next, true);
> +}

CONFIG_IOSCHED_UFQ is a tristate option so the UFQ modules can be built
with CONFIG_IOSCHED_UFQ=m. However bpf_attempt_merge has no EXPORT_SYMBOL
or EXPORT_SYMBOL_GPL declaration. When built as a module,
ufq-iosched.ko calls this function in ufq_try_insert_merge and
ufq-kfunc.ko calls it in bpf_request_try_merge, causing a linker failure
("undefined reference to bpf_attempt_merge"). Should EXPORT_SYMBOL_GPL be
added here, or should CONFIG_IOSCHED_UFQ be changed to a bool?

>
>  /*
>   * Try to merge 'next' into 'rq'. Return true if the merge happened, false
> @@ -1035,11 +1041,11 @@
>  	return BIO_MERGE_FAILED;
>  }
>
> -static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
> -						   struct request *rq,
> -						   struct bio *bio,
> -						   unsigned int nr_segs,
> -						   bool sched_allow_merge)
> +enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
> +					    struct request *rq,
> +					    struct bio *bio,
> +					    unsigned int nr_segs,
> +					    bool sched_allow_merge)
>  {
>  	if (!blk_rq_merge_ok(rq, bio))
>  		return BIO_MERGE_NONE;

blk_attempt_bio_merge is changed from static to externally linked and is
called from ufq-iosched.c (ufq_bio_merge) and ufq-kfunc.c
(bpf_request_bio_try_merge). Since CONFIG_IOSCHED_UFQ is tristate, these
callers can be built as modules, but blk_attempt_bio_merge has no
EXPORT_SYMBOL or EXPORT_SYMBOL_GPL. When CONFIG_IOSCHED_UFQ=m, the
modules will fail to link with "undefined reference to
blk_attempt_bio_merge". Should EXPORT_SYMBOL_GPL be added, or should
CONFIG_IOSCHED_UFQ be restricted to bool?

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -796,7 +796,7 @@
>  	}
>  }
>
> -static void __blk_mq_free_request(struct request *rq)
> +void __blk_mq_free_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
>  	struct blk_mq_ctx *ctx = rq->mq_ctx;

__blk_mq_free_request is changed from static to externally linked and is
called from ufq-iosched.c (ufq_dispatch_request, ufq_try_insert_merge)
and ufq-kfunc.c (bpf_request_release). CONFIG_IOSCHED_UFQ is tristate so
these callers can be modules, but __blk_mq_free_request has no
EXPORT_SYMBOL declaration. With CONFIG_IOSCHED_UFQ=m the modules will
fail to link ("undefined reference to __blk_mq_free_request"). Should
EXPORT_SYMBOL_GPL be added here, or should CONFIG_IOSCHED_UFQ be
restricted to bool?

[ ... ]

> diff --git a/block/ufq-bpfops.c b/block/ufq-bpfops.c
> --- /dev/null
> +++ b/block/ufq-bpfops.c
> @@ -0,0 +1,241 @@

[ ... ]

> +static int bpf_ufq_enable(void *ops)
> +{
> +	ufq_ops = *(struct ufq_iosched_ops *)ops;
> +	atomic_set(&ufq_bpfops_enabled, 1);
> +	return 0;
> +}
> +
> +static void bpf_ufq_disable(struct ufq_iosched_ops *ops)
> +{
> +	atomic_set(&ufq_bpfops_enabled, 0);
> +	wait_event(ufq_bpfops_wq, !atomic_read(&ufq_bpfops_users));
> +	memset(&ufq_ops, 0, sizeof(ufq_ops));
> +}

Is a memory barrier needed between atomic_set(&ufq_bpfops_enabled, 0) and
the first atomic_read(&ufq_bpfops_users)?

Looking at ufq_bpfops_tryget:

  atomic_inc(&ufq_bpfops_users);
  smp_mb__after_atomic();
  if (unlikely(!atomic_read(&ufq_bpfops_enabled))) {
      if (atomic_dec_and_test(&ufq_bpfops_users))
          wake_up_all(&ufq_bpfops_wq);
      return NULL;
  }
  return &ufq_ops;

The inc/mb/read sequence on the reader side pairs with a store/mb/read
on the disable side. But bpf_ufq_disable only has:

  atomic_set(&ufq_bpfops_enabled, 0);
  wait_event(..., !atomic_read(&ufq_bpfops_users));

The wait_event macro expands to an immediate condition check before
acquiring the wait queue lock. Without a barrier between the store to
enabled and the first load of users, Store-Load reordering can allow the
disable path to observe users == 0 while a concurrent reader observes
enabled == 1 and returns &ufq_ops. This creates a window where the
disable path proceeds to memset(&ufq_ops, 0, ...) while a reader is
returning the pointer to a caller who will immediately dereference it,
resulting in a NULL function pointer dereference.

Should smp_mb__after_atomic() be added after the atomic_set in
bpf_ufq_disable?

> +static bool bpf_ufq_is_valid_access(int off, int size,
> +				    enum bpf_access_type type,
> +				    const struct bpf_prog *prog,
> +				    struct bpf_insn_access_aux *info)
> +{
> +	if (type != BPF_READ)
> +		return false;
> +	if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS)
> +		return false;
> +	if (off % size != 0)
> +		return false;
> +
> +	/*
> +	 * btf_ctx_access() treats pointers that are not "pointer to struct"
> +	 * as scalars (no reg_type), so loading pointers like merge_req()'s
> +	 * int *type or merge_bio()'s bool *merged from ctx leaves a SCALAR
> +	 * and stores through them fail verification. Model both as writable
> +	 * buffers.
> +	 */
> +	if (size == sizeof(__u64) && prog->aux->attach_func_name &&
> +	    ((!strcmp(prog->aux->attach_func_name, "merge_req") && off == 16) ||
> +	     (!strcmp(prog->aux->attach_func_name, "merge_bio") && off == 24))) {
> +		if (!btf_ctx_access(off, size, type, prog, info))
> +			return false;
> +		info->reg_type = PTR_TO_BUF;
> +		return true;
> +	}

The merge_req and merge_bio ctx offsets at 16 and 24 are overridden to
PTR_TO_BUF so BPF programs can store through the int *type and bool
*merged pointers. However info->reg_type is set without populating any
size or range field (such as info->mem_size or a MEM_FIXED_SIZE marker).

The verifier's check_buffer_access only enforces off >= 0 and rejects
variable offsets; it does not clamp the access to any declared buffer
size. This allows a BPF program to compute ptr + K for large K and write
beyond the stack object.

The kernel callers pass real stack variables:

  ufq_try_insert_merge: &merge_type (4-byte int)
  ufq_bio_merge: &ret (1-byte bool)

A BPF program could issue *(type + 1024) = 0 which would be accepted by
the verifier (constant offset, non-negative, PTR_TO_BUF) but write to
stack memory beyond the local variable, corrupting saved registers or
frame pointers.

Other subsystems mark such buffers with PTR_TO_MEM | MEM_FIXED_SIZE and
set info->mem_size, or use kfuncs with explicit size arguments. Should a
declared size be added here, or should these be redesigned as kfunc
parameters?

[ ... ]

> diff --git a/block/ufq-iosched.c b/block/ufq-iosched.c
> --- /dev/null
> +++ b/block/ufq-iosched.c
> @@ -0,0 +1,640 @@

[ ... ]

> +static int ufq_init_sched(struct request_queue *q, struct elevator_queue *eq)
> +{
> +	const struct ufq_iosched_ops *ops;
> +	struct ufq_data *ufq;
> +
> +	ufq = kzalloc_node(sizeof(*ufq), GFP_KERNEL, q->node);
> +	if (!ufq)
> +		return -ENOMEM;
> +
> +	eq->elevator_data = ufq;
> +	ufq->q = q;
> +	INIT_LIST_HEAD(&ufq->active_node);
> +
> +	blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
> +	q->elevator = eq;
> +
> +	q->async_depth = q->nr_requests;
> +	ufq->async_depth = q->nr_requests;
> +
> +	ops = ufq_bpfops_tryget();
> +	if (ops) {
> +		if (ops->init_sched)
> +			ops->init_sched(q);
> +		ufq_bpfops_put();
> +	}

The ops->init_sched(q) call here discards the return value. The
ufq_iosched_ops struct defines init_sched with signature 'int
(*init_sched)(struct request_queue *q)', intending it to signal errors.

If a BPF init_sched returns an error (for example -ENOMEM from a failed
per-queue allocation), ufq_init_sched continues, returns 0, and registers
the queue on ufq_active_queues with the BPF program in an inconsistent
state. The BPF program believes initialization failed while the kernel
believes the queue is active. Later ops->insert_req / ops->dispatch_req
calls will run against uninitialized BPF state.

Other block elevators (blk-mq-sched.c, blk_mq_init_sched) check the
return value and roll back on error (teardown tags, free structures,
clear q->elevator). Should the return value be checked here and the error
propagated?

[ ... ]

> diff --git a/block/ufq-kfunc.c b/block/ufq-kfunc.c
> --- /dev/null
> +++ b/block/ufq-kfunc.c
> @@ -0,0 +1,131 @@

[ ... ]

> +__bpf_kfunc struct request *bpf_request_try_merge(struct request *rq, struct request *next)
> +{
> +	struct blk_mq_ctx *ctx;
> +	struct ufq_data *ufq;
> +	struct request *free;
> +
> +	if (!rq || !next || !rq->q || rq->q != next->q)
> +		return NULL;
> +
> +	ufq = rq->q->elevator->elevator_data;
> +	if (!ufq)
> +		return NULL;
> +
> +	if (rq->mq_ctx != next->mq_ctx || rq->mq_hctx != next->mq_hctx)
> +		return NULL;
> +
> +	ctx = rq->mq_ctx;
> +	if (!ctx)
> +		return NULL;
> +
> +	spin_lock(&ctx->lock);
> +	free = bpf_attempt_merge(rq->q, rq, next);
> +	if (free) {
> +		if (rq->q->last_merge == free)
> +			rq->q->last_merge = NULL;
> +		list_del_init(&free->queuelist);
> +		atomic_dec(&ufq->rqs_count);
> +	}
> +	spin_unlock(&ctx->lock);
> +
> +	return free;
> +}

[ ... ]

> +BTF_KFUNCS_START(ufq_kfunc_set_ops)
> +BTF_ID_FLAGS(func, bpf_request_acquire, KF_ACQUIRE | KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_request_release, KF_RELEASE)
> +BTF_ID_FLAGS(func, bpf_request_bio_try_merge, KF_SPIN_LOCK)
> +BTF_ID_FLAGS(func, bpf_request_try_merge, KF_SPIN_LOCK)
> +BTF_KFUNCS_END(ufq_kfunc_set_ops)

bpf_request_try_merge returns 'struct request *' and has multiple NULL
return paths (!rq, !next, !ufq, mq_ctx/mq_hctx mismatch, !ctx, merge
failure). The kfunc is annotated with KF_SPIN_LOCK but lacks KF_RET_NULL.

Without KF_RET_NULL the verifier treats the return value as a non-NULL
pointer and allows BPF programs to dereference it unconditionally. When
the kfunc returns NULL at runtime, the BPF program dereferences NULL and
causes a kernel crash.

bpf_request_acquire is correctly annotated with 'KF_ACQUIRE |
KF_RET_NULL'. Should bpf_request_try_merge also have KF_RET_NULL (or
'KF_SPIN_LOCK | KF_ACQUIRE | KF_RET_NULL')?

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25269522212

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC v2 3/3] tools/ufq_iosched: add BPF example scheduler and build scaffolding
  2026-05-03  3:56 [RFC v2 0/3] block: Introduce a BPF-based I/O scheduler Kaitao cheng
  2026-05-03  3:56 ` [RFC v2 1/3] bpf: Add KF_SPIN_LOCK flag for kfuncs under bpf_spin_lock Kaitao cheng
  2026-05-03  3:56 ` [RFC v2 2/3] block: Introduce the UFQ I/O scheduler Kaitao cheng
@ 2026-05-03  3:56 ` Kaitao cheng
  2026-05-03  4:44   ` bot+bpf-ci
  2 siblings, 1 reply; 6+ messages in thread
From: Kaitao cheng @ 2026-05-03  3:56 UTC (permalink / raw)
  To: axboe, ast, daniel, andrii, martin.lau, eddyz87, memxor, song,
	yonghong.song, jolsa, john.fastabend
  Cc: bpf, linux-block, linux-kernel, Kaitao Cheng

From: Kaitao Cheng <chengkaitao@kylinos.cn>

Add ufq_iosched as a simple example for the UFQ block I/O scheduler,
In the ufq_simple example, we implement the eBPF struct_ops hooks the
kernel exposes so we can exercise and validate the behavior and stability
of the kernel UFQ scheduling framework. The Makefile and directory
layout are modeled after sched_ext.

This mirrors the sched_ext examples pattern so developers can experiment
with user-defined queueing policies on top of IOSCHED_UFQ.

Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
 tools/ufq_iosched/.gitignore                  |   2 +
 tools/ufq_iosched/Makefile                    | 262 ++++++++
 tools/ufq_iosched/README.md                   | 145 +++++
 .../include/bpf-compat/gnu/stubs.h            |  12 +
 tools/ufq_iosched/include/ufq/common.bpf.h    |  75 +++
 tools/ufq_iosched/include/ufq/common.h        |  90 +++
 tools/ufq_iosched/include/ufq/simple_stat.h   |  23 +
 tools/ufq_iosched/ufq_simple.bpf.c            | 604 ++++++++++++++++++
 tools/ufq_iosched/ufq_simple.c                | 120 ++++
 9 files changed, 1333 insertions(+)
 create mode 100644 tools/ufq_iosched/.gitignore
 create mode 100644 tools/ufq_iosched/Makefile
 create mode 100644 tools/ufq_iosched/README.md
 create mode 100644 tools/ufq_iosched/include/bpf-compat/gnu/stubs.h
 create mode 100644 tools/ufq_iosched/include/ufq/common.bpf.h
 create mode 100644 tools/ufq_iosched/include/ufq/common.h
 create mode 100644 tools/ufq_iosched/include/ufq/simple_stat.h
 create mode 100644 tools/ufq_iosched/ufq_simple.bpf.c
 create mode 100644 tools/ufq_iosched/ufq_simple.c

diff --git a/tools/ufq_iosched/.gitignore b/tools/ufq_iosched/.gitignore
new file mode 100644
index 000000000000..d6264fe1c8cd
--- /dev/null
+++ b/tools/ufq_iosched/.gitignore
@@ -0,0 +1,2 @@
+tools/
+build/
diff --git a/tools/ufq_iosched/Makefile b/tools/ufq_iosched/Makefile
new file mode 100644
index 000000000000..7dc37d9172aa
--- /dev/null
+++ b/tools/ufq_iosched/Makefile
@@ -0,0 +1,262 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 KylinSoft Corporation.
+# Copyright (c) 2026 Kaitao Cheng <chengkaitao@kylinos.cn>
+include ../build/Build.include
+include ../scripts/Makefile.arch
+include ../scripts/Makefile.include
+
+all: all_targets
+
+ifneq ($(LLVM),)
+ifneq ($(filter %/,$(LLVM)),)
+LLVM_PREFIX := $(LLVM)
+else ifneq ($(filter -%,$(LLVM)),)
+LLVM_SUFFIX := $(LLVM)
+endif
+
+CLANG_TARGET_FLAGS_arm          := arm-linux-gnueabi
+CLANG_TARGET_FLAGS_arm64        := aarch64-linux-gnu
+CLANG_TARGET_FLAGS_hexagon      := hexagon-linux-musl
+CLANG_TARGET_FLAGS_m68k         := m68k-linux-gnu
+CLANG_TARGET_FLAGS_mips         := mipsel-linux-gnu
+CLANG_TARGET_FLAGS_powerpc      := powerpc64le-linux-gnu
+CLANG_TARGET_FLAGS_riscv        := riscv64-linux-gnu
+CLANG_TARGET_FLAGS_s390         := s390x-linux-gnu
+CLANG_TARGET_FLAGS_x86          := x86_64-linux-gnu
+CLANG_TARGET_FLAGS              := $(CLANG_TARGET_FLAGS_$(ARCH))
+
+ifeq ($(CROSS_COMPILE),)
+ifeq ($(CLANG_TARGET_FLAGS),)
+$(error Specify CROSS_COMPILE or add '--target=' option to lib.mk)
+else
+CLANG_FLAGS     += --target=$(CLANG_TARGET_FLAGS)
+endif # CLANG_TARGET_FLAGS
+else
+CLANG_FLAGS     += --target=$(notdir $(CROSS_COMPILE:%-=%))
+endif # CROSS_COMPILE
+
+CC := $(LLVM_PREFIX)clang$(LLVM_SUFFIX) $(CLANG_FLAGS) -fintegrated-as
+else
+CC := $(CROSS_COMPILE)gcc
+endif # LLVM
+
+CURDIR := $(abspath .)
+TOOLSDIR := $(abspath ..)
+LIBDIR := $(TOOLSDIR)/lib
+BPFDIR := $(LIBDIR)/bpf
+TOOLSINCDIR := $(TOOLSDIR)/include
+BPFTOOLDIR := $(TOOLSDIR)/bpf/bpftool
+APIDIR := $(TOOLSINCDIR)/uapi
+GENDIR := $(abspath ../../include/generated)
+GENHDR := $(GENDIR)/autoconf.h
+
+ifeq ($(O),)
+OUTPUT_DIR := $(CURDIR)/build
+else
+OUTPUT_DIR := $(O)/build
+endif # O
+OBJ_DIR := $(OUTPUT_DIR)/obj
+INCLUDE_DIR := $(OUTPUT_DIR)/include
+BPFOBJ_DIR := $(OBJ_DIR)/libbpf
+UFQOBJ_DIR := $(OBJ_DIR)/ufq_iosched
+BINDIR := $(OUTPUT_DIR)/bin
+BPFOBJ := $(BPFOBJ_DIR)/libbpf.a
+ifneq ($(CROSS_COMPILE),)
+HOST_BUILD_DIR		:= $(OBJ_DIR)/host/obj
+HOST_OUTPUT_DIR		:= $(OBJ_DIR)/host
+HOST_INCLUDE_DIR	:= $(HOST_OUTPUT_DIR)/include
+else
+HOST_BUILD_DIR		:= $(OBJ_DIR)
+HOST_OUTPUT_DIR		:= $(OUTPUT_DIR)
+HOST_INCLUDE_DIR	:= $(INCLUDE_DIR)
+endif
+HOST_BPFOBJ := $(HOST_BUILD_DIR)/libbpf/libbpf.a
+RESOLVE_BTFIDS := $(HOST_BUILD_DIR)/resolve_btfids/resolve_btfids
+DEFAULT_BPFTOOL := $(HOST_OUTPUT_DIR)/sbin/bpftool
+
+VMLINUX_BTF_PATHS ?= $(if $(O),$(O)/vmlinux)					\
+		     $(if $(KBUILD_OUTPUT),$(KBUILD_OUTPUT)/vmlinux)		\
+		     ../../vmlinux						\
+		     /sys/kernel/btf/vmlinux					\
+		     /boot/vmlinux-$(shell uname -r)
+VMLINUX_BTF ?= $(abspath $(firstword $(wildcard $(VMLINUX_BTF_PATHS))))
+ifeq ($(VMLINUX_BTF),)
+$(error Cannot find a vmlinux for VMLINUX_BTF at any of "$(VMLINUX_BTF_PATHS)")
+endif
+
+BPFTOOL ?= $(DEFAULT_BPFTOOL)
+
+ifneq ($(wildcard $(GENHDR)),)
+  GENFLAGS := -DHAVE_GENHDR
+endif
+
+CFLAGS += -g -O2 -rdynamic -pthread -Wall -Werror $(GENFLAGS)			\
+	  -I$(INCLUDE_DIR) -I$(GENDIR) -I$(LIBDIR)				\
+	  -I$(TOOLSINCDIR) -I$(APIDIR) -I$(CURDIR)/include
+
+# Silence some warnings when compiled with clang
+ifneq ($(LLVM),)
+CFLAGS += -Wno-unused-command-line-argument
+endif
+
+LDFLAGS += -lelf -lz -lpthread
+
+IS_LITTLE_ENDIAN = $(shell $(CC) -dM -E - </dev/null |				\
+			grep 'define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__')
+
+# Get Clang's default includes on this system, as opposed to those seen by
+# '-target bpf'. This fixes "missing" files on some architectures/distros,
+# such as asm/byteorder.h, asm/socket.h, asm/sockios.h, sys/cdefs.h etc.
+#
+# Use '-idirafter': Don't interfere with include mechanics except where the
+# build would have failed anyways.
+define get_sys_includes
+$(shell $(1) -v -E - </dev/null 2>&1 \
+	| sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') \
+$(shell $(1) -dM -E - </dev/null | grep '__riscv_xlen ' | awk '{printf("-D__riscv_xlen=%d -D__BITS_PER_LONG=%d", $$3, $$3)}')
+endef
+
+BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH)					\
+	     $(if $(IS_LITTLE_ENDIAN),-mlittle-endian,-mbig-endian)		\
+	     -I$(CURDIR)/include -I$(CURDIR)/include/bpf-compat			\
+	     -I$(INCLUDE_DIR) -I$(APIDIR)					\
+	     -I../../include							\
+	     $(call get_sys_includes,$(CLANG))					\
+	     -Wall -Wno-compare-distinct-pointer-types				\
+	     -Wno-microsoft-anon-tag						\
+	     -fms-extensions							\
+	     -O2 -mcpu=v3
+
+# sort removes libbpf duplicates when not cross-building
+MAKE_DIRS := $(sort $(OBJ_DIR)/libbpf $(HOST_BUILD_DIR)/libbpf			\
+	       $(HOST_BUILD_DIR)/bpftool $(HOST_BUILD_DIR)/resolve_btfids	\
+	       $(INCLUDE_DIR) $(UFQOBJ_DIR) $(BINDIR))
+
+$(MAKE_DIRS):
+	$(call msg,MKDIR,,$@)
+	$(Q)mkdir -p $@
+
+ifneq ($(CROSS_COMPILE),)
+$(BPFOBJ): $(wildcard $(BPFDIR)/*.[ch] $(BPFDIR)/Makefile)			\
+	   $(APIDIR)/linux/bpf.h						\
+	   | $(OBJ_DIR)/libbpf
+	$(Q)$(MAKE) $(submake_extras) CROSS_COMPILE=$(CROSS_COMPILE) 		\
+		    -C $(BPFDIR) OUTPUT=$(OBJ_DIR)/libbpf/			\
+		    EXTRA_CFLAGS='-g -O0 -fPIC'					\
+		    LDFLAGS="$(LDFLAGS)"					\
+		    DESTDIR=$(OUTPUT_DIR) prefix= all install_headers
+endif
+
+$(HOST_BPFOBJ): $(wildcard $(BPFDIR)/*.[ch] $(BPFDIR)/Makefile)		\
+	   $(APIDIR)/linux/bpf.h						\
+	   | $(HOST_BUILD_DIR)/libbpf
+	$(Q)$(MAKE) $(submake_extras) -C $(BPFDIR) 				\
+		    OUTPUT=$(HOST_BUILD_DIR)/libbpf/				\
+		    ARCH= CROSS_COMPILE= CC="$(HOSTCC)" LD=$(HOSTLD)		\
+		    EXTRA_CFLAGS='-g -O0 -fPIC'					\
+		    DESTDIR=$(HOST_OUTPUT_DIR) prefix= all install_headers
+
+$(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile)	\
+		    $(HOST_BPFOBJ) | $(HOST_BUILD_DIR)/bpftool
+	$(Q)$(MAKE) $(submake_extras)  -C $(BPFTOOLDIR)				\
+		    ARCH= CROSS_COMPILE= CC="$(HOSTCC)" LD=$(HOSTLD)		\
+		    EXTRA_CFLAGS='-g -O0'					\
+		    OUTPUT=$(HOST_BUILD_DIR)/bpftool/				\
+		    LIBBPF_OUTPUT=$(HOST_BUILD_DIR)/libbpf/			\
+		    LIBBPF_DESTDIR=$(HOST_OUTPUT_DIR)/				\
+		    prefix= DESTDIR=$(HOST_OUTPUT_DIR)/ install-bin
+
+$(INCLUDE_DIR)/vmlinux.h: $(VMLINUX_BTF) $(BPFTOOL) | $(INCLUDE_DIR)
+ifeq ($(VMLINUX_H),)
+	$(call msg,GEN,,$@)
+	$(Q)$(BPFTOOL) btf dump file $(VMLINUX_BTF) format c > $@
+else
+	$(call msg,CP,,$@)
+	$(Q)cp "$(VMLINUX_H)" $@
+endif
+
+$(UFQOBJ_DIR)/%.bpf.o: %.bpf.c $(INCLUDE_DIR)/vmlinux.h include/ufq/*.h		\
+		       | $(BPFOBJ) $(UFQOBJ_DIR)
+	$(call msg,CLNG-BPF,,$(notdir $@))
+	$(Q)$(CLANG) $(BPF_CFLAGS) -target bpf -c $< -o $@
+
+$(INCLUDE_DIR)/%.bpf.skel.h: $(UFQOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BPFTOOL)
+	$(eval sched=$(notdir $@))
+	$(call msg,GEN-SKEL,,$(sched))
+	$(Q)$(BPFTOOL) gen object $(<:.o=.linked1.o) $<
+	$(Q)$(BPFTOOL) gen object $(<:.o=.linked2.o) $(<:.o=.linked1.o)
+	$(Q)$(BPFTOOL) gen object $(<:.o=.linked3.o) $(<:.o=.linked2.o)
+	$(Q)diff $(<:.o=.linked2.o) $(<:.o=.linked3.o)
+	$(Q)$(BPFTOOL) gen skeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $@
+	$(Q)$(BPFTOOL) gen subskeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $(@:.skel.h=.subskel.h)
+
+UFQ_COMMON_DEPS := include/ufq/common.h include/ufq/simple_stat.h | $(BINDIR)
+
+c-sched-targets = ufq_simple
+
+$(addprefix $(BINDIR)/,$(c-sched-targets)): \
+	$(BINDIR)/%: \
+		$(filter-out %.bpf.c,%.c) \
+		$(INCLUDE_DIR)/%.bpf.skel.h \
+		$(UFQ_COMMON_DEPS)
+	$(eval sched=$(notdir $@))
+	$(CC) $(CFLAGS) -c $(sched).c -o $(UFQOBJ_DIR)/$(sched).o
+	$(CC) -o $@ $(UFQOBJ_DIR)/$(sched).o $(BPFOBJ) $(LDFLAGS)
+
+$(c-sched-targets): %: $(BINDIR)/%
+
+install: all
+	$(Q)mkdir -p $(DESTDIR)/usr/local/bin/
+	$(Q)cp $(BINDIR)/* $(DESTDIR)/usr/local/bin/
+
+clean:
+	rm -rf $(OUTPUT_DIR) $(HOST_OUTPUT_DIR)
+	rm -f *.o *.bpf.o *.bpf.skel.h *.bpf.subskel.h
+	rm -f $(c-sched-targets)
+
+help:
+	@echo   'Building targets'
+	@echo   '================'
+	@echo   ''
+	@echo   '  all		  - Compile all schedulers'
+	@echo   ''
+	@echo   'Alternatively, you may compile individual schedulers:'
+	@echo   ''
+	@printf '  %s\n' $(c-sched-targets)
+	@echo   ''
+	@echo   'For any scheduler build target, you may specify an alternative'
+	@echo   'build output path with the O= environment variable. For example:'
+	@echo   ''
+	@echo   '   O=/tmp/ufq_iosched make all'
+	@echo   ''
+	@echo   'will compile all schedulers, and emit the build artifacts to'
+	@echo   '/tmp/ufq_iosched/build.'
+	@echo   ''
+	@echo   ''
+	@echo   'Installing targets'
+	@echo   '=================='
+	@echo   ''
+	@echo   '  install	  - Compile and install all schedulers to /usr/bin.'
+	@echo   '		    You may specify the DESTDIR= environment variable'
+	@echo   '		    to indicate a prefix for /usr/bin. For example:'
+	@echo   ''
+	@echo   '                     DESTDIR=/tmp/ufq_iosched make install'
+	@echo   ''
+	@echo   '		    will build the schedulers in CWD/build, and'
+	@echo   '		    install the schedulers to /tmp/ufq_iosched/usr/bin.'
+	@echo   ''
+	@echo   ''
+	@echo   'Cleaning targets'
+	@echo   '================'
+	@echo   ''
+	@echo   '  clean		  - Remove all generated files'
+
+all_targets: $(c-sched-targets)
+
+.PHONY: all all_targets $(c-sched-targets) clean help
+
+# delete failed targets
+.DELETE_ON_ERROR:
+
+# keep intermediate (.bpf.skel.h, .bpf.o, etc) targets
+.SECONDARY:
diff --git a/tools/ufq_iosched/README.md b/tools/ufq_iosched/README.md
new file mode 100644
index 000000000000..1fa305fb576e
--- /dev/null
+++ b/tools/ufq_iosched/README.md
@@ -0,0 +1,145 @@
+UFQ IOSCHED EXAMPLE SCHEDULERS
+============================
+
+# Introduction
+
+This directory contains a simple example of the ufq IO scheduler. It is meant
+to illustrate the different kinds of IO schedulers you can build with ufq;
+new schedulers will be added as the project evolves across releases.
+
+# Compiling the examples
+
+There are a few toolchain dependencies for compiling the example schedulers.
+
+## Toolchain dependencies
+
+1. clang >= 16.0.0
+
+The schedulers are BPF programs, and therefore must be compiled with clang. gcc
+is actively working on adding a BPF backend compiler as well, but are still
+missing some features such as BTF type tags which are necessary for using
+kptrs.
+
+2. pahole >= 1.25
+
+You may need pahole in order to generate BTF from DWARF.
+
+3. rust >= 1.85.0
+
+Rust schedulers uses features present in the rust toolchain >= 1.85.0. You
+should be able to use the stable build from rustup.
+
+There are other requirements as well, such as make, but these are the main /
+non-trivial ones.
+
+## Compiling the kernel
+
+In order to run a ufq scheduler, you'll have to run a kernel compiled
+with the patches in this repository, and with a minimum set of necessary
+Kconfig options:
+
+```
+CONFIG_BPF=y
+CONFIG_IOSCHED_UFQ=y
+CONFIG_BPF_SYSCALL=y
+CONFIG_BPF_JIT=y
+CONFIG_DEBUG_INFO_BTF=y
+```
+
+It's also recommended that you also include the following Kconfig options:
+
+```
+CONFIG_BPF_JIT_ALWAYS_ON=y
+CONFIG_BPF_JIT_DEFAULT_ON=y
+CONFIG_PAHOLE_HAS_SPLIT_BTF=y
+CONFIG_PAHOLE_HAS_BTF_TAG=y
+```
+
+There is a `Kconfig` file in this directory whose contents you can append to
+your local `.config` file, as long as there are no conflicts with any existing
+options in the file.
+
+## Getting a vmlinux.h file
+
+You may notice that most of the example schedulers include a "vmlinux.h" file.
+This is a large, auto-generated header file that contains all of the types
+defined in some vmlinux binary that was compiled with
+[BTF](https://docs.kernel.org/bpf/btf.html) (i.e. with the BTF-related Kconfig
+options specified above).
+
+The header file is created using `bpftool`, by passing it a vmlinux binary
+compiled with BTF as follows:
+
+```bash
+$ bpftool btf dump file /path/to/vmlinux format c > vmlinux.h
+```
+
+`bpftool` analyzes all of the BTF encodings in the binary, and produces a
+header file that can be included by BPF programs to access those types.  For
+example, using vmlinux.h allows a scheduler to access fields defined directly
+in vmlinux
+
+The scheduler build system will generate this vmlinux.h file as part of the
+scheduler build pipeline. It looks for a vmlinux file in the following
+dependency order:
+
+1. If the O= environment variable is defined, at `$O/vmlinux`
+2. If the KBUILD_OUTPUT= environment variable is defined, at
+   `$KBUILD_OUTPUT/vmlinux`
+3. At `../../vmlinux` (i.e. at the root of the kernel tree where you're
+   compiling the schedulers)
+4. `/sys/kernel/btf/vmlinux`
+5. `/boot/vmlinux-$(uname -r)`
+
+In other words, if you have compiled a kernel in your local repo, its vmlinux
+file will be used to generate vmlinux.h. Otherwise, it will be the vmlinux of
+the kernel you're currently running on. This means that if you're running on a
+kernel with ufq support, you may not need to compile a local kernel at
+all.
+
+### Aside on CO-RE
+
+One of the cooler features of BPF is that it supports
+[CO-RE](https://nakryiko.com/posts/bpf-core-reference-guide/) (Compile Once Run
+Everywhere). This feature allows you to reference fields inside of structs with
+types defined internal to the kernel, and not have to recompile if you load the
+BPF program on a different kernel with the field at a different offset.
+
+## Compiling the schedulers
+
+Once you have your toolchain setup, and a vmlinux that can be used to generate
+a full vmlinux.h file, you can compile the schedulers using `make`. The build
+vendors libbpf and bpftool under `build/`, then compiles BPF objects with
+`clang` and the userspace loader (for example `ufq_simple.c`) with `gcc`.
+
+```bash
+$ make -j($nproc)
+```
+
+## ufq_simple
+
+A simple IO scheduler that provides an example of a minimal ufq scheduler.
+Populates commonly used kernel-exposed BPF interfaces for testing the UFQ
+scheduler framework in the kernel.
+
+### bpftool feature detection (`llvm`, `libcap`, `libbfd`)
+
+While building bpftool, you may see lines similar to:
+
+```
+Auto-detecting system features:
+...                         clang-bpf-co-re: [ on  ]
+...                                    llvm: [ OFF ]
+...                                  libcap: [ OFF ]
+...                                  libbfd: [ OFF ]
+```
+
+This matches a typical minimal build environment: CO-RE support is on, while
+bpftool's optional links to LLVM (for some disassembly paths), libcap, and
+libbfd are off. **`llvm: [ OFF ]` is not a build failure** for these examples;
+the scheduler build can still complete (libbpf static library, bpftool,
+`vmlinux.h` generation, BPF `.o` files, and final `ufq_simple` link with `gcc`).
+
+If `libcap` or `libbfd` show `[ OFF ]`, you only miss optional bpftool features
+that depend on those libraries; you do not need them for compiling and loading
+the schedulers in this directory.
diff --git a/tools/ufq_iosched/include/bpf-compat/gnu/stubs.h b/tools/ufq_iosched/include/bpf-compat/gnu/stubs.h
new file mode 100644
index 000000000000..f200ac0f6ccd
--- /dev/null
+++ b/tools/ufq_iosched/include/bpf-compat/gnu/stubs.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Dummy gnu/stubs.h. clang can end up including /usr/include/gnu/stubs.h when
+ * compiling BPF files although its content doesn't play any role. The file in
+ * turn includes stubs-64.h or stubs-32.h depending on whether __x86_64__ is
+ * defined. When compiling a BPF source, __x86_64__ isn't set and thus
+ * stubs-32.h is selected. However, the file is not there if the system doesn't
+ * have 32bit glibc devel package installed leading to a build failure.
+ *
+ * The problem is worked around by making this file available in the include
+ * search paths before the system one when building BPF.
+ */
diff --git a/tools/ufq_iosched/include/ufq/common.bpf.h b/tools/ufq_iosched/include/ufq/common.bpf.h
new file mode 100644
index 000000000000..515821ba5b01
--- /dev/null
+++ b/tools/ufq_iosched/include/ufq/common.bpf.h
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2026 KylinSoft Corporation.
+ * Copyright (c) 2026 Kaitao Cheng <chengkaitao@kylinos.cn>
+ */
+#ifndef __UFQ_COMMON_BPF_H
+#define __UFQ_COMMON_BPF_H
+
+#ifdef LSP
+#define __bpf__
+#include "../vmlinux/vmlinux.h"
+#else
+#include "vmlinux.h"
+#endif
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+#include <asm-generic/errno.h>
+#include "simple_stat.h"
+
+#define BPF_STRUCT_OPS(name, args...)					\
+	SEC("struct_ops/" #name) BPF_PROG(name, ##args)
+
+/* Define struct ufq_iosched_ops for .struct_ops.link in the BPF object */
+#define UFQ_OPS_DEFINE(__name, ...)					\
+	SEC(".struct_ops.link")						\
+	struct ufq_iosched_ops __name = {				\
+		__VA_ARGS__,						\
+	}
+
+/* list and rbtree */
+#define __contains(name, node) __attribute__((btf_decl_tag("contains:" #name ":" #node)))
+
+struct request *bpf_request_acquire(struct request *rq) __ksym;
+void bpf_request_release(struct request *rq) __ksym;
+bool bpf_request_bio_try_merge(struct request *rq, struct bio *bio,
+			       unsigned int nr_segs) __ksym;
+struct request *bpf_request_try_merge(struct request *rq, struct request *next) __ksym;
+
+void *bpf_obj_new_impl(__u64 local_type_id, void *meta) __ksym;
+void bpf_obj_drop_impl(void *kptr, void *meta) __ksym;
+
+#define bpf_obj_new(type) ((type *)bpf_obj_new_impl(bpf_core_type_id_local(type), NULL))
+#define bpf_obj_drop(kptr) bpf_obj_drop_impl(kptr, NULL)
+
+int bpf_list_push_front_impl(struct bpf_list_head *head,
+				    struct bpf_list_node *node,
+				    void *meta, __u64 off) __ksym;
+#define bpf_list_push_front(head, node) bpf_list_push_front_impl(head, node, NULL, 0)
+
+int bpf_list_push_back_impl(struct bpf_list_head *head,
+				   struct bpf_list_node *node,
+				   void *meta, __u64 off) __ksym;
+#define bpf_list_push_back(head, node) bpf_list_push_back_impl(head, node, NULL, 0)
+
+struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head) __ksym;
+struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head) __ksym;
+bool bpf_list_empty(struct bpf_list_head *head) __ksym;
+struct bpf_list_node *bpf_list_del(struct bpf_list_head *head,
+				   struct bpf_list_node *node) __ksym;
+
+struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root,
+				      struct bpf_rb_node *node) __ksym;
+int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *node,
+			bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b),
+			void *meta, __u64 off) __ksym;
+#define bpf_rbtree_add(head, node, less) bpf_rbtree_add_impl(head, node, less, NULL, 0)
+
+struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
+
+void *bpf_refcount_acquire_impl(void *kptr, void *meta) __ksym;
+#define bpf_refcount_acquire(kptr) bpf_refcount_acquire_impl(kptr, NULL)
+
+#endif	/* __UFQ_COMMON_BPF_H */
diff --git a/tools/ufq_iosched/include/ufq/common.h b/tools/ufq_iosched/include/ufq/common.h
new file mode 100644
index 000000000000..a1e3a07844c4
--- /dev/null
+++ b/tools/ufq_iosched/include/ufq/common.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2026 KylinSoft Corporation.
+ * Copyright (c) 2026 Kaitao Cheng <chengkaitao@kylinos.cn>
+ */
+#ifndef __UFQ_IOSCHED_COMMON_H
+#define __UFQ_IOSCHED_COMMON_H
+
+#ifdef __KERNEL__
+#error "Should not be included by BPF programs"
+#endif
+
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <errno.h>
+#include <bpf/bpf.h>
+#include "simple_stat.h"
+
+typedef uint8_t u8;
+typedef uint16_t u16;
+typedef uint32_t u32;
+typedef uint64_t u64;
+typedef int8_t s8;
+typedef int16_t s16;
+typedef int32_t s32;
+typedef int64_t s64;
+
+#define UFQ_ERR(__fmt, ...)							\
+	do {									\
+		fprintf(stderr, "[UFQ_ERR] %s:%d", __FILE__, __LINE__);		\
+		if (errno)							\
+			fprintf(stderr, " (%s)\n", strerror(errno));		\
+		else								\
+			fprintf(stderr, "\n");					\
+		fprintf(stderr, __fmt __VA_OPT__(,) __VA_ARGS__);		\
+		fprintf(stderr, "\n");						\
+										\
+		exit(EXIT_FAILURE);						\
+	} while (0)
+
+#define UFQ_ERR_IF(__cond, __fmt, ...)						\
+	do {									\
+		if (__cond)							\
+			UFQ_ERR((__fmt) __VA_OPT__(,) __VA_ARGS__);		\
+	} while (0)
+
+/*
+ * struct ufq_iosched_ops can grow over time. With common.bpf.h::UFQ_OPS_DEFINE()
+ * and UFQ_OPS_LOAD()/UFQ_OPS_ATTACH(), libbpf performs struct_ops attachment.
+ */
+#define UFQ_OPS_OPEN(__ops_name, __ufq_name) ({					\
+	struct __ufq_name *__skel;						\
+										\
+	__skel = __ufq_name##__open();						\
+	UFQ_ERR_IF(!__skel, "Could not open " #__ufq_name);			\
+	(void)__skel->maps.__ops_name;						\
+	__skel;									\
+})
+
+#define UFQ_OPS_LOAD(__skel, __ops_name, __ufq_name) ({				\
+	(void)(__skel)->maps.__ops_name;					\
+	UFQ_ERR_IF(__ufq_name##__load((__skel)), "Failed to load skel");	\
+})
+
+/*
+ * New versions of bpftool emit additional link placeholders for BPF maps,
+ * and set up BPF skeleton so libbpf can auto-attach BPF maps (v1.5+). Old
+ * libbpf ignores those links. Disable autoattach on newer libbpf to avoid
+ * attaching twice when we attach struct_ops explicitly.
+ */
+#if LIBBPF_MAJOR_VERSION > 1 ||							\
+	(LIBBPF_MAJOR_VERSION == 1 && LIBBPF_MINOR_VERSION >= 5)
+#define __UFQ_OPS_DISABLE_AUTOATTACH(__skel, __ops_name)			\
+	bpf_map__set_autoattach((__skel)->maps.__ops_name, false)
+#else
+#define __UFQ_OPS_DISABLE_AUTOATTACH(__skel, __ops_name) do {} while (0)
+#endif
+
+#define UFQ_OPS_ATTACH(__skel, __ops_name, __ufq_name) ({			\
+	struct bpf_link *__link;						\
+	__UFQ_OPS_DISABLE_AUTOATTACH(__skel, __ops_name);			\
+	UFQ_ERR_IF(__ufq_name##__attach((__skel)), "Failed to attach skel");	\
+	__link = bpf_map__attach_struct_ops((__skel)->maps.__ops_name);		\
+	UFQ_ERR_IF(!__link, "Failed to attach struct_ops");			\
+	__link;									\
+})
+
+#endif	/* __UFQ_IOSCHED_COMMON_H */
diff --git a/tools/ufq_iosched/include/ufq/simple_stat.h b/tools/ufq_iosched/include/ufq/simple_stat.h
new file mode 100644
index 000000000000..c26d0ea57a69
--- /dev/null
+++ b/tools/ufq_iosched/include/ufq/simple_stat.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2026 KylinSoft Corporation.
+ * Copyright (c) 2026 Kaitao Cheng <chengkaitao@kylinos.cn>
+ */
+#ifndef __UFQ_SIMPLE_STAT_H
+#define __UFQ_SIMPLE_STAT_H
+
+enum ufq_simp_stat_index {
+	UFQ_SIMP_INSERT_CNT,
+	UFQ_SIMP_INSERT_SIZE,
+	UFQ_SIMP_DISPATCH_CNT,
+	UFQ_SIMP_DISPATCH_SIZE,
+	UFQ_SIMP_RQMERGE_CNT,
+	UFQ_SIMP_RQMERGE_SIZE,
+	UFQ_SIMP_BIOMERGE_CNT,
+	UFQ_SIMP_BIOMERGE_SIZE,
+	UFQ_SIMP_FINISH_CNT,
+	UFQ_SIMP_FINISH_SIZE,
+	UFQ_SIMP_STAT_MAX,
+};
+
+#endif /* __UFQ_SIMPLE_STAT_H */
diff --git a/tools/ufq_iosched/ufq_simple.bpf.c b/tools/ufq_iosched/ufq_simple.bpf.c
new file mode 100644
index 000000000000..81954e73068a
--- /dev/null
+++ b/tools/ufq_iosched/ufq_simple.bpf.c
@@ -0,0 +1,604 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 KylinSoft Corporation.
+ * Copyright (c) 2026 Kaitao Cheng <chengkaitao@kylinos.cn>
+ */
+#include <ufq/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define UFQ_DISK_SUM		20
+#define BLK_MQ_INSERT_AT_HEAD	0x01
+#define REQ_OP_MASK		((1 << 8) - 1)
+#define SECTOR_SHIFT		9
+#define UFQ_LOOP_MAX		100
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(key_size, sizeof(u32));
+	__uint(value_size, sizeof(u64));
+	__uint(max_entries, UFQ_SIMP_STAT_MAX);
+} stats SEC(".maps");
+
+enum ufq_simp_data_dir {
+	UFQ_SIMP_READ,
+	UFQ_SIMP_WRITE,
+	UFQ_SIMP_DIR_COUNT
+};
+
+struct queue_list_node {
+	struct bpf_list_node node;
+	struct request __kptr * req;
+};
+
+struct sort_tree_node {
+	struct bpf_refcount ref;
+	struct bpf_rb_node rb_node;
+	u64 key;
+	struct request __kptr * req;
+};
+
+struct ufq_simple_data {
+	struct bpf_spin_lock lock;
+	struct bpf_rb_root sort_tree_read __contains(sort_tree_node, rb_node);
+	struct bpf_rb_root sort_tree_write __contains(sort_tree_node, rb_node);
+	struct bpf_list_head dispatch __contains(queue_list_node, node);
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, UFQ_DISK_SUM);
+	__type(key, s32);
+	__type(value, struct ufq_simple_data);
+} ufq_map SEC(".maps");
+
+static void stat_add(u32 idx, u32 val)
+{
+	u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx);
+
+	if (cnt_p)
+		(*cnt_p) += val;
+}
+
+static void stat_sub(u32 idx, u32 val)
+{
+	u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx);
+
+	if (cnt_p)
+		(*cnt_p) -= val;
+}
+
+static bool sort_tree_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct sort_tree_node *node_a, *node_b;
+
+	node_a = container_of(a, struct sort_tree_node, rb_node);
+	node_b = container_of(b, struct sort_tree_node, rb_node);
+
+	return node_a->key < node_b->key;
+}
+
+static struct ufq_simple_data *dd_init_sched(struct request_queue *q)
+{
+	struct ufq_simple_data ufq_sd = {}, *ufq_sp;
+	int ret, id = q->id;
+
+	bpf_printk("ufq_simple init sched!");
+	ret = bpf_map_update_elem(&ufq_map, &id, &ufq_sd, BPF_NOEXIST);
+	if (ret && ret != -EEXIST) {
+		bpf_printk("ufq_simple/init_sched: update ufq_map err %d", ret);
+		return NULL;
+	}
+
+	ufq_sp = bpf_map_lookup_elem(&ufq_map, &id);
+	if (!ufq_sp) {
+		bpf_printk("ufq_simple/init_sched: lookup queue id %d in ufq_map failed", id);
+		return NULL;
+	}
+
+	return ufq_sp;
+}
+
+int BPF_STRUCT_OPS(ufq_simple_init_sched, struct request_queue *q)
+{
+	if (dd_init_sched(q))
+		return 0;
+	else
+		return -EPERM;
+}
+
+int BPF_STRUCT_OPS(ufq_simple_exit_sched, struct request_queue *q)
+{
+	int id = q->id;
+
+	bpf_printk("ufq_simple exit sched!");
+	bpf_map_delete_elem(&ufq_map, &id);
+	return 0;
+}
+
+int BPF_STRUCT_OPS(ufq_simple_insert_req, struct request_queue *q,
+		   struct request *rq, blk_insert_t flags,
+		   struct list_head *freeq)
+{
+	struct ufq_simple_data *ufq_sd;
+	struct queue_list_node *qnode;
+	struct sort_tree_node *snode;
+	int id = q->id, ret = 0;
+	struct request *acquired, *old;
+	enum ufq_simp_data_dir dir = ((rq->cmd_flags & REQ_OP_MASK) & 1) ?
+				   UFQ_SIMP_WRITE : UFQ_SIMP_READ;
+
+	ufq_sd = bpf_map_lookup_elem(&ufq_map, &id);
+	if (!ufq_sd) {
+		ufq_sd = dd_init_sched(q);
+		if (!ufq_sd) {
+			bpf_printk("ufq_simple/insert_req: dd_init_sched failed");
+			return -EPERM;
+		}
+	}
+
+	if (flags & BLK_MQ_INSERT_AT_HEAD) {
+		/* create queue_list_node */
+		qnode = bpf_obj_new(typeof(*qnode));
+		if (!qnode) {
+			bpf_printk("ufq_simple/insert_req: qnode alloc failed");
+			return -ENOMEM;
+		}
+
+		acquired = bpf_request_acquire(rq);
+		if (!acquired) {
+			bpf_obj_drop(qnode);
+			bpf_printk("ufq_simple/head-insert_req: request_acquire failed");
+			return -EPERM;
+		}
+
+		/* Set request for queue_list_node */
+		old = bpf_kptr_xchg(&qnode->req, acquired);
+		if (old)
+			bpf_request_release(old);
+
+		/* Add queue_list_node to dispatch list */
+		bpf_spin_lock(&ufq_sd->lock);
+		ret = bpf_list_push_back(&ufq_sd->dispatch, &qnode->node);
+		bpf_spin_unlock(&ufq_sd->lock);
+	} else {
+		/* create sort_tree_node */
+		snode = bpf_obj_new(typeof(*snode));
+		if (!snode) {
+			bpf_printk("ufq_simple/insert_req: sort_tree_node alloc failed");
+			return -ENOMEM;
+		}
+
+		/* Use request's starting sector as sort key */
+		snode->key = rq->__sector;
+
+		/*
+		 * Acquire request reference again for sort_tree_node (each node
+		 * needs independent reference)
+		 */
+		acquired = bpf_request_acquire(rq);
+		if (!acquired) {
+			bpf_obj_drop(snode);
+			bpf_printk("ufq_simple/insert_req: bpf_request_acquire failed");
+			return -EPERM;
+		}
+
+		/* Set request for sort_tree_node */
+		old = bpf_kptr_xchg(&snode->req, acquired);
+		if (old)
+			bpf_request_release(old);
+
+		/* Add sort_tree_node to red-black tree */
+		bpf_spin_lock(&ufq_sd->lock);
+		if (dir == UFQ_SIMP_READ)
+			bpf_rbtree_add(&ufq_sd->sort_tree_read, &snode->rb_node, sort_tree_less);
+		else
+			bpf_rbtree_add(&ufq_sd->sort_tree_write, &snode->rb_node, sort_tree_less);
+		bpf_spin_unlock(&ufq_sd->lock);
+	}
+
+	if (!ret) {
+		stat_add(UFQ_SIMP_INSERT_CNT, 1);
+		stat_add(UFQ_SIMP_INSERT_SIZE, rq->__data_len);
+	}
+	return ret;
+}
+
+struct request *BPF_STRUCT_OPS(ufq_simple_dispatch_req, struct request_queue *q)
+{
+	struct request *rq = NULL;
+	struct bpf_list_node *list_node;
+	struct bpf_rb_node *rb_node = NULL;
+	struct queue_list_node *qnode;
+	struct sort_tree_node *snode;
+	struct ufq_simple_data *ufq_sd;
+	int id = q->id;
+
+	ufq_sd = bpf_map_lookup_elem(&ufq_map, &id);
+	if (!ufq_sd) {
+		bpf_printk("ufq_simple/dispatch_req: ufq_map lookup %d failed", id);
+		return NULL;
+	}
+
+	bpf_spin_lock(&ufq_sd->lock);
+	list_node = bpf_list_pop_front(&ufq_sd->dispatch);
+
+	if (list_node) {
+		qnode = container_of(list_node, struct queue_list_node, node);
+		rq = bpf_kptr_xchg(&qnode->req, NULL);
+		bpf_spin_unlock(&ufq_sd->lock);
+		bpf_obj_drop(qnode);
+	} else {
+		rb_node = bpf_rbtree_first(&ufq_sd->sort_tree_read);
+		if (rb_node) {
+			rb_node = bpf_rbtree_remove(&ufq_sd->sort_tree_read, rb_node);
+		} else {
+			rb_node = bpf_rbtree_first(&ufq_sd->sort_tree_write);
+			if (rb_node)
+				rb_node = bpf_rbtree_remove(&ufq_sd->sort_tree_write, rb_node);
+		}
+
+		if (!rb_node) {
+			bpf_spin_unlock(&ufq_sd->lock);
+			goto out;
+		}
+
+		snode = container_of(rb_node, struct sort_tree_node, rb_node);
+
+		/* Get request from sort_tree_node (this will be returned) */
+		rq = bpf_kptr_xchg(&snode->req, NULL);
+		bpf_spin_unlock(&ufq_sd->lock);
+		bpf_obj_drop(snode);
+	}
+	if (!rq)
+		bpf_printk("ufq_simple/dispatch_req: no request to dispatch");
+
+out:
+	if (rq) {
+		stat_add(UFQ_SIMP_DISPATCH_CNT, 1);
+		stat_add(UFQ_SIMP_DISPATCH_SIZE, rq->__data_len);
+	}
+
+	return rq;
+}
+
+bool BPF_STRUCT_OPS(ufq_simple_has_req, struct request_queue *q, int rqs_count)
+{
+	struct ufq_simple_data *ufq_sd;
+	bool has;
+	int id = q->id;
+
+	ufq_sd = bpf_map_lookup_elem(&ufq_map, &id);
+	if (!ufq_sd) {
+		bpf_printk("ufq_simple/has_req: ufq_map lookup %d failed", id);
+		return false;
+	}
+
+	bpf_spin_lock(&ufq_sd->lock);
+	has = !bpf_list_empty(&ufq_sd->dispatch) ||
+	      bpf_rbtree_root(&ufq_sd->sort_tree_read) ||
+	      bpf_rbtree_root(&ufq_sd->sort_tree_write);
+	bpf_spin_unlock(&ufq_sd->lock);
+
+	return has;
+}
+
+void BPF_STRUCT_OPS(ufq_simple_finish_req, struct request *rq)
+{
+	if (rq) {
+		stat_add(UFQ_SIMP_FINISH_CNT, 1);
+		stat_add(UFQ_SIMP_FINISH_SIZE, (u64)rq->stats_sectors << SECTOR_SHIFT);
+	}
+}
+
+struct request *BPF_STRUCT_OPS(ufq_simple_merge_req, struct request_queue *q,
+				struct request *rq, int *type)
+{
+	struct sort_tree_node *snode = NULL;
+	sector_t rq_start, rq_end, other_start, other_end;
+	enum elv_merge mt = ELEVATOR_NO_MERGE;
+	struct bpf_rb_node *rb_node = NULL;
+	struct blk_mq_ctx *targ_mq_ctx;
+	struct blk_mq_hw_ctx *targ_mq_hctx;
+	struct ufq_simple_data *ufq_sd;
+	struct request *targ = NULL;
+	enum ufq_simp_data_dir dir;
+	struct bpf_rb_root *tree;
+	int id = q->id;
+	int count = 0;
+
+	*type = ELEVATOR_NO_MERGE;
+	dir = ((rq->cmd_flags & REQ_OP_MASK) & 1) ? UFQ_SIMP_WRITE : UFQ_SIMP_READ;
+	ufq_sd = bpf_map_lookup_elem(&ufq_map, &id);
+	if (!ufq_sd)
+		return NULL;
+
+	/* Calculate current request position and end */
+	rq_start = rq->__sector;
+	rq_end = rq_start + (rq->__data_len >> SECTOR_SHIFT);
+
+	if (dir == UFQ_SIMP_READ)
+		tree = &ufq_sd->sort_tree_read;
+	else
+		tree = &ufq_sd->sort_tree_write;
+
+	bpf_spin_lock(&ufq_sd->lock);
+	rb_node = bpf_rbtree_root(tree);
+	if (!rb_node) {
+		bpf_spin_unlock(&ufq_sd->lock);
+		return NULL;
+	}
+
+	while (mt == ELEVATOR_NO_MERGE && rb_node && count < UFQ_LOOP_MAX) {
+		count++;
+		snode = container_of(rb_node, struct sort_tree_node, rb_node);
+		targ = bpf_kptr_xchg(&snode->req, NULL);
+		if (!targ)
+			break;
+
+		other_start = targ->__sector;
+		other_end = other_start + (targ->__data_len >> SECTOR_SHIFT);
+		targ_mq_ctx = targ->mq_ctx;
+		targ_mq_hctx = targ->mq_hctx;
+
+		targ = bpf_kptr_xchg(&snode->req, targ);
+		if (targ) {
+			bpf_spin_unlock(&ufq_sd->lock);
+			bpf_request_release(targ);
+			return NULL;
+		}
+
+		if (rq_start > other_end)
+			rb_node = bpf_rbtree_right(tree, rb_node);
+		else if (rq_end < other_start)
+			rb_node = bpf_rbtree_left(tree, rb_node);
+		else if (rq_end == other_start)
+			mt = ELEVATOR_FRONT_MERGE;
+		else if (other_end == rq_start)
+			mt = ELEVATOR_BACK_MERGE;
+		else
+			break;
+
+		if (mt) {
+			if (rq->mq_ctx != targ_mq_ctx || rq->mq_hctx != targ_mq_hctx) {
+				mt = ELEVATOR_NO_MERGE;
+				break;
+			}
+
+			rb_node = bpf_rbtree_remove(tree, rb_node);
+			if (rb_node) {
+				snode = container_of(rb_node,
+					struct sort_tree_node, rb_node);
+				targ = bpf_kptr_xchg(&snode->req, NULL);
+				bpf_spin_unlock(&ufq_sd->lock);
+				if (targ) {
+					*type = mt;
+					stat_add(UFQ_SIMP_RQMERGE_CNT, 1);
+					stat_add(UFQ_SIMP_RQMERGE_SIZE, targ->__data_len);
+					stat_sub(UFQ_SIMP_INSERT_CNT, 1);
+					stat_sub(UFQ_SIMP_INSERT_SIZE, targ->__data_len);
+				}
+
+				bpf_obj_drop(snode);
+			} else {
+				bpf_spin_unlock(&ufq_sd->lock);
+				*type = ELEVATOR_NO_MERGE;
+			}
+			return targ;
+		}
+	}
+	bpf_spin_unlock(&ufq_sd->lock);
+
+	return NULL;
+}
+
+static struct request *merge_bio_left_unlock(struct ufq_simple_data *ufq_sd,
+					     struct bpf_rb_root *tree,
+					     struct sort_tree_node *snode,
+					     struct request *cand)
+{
+	sector_t cand_start, left_start, left_end;
+	struct request *free = NULL;
+	struct request *left_rq = NULL;
+	struct sort_tree_node *left_node = NULL;
+	struct bpf_rb_node *tmp, *removed = NULL;
+
+	cand_start = cand->__sector;
+	tmp = bpf_rbtree_left(tree, &snode->rb_node);
+	if (!tmp)
+		goto end;
+
+	left_node = container_of(tmp, struct sort_tree_node, rb_node);
+	if (!left_node)
+		goto end;
+
+	left_rq = bpf_kptr_xchg(&left_node->req, NULL);
+	if (!left_rq)
+		goto end;
+
+	left_start = left_rq->__sector;
+	left_end = left_start + (left_rq->__data_len >> SECTOR_SHIFT);
+
+	if (left_end == cand_start)
+		free = bpf_request_try_merge(left_rq, cand);
+
+	if (free == cand) {
+		removed = bpf_rbtree_remove(tree, &snode->rb_node);
+		left_rq = bpf_kptr_xchg(&left_node->req, left_rq);
+		bpf_spin_unlock(&ufq_sd->lock);
+		if (removed) {
+			struct sort_tree_node *drop = container_of(removed,
+					struct sort_tree_node, rb_node);
+			bpf_obj_drop(drop);
+		}
+		stat_add(UFQ_SIMP_RQMERGE_CNT, 1);
+		stat_add(UFQ_SIMP_RQMERGE_SIZE, free->__data_len);
+
+		if (left_rq)
+			bpf_request_release(left_rq);
+
+		return free;
+	}
+
+	left_rq = bpf_kptr_xchg(&left_node->req, left_rq);
+
+end:
+	cand = bpf_kptr_xchg(&snode->req, cand);
+	bpf_spin_unlock(&ufq_sd->lock);
+	if (left_rq)
+		bpf_request_release(left_rq);
+
+	if (cand)
+		bpf_request_release(cand);
+
+	return NULL;
+}
+
+static struct request *merge_bio_right_unlock(struct ufq_simple_data *ufq_sd,
+					      struct bpf_rb_root *tree,
+					      struct sort_tree_node *snode,
+					      struct request *cand)
+{
+	sector_t cand_end, right_start;
+	struct request *free = NULL;
+	struct request *right_rq = NULL;
+	struct sort_tree_node *right_node = NULL;
+	struct bpf_rb_node *right_rb, *removed = NULL;
+
+	cand_end = cand->__sector + (cand->__data_len >> SECTOR_SHIFT);
+
+	right_rb = bpf_rbtree_right(tree, &snode->rb_node);
+	if (!right_rb)
+		goto end;
+
+	right_node = container_of(right_rb, struct sort_tree_node, rb_node);
+	if (!right_node)
+		goto end;
+
+	right_rq = bpf_kptr_xchg(&right_node->req, NULL);
+	if (!right_rq)
+		goto end;
+
+	right_start = right_rq->__sector;
+	if (cand_end == right_start)
+		free = bpf_request_try_merge(cand, right_rq);
+
+	if (free == right_rq) {
+		removed = bpf_rbtree_remove(tree, right_rb);
+		cand = bpf_kptr_xchg(&snode->req, cand);
+		bpf_spin_unlock(&ufq_sd->lock);
+		if (removed) {
+			struct sort_tree_node *drop = container_of(removed,
+					struct sort_tree_node, rb_node);
+			bpf_obj_drop(drop);
+		}
+		stat_add(UFQ_SIMP_RQMERGE_CNT, 1);
+		stat_add(UFQ_SIMP_RQMERGE_SIZE, free->__data_len);
+
+		if (cand)
+			bpf_request_release(cand);
+
+		return free;
+	}
+
+	right_rq = bpf_kptr_xchg(&right_node->req, right_rq);
+
+end:
+	cand = bpf_kptr_xchg(&snode->req, cand);
+	bpf_spin_unlock(&ufq_sd->lock);
+	if (right_rq)
+		bpf_request_release(right_rq);
+
+	if (cand)
+		bpf_request_release(cand);
+
+	return NULL;
+}
+
+struct request *BPF_STRUCT_OPS(ufq_simple_merge_bio,
+			       struct request_queue *q, struct bio *bio,
+			       unsigned int nr_segs, bool *merged)
+{
+	sector_t start, end, cand_start, cand_end;
+	struct request *cand = NULL, *old, *free = NULL;
+	struct sort_tree_node *snode = NULL;
+	struct bpf_rb_node *rb_node = NULL;
+	struct ufq_simple_data *ufq_sd;
+	enum ufq_simp_data_dir dir;
+	int id = q->id, count = 0;
+	struct bpf_rb_root *tree;
+
+	if (!merged)
+		return NULL;
+	start = bio->bi_iter.bi_sector;
+	end = start + (bio->bi_iter.bi_size >> SECTOR_SHIFT);
+	dir = ((bio->bi_opf & REQ_OP_MASK) & 1) ? UFQ_SIMP_WRITE : UFQ_SIMP_READ;
+	ufq_sd = bpf_map_lookup_elem(&ufq_map, &id);
+	if (!ufq_sd)
+		return NULL;
+
+	bpf_spin_lock(&ufq_sd->lock);
+	if (dir == UFQ_SIMP_READ)
+		tree = &ufq_sd->sort_tree_read;
+	else
+		tree = &ufq_sd->sort_tree_write;
+
+	rb_node = bpf_rbtree_root(tree);
+	while (rb_node && count < UFQ_LOOP_MAX) {
+		count++;
+		snode = container_of(rb_node, struct sort_tree_node, rb_node);
+		cand = bpf_kptr_xchg(&snode->req, NULL);
+		if (!cand)
+			break;
+
+		cand_start = cand->__sector;
+		cand_end = cand_start + (cand->__data_len >> SECTOR_SHIFT);
+
+		if (end < cand_start) {
+			rb_node = bpf_rbtree_left(tree, rb_node);
+		} else if (start > cand_end) {
+			rb_node = bpf_rbtree_right(tree, rb_node);
+		} else if (cand_start == end) {
+			if (bpf_request_bio_try_merge(cand, bio, nr_segs)) {
+				*merged = true;
+				free = merge_bio_left_unlock(ufq_sd, tree, snode, cand);
+				stat_add(UFQ_SIMP_BIOMERGE_CNT, 1);
+				stat_add(UFQ_SIMP_BIOMERGE_SIZE, bio->bi_iter.bi_size);
+				return free;
+			}
+			rb_node = NULL;
+		} else if (cand_end == start) {
+			if (bpf_request_bio_try_merge(cand, bio, nr_segs)) {
+				*merged = true;
+				free = merge_bio_right_unlock(ufq_sd, tree, snode, cand);
+				stat_add(UFQ_SIMP_BIOMERGE_CNT, 1);
+				stat_add(UFQ_SIMP_BIOMERGE_SIZE, bio->bi_iter.bi_size);
+				return free;
+			}
+			rb_node = NULL;
+		} else {
+			rb_node = NULL;
+		}
+
+		old = bpf_kptr_xchg(&snode->req, cand);
+		if (old) {
+			bpf_spin_unlock(&ufq_sd->lock);
+			bpf_request_release(old);
+			return NULL;
+		}
+	}
+
+	bpf_spin_unlock(&ufq_sd->lock);
+	return NULL;
+}
+
+UFQ_OPS_DEFINE(ufq_simple_ops,
+	.init_sched		= (void *)ufq_simple_init_sched,
+	.exit_sched		= (void *)ufq_simple_exit_sched,
+	.insert_req		= (void *)ufq_simple_insert_req,
+	.dispatch_req		= (void *)ufq_simple_dispatch_req,
+	.has_req		= (void *)ufq_simple_has_req,
+	.finish_req		= (void *)ufq_simple_finish_req,
+	.merge_req		= (void *)ufq_simple_merge_req,
+	.merge_bio		= (void *)ufq_simple_merge_bio,
+	.name			= "ufq_simple");
diff --git a/tools/ufq_iosched/ufq_simple.c b/tools/ufq_iosched/ufq_simple.c
new file mode 100644
index 000000000000..0e7b018699a2
--- /dev/null
+++ b/tools/ufq_iosched/ufq_simple.c
@@ -0,0 +1,120 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 KylinSoft Corporation.
+ * Copyright (c) 2026 Kaitao Cheng <chengkaitao@kylinos.cn>
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <signal.h>
+#include <libgen.h>
+#include <bpf/bpf.h>
+#include <ufq/common.h>
+#include "ufq_simple.bpf.skel.h"
+
+const char help_fmt[] =
+"A simple ufq scheduler.\n"
+"\n"
+"Usage: %s [-v] [-d] [-h]\n"
+"\n"
+"  -v            Print version\n"
+"  -d            Print libbpf debug messages\n"
+"  -h            Display this help and exit\n";
+
+#define UFQ_SIMPLE_VERSION "0.1.0"
+#define TIME_INTERVAL 3
+static bool verbose;
+static volatile int exit_req;
+__u64 old_stats[UFQ_SIMP_STAT_MAX];
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
+{
+	if (level == LIBBPF_DEBUG && !verbose)
+		return 0;
+	return vfprintf(stderr, format, args);
+}
+
+static void sigint_handler(int simple)
+{
+	exit_req = 1;
+}
+
+static void read_stats(struct ufq_simple *skel, __u64 *stats)
+{
+	int nr_cpus = libbpf_num_possible_cpus();
+	__u64 cnts[UFQ_SIMP_STAT_MAX][nr_cpus];
+	__u32 idx;
+
+	memset(stats, 0, sizeof(stats[0]) * UFQ_SIMP_STAT_MAX);
+
+	for (idx = 0; idx < UFQ_SIMP_STAT_MAX; idx++) {
+		int ret, cpu;
+
+		ret = bpf_map_lookup_elem(bpf_map__fd(skel->maps.stats),
+					  &idx, cnts[idx]);
+		if (ret < 0)
+			continue;
+		for (cpu = 0; cpu < nr_cpus; cpu++)
+			stats[idx] += cnts[idx][cpu];
+	}
+}
+
+int main(int argc, char **argv)
+{
+	struct ufq_simple *skel;
+	struct bpf_link *link;
+	__u32 opt;
+
+	libbpf_set_print(libbpf_print_fn);
+	signal(SIGINT, sigint_handler);
+	signal(SIGTERM, sigint_handler);
+
+	skel = UFQ_OPS_OPEN(ufq_simple_ops, ufq_simple);
+
+	while ((opt = getopt(argc, argv, "vdh")) != -1) {
+		switch (opt) {
+		case 'v':
+			printf("ufq_simple version: %s\n", UFQ_SIMPLE_VERSION);
+			return 0;
+		case 'd':
+			verbose = true;
+			break;
+		default:
+			fprintf(stderr, help_fmt, basename(argv[0]));
+			return opt != 'h';
+		}
+	}
+
+	UFQ_OPS_LOAD(skel, ufq_simple_ops, ufq_simple);
+	link = UFQ_OPS_ATTACH(skel, ufq_simple_ops, ufq_simple);
+
+	printf("ufq_simple loop ...\n");
+	while (!exit_req) {
+		__u64 stats[UFQ_SIMP_STAT_MAX];
+
+		printf("--------------------------------\n");
+		read_stats(skel, stats);
+		printf("bps:%lluk  iops:%llu\n",
+		       (stats[UFQ_SIMP_FINISH_SIZE] -
+			old_stats[UFQ_SIMP_FINISH_SIZE]) / 1024 / TIME_INTERVAL,
+		       (stats[UFQ_SIMP_FINISH_CNT] -
+			old_stats[UFQ_SIMP_FINISH_CNT]) / TIME_INTERVAL);
+		printf("(insert:   cnt=%llu size=%llu)\n",
+			stats[UFQ_SIMP_INSERT_CNT], stats[UFQ_SIMP_INSERT_SIZE]);
+		printf("(rqmerge: cnt=%llu size=%llu) (biomerge: cnt=%llu size=%llu)\n",
+			stats[UFQ_SIMP_RQMERGE_CNT], stats[UFQ_SIMP_RQMERGE_SIZE],
+			stats[UFQ_SIMP_BIOMERGE_CNT], stats[UFQ_SIMP_BIOMERGE_SIZE]);
+		printf("(dispatch: cnt=%llu size=%llu) (finish: cnt=%llu size=%llu)\n",
+			stats[UFQ_SIMP_DISPATCH_CNT], stats[UFQ_SIMP_DISPATCH_SIZE],
+			stats[UFQ_SIMP_FINISH_CNT], stats[UFQ_SIMP_FINISH_SIZE]);
+		memcpy(old_stats, stats, sizeof(old_stats));
+		sleep(TIME_INTERVAL);
+	}
+
+	printf("ufq_simple loop exit ...\n");
+	bpf_link__destroy(link);
+	ufq_simple__destroy(skel);
+
+	return 0;
+}
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC v2 3/3] tools/ufq_iosched: add BPF example scheduler and build scaffolding
  2026-05-03  3:56 ` [RFC v2 3/3] tools/ufq_iosched: add BPF example scheduler and build scaffolding Kaitao cheng
@ 2026-05-03  4:44   ` bot+bpf-ci
  0 siblings, 0 replies; 6+ messages in thread
From: bot+bpf-ci @ 2026-05-03  4:44 UTC (permalink / raw)
  To: kaitao.cheng, axboe, ast, daniel, andrii, martin.lau, eddyz87,
	memxor, song, yonghong.song, jolsa, john.fastabend
  Cc: bpf, linux-block, linux-kernel, chengkaitao, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2210 bytes --]

> diff --git a/tools/ufq_iosched/ufq_simple.bpf.c b/tools/ufq_iosched/ufq_simple.bpf.c
> new file mode 100644
> index 000000000000..81954e73068a
> --- /dev/null
> +++ b/tools/ufq_iosched/ufq_simple.bpf.c

[ ... ]

> +int BPF_STRUCT_OPS(ufq_simple_insert_req, struct request_queue *q,
> +		   struct request *rq, blk_insert_t flags,
> +		   struct list_head *freeq)
                                              ^^^^^^^

Does this signature match the kernel's ufq_iosched_ops definition?

Looking at block/ufq-iosched.h, the insert_req member is defined with
three parameters:

    int (*insert_req)(struct request_queue *q, struct request *rq,
                     blk_insert_t flags);

The kernel-side callers in block/ufq-iosched.c also pass only three
arguments:

    ret = ops->insert_req(q, rq, flags);
    ops->insert_req(q, target, 0);

The BPF struct_ops verifier (check_struct_ops_btf_id) validates that each
BPF program matches the declared function-pointer signature in the kernel.
A program with four parameters registered against a three-parameter member
will be rejected at load time, so ufq_simple will fail to attach via
UFQ_OPS_LOAD / UFQ_OPS_ATTACH and the example cannot be used.

Even if the verifier accepted the mismatch, the trampoline would not
populate a fourth argument, leaving freeq uninitialized. Should the extra
'struct list_head *freeq' parameter be removed?

> +{
> +	struct ufq_simple_data *ufq_sd;
> +	struct queue_list_node *qnode;
> +	struct sort_tree_node *snode;

[ ... ]

> +UFQ_OPS_DEFINE(ufq_simple_ops,
> +	.init_sched		= (void *)ufq_simple_init_sched,
> +	.exit_sched		= (void *)ufq_simple_exit_sched,
> +	.insert_req		= (void *)ufq_simple_insert_req,
> +	.dispatch_req		= (void *)ufq_simple_dispatch_req,
> +	.has_req		= (void *)ufq_simple_has_req,
> +	.finish_req		= (void *)ufq_simple_finish_req,
> +	.merge_req		= (void *)ufq_simple_merge_req,
> +	.merge_bio		= (void *)ufq_simple_merge_bio,
> +	.name			= "ufq_simple");


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25269522212

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-03  4:45 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-03  3:56 [RFC v2 0/3] block: Introduce a BPF-based I/O scheduler Kaitao cheng
2026-05-03  3:56 ` [RFC v2 1/3] bpf: Add KF_SPIN_LOCK flag for kfuncs under bpf_spin_lock Kaitao cheng
2026-05-03  3:56 ` [RFC v2 2/3] block: Introduce the UFQ I/O scheduler Kaitao cheng
2026-05-03  4:45   ` bot+bpf-ci
2026-05-03  3:56 ` [RFC v2 3/3] tools/ufq_iosched: add BPF example scheduler and build scaffolding Kaitao cheng
2026-05-03  4:44   ` bot+bpf-ci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox