Re: [RFC v2 3/3] tools/ufq_iosched: add BPF example scheduler and build scaffolding

BPF List
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Kaitao cheng" <kaitao.cheng@linux.dev>
Cc: bpf@vger.kernel.org
Subject: Re: [RFC v2 3/3] tools/ufq_iosched: add BPF example scheduler and build scaffolding
Date: Sun, 03 May 2026 05:18:33 +0000	[thread overview]
Message-ID: <20260503051833.E4773C2BCB4@smtp.kernel.org> (raw)
In-Reply-To: <20260503035623.28771-4-kaitao.cheng@linux.dev>

Thank you for your contribution! Sashiko AI review found 9 potential issue(s) to consider:
- [High] Hallucinated BPF rbtree traversal functions
- [High] Wait-context violation and lock inversion via `bpf_request_bio_try_merge`
- [High] Verifier rejection due to unreleased acquired request references
- [High] Verifier rejection from accessing non-owning reference after tree mutation
- [High] Target request permanently orphaned after a merge operation
- [High] Red-black tree corruption due to mutable traversal keys
- [High] Use-After-Free when accessing `free->__data_len`
- [Medium] Incorrect queue insertion order for `BLK_MQ_INSERT_AT_HEAD`
- [Medium] Non-atomic updates to per-CPU array statistics
--

commit 837eafbebf3f48dff3cc402191f742e1f891291d
Author: Kaitao Cheng <chengkaitao@kylinos.cn>
tools/ufq_iosched: add BPF example scheduler and build scaffolding

Add ufq_iosched as a simple example for the UFQ block I/O scheduler. It
implements the eBPF struct_ops hooks the kernel exposes to exercise the
behavior and stability of the kernel UFQ scheduling framework.

> diff --git a/tools/ufq_iosched/ufq_simple.bpf.c b/tools/ufq_iosched/ufq_simple.bpf.c
> new file mode 100644
> index 000000000000..81954e73068a
> --- /dev/null
> +++ b/tools/ufq_iosched/ufq_simple.bpf.c

[ ... ]

> +static void stat_add(u32 idx, u32 val)
> +{
> +	u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx);
> +
> +	if (cnt_p)
> +		(*cnt_p) += val;
> +}

Since operations like insert_req can run in process context while finish_req
might run in softirq on the same CPU, could these non-atomic updates result
in lost statistics? Should this use __sync_fetch_and_add() instead?

[ ... ]

> +int BPF_STRUCT_OPS(ufq_simple_insert_req, struct request_queue *q,
> +		   struct request *rq, blk_insert_t flags,
> +		   struct list_head *freeq)
> +{
[ ... ]
> +	if (flags & BLK_MQ_INSERT_AT_HEAD) {
[ ... ]
> +		/* Add queue_list_node to dispatch list */
> +		bpf_spin_lock(&ufq_sd->lock);
> +		ret = bpf_list_push_back(&ufq_sd->dispatch, &qnode->node);
> +		bpf_spin_unlock(&ufq_sd->lock);
> +	} else {

Does using bpf_list_push_back() here place the request at the tail of the
dispatch list rather than the head? Since dispatch_req pops from the front,
this seems to violate the requested BLK_MQ_INSERT_AT_HEAD semantics.

[ ... ]

> +struct request *BPF_STRUCT_OPS(ufq_simple_dispatch_req, struct request_queue *q)
> +{
[ ... ]
> +		snode = container_of(rb_node, struct sort_tree_node, rb_node);
> +
> +		/* Get request from sort_tree_node (this will be returned) */
> +		rq = bpf_kptr_xchg(&snode->req, NULL);
> +		bpf_spin_unlock(&ufq_sd->lock);
> +		bpf_obj_drop(snode);
> +	}
> +	if (!rq)
> +		bpf_printk("ufq_simple/dispatch_req: no request to dispatch");
> +
> +out:
> +	if (rq) {
> +		stat_add(UFQ_SIMP_DISPATCH_CNT, 1);
> +		stat_add(UFQ_SIMP_DISPATCH_SIZE, rq->__data_len);
> +	}
> +
> +	return rq;
> +}

Will the verifier reject this program because it returns an acquired reference
from bpf_kptr_xchg without calling bpf_request_release()? Also, if the
verifier did allow this, would the request reference permanently leak since
the block layer doesn't consume BPF kptr references?

> +bool BPF_STRUCT_OPS(ufq_simple_has_req, struct request_queue *q, int rqs_count)
> +{
[ ... ]
> +	bpf_spin_lock(&ufq_sd->lock);
> +	has = !bpf_list_empty(&ufq_sd->dispatch) ||
> +	      bpf_rbtree_root(&ufq_sd->sort_tree_read) ||
> +	      bpf_rbtree_root(&ufq_sd->sort_tree_write);

The functions bpf_rbtree_root(), bpf_rbtree_left(), and bpf_rbtree_right()
are used throughout this file, but they do not appear to exist in the BPF
API. The API only exposes bpf_rbtree_add(), bpf_rbtree_remove(), and
bpf_rbtree_first(). Will this code fail to compile and load?

[ ... ]

> +struct request *BPF_STRUCT_OPS(ufq_simple_merge_req, struct request_queue *q,
> +				struct request *rq, int *type)
> +{
[ ... ]
> +		if (mt) {
> +			if (rq->mq_ctx != targ_mq_ctx || rq->mq_hctx != targ_mq_hctx) {
> +				mt = ELEVATOR_NO_MERGE;
> +				break;
> +			}
> +
> +			rb_node = bpf_rbtree_remove(tree, rb_node);
> +			if (rb_node) {
> +				snode = container_of(rb_node,
> +					struct sort_tree_node, rb_node);
> +				targ = bpf_kptr_xchg(&snode->req, NULL);
> +				bpf_spin_unlock(&ufq_sd->lock);
> +				if (targ) {
> +					*type = mt;
> +					stat_add(UFQ_SIMP_RQMERGE_CNT, 1);
> +					stat_add(UFQ_SIMP_RQMERGE_SIZE, targ->__data_len);
> +					stat_sub(UFQ_SIMP_INSERT_CNT, 1);
> +					stat_sub(UFQ_SIMP_INSERT_SIZE, targ->__data_len);
> +				}
> +
> +				bpf_obj_drop(snode);
> +			} else {
> +				bpf_spin_unlock(&ufq_sd->lock);
> +				*type = ELEVATOR_NO_MERGE;
> +			}
> +			return targ;
> +		}
> +	}
> +	bpf_spin_unlock(&ufq_sd->lock);
> +
> +	return NULL;
> +}

If a candidate request is found for merging, it gets permanently removed from
the scheduler's rbtree and its node is dropped. Since dispatch_req will now
never see it, does this result in a permanent I/O hang for the merged target
request?

> +static struct request *merge_bio_left_unlock(struct ufq_simple_data *ufq_sd,
> +					     struct bpf_rb_root *tree,
> +					     struct sort_tree_node *snode,
> +					     struct request *cand)
> +{
[ ... ]
> +	if (left_end == cand_start)
> +		free = bpf_request_try_merge(left_rq, cand);
> +
> +	if (free == cand) {
> +		removed = bpf_rbtree_remove(tree, &snode->rb_node);
> +		left_rq = bpf_kptr_xchg(&left_node->req, left_rq);

According to BPF safety rules, removing a node from a collection invalidates
all non-owning references to that collection. Will the verifier reject this
program because it accesses left_node right after calling bpf_rbtree_remove()?

> +		bpf_spin_unlock(&ufq_sd->lock);
> +		if (removed) {
> +			struct sort_tree_node *drop = container_of(removed,
> +					struct sort_tree_node, rb_node);
> +			bpf_obj_drop(drop);
> +		}
> +		stat_add(UFQ_SIMP_RQMERGE_CNT, 1);
> +		stat_add(UFQ_SIMP_RQMERGE_SIZE, free->__data_len);

If bpf_request_try_merge() is successful, it merges and frees the candidate
request. By passing free->__data_len to stat_add, does this cause a
use-after-free read on a request that was just returned to the block layer?

[ ... ]

> +struct request *BPF_STRUCT_OPS(ufq_simple_merge_bio,
> +			       struct request_queue *q, struct bio *bio,
> +			       unsigned int nr_segs, bool *merged)
> +{
[ ... ]
> +	bpf_spin_lock(&ufq_sd->lock);
[ ... ]
> +	while (rb_node && count < UFQ_LOOP_MAX) {
> +		count++;
> +		snode = container_of(rb_node, struct sort_tree_node, rb_node);
> +		cand = bpf_kptr_xchg(&snode->req, NULL);
> +		if (!cand)
> +			break;
> +
> +		cand_start = cand->__sector;
> +		cand_end = cand_start + (cand->__data_len >> SECTOR_SHIFT);
> +
> +		if (end < cand_start) {
> +			rb_node = bpf_rbtree_left(tree, rb_node);
> +		} else if (start > cand_end) {
> +			rb_node = bpf_rbtree_right(tree, rb_node);
> +		} else if (cand_start == end) {
> +			if (bpf_request_bio_try_merge(cand, bio, nr_segs)) {

The rbtree traversal here compares against cand->__sector. If a request was
previously front-merged, its __sector would be modified by the block layer.
Since the node isn't re-inserted to update its sorting key, could subsequent
traversals navigate down the wrong tree branches and corrupt the red-black
tree?

Additionally, the code holds a bpf_spin_lock while calling the kfunc
bpf_request_bio_try_merge().

Looking at bpf_request_bio_try_merge() in block/ufq-kfunc.c, it internally
acquires a spin_lock:

block/ufq-kfunc.c:bpf_request_bio_try_merge() {
    ...
	spin_lock(&ctx->lock);
	merged = blk_attempt_bio_merge(rq->q, rq, bio, nr_segs, true) == BIO_MERGE_OK;
	spin_unlock(&ctx->lock);
    ...
}

Since bpf_spin_lock() maps to a raw_spinlock_t, and spin_lock() is sleepable
on PREEMPT_RT kernels, will this create a sleep-in-atomic context that
triggers a lockdep BUG and system panic?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260503035623.28771-1-kaitao.cheng@linux.dev?part=3

     prev parent reply	other threads:[~2026-05-03  5:18 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-03  3:56 [RFC v2 0/3] block: Introduce a BPF-based I/O scheduler Kaitao cheng
2026-05-03  3:56 ` [RFC v2 1/3] bpf: Add KF_SPIN_LOCK flag for kfuncs under bpf_spin_lock Kaitao cheng
2026-05-03  3:56 ` [RFC v2 2/3] block: Introduce the UFQ I/O scheduler Kaitao cheng
2026-05-03  4:45   ` bot+bpf-ci
2026-05-03  4:57   ` sashiko-bot
2026-05-03  3:56 ` [RFC v2 3/3] tools/ufq_iosched: add BPF example scheduler and build scaffolding Kaitao cheng
2026-05-03  4:44   ` bot+bpf-ci
2026-05-03  5:18   ` sashiko-bot [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260503051833.E4773C2BCB4@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=kaitao.cheng@linux.dev \
    --cc=sashiko@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox