Re: [RFC v2 3/3] tools/ufq_iosched: add BPF example scheduler and build scaffolding

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Kaitao cheng" <kaitao.cheng@linux.dev>
Cc: bpf@vger.kernel.org
Subject: Re: [RFC v2 3/3] tools/ufq_iosched: add BPF example scheduler and build scaffolding
Date: Sun, 03 May 2026 05:18:33 +0000	[thread overview]
Message-ID: <20260503051833.E4773C2BCB4@smtp.kernel.org> (raw)
In-Reply-To: <20260503035623.28771-4-kaitao.cheng@linux.dev>

Thank you for your contribution! Sashiko AI review found 9 potential issue(s) to consider:
- [High] Hallucinated BPF rbtree traversal functions
- [High] Wait-context violation and lock inversion via `bpf_request_bio_try_merge`
- [High] Verifier rejection due to unreleased acquired request references
- [High] Verifier rejection from accessing non-owning reference after tree mutation
- [High] Target request permanently orphaned after a merge operation
- [High] Red-black tree corruption due to mutable traversal keys
- [High] Use-After-Free when accessing `free->__data_len`
- [Medium] Incorrect queue insertion order for `BLK_MQ_INSERT_AT_HEAD`
- [Medium] Non-atomic updates to per-CPU array statistics
--

commit 837eafbebf3f48dff3cc402191f742e1f891291d
Author: Kaitao Cheng <chengkaitao@kylinos.cn>
tools/ufq_iosched: add BPF example scheduler and build scaffolding

Add ufq_iosched as a simple example for the UFQ block I/O scheduler. It
implements the eBPF struct_ops hooks the kernel exposes to exercise the
behavior and stability of the kernel UFQ scheduling framework.

> diff --git a/tools/ufq_iosched/ufq_simple.bpf.c b/tools/ufq_iosched/ufq_simple.bpf.c
> new file mode 100644
> index 000000000000..81954e73068a
> --- /dev/null
> +++ b/tools/ufq_iosched/ufq_simple.bpf.c

[ ... ]

> +static void stat_add(u32 idx, u32 val)
> +{
> +	u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx);
> +
> +	if (cnt_p)
> +		(*cnt_p) += val;
> +}

Since operations like insert_req can run in process context while finish_req
might run in softirq on the same CPU, could these non-atomic updates result
in lost statistics? Should this use __sync_fetch_and_add() instead?

[ ... ]

> +int BPF_STRUCT_OPS(ufq_simple_insert_req, struct request_queue *q,
> +		   struct request *rq, blk_insert_t flags,
> +		   struct list_head *freeq)
> +{
[ ... ]
> +	if (flags & BLK_MQ_INSERT_AT_HEAD) {
[ ... ]
> +		/* Add queue_list_node to dispatch list */
> +		bpf_spin_lock(&ufq_sd->lock);
> +		ret = bpf_list_push_back(&ufq_sd->dispatch, &qnode->node);
> +		bpf_spin_unlock(&ufq_sd->lock);
> +	} else {

Does using bpf_list_push_back() here place the request at the tail of the
dispatch list rather than the head? Since dispatch_req pops from the front,
this seems to violate the requested BLK_MQ_INSERT_AT_HEAD semantics.

[ ... ]

> +struct request *BPF_STRUCT_OPS(ufq_simple_dispatch_req, struct request_queue *q)
> +{
[ ... ]
> +		snode = container_of(rb_node, struct sort_tree_node, rb_node);
> +
> +		/* Get request from sort_tree_node (this will be returned) */
> +		rq = bpf_kptr_xchg(&snode->req, NULL);
> +		bpf_spin_unlock(&ufq_sd->lock);
> +		bpf_obj_drop(snode);
> +	}
> +	if (!rq)
> +		bpf_printk("ufq_simple/dispatch_req: no request to dispatch");
> +
> +out:
> +	if (rq) {
> +		stat_add(UFQ_SIMP_DISPATCH_CNT, 1);
> +		stat_add(UFQ_SIMP_DISPATCH_SIZE, rq->__data_len);
> +	}
> +
> +	return rq;
> +}

Will the verifier reject this program because it returns an acquired reference
from bpf_kptr_xchg without calling bpf_request_release()? Also, if the
verifier did allow this, would the request reference permanently leak since
the block layer doesn't consume BPF kptr references?

> +bool BPF_STRUCT_OPS(ufq_simple_has_req, struct request_queue *q, int rqs_count)
> +{
[ ... ]
> +	bpf_spin_lock(&ufq_sd->lock);
> +	has = !bpf_list_empty(&ufq_sd->dispatch) ||
> +	      bpf_rbtree_root(&ufq_sd->sort_tree_read) ||
> +	      bpf_rbtree_root(&ufq_sd->sort_tree_write);

The functions bpf_rbtree_root(), bpf_rbtree_left(), and bpf_rbtree_right()
are used throughout this file, but they do not appear to exist in the BPF
API. The API only exposes bpf_rbtree_add(), bpf_rbtree_remove(), and
bpf_rbtree_first(). Will this code fail to compile and load?

[ ... ]

> +struct request *BPF_STRUCT_OPS(ufq_simple_merge_req, struct request_queue *q,
> +				struct request *rq, int *type)
> +{
[ ... ]
> +		if (mt) {
> +			if (rq->mq_ctx != targ_mq_ctx || rq->mq_hctx != targ_mq_hctx) {
> +				mt = ELEVATOR_NO_MERGE;
> +				break;
> +			}
> +
> +			rb_node = bpf_rbtree_remove(tree, rb_node);
> +			if (rb_node) {
> +				snode = container_of(rb_node,
> +					struct sort_tree_node, rb_node);
> +				targ = bpf_kptr_xchg(&snode->req, NULL);
> +				bpf_spin_unlock(&ufq_sd->lock);
> +				if (targ) {
> +					*type = mt;
> +					stat_add(UFQ_SIMP_RQMERGE_CNT, 1);
> +					stat_add(UFQ_SIMP_RQMERGE_SIZE, targ->__data_len);
> +					stat_sub(UFQ_SIMP_INSERT_CNT, 1);
> +					stat_sub(UFQ_SIMP_INSERT_SIZE, targ->__data_len);
> +				}
> +
> +				bpf_obj_drop(snode);
> +			} else {
> +				bpf_spin_unlock(&ufq_sd->lock);
> +				*type = ELEVATOR_NO_MERGE;
> +			}
> +			return targ;
> +		}
> +	}
> +	bpf_spin_unlock(&ufq_sd->lock);
> +
> +	return NULL;
> +}

If a candidate request is found for merging, it gets permanently removed from
the scheduler's rbtree and its node is dropped. Since dispatch_req will now
never see it, does this result in a permanent I/O hang for the merged target
request?

> +static struct request *merge_bio_left_unlock(struct ufq_simple_data *ufq_sd,
> +					     struct bpf_rb_root *tree,
> +					     struct sort_tree_node *snode,
> +					     struct request *cand)
> +{
[ ... ]
> +	if (left_end == cand_start)
> +		free = bpf_request_try_merge(left_rq, cand);
> +
> +	if (free == cand) {
> +		removed = bpf_rbtree_remove(tree, &snode->rb_node);
> +		left_rq = bpf_kptr_xchg(&left_node->req, left_rq);

According to BPF safety rules, removing a node from a collection invalidates
all non-owning references to that collection. Will the verifier reject this
program because it accesses left_node right after calling bpf_rbtree_remove()?

> +		bpf_spin_unlock(&ufq_sd->lock);
> +		if (removed) {
> +			struct sort_tree_node *drop = container_of(removed,
> +					struct sort_tree_node, rb_node);
> +			bpf_obj_drop(drop);
> +		}
> +		stat_add(UFQ_SIMP_RQMERGE_CNT, 1);
> +		stat_add(UFQ_SIMP_RQMERGE_SIZE, free->__data_len);

If bpf_request_try_merge() is successful, it merges and frees the candidate
request. By passing free->__data_len to stat_add, does this cause a
use-after-free read on a request that was just returned to the block layer?

[ ... ]

> +struct request *BPF_STRUCT_OPS(ufq_simple_merge_bio,
> +			       struct request_queue *q, struct bio *bio,
> +			       unsigned int nr_segs, bool *merged)
> +{
[ ... ]
> +	bpf_spin_lock(&ufq_sd->lock);
[ ... ]
> +	while (rb_node && count < UFQ_LOOP_MAX) {
> +		count++;
> +		snode = container_of(rb_node, struct sort_tree_node, rb_node);
> +		cand = bpf_kptr_xchg(&snode->req, NULL);
> +		if (!cand)
> +			break;
> +
> +		cand_start = cand->__sector;
> +		cand_end = cand_start + (cand->__data_len >> SECTOR_SHIFT);
> +
> +		if (end < cand_start) {
> +			rb_node = bpf_rbtree_left(tree, rb_node);
> +		} else if (start > cand_end) {
> +			rb_node = bpf_rbtree_right(tree, rb_node);
> +		} else if (cand_start == end) {
> +			if (bpf_request_bio_try_merge(cand, bio, nr_segs)) {

The rbtree traversal here compares against cand->__sector. If a request was
previously front-merged, its __sector would be modified by the block layer.
Since the node isn't re-inserted to update its sorting key, could subsequent
traversals navigate down the wrong tree branches and corrupt the red-black
tree?

Additionally, the code holds a bpf_spin_lock while calling the kfunc
bpf_request_bio_try_merge().

Looking at bpf_request_bio_try_merge() in block/ufq-kfunc.c, it internally
acquires a spin_lock:

block/ufq-kfunc.c:bpf_request_bio_try_merge() {
    ...
	spin_lock(&ctx->lock);
	merged = blk_attempt_bio_merge(rq->q, rq, bio, nr_segs, true) == BIO_MERGE_OK;
	spin_unlock(&ctx->lock);
    ...
}

Since bpf_spin_lock() maps to a raw_spinlock_t, and spin_lock() is sleepable
on PREEMPT_RT kernels, will this create a sleep-in-atomic context that
triggers a lockdep BUG and system panic?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260503035623.28771-1-kaitao.cheng@linux.dev?part=3

     prev parent reply	other threads:[~2026-05-03  5:18 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-03  3:56 [RFC v2 0/3] block: Introduce a BPF-based I/O scheduler Kaitao cheng
2026-05-03  3:56 ` [RFC v2 1/3] bpf: Add KF_SPIN_LOCK flag for kfuncs under bpf_spin_lock Kaitao cheng
2026-05-03  3:56 ` [RFC v2 2/3] block: Introduce the UFQ I/O scheduler Kaitao cheng
2026-05-03  4:45   ` bot+bpf-ci
2026-05-03  4:57   ` sashiko-bot
2026-05-03  3:56 ` [RFC v2 3/3] tools/ufq_iosched: add BPF example scheduler and build scaffolding Kaitao cheng
2026-05-03  4:44   ` bot+bpf-ci
2026-05-03  5:18   ` sashiko-bot [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260503051833.E4773C2BCB4@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=kaitao.cheng@linux.dev \
    --cc=sashiko@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.