Re: [RFC v2 2/3] block: Introduce the UFQ I/O scheduler

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Kaitao cheng" <kaitao.cheng@linux.dev>
Cc: bpf@vger.kernel.org
Subject: Re: [RFC v2 2/3] block: Introduce the UFQ I/O scheduler
Date: Sun, 03 May 2026 04:57:18 +0000	[thread overview]
Message-ID: <20260503045719.320E3C2BCB4@smtp.kernel.org> (raw)
In-Reply-To: <20260503035623.28771-3-kaitao.cheng@linux.dev>

Thank you for your contribution! Sashiko AI review found 11 potential issue(s) to consider:
- [Critical] Arbitrary kernel stack read/write due to unvalidated PTR_TO_BUF context arguments.
- [Critical] `dispatch_req` accepts requests from foreign block queues, corrupting hardware context maps.
- [High] `ufq_bio_merge` incorrectly destroys the request when a bio is successfully merged.
- [High] BPF kfuncs and callbacks modify request state without verifying they are idle, causing DMA corruption.
- [High] Incorrect request handling after successful `merge_req` causes duplicate BPF insertions and ghost requests.
- [High] The `bpf_request_try_merge` kfunc permanently leaks the merged request and hardware tag.
- [High] Exposing the request to the dispatch list before BPF insertion causes a Write-After-Free.
- [High] Missing memory barrier in `bpf_ufq_disable` allows `ufq_ops` to be destroyed while still in use.
- [High] Annotating blocking kfuncs with `KF_SPIN_LOCK` triggers a PREEMPT_RT panic.
- [High] `bpf_request_try_merge` blindly casts elevator private data, causing memory corruption across I/O schedulers.
- [High] Failed BPF request insertions still receive a `finish_req` callback, corrupting BPF state.
--

commit 3bcfa14cede6b5991bb234bc371f06d27de520a5
Author: Kaitao Cheng <chengkaitao@kylinos.cn>

block: Introduce the UFQ I/O scheduler

Introduce IOSCHED_UFQ, a blk-mq elevator whose policy is supplied by an eBPF
program via struct_ops. When no program is attached, it uses a simple
per-ctx queueing policy as a fallback mechanism.

> diff --git a/block/ufq-bpfops.c b/block/ufq-bpfops.c
> --- /dev/null
> +++ b/block/ufq-bpfops.c
[ ... ]
> +static bool bpf_ufq_is_valid_access(int off, int size,
> +				    enum bpf_access_type type,
> +				    const struct bpf_prog *prog,
> +				    struct bpf_insn_access_aux *info)
> +{
[ ... ]
> +	if (size == sizeof(__u64) && prog->aux->attach_func_name &&
> +	    ((!strcmp(prog->aux->attach_func_name, "merge_req") && off == 16) ||
> +	     (!strcmp(prog->aux->attach_func_name, "merge_bio") && off == 24))) {
> +		if (!btf_ctx_access(off, size, type, prog, info))
> +			return false;
> +		info->reg_type = PTR_TO_BUF;
> +		return true;
> +	}
> +
> +	return btf_ctx_access(off, size, type, prog, info);
> +}

Does this allow arbitrary memory access?

The verifier handles PTR_TO_BUF by recording the maximum accessed offset in
env->prog->aux->max_rdwr_access. Since UFQ uses BPF_PROG_TYPE_STRUCT_OPS,
max_rdwr_access is ignored during program load.

Can a BPF program use large offsets to read or write arbitrary kernel stack
memory?

[ ... ]
> +static void bpf_ufq_disable(struct ufq_iosched_ops *ops)
> +{
> +	atomic_set(&ufq_bpfops_enabled, 0);
> +	wait_event(ufq_bpfops_wq, !atomic_read(&ufq_bpfops_users));
> +	memset(&ufq_ops, 0, sizeof(ufq_ops));
> +}

Since atomic_set() is relaxed and the initial check in wait_event() evaluates
locklessly without a full memory barrier, could the read of ufq_bpfops_users
be reordered before the store to ufq_bpfops_enabled?

If a concurrent ufq_bpfops_tryget() reads enabled == 1 while the disable
path observes users == 0, would it proceed to clear ufq_ops while it is still
in use?

> diff --git a/block/ufq-iosched.c b/block/ufq-iosched.c
> --- /dev/null
> +++ b/block/ufq-iosched.c
[ ... ]
> +static struct request *ufq_dispatch_request(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct ufq_data *ufq = hctx->queue->elevator->elevator_data;
> +	const struct ufq_iosched_ops *ops;
> +	struct blk_mq_ctx *ctx;
> +	struct request *rq = NULL;
> +	unsigned short idx;
> +
> +	ops = ufq_bpfops_tryget();
> +	if (ops && ops->dispatch_req) {
> +		rq = ops->dispatch_req(hctx->queue);

Is there anything preventing the BPF program from returning a request that
belongs to a different block device's queue?

If the BPF program stores a request from Queue A and returns it during the
dispatch cycle of Queue B, would this clear the wrong ctx_map bits below
and dispatch the request to the wrong driver?

[ ... ]
> +static bool ufq_bio_merge(struct request_queue *q, struct bio *bio,
> +			  unsigned int nr_segs)
> +{
[ ... ]
> +	ops = ufq_bpfops_tryget();
> +	if (ops) {
> +		if (ops->merge_bio) {
> +			rq = ops->merge_bio(q, bio, nr_segs, &ret);
[ ... ]
> +			if (rq) {
> +				ufq_bpfops_put();
> +				spin_lock(&rq->mq_ctx->lock);
> +				if (!list_empty(&rq->queuelist)) {
> +					list_del_init(&rq->queuelist);
> +					atomic_dec(&ufq->rqs_count);
> +				}
> +				spin_unlock(&rq->mq_ctx->lock);
> +				blk_mq_free_request(rq);

When merge_bio() returns a request, doesn't this indicate the bio was
successfully appended to that existing request?

If so, why is blk_mq_free_request() called here? This seems to abort the
pending I/O and destroy the request while it is still waiting to be
dispatched.

[ ... ]
> +static enum elv_merge ufq_try_insert_merge(struct request_queue *q,
> +					   struct request **new)
> +{
[ ... ]
> +	if (type == ELEVATOR_NO_MERGE || !target) {
> +		if (ops)
> +			ufq_bpfops_put();
> +		return ELEVATOR_NO_MERGE;
> +	} else if (type == ELEVATOR_FRONT_MERGE) {
[ ... ]
> +	} else if (type == ELEVATOR_BACK_MERGE) {
> +		spin_lock(&target->mq_ctx->lock);
> +		free = bpf_attempt_merge(q, target, rq);
> +		if (!free) {
> +			spin_unlock(&target->mq_ctx->lock);
> +			pr_err("ufq-iosched: back merge failed\n");
> +			goto rollback;
> +		}
> +		*new = target;

If ops->merge_req() returns ELEVATOR_BACK_MERGE, *new is set to target.
In ufq_insert_requests(), this request is then passed to ops->insert_req().

Doesn't this cause a duplicate insert_req() call for a request the BPF
program is already tracking?

Also, if ELEVATOR_FRONT_MERGE is returned, the kernel replaces and frees
the target request, but is the BPF program ever notified about the freed
request?

[ ... ]
> +static void ufq_insert_requests(struct blk_mq_hw_ctx *hctx,
> +			       struct list_head *list,
> +			       blk_insert_t flags)
> +{
[ ... ]
> +		type = ufq_try_insert_merge(q, &rq);
> +		if (type == ELEVATOR_NO_MERGE) {
> +			rq->fifo_time = jiffies;
> +			ctx = rq->mq_ctx;
> +			rq->elv.priv[0] = (void *)((uintptr_t)rq->elv.priv[0]
> +					  | UFQ_PRIV_IN_UFQ);
> +			spin_lock(&ctx->lock);
> +			if (flags & BLK_MQ_INSERT_AT_HEAD)
> +				list_add(&rq->queuelist, &ctx->rq_lists[hctx->type]);
> +			else
> +				list_add_tail(&rq->queuelist,
> +					&ctx->rq_lists[hctx->type]);
> +
> +			bit = ctx->index_hw[hctx->type];
> +			if (!sbitmap_test_bit(&hctx->ctx_map, bit))
> +				sbitmap_set_bit(&hctx->ctx_map, bit);
> +			q->last_merge = rq;
> +			spin_unlock(&ctx->lock);
> +			atomic_inc(&ufq->rqs_count);
> +		}
> +
> +		if (ops && rq && ops->insert_req) {
> +			rq->elv.priv[0] = (void *)((uintptr_t)rq->elv.priv[0]
> +				  | UFQ_PRIV_IN_BPF);

Once ctx->lock is dropped above, the request is globally visible and can be
immediately dispatched and freed by a concurrent hardware queue run.

Could accessing rq->elv.priv[0] and passing rq to ops->insert_req() below
lead to a write-after-free and a dangling pointer being sent to the BPF
program?

> +			ret = ops->insert_req(q, rq, flags);
> +			if (ret) {
> +				atomic_inc(&ufq->ops_stats.insert_err_count);
> +				pr_err("ufq-iosched: bpf insert_req error (%d)\n", ret);
> +			} else {

If ops->insert_req() returns an error here, the request remains in the
software queue but the UFQ_PRIV_IN_BPF flag is never cleared.

When the request completes, won't ufq_finish_request() see this flag and
incorrectly call ops->finish_req() for a request the BPF program explicitly
rejected?

> diff --git a/block/ufq-kfunc.c b/block/ufq-kfunc.c
> --- /dev/null
> +++ b/block/ufq-kfunc.c
[ ... ]
> +__bpf_kfunc bool bpf_request_bio_try_merge(struct request *rq, struct bio *bio,
> +					   unsigned int nr_segs)
> +{
[ ... ]
> +	spin_lock(&ctx->lock);
> +	merged = blk_attempt_bio_merge(rq->q, rq, bio, nr_segs, true) == BIO_MERGE_OK;
> +	spin_unlock(&ctx->lock);
> +
> +	return merged;
> +}

Is it safe to merge bios into this request without checking if it is idle?

If the target request has already been dispatched to the hardware
(rq->rq_flags & RQF_STARTED), could modifying the bio lists and segment
counts while the hardware is actively executing the I/O cause DMA corruption?

[ ... ]
> +__bpf_kfunc struct request *bpf_request_try_merge(struct request *rq, struct request *next)
> +{
> +	struct blk_mq_ctx *ctx;
> +	struct ufq_data *ufq;
> +	struct request *free;
> +
> +	if (!rq || !next || !rq->q || rq->q != next->q)
> +		return NULL;
> +
> +	ufq = rq->q->elevator->elevator_data;

Because this kfunc is also registered for BPF_PROG_TYPE_SYSCALL, a BPF
program could pass a request from a block device running a completely
different I/O scheduler.

Wouldn't assuming rq->q->elevator->elevator_data is a struct ufq_data and
decrementing ufq->rqs_count below corrupt the private data of that other
scheduler?

[ ... ]
> +	spin_lock(&ctx->lock);
> +	free = bpf_attempt_merge(rq->q, rq, next);
> +	if (free) {
> +		if (rq->q->last_merge == free)
> +			rq->q->last_merge = NULL;
> +		list_del_init(&free->queuelist);
> +		atomic_dec(&ufq->rqs_count);
> +	}
> +	spin_unlock(&ctx->lock);
> +
> +	return free;
> +}

When bpf_attempt_merge() successfully merges next into rq, it returns the
hollowed-out next request.

Since this kfunc returns it directly to the BPF program without the
KF_ACQUIRE flag, the verifier treats it as an untracked pointer. Does this
prevent the BPF program from calling bpf_request_release() on it,
permanently leaking the request and its hardware tag?

[ ... ]
> +BTF_KFUNCS_START(ufq_kfunc_set_ops)
> +BTF_ID_FLAGS(func, bpf_request_acquire, KF_ACQUIRE | KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_request_release, KF_RELEASE)
> +BTF_ID_FLAGS(func, bpf_request_bio_try_merge, KF_SPIN_LOCK)
> +BTF_ID_FLAGS(func, bpf_request_try_merge, KF_SPIN_LOCK)
> +BTF_KFUNCS_END(ufq_kfunc_set_ops)

These kfuncs are registered with KF_SPIN_LOCK, meaning they can be called
inside a bpf_spin_lock critical section where preemption and interrupts are
disabled.

Since they call spin_lock(&ctx->lock) internally, won't this cause a
scheduling while atomic panic on PREEMPT_RT kernels where spinlock_t is a
sleeping rt_mutex?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260503035623.28771-1-kaitao.cheng@linux.dev?part=2

next prev parent reply	other threads:[~2026-05-03  4:57 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-03  3:56 [RFC v2 0/3] block: Introduce a BPF-based I/O scheduler Kaitao cheng
2026-05-03  3:56 ` [RFC v2 1/3] bpf: Add KF_SPIN_LOCK flag for kfuncs under bpf_spin_lock Kaitao cheng
2026-05-03  3:56 ` [RFC v2 2/3] block: Introduce the UFQ I/O scheduler Kaitao cheng
2026-05-03  4:45   ` bot+bpf-ci
2026-05-03  4:57   ` sashiko-bot [this message]
2026-05-03  3:56 ` [RFC v2 3/3] tools/ufq_iosched: add BPF example scheduler and build scaffolding Kaitao cheng
2026-05-03  4:44   ` bot+bpf-ci
2026-05-03  5:18   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260503045719.320E3C2BCB4@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=kaitao.cheng@linux.dev \
    --cc=sashiko@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.