Sched_ext development
 help / color / mirror / Atom feed
From: sashiko-bot@kernel.org
To: "Andrea Righi" <arighi@nvidia.com>
Cc: sched-ext@lists.linux.dev
Subject: Re: [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg()
Date: Tue, 12 May 2026 00:44:36 +0000	[thread overview]
Message-ID: <20260512004437.A23D7C2BCB0@smtp.kernel.org> (raw)
In-Reply-To: <20260510224332.2011982-1-arighi@nvidia.com>

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [Critical] Dangling pointer in scx_alloc_and_add_sched() leads to Use-After-Free.
- [High] Spurious scx_error() triggers Denial of Service in multi-sched environments.
- [High] Early return in bpf_scx_unreg() introduces a permanent resource leak and wedges the scheduler state.
- [High] Fallback to scx_root in scx_prog_sched() breaks isolation and allows cross-scheduler state corruption.
- [High] Using ops->priv to track the scheduler instance inherently breaks sub-schedulers.
--

commit 13d28bde689d42f58394dbdb0585d40676beacf4
Author: Andrea Righi <arighi@nvidia.com>

sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg()

Under heavy concurrent attach/detach operations, scx_claim_exit() can
trigger a NULL pointer dereference. This commit fixes it by falling back
to the scx_root path in scx_prog_sched() when ops->priv is NULL, and
returning early in bpf_scx_unreg() if the scheduler pointer is NULL.

> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 345aa11b84b28..11998530416fb 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -7493,6 +7493,14 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
>  	struct sched_ext_ops *ops = kdata;
>  	struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
>  
> +	/*
> +	 * ops->priv can be NULL if scx_alloc_and_add_sched() failed before
> +	 * assigning it, or if bpf_scx_unreg() somehow re-entered. There's
> +	 * nothing to tear down in either case.
> +	 */
> +	if (!sch)
> +		return;
> +

Does this lockless check introduce a permanent resource leak when multiple
links are created for the same map?

Looking at bpf_struct_ops_map_link_dealloc(), it does not hold locks
preventing concurrent reg operations. If Thread 1 initiates unreg for a
detached link and executes scx_disable(), but is preempted before executing
RCU_INIT_POINTER(ops->priv, NULL):

Thread 1:
scx_disable()
... preempted ...

Thread 2 could successfully attach a new link for the same map, assigning its
new scheduler sch2 to ops->priv.

When Thread 1 resumes and executes RCU_INIT_POINTER(ops->priv, NULL), it
clears the pointer to Thread 2's active scheduler.

When Thread 2 subsequently detaches, would bpf_scx_unreg() read ops->priv,
see NULL, and return early here?

This would completely skip scx_disable() and kobject_put(), permanently
leaking sch2 and leaving scx_enable_state stuck at SCX_ENABLED. Would this
prevent new schedulers from being loaded until the system is rebooted?

Can a dangling pointer in scx_alloc_and_add_sched() lead to a use-after-free
with this ops->priv logic?

While not introduced by this patch, in scx_alloc_and_add_sched(),
rcu_assign_pointer(ops->priv, sch) is called before several fallible
allocations.

If a subsequent step like kobject_init_and_add() fails, the function jumps to
the error path and frees sch via kfree(), but fails to reset ops->priv to
NULL.

If a map was previously attached, ran, and armed a stray bpf_timer, could a
subsequent failed re-attach attempt leave ops->priv pointing to the freed sch?

When the stray timer fires, would scx_prog_sched() dereference ops->priv and
return the freed pointer to BPF kfuncs?

> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index a075732d4430d..e468a7401ed83 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -1433,11 +1433,21 @@ static inline bool scx_task_on_sched(struct scx_sched *sch,
>  static inline struct scx_sched *scx_prog_sched(const struct bpf_prog_aux *aux)
>  {
>  	struct sched_ext_ops *ops;
> -	struct scx_sched *root;
> +	struct scx_sched *root, *sch;
>  
>  	ops = bpf_prog_get_assoc_struct_ops(aux);
> -	if (likely(ops))
> -		return rcu_dereference_all(ops->priv);
> +	if (likely(ops)) {
> +		sch = rcu_dereference_all(ops->priv);
> +		if (likely(sch))
> +			return sch;
> +		/*
> +		 * @aux is associated with @ops but @ops->priv is NULL. This can
> +		 * be observed transiently under concurrent attach/detach (e.g.
> +		 * bpf_scx_unreg() clears @ops->priv before kdata is freed).
> +		 * Continue with the scx_root path so single-sched users keep
> +		 * working, sub-sched users see no scheduler.
> +		 */
> +	}

Does this fallback to scx_root trigger a denial of service in multi-sched
environments?

If root->ops.sub_attach is true, the fallback bypasses the return root; path
and instead hits scx_error(root, "Unassociated program executing with
sub-schedulers").

Because scx_error() is a macro that disables the BPF scheduler and forces
a system-wide fallback to CFS, could encountering this transient
ops->priv == NULL state instantly crash the root scheduler?

Can this fallback to scx_root break isolation and allow cross-scheduler state
corruption?

When a scheduler is detached, bpf_scx_unreg() clears ops->priv to NULL.
However, asynchronous execution contexts like bpf_timer or workqueues armed
by the detached scheduler can still fire.

If this stray program calls a kfunc, it observes ops->priv == NULL and now
falls back to returning the globally active scx_root.

If an entirely different scheduler has since been attached, would the stray
program bypass isolation boundaries and directly manipulate the new
scheduler's state? Prior to this patch, it safely returned NULL.

This isn't a bug introduced by this patch, but does using ops->priv to track
the scheduler instance inherently break sub-schedulers?

Because ops->priv is a single pointer on the shared struct_ops map (kdata),
attaching the same map to multiple cgroups will sequentially overwrite
ops->priv. It will only track the most recently attached scheduler.

If a BPF program executes for an older cgroup's scheduler, will it read
ops->priv, receive the newer cgroup's scheduler, and operate on the wrong
state?

>  
>  	root = rcu_dereference_all(scx_root);
>  	if (root) {

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260510224332.2011982-1-arighi@nvidia.com?part=1

      parent reply	other threads:[~2026-05-12  0:44 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-10 22:43 [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg() Andrea Righi
2026-05-11  2:55 ` Tejun Heo
2026-05-11  5:41   ` Andrea Righi
2026-05-12  0:44 ` sashiko-bot [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260512004437.A23D7C2BCB0@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=arighi@nvidia.com \
    --cc=sashiko@lists.linux.dev \
    --cc=sched-ext@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox