* Re: [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg()
2026-05-10 22:43 [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg() Andrea Righi
@ 2026-05-11 2:55 ` Tejun Heo
2026-05-11 5:41 ` Andrea Righi
2026-05-12 0:44 ` sashiko-bot
1 sibling, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2026-05-11 2:55 UTC (permalink / raw)
To: Andrea Righi
Cc: David Vernet, Changwoo Min, Emil Tsalapatis, sched-ext,
linux-kernel
Hello, Andrea.
I traced reload_loop with per-CPU ring probes around all @ops->priv
and scx_root assign/clear sites. The race is a stomp:
T2 unreg(K) T1 reg(K)
----------- ---------
sch = ops->priv = sch_b800
scx_disable; flush_disable_work
[scx_root_disable: scx_root=NULL,
mutex_unlock, state=DISABLED]
mutex_lock; state ok
scx_alloc_and_add_sched:
ops->priv = sch_a800
scx_root = sch_a800; init=0
state=ENABLED; mutex_unlock
[flush returns]
RCU_INIT_POINTER(ops->priv, NULL) <-- clobbers sch_a800
kobject_put(sch_b800)
Reachable because the unreg waits on sch->helper while the next reg
runs on the global scx_enable_helper, and scx_enable_mutex is released
inside scx_root_disable() well before bpf_scx_unreg() reaches its
RCU_INIT_POINTER. My trace caught 11us between PRIV_SET sch_a800 and
the clobber; nothing bounds it.
The posted patch suppresses the deref but leaves the stomp. Each
stomp leaks one sch (the "sch's base reference will be put by
bpf_scx_unreg()" contract assumes ops->priv still points at it), and
in the case I caught, sch_a800 is already SCX_ENABLED with scx_root
pointing at it - the bpf_link is gone but state stays ENABLED, so all
future attaches fail with -EBUSY permanently.
Suggestion: make @ops->priv the lifecycle binding. In
scx_root_enable_workfn() (and scx_sub_enable_workfn()), after the
existing state check and still under scx_enable_mutex, refuse with
-EBUSY if @ops->priv is non-NULL. Unreg side keeps its current
ordering.
One question: are there other paths that write or clear @ops->priv?
I only see the rcu_assign_pointer in scx_alloc_and_add_sched and the
RCU_INIT_POINTER(NULL) in bpf_scx_unreg().
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg()
2026-05-10 22:43 [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg() Andrea Righi
2026-05-11 2:55 ` Tejun Heo
@ 2026-05-12 0:44 ` sashiko-bot
1 sibling, 0 replies; 4+ messages in thread
From: sashiko-bot @ 2026-05-12 0:44 UTC (permalink / raw)
To: Andrea Righi; +Cc: sched-ext
Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [Critical] Dangling pointer in scx_alloc_and_add_sched() leads to Use-After-Free.
- [High] Spurious scx_error() triggers Denial of Service in multi-sched environments.
- [High] Early return in bpf_scx_unreg() introduces a permanent resource leak and wedges the scheduler state.
- [High] Fallback to scx_root in scx_prog_sched() breaks isolation and allows cross-scheduler state corruption.
- [High] Using ops->priv to track the scheduler instance inherently breaks sub-schedulers.
--
commit 13d28bde689d42f58394dbdb0585d40676beacf4
Author: Andrea Righi <arighi@nvidia.com>
sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg()
Under heavy concurrent attach/detach operations, scx_claim_exit() can
trigger a NULL pointer dereference. This commit fixes it by falling back
to the scx_root path in scx_prog_sched() when ops->priv is NULL, and
returning early in bpf_scx_unreg() if the scheduler pointer is NULL.
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 345aa11b84b28..11998530416fb 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -7493,6 +7493,14 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
> struct sched_ext_ops *ops = kdata;
> struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
>
> + /*
> + * ops->priv can be NULL if scx_alloc_and_add_sched() failed before
> + * assigning it, or if bpf_scx_unreg() somehow re-entered. There's
> + * nothing to tear down in either case.
> + */
> + if (!sch)
> + return;
> +
Does this lockless check introduce a permanent resource leak when multiple
links are created for the same map?
Looking at bpf_struct_ops_map_link_dealloc(), it does not hold locks
preventing concurrent reg operations. If Thread 1 initiates unreg for a
detached link and executes scx_disable(), but is preempted before executing
RCU_INIT_POINTER(ops->priv, NULL):
Thread 1:
scx_disable()
... preempted ...
Thread 2 could successfully attach a new link for the same map, assigning its
new scheduler sch2 to ops->priv.
When Thread 1 resumes and executes RCU_INIT_POINTER(ops->priv, NULL), it
clears the pointer to Thread 2's active scheduler.
When Thread 2 subsequently detaches, would bpf_scx_unreg() read ops->priv,
see NULL, and return early here?
This would completely skip scx_disable() and kobject_put(), permanently
leaking sch2 and leaving scx_enable_state stuck at SCX_ENABLED. Would this
prevent new schedulers from being loaded until the system is rebooted?
Can a dangling pointer in scx_alloc_and_add_sched() lead to a use-after-free
with this ops->priv logic?
While not introduced by this patch, in scx_alloc_and_add_sched(),
rcu_assign_pointer(ops->priv, sch) is called before several fallible
allocations.
If a subsequent step like kobject_init_and_add() fails, the function jumps to
the error path and frees sch via kfree(), but fails to reset ops->priv to
NULL.
If a map was previously attached, ran, and armed a stray bpf_timer, could a
subsequent failed re-attach attempt leave ops->priv pointing to the freed sch?
When the stray timer fires, would scx_prog_sched() dereference ops->priv and
return the freed pointer to BPF kfuncs?
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index a075732d4430d..e468a7401ed83 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -1433,11 +1433,21 @@ static inline bool scx_task_on_sched(struct scx_sched *sch,
> static inline struct scx_sched *scx_prog_sched(const struct bpf_prog_aux *aux)
> {
> struct sched_ext_ops *ops;
> - struct scx_sched *root;
> + struct scx_sched *root, *sch;
>
> ops = bpf_prog_get_assoc_struct_ops(aux);
> - if (likely(ops))
> - return rcu_dereference_all(ops->priv);
> + if (likely(ops)) {
> + sch = rcu_dereference_all(ops->priv);
> + if (likely(sch))
> + return sch;
> + /*
> + * @aux is associated with @ops but @ops->priv is NULL. This can
> + * be observed transiently under concurrent attach/detach (e.g.
> + * bpf_scx_unreg() clears @ops->priv before kdata is freed).
> + * Continue with the scx_root path so single-sched users keep
> + * working, sub-sched users see no scheduler.
> + */
> + }
Does this fallback to scx_root trigger a denial of service in multi-sched
environments?
If root->ops.sub_attach is true, the fallback bypasses the return root; path
and instead hits scx_error(root, "Unassociated program executing with
sub-schedulers").
Because scx_error() is a macro that disables the BPF scheduler and forces
a system-wide fallback to CFS, could encountering this transient
ops->priv == NULL state instantly crash the root scheduler?
Can this fallback to scx_root break isolation and allow cross-scheduler state
corruption?
When a scheduler is detached, bpf_scx_unreg() clears ops->priv to NULL.
However, asynchronous execution contexts like bpf_timer or workqueues armed
by the detached scheduler can still fire.
If this stray program calls a kfunc, it observes ops->priv == NULL and now
falls back to returning the globally active scx_root.
If an entirely different scheduler has since been attached, would the stray
program bypass isolation boundaries and directly manipulate the new
scheduler's state? Prior to this patch, it safely returned NULL.
This isn't a bug introduced by this patch, but does using ops->priv to track
the scheduler instance inherently break sub-schedulers?
Because ops->priv is a single pointer on the shared struct_ops map (kdata),
attaching the same map to multiple cgroups will sequentially overwrite
ops->priv. It will only track the most recently attached scheduler.
If a BPF program executes for an older cgroup's scheduler, will it read
ops->priv, receive the newer cgroup's scheduler, and operate on the wrong
state?
>
> root = rcu_dereference_all(scx_root);
> if (root) {
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260510224332.2011982-1-arighi@nvidia.com?part=1
^ permalink raw reply [flat|nested] 4+ messages in thread