Sched_ext development
 help / color / mirror / Atom feed
* [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg()
@ 2026-05-10 22:43 Andrea Righi
  2026-05-11  2:55 ` Tejun Heo
  2026-05-12  0:44 ` sashiko-bot
  0 siblings, 2 replies; 4+ messages in thread
From: Andrea Righi @ 2026-05-10 22:43 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min; +Cc: sched-ext, linux-kernel

Under heavy concurrent attach/detach operations, scx_claim_exit() can
trigger a NULL pointer dereference. This can be reproduced running the
reload_loop kselftests inside a virtme-ng session:

 $ vng -v -- ./tools/testing/selftests/sched_ext/runner -t reload_loop
 ...
 BUG: kernel NULL pointer dereference, address: 0000000000000400
 ...
 RIP: 0010:scx_claim_exit+0x3b/0x120
 Call Trace:
  <TASK>
  bpf_scx_unreg+0x45/0xb0
  bpf_struct_ops_map_link_dealloc+0x39/0x50
  bpf_link_release+0x18/0x20
  __fput+0x10b/0x2e0
  __x64_sys_close+0x47/0xa0

This was introduced by commit 105dcd005be2 ("sched_ext: Introduce
scx_prog_sched()"), which:

  - Made kfuncs look up the scheduler via scx_prog_sched(aux), which
    resolves aux -> struct_ops -> ops->priv.

  - Added RCU_INIT_POINTER(ops->priv, NULL) to bpf_scx_unreg() before
    dropping the kobject reference.

Under concurrent attach/detach of the same struct_ops program, the BPF
program's aux->struct_ops association can resolve to a struct_ops whose
->priv was just cleared by a concurrent bpf_scx_unreg(), or to one where
scx_alloc_and_add_sched() has not yet completed rcu_assign_pointer().

When scx_prog_sched() observes this transient ops->priv == NULL, it
returns NULL; kfuncs like scx_bpf_create_dsq() then return -ENODEV,
which causes ops.init() to fail with -ENODEV. The failed attach enters
the disable path, and the subsequent bpf_scx_unreg() reads NULL from
ops->priv and dereferences it in scx_claim_exit().

Fix it in two places:

  - scx_prog_sched(): when ops is found but ops->priv is NULL, fall
    through to the scx_root path instead of returning NULL. For
    single-sched (the only currently supported configuration), this
    recovers the previous behavior; for sub-sched-aware schedulers the
    existing !root->ops.sub_attach guard keeps the fallback off so
    multi-sched semantics are preserved.

  - bpf_scx_unreg(): guard against ops->priv == NULL so the function is
    a no-op instead of NULL-dereferencing scx_disable(NULL, ...).

Fixes: 105dcd005be2 ("sched_ext: Introduce scx_prog_sched()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c          |  8 ++++++++
 kernel/sched/ext_internal.h | 16 +++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 4efe0099f79af..6c476ec5dcbe1 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7608,6 +7608,14 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
 	struct sched_ext_ops *ops = kdata;
 	struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
 
+	/*
+	 * ops->priv can be NULL if scx_alloc_and_add_sched() failed before
+	 * assigning it, or if bpf_scx_unreg() somehow re-entered. There's
+	 * nothing to tear down in either case.
+	 */
+	if (!sch)
+		return;
+
 	scx_disable(sch, SCX_EXIT_UNREG);
 	scx_flush_disable_work(sch);
 	RCU_INIT_POINTER(ops->priv, NULL);
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index a075732d4430d..e468a7401ed83 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1433,11 +1433,21 @@ static inline bool scx_task_on_sched(struct scx_sched *sch,
 static inline struct scx_sched *scx_prog_sched(const struct bpf_prog_aux *aux)
 {
 	struct sched_ext_ops *ops;
-	struct scx_sched *root;
+	struct scx_sched *root, *sch;
 
 	ops = bpf_prog_get_assoc_struct_ops(aux);
-	if (likely(ops))
-		return rcu_dereference_all(ops->priv);
+	if (likely(ops)) {
+		sch = rcu_dereference_all(ops->priv);
+		if (likely(sch))
+			return sch;
+		/*
+		 * @aux is associated with @ops but @ops->priv is NULL. This can
+		 * be observed transiently under concurrent attach/detach (e.g.
+		 * bpf_scx_unreg() clears @ops->priv before kdata is freed).
+		 * Continue with the scx_root path so single-sched users keep
+		 * working, sub-sched users see no scheduler.
+		 */
+	}
 
 	root = rcu_dereference_all(scx_root);
 	if (root) {
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg()
  2026-05-10 22:43 [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg() Andrea Righi
@ 2026-05-11  2:55 ` Tejun Heo
  2026-05-11  5:41   ` Andrea Righi
  2026-05-12  0:44 ` sashiko-bot
  1 sibling, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2026-05-11  2:55 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Emil Tsalapatis, sched-ext,
	linux-kernel

Hello, Andrea.

I traced reload_loop with per-CPU ring probes around all @ops->priv
and scx_root assign/clear sites. The race is a stomp:

  T2 unreg(K)                       T1 reg(K)
  -----------                       ---------
  sch = ops->priv = sch_b800
  scx_disable; flush_disable_work
    [scx_root_disable: scx_root=NULL,
     mutex_unlock, state=DISABLED]
                                    mutex_lock; state ok
                                    scx_alloc_and_add_sched:
                                      ops->priv = sch_a800
                                    scx_root = sch_a800; init=0
                                    state=ENABLED; mutex_unlock
    [flush returns]
  RCU_INIT_POINTER(ops->priv, NULL) <-- clobbers sch_a800
  kobject_put(sch_b800)

Reachable because the unreg waits on sch->helper while the next reg
runs on the global scx_enable_helper, and scx_enable_mutex is released
inside scx_root_disable() well before bpf_scx_unreg() reaches its
RCU_INIT_POINTER. My trace caught 11us between PRIV_SET sch_a800 and
the clobber; nothing bounds it.

The posted patch suppresses the deref but leaves the stomp. Each
stomp leaks one sch (the "sch's base reference will be put by
bpf_scx_unreg()" contract assumes ops->priv still points at it), and
in the case I caught, sch_a800 is already SCX_ENABLED with scx_root
pointing at it - the bpf_link is gone but state stays ENABLED, so all
future attaches fail with -EBUSY permanently.

Suggestion: make @ops->priv the lifecycle binding. In
scx_root_enable_workfn() (and scx_sub_enable_workfn()), after the
existing state check and still under scx_enable_mutex, refuse with
-EBUSY if @ops->priv is non-NULL. Unreg side keeps its current
ordering.

One question: are there other paths that write or clear @ops->priv?
I only see the rcu_assign_pointer in scx_alloc_and_add_sched and the
RCU_INIT_POINTER(NULL) in bpf_scx_unreg().

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg()
  2026-05-11  2:55 ` Tejun Heo
@ 2026-05-11  5:41   ` Andrea Righi
  0 siblings, 0 replies; 4+ messages in thread
From: Andrea Righi @ 2026-05-11  5:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Emil Tsalapatis, sched-ext,
	linux-kernel

Hi Tejun,

On Sun, May 10, 2026 at 04:55:30PM -1000, Tejun Heo wrote:
> Hello, Andrea.
> 
> I traced reload_loop with per-CPU ring probes around all @ops->priv
> and scx_root assign/clear sites. The race is a stomp:
> 
>   T2 unreg(K)                       T1 reg(K)
>   -----------                       ---------
>   sch = ops->priv = sch_b800
>   scx_disable; flush_disable_work
>     [scx_root_disable: scx_root=NULL,
>      mutex_unlock, state=DISABLED]
>                                     mutex_lock; state ok
>                                     scx_alloc_and_add_sched:
>                                       ops->priv = sch_a800
>                                     scx_root = sch_a800; init=0
>                                     state=ENABLED; mutex_unlock
>     [flush returns]
>   RCU_INIT_POINTER(ops->priv, NULL) <-- clobbers sch_a800
>   kobject_put(sch_b800)

Ah makes sense! Yes, that's the case.

> 
> Reachable because the unreg waits on sch->helper while the next reg
> runs on the global scx_enable_helper, and scx_enable_mutex is released
> inside scx_root_disable() well before bpf_scx_unreg() reaches its
> RCU_INIT_POINTER. My trace caught 11us between PRIV_SET sch_a800 and
> the clobber; nothing bounds it.
> 
> The posted patch suppresses the deref but leaves the stomp. Each
> stomp leaks one sch (the "sch's base reference will be put by
> bpf_scx_unreg()" contract assumes ops->priv still points at it), and
> in the case I caught, sch_a800 is already SCX_ENABLED with scx_root
> pointing at it - the bpf_link is gone but state stays ENABLED, so all
> future attaches fail with -EBUSY permanently.
> 
> Suggestion: make @ops->priv the lifecycle binding. In
> scx_root_enable_workfn() (and scx_sub_enable_workfn()), after the
> existing state check and still under scx_enable_mutex, refuse with
> -EBUSY if @ops->priv is non-NULL. Unreg side keeps its current
> ordering.

I'll send a new version implementing this.

> 
> One question: are there other paths that write or clear @ops->priv?
> I only see the rcu_assign_pointer in scx_alloc_and_add_sched and the
> RCU_INIT_POINTER(NULL) in bpf_scx_unreg().

AFAICS there's only the rcu_assign_pointer() in scx_alloc_and_add_sched() and
RCU_INIT_POINTER(NULL) in bpf_scx_unreg(), no other writers/clearers. So the
-EBUSY check should be sufficient to close all the races.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg()
  2026-05-10 22:43 [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg() Andrea Righi
  2026-05-11  2:55 ` Tejun Heo
@ 2026-05-12  0:44 ` sashiko-bot
  1 sibling, 0 replies; 4+ messages in thread
From: sashiko-bot @ 2026-05-12  0:44 UTC (permalink / raw)
  To: Andrea Righi; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [Critical] Dangling pointer in scx_alloc_and_add_sched() leads to Use-After-Free.
- [High] Spurious scx_error() triggers Denial of Service in multi-sched environments.
- [High] Early return in bpf_scx_unreg() introduces a permanent resource leak and wedges the scheduler state.
- [High] Fallback to scx_root in scx_prog_sched() breaks isolation and allows cross-scheduler state corruption.
- [High] Using ops->priv to track the scheduler instance inherently breaks sub-schedulers.
--

commit 13d28bde689d42f58394dbdb0585d40676beacf4
Author: Andrea Righi <arighi@nvidia.com>

sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg()

Under heavy concurrent attach/detach operations, scx_claim_exit() can
trigger a NULL pointer dereference. This commit fixes it by falling back
to the scx_root path in scx_prog_sched() when ops->priv is NULL, and
returning early in bpf_scx_unreg() if the scheduler pointer is NULL.

> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 345aa11b84b28..11998530416fb 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -7493,6 +7493,14 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
>  	struct sched_ext_ops *ops = kdata;
>  	struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
>  
> +	/*
> +	 * ops->priv can be NULL if scx_alloc_and_add_sched() failed before
> +	 * assigning it, or if bpf_scx_unreg() somehow re-entered. There's
> +	 * nothing to tear down in either case.
> +	 */
> +	if (!sch)
> +		return;
> +

Does this lockless check introduce a permanent resource leak when multiple
links are created for the same map?

Looking at bpf_struct_ops_map_link_dealloc(), it does not hold locks
preventing concurrent reg operations. If Thread 1 initiates unreg for a
detached link and executes scx_disable(), but is preempted before executing
RCU_INIT_POINTER(ops->priv, NULL):

Thread 1:
scx_disable()
... preempted ...

Thread 2 could successfully attach a new link for the same map, assigning its
new scheduler sch2 to ops->priv.

When Thread 1 resumes and executes RCU_INIT_POINTER(ops->priv, NULL), it
clears the pointer to Thread 2's active scheduler.

When Thread 2 subsequently detaches, would bpf_scx_unreg() read ops->priv,
see NULL, and return early here?

This would completely skip scx_disable() and kobject_put(), permanently
leaking sch2 and leaving scx_enable_state stuck at SCX_ENABLED. Would this
prevent new schedulers from being loaded until the system is rebooted?

Can a dangling pointer in scx_alloc_and_add_sched() lead to a use-after-free
with this ops->priv logic?

While not introduced by this patch, in scx_alloc_and_add_sched(),
rcu_assign_pointer(ops->priv, sch) is called before several fallible
allocations.

If a subsequent step like kobject_init_and_add() fails, the function jumps to
the error path and frees sch via kfree(), but fails to reset ops->priv to
NULL.

If a map was previously attached, ran, and armed a stray bpf_timer, could a
subsequent failed re-attach attempt leave ops->priv pointing to the freed sch?

When the stray timer fires, would scx_prog_sched() dereference ops->priv and
return the freed pointer to BPF kfuncs?

> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index a075732d4430d..e468a7401ed83 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -1433,11 +1433,21 @@ static inline bool scx_task_on_sched(struct scx_sched *sch,
>  static inline struct scx_sched *scx_prog_sched(const struct bpf_prog_aux *aux)
>  {
>  	struct sched_ext_ops *ops;
> -	struct scx_sched *root;
> +	struct scx_sched *root, *sch;
>  
>  	ops = bpf_prog_get_assoc_struct_ops(aux);
> -	if (likely(ops))
> -		return rcu_dereference_all(ops->priv);
> +	if (likely(ops)) {
> +		sch = rcu_dereference_all(ops->priv);
> +		if (likely(sch))
> +			return sch;
> +		/*
> +		 * @aux is associated with @ops but @ops->priv is NULL. This can
> +		 * be observed transiently under concurrent attach/detach (e.g.
> +		 * bpf_scx_unreg() clears @ops->priv before kdata is freed).
> +		 * Continue with the scx_root path so single-sched users keep
> +		 * working, sub-sched users see no scheduler.
> +		 */
> +	}

Does this fallback to scx_root trigger a denial of service in multi-sched
environments?

If root->ops.sub_attach is true, the fallback bypasses the return root; path
and instead hits scx_error(root, "Unassociated program executing with
sub-schedulers").

Because scx_error() is a macro that disables the BPF scheduler and forces
a system-wide fallback to CFS, could encountering this transient
ops->priv == NULL state instantly crash the root scheduler?

Can this fallback to scx_root break isolation and allow cross-scheduler state
corruption?

When a scheduler is detached, bpf_scx_unreg() clears ops->priv to NULL.
However, asynchronous execution contexts like bpf_timer or workqueues armed
by the detached scheduler can still fire.

If this stray program calls a kfunc, it observes ops->priv == NULL and now
falls back to returning the globally active scx_root.

If an entirely different scheduler has since been attached, would the stray
program bypass isolation boundaries and directly manipulate the new
scheduler's state? Prior to this patch, it safely returned NULL.

This isn't a bug introduced by this patch, but does using ops->priv to track
the scheduler instance inherently break sub-schedulers?

Because ops->priv is a single pointer on the shared struct_ops map (kdata),
attaching the same map to multiple cgroups will sequentially overwrite
ops->priv. It will only track the most recently attached scheduler.

If a BPF program executes for an older cgroup's scheduler, will it read
ops->priv, receive the newer cgroup's scheduler, and operate on the wrong
state?

>  
>  	root = rcu_dereference_all(scx_root);
>  	if (root) {

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260510224332.2011982-1-arighi@nvidia.com?part=1

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-05-12  0:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-10 22:43 [PATCH sched_ext/for-7.1-fixes] sched_ext: Fix ops->priv NULL pointer deref in bpf_scx_unreg() Andrea Righi
2026-05-11  2:55 ` Tejun Heo
2026-05-11  5:41   ` Andrea Righi
2026-05-12  0:44 ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox