* [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race
@ 2026-03-26 2:28 zhidao su
2026-03-26 2:28 ` [PATCH 2/3] selftests/sched_ext: Fix dsq_reenq test reliability zhidao su
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: zhidao su @ 2026-03-26 2:28 UTC (permalink / raw)
To: sched-ext
Cc: linux-kernel, tj, void, arighi, changwoo, peterz, mingo,
zhidao su
The reload_loop selftest triggers a KASAN null-ptr-deref at
scx_claim_exit+0x83 when two threads concurrently attach and
destroy BPF schedulers using the same ops map.
The race occurs between bpf_scx_unreg() and a concurrent reg():
1. Thread A's bpf_scx_unreg() calls scx_disable() then
kthread_flush_work(), which blocks until disable completes
and transitions state back to SCX_DISABLED.
2. With state SCX_DISABLED, a concurrent reg() allocates a
new sch_B and sets ops->priv = sch_B under scx_enable_mutex.
3. Thread A's bpf_scx_unreg() then executes
RCU_INIT_POINTER(ops->priv, NULL), overwriting sch_B.
4. When Thread B's link is destroyed, bpf_scx_unreg() reads
ops->priv == NULL and passes it to scx_disable(), which
calls scx_claim_exit(NULL), crashing at NULL+0x310.
Fix by adding a NULL guard for the case where ops->priv was
never set, and by acquiring scx_enable_mutex before clearing
ops->priv so that the check-and-clear is atomic with respect
to reg() which also sets ops->priv under scx_enable_mutex.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
---
kernel/sched/ext.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 551bfb99157d..01077cc2eb62 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7372,9 +7372,22 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
struct sched_ext_ops *ops = kdata;
struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
+ if (!sch)
+ return;
+
scx_disable(sch, SCX_EXIT_UNREG);
kthread_flush_work(&sch->disable_work);
- RCU_INIT_POINTER(ops->priv, NULL);
+
+ /*
+ * A concurrent reg() may have already installed a new scheduler into
+ * ops->priv by the time disable completes. Clear ops->priv only if it
+ * still holds our sch.
+ */
+ mutex_lock(&scx_enable_mutex);
+ if (rcu_access_pointer(ops->priv) == sch)
+ RCU_INIT_POINTER(ops->priv, NULL);
+ mutex_unlock(&scx_enable_mutex);
+
kobject_put(&sch->kobj);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH 2/3] selftests/sched_ext: Fix dsq_reenq test reliability 2026-03-26 2:28 [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race zhidao su @ 2026-03-26 2:28 ` zhidao su 2026-03-26 2:28 ` [PATCH 3/3] selftests/sched_ext: Fix consume_immed " zhidao su 2026-03-26 2:45 ` [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race Tejun Heo 2 siblings, 0 replies; 6+ messages in thread From: zhidao su @ 2026-03-26 2:28 UTC (permalink / raw) To: sched-ext Cc: linux-kernel, tj, void, arighi, changwoo, peterz, mingo, zhidao su The dsq_reenq test was using NUM_WORKERS=4 on a 4-CPU system, so each worker ran on its own CPU. Workers sleep 500us and yield, getting dispatched from USER_DSQ almost immediately. By the time the BPF timer's deferred scx_bpf_dsq_reenq() fires, USER_DSQ is empty and no reenqueue event is observed (nr_reenq_kfunc stays 0). Fix by increasing NUM_WORKERS to 16 (>4 CPUs). With more runnable tasks than CPUs, USER_DSQ always has a backlog when the timer fires, so scx_bpf_dsq_reenq() reliably finds tasks to reenqueue. Signed-off-by: zhidao su <suzhidao@xiaomi.com> --- tools/testing/selftests/sched_ext/dsq_reenq.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/sched_ext/dsq_reenq.c b/tools/testing/selftests/sched_ext/dsq_reenq.c index b0d99f9c9a9a..fe18c6af3877 100644 --- a/tools/testing/selftests/sched_ext/dsq_reenq.c +++ b/tools/testing/selftests/sched_ext/dsq_reenq.c @@ -15,7 +15,12 @@ #include "dsq_reenq.bpf.skel.h" #include "scx_test.h" -#define NUM_WORKERS 4 +/* + * Use more workers than CPUs so USER_DSQ is always backlogged. + * With tasks queued faster than dispatch can consume them, the + * periodic timer will always find tasks to reenqueue. + */ +#define NUM_WORKERS 16 #define TEST_DURATION_SEC 3 static volatile bool stop_workers; -- 2.43.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 3/3] selftests/sched_ext: Fix consume_immed test reliability 2026-03-26 2:28 [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race zhidao su 2026-03-26 2:28 ` [PATCH 2/3] selftests/sched_ext: Fix dsq_reenq test reliability zhidao su @ 2026-03-26 2:28 ` zhidao su 2026-03-26 7:06 ` Andrea Righi 2026-03-26 2:45 ` [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race Tejun Heo 2 siblings, 1 reply; 6+ messages in thread From: zhidao su @ 2026-03-26 2:28 UTC (permalink / raw) To: sched-ext Cc: linux-kernel, tj, void, arighi, changwoo, peterz, mingo, zhidao su The consume_immed test was failing because nr_consume_immed_reenq stayed 0. Two issues: 1. Workers were spread across CPUs, so CPU 0's local DSQ rarely accumulated multiple tasks. Fix: pin all workers to CPU 0 and increase NUM_WORKERS to 8 to ensure USER_DSQ is always backlogged. 2. ops.dispatch() called scx_bpf_dsq_move_to_local() only once, so CPU 0's local DSQ would contain exactly 1 task after dispatch. The IMMED slow path requires dsq->nr > 1 at the time of insertion (in dsq_inc_nr()), so a single dispatch call never triggers it. Fix: call scx_bpf_dsq_move_to_local() in a loop (up to 4 times) within a single ops.dispatch() invocation. The second call finds dsq->nr already 1, so dsq->nr increments to 2, triggering schedule_reenq_local() and the IMMED slow path. Signed-off-by: zhidao su <suzhidao@xiaomi.com> --- .../selftests/sched_ext/consume_immed.bpf.c | 25 ++++++++++++++----- .../selftests/sched_ext/consume_immed.c | 14 ++++++++++- 2 files changed, 32 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/sched_ext/consume_immed.bpf.c b/tools/testing/selftests/sched_ext/consume_immed.bpf.c index 9c7808f5abe1..e99bea0b2c24 100644 --- a/tools/testing/selftests/sched_ext/consume_immed.bpf.c +++ b/tools/testing/selftests/sched_ext/consume_immed.bpf.c @@ -11,9 +11,9 @@ * explicit SCX_ENQ_IMMED in enq_flags (requires v2 kfunc) * * Worker threads belonging to test_tgid are inserted into USER_DSQ. - * ops.dispatch() on CPU 0 consumes from USER_DSQ with SCX_ENQ_IMMED. - * With multiple workers competing for CPU 0, dsq->nr > 1 triggers the - * IMMED slow path (reenqueue with SCX_TASK_REENQ_IMMED). + * ops.dispatch() on CPU 0 consumes multiple tasks from USER_DSQ with + * SCX_ENQ_IMMED in a single dispatch call, causing dsq->nr to exceed 1 + * and triggering the IMMED slow path (reenqueue with SCX_TASK_REENQ_IMMED). * * Requires scx_bpf_dsq_move_to_local___v2() (v7.1+) for enq_flags support. */ @@ -55,10 +55,23 @@ void BPF_STRUCT_OPS(consume_immed_enqueue, struct task_struct *p, void BPF_STRUCT_OPS(consume_immed_dispatch, s32 cpu, struct task_struct *prev) { - if (cpu == 0) - scx_bpf_dsq_move_to_local(USER_DSQ, SCX_ENQ_IMMED); - else + int i; + + if (cpu != 0) { scx_bpf_dsq_move_to_local(SCX_DSQ_GLOBAL, 0); + return; + } + + /* + * Move multiple tasks into CPU 0's local DSQ with SCX_ENQ_IMMED in a + * single dispatch call. When the second task is inserted (dsq->nr > 1), + * dsq_inc_nr() triggers the IMMED slow path via schedule_reenq_local(), + * which calls ops.enqueue() with SCX_ENQ_REENQ | SCX_TASK_REENQ_IMMED. + */ + for (i = 0; i < 4; i++) { + if (!scx_bpf_dsq_move_to_local(USER_DSQ, SCX_ENQ_IMMED)) + break; + } } s32 BPF_STRUCT_OPS_SLEEPABLE(consume_immed_init) diff --git a/tools/testing/selftests/sched_ext/consume_immed.c b/tools/testing/selftests/sched_ext/consume_immed.c index 7f9594cfa9cb..61cbc0fe3663 100644 --- a/tools/testing/selftests/sched_ext/consume_immed.c +++ b/tools/testing/selftests/sched_ext/consume_immed.c @@ -12,18 +12,30 @@ #include <stdio.h> #include <unistd.h> #include <pthread.h> +#include <sched.h> #include <bpf/bpf.h> #include <scx/common.h> #include "consume_immed.bpf.skel.h" #include "scx_test.h" -#define NUM_WORKERS 4 +/* + * Use more workers than CPUs, all pinned to CPU 0, so CPU 0's local DSQ + * accumulates multiple IMMED tasks at once, reliably triggering the slow path. + */ +#define NUM_WORKERS 8 #define TEST_DURATION_SEC 3 static volatile bool stop_workers; static void *worker_fn(void *arg) { + cpu_set_t cpuset; + + /* Pin to CPU 0 to saturate its local DSQ */ + CPU_ZERO(&cpuset); + CPU_SET(0, &cpuset); + pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset); + while (!stop_workers) { volatile unsigned long i; -- 2.43.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 3/3] selftests/sched_ext: Fix consume_immed test reliability 2026-03-26 2:28 ` [PATCH 3/3] selftests/sched_ext: Fix consume_immed " zhidao su @ 2026-03-26 7:06 ` Andrea Righi 0 siblings, 0 replies; 6+ messages in thread From: Andrea Righi @ 2026-03-26 7:06 UTC (permalink / raw) To: zhidao su Cc: sched-ext, linux-kernel, tj, void, changwoo, peterz, mingo, zhidao su Hi zhidao, On Thu, Mar 26, 2026 at 10:28:27AM +0800, zhidao su wrote: > The consume_immed test was failing because nr_consume_immed_reenq > stayed 0. Two issues: > > 1. Workers were spread across CPUs, so CPU 0's local DSQ rarely > accumulated multiple tasks. Fix: pin all workers to CPU 0 and > increase NUM_WORKERS to 8 to ensure USER_DSQ is always backlogged. > > 2. ops.dispatch() called scx_bpf_dsq_move_to_local() only once, so > CPU 0's local DSQ would contain exactly 1 task after dispatch. > The IMMED slow path requires dsq->nr > 1 at the time of insertion > (in dsq_inc_nr()), so a single dispatch call never triggers it. > > Fix: call scx_bpf_dsq_move_to_local() in a loop (up to 4 times) > within a single ops.dispatch() invocation. The second call finds > dsq->nr already 1, so dsq->nr increments to 2, triggering > schedule_reenq_local() and the IMMED slow path. > > Signed-off-by: zhidao su <suzhidao@xiaomi.com> > --- > .../selftests/sched_ext/consume_immed.bpf.c | 25 ++++++++++++++----- > .../selftests/sched_ext/consume_immed.c | 14 ++++++++++- > 2 files changed, 32 insertions(+), 7 deletions(-) > > diff --git a/tools/testing/selftests/sched_ext/consume_immed.bpf.c b/tools/testing/selftests/sched_ext/consume_immed.bpf.c > index 9c7808f5abe1..e99bea0b2c24 100644 > --- a/tools/testing/selftests/sched_ext/consume_immed.bpf.c > +++ b/tools/testing/selftests/sched_ext/consume_immed.bpf.c > @@ -11,9 +11,9 @@ > * explicit SCX_ENQ_IMMED in enq_flags (requires v2 kfunc) > * > * Worker threads belonging to test_tgid are inserted into USER_DSQ. > - * ops.dispatch() on CPU 0 consumes from USER_DSQ with SCX_ENQ_IMMED. > - * With multiple workers competing for CPU 0, dsq->nr > 1 triggers the > - * IMMED slow path (reenqueue with SCX_TASK_REENQ_IMMED). > + * ops.dispatch() on CPU 0 consumes multiple tasks from USER_DSQ with > + * SCX_ENQ_IMMED in a single dispatch call, causing dsq->nr to exceed 1 > + * and triggering the IMMED slow path (reenqueue with SCX_TASK_REENQ_IMMED). > * > * Requires scx_bpf_dsq_move_to_local___v2() (v7.1+) for enq_flags support. > */ > @@ -55,10 +55,23 @@ void BPF_STRUCT_OPS(consume_immed_enqueue, struct task_struct *p, We should define a custom ops.select_cpu() to make sure all tasks are bounced to ops.enqueue(), in order to have more inserts into USER_DSQ, something like: s32 BPF_STRUCT_OPS(consume_immed_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) { return prev_cpu; } > > void BPF_STRUCT_OPS(consume_immed_dispatch, s32 cpu, struct task_struct *prev) > { > - if (cpu == 0) > - scx_bpf_dsq_move_to_local(USER_DSQ, SCX_ENQ_IMMED); > - else > + int i; > + > + if (cpu != 0) { > scx_bpf_dsq_move_to_local(SCX_DSQ_GLOBAL, 0); Hm.. this should trigger an error, you can't use scx_bpf_dsq_move_to_local() with SCX_DSQ_GLOBAL. > + return; > + } > + > + /* > + * Move multiple tasks into CPU 0's local DSQ with SCX_ENQ_IMMED in a > + * single dispatch call. When the second task is inserted (dsq->nr > 1), > + * dsq_inc_nr() triggers the IMMED slow path via schedule_reenq_local(), > + * which calls ops.enqueue() with SCX_ENQ_REENQ | SCX_TASK_REENQ_IMMED. > + */ > + for (i = 0; i < 4; i++) { > + if (!scx_bpf_dsq_move_to_local(USER_DSQ, SCX_ENQ_IMMED)) > + break; > + } Why 4? Two consecutive scx_bpf_dsq_move_to_local() should be enough, right? > } > > s32 BPF_STRUCT_OPS_SLEEPABLE(consume_immed_init) We don't see this from the context, but to check if SCX_ENQ_IMMED is available can we just check if it's != 0, instead of checking the _scx_bpf_dsq_move_to_local___v2 symbol? Something like: if (SCX_ENQ_IMMED == 0) { scx_bpf_error("SCX_ENQ_IMMED not available"); return -EOPNOTSUPP; } And remove the same check in consume_immed.c, because we're already checking it in the BPF part. Thanks, -Andrea > diff --git a/tools/testing/selftests/sched_ext/consume_immed.c b/tools/testing/selftests/sched_ext/consume_immed.c > index 7f9594cfa9cb..61cbc0fe3663 100644 > --- a/tools/testing/selftests/sched_ext/consume_immed.c > +++ b/tools/testing/selftests/sched_ext/consume_immed.c > @@ -12,18 +12,30 @@ > #include <stdio.h> > #include <unistd.h> > #include <pthread.h> > +#include <sched.h> > #include <bpf/bpf.h> > #include <scx/common.h> > #include "consume_immed.bpf.skel.h" > #include "scx_test.h" > > -#define NUM_WORKERS 4 > +/* > + * Use more workers than CPUs, all pinned to CPU 0, so CPU 0's local DSQ > + * accumulates multiple IMMED tasks at once, reliably triggering the slow path. > + */ > +#define NUM_WORKERS 8 > #define TEST_DURATION_SEC 3 > > static volatile bool stop_workers; > > static void *worker_fn(void *arg) > { > + cpu_set_t cpuset; > + > + /* Pin to CPU 0 to saturate its local DSQ */ > + CPU_ZERO(&cpuset); > + CPU_SET(0, &cpuset); > + pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset); > + > while (!stop_workers) { > volatile unsigned long i; > > -- > 2.43.0 > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race 2026-03-26 2:28 [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race zhidao su 2026-03-26 2:28 ` [PATCH 2/3] selftests/sched_ext: Fix dsq_reenq test reliability zhidao su 2026-03-26 2:28 ` [PATCH 3/3] selftests/sched_ext: Fix consume_immed " zhidao su @ 2026-03-26 2:45 ` Tejun Heo 2026-03-26 5:13 ` zhidao su 2 siblings, 1 reply; 6+ messages in thread From: Tejun Heo @ 2026-03-26 2:45 UTC (permalink / raw) To: zhidao su Cc: sched-ext, linux-kernel, void, arighi, changwoo, peterz, mingo, zhidao su On Thu, Mar 26, 2026 at 10:28:25AM +0800, zhidao su wrote: > The reload_loop selftest triggers a KASAN null-ptr-deref at > scx_claim_exit+0x83 when two threads concurrently attach and > destroy BPF schedulers using the same ops map. > > The race occurs between bpf_scx_unreg() and a concurrent reg(): > > 1. Thread A's bpf_scx_unreg() calls scx_disable() then > kthread_flush_work(), which blocks until disable completes > and transitions state back to SCX_DISABLED. > > 2. With state SCX_DISABLED, a concurrent reg() allocates a > new sch_B and sets ops->priv = sch_B under scx_enable_mutex. > > 3. Thread A's bpf_scx_unreg() then executes > RCU_INIT_POINTER(ops->priv, NULL), overwriting sch_B. > > 4. When Thread B's link is destroyed, bpf_scx_unreg() reads > ops->priv == NULL and passes it to scx_disable(), which > calls scx_claim_exit(NULL), crashing at NULL+0x310. > > Fix by adding a NULL guard for the case where ops->priv was > never set, and by acquiring scx_enable_mutex before clearing > ops->priv so that the check-and-clear is atomic with respect > to reg() which also sets ops->priv under scx_enable_mutex. Can you reproduce this? How do you trigger enable on the same ops that has already been enabled? Thanks. -- tejun ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race 2026-03-26 2:45 ` [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race Tejun Heo @ 2026-03-26 5:13 ` zhidao su 0 siblings, 0 replies; 6+ messages in thread From: zhidao su @ 2026-03-26 5:13 UTC (permalink / raw) To: tj; +Cc: sched-ext, linux-kernel, void, arighi, changwoo, peterz, mingo, suzhidao On Thu, Mar 26, 2026 at 10:55:08AM +0800, Tejun Heo wrote: > On Thu, Mar 26, 2026 at 10:28:25AM +0800, zhidao su wrote: > > The reload_loop selftest triggers a KASAN null-ptr-deref at > > scx_claim_exit+0x83 when two threads concurrently attach and > > destroy BPF schedulers using the same ops map. > ... > Can you reproduce this? How do you trigger enable on the same ops that has > already been enabled? I investigated further and the analysis in patch 1/3 was wrong. Please do not merge patches 1/3 and 2/3 from this series. Patch 3/3 is still valid and can be applied independently. Patch 1/3 (NULL deref): The race described in the commit message cannot occur. Both bpf_struct_ops_link_create() and bpf_struct_ops_map_link_detach() hold update_mutex when calling reg()/unreg(), so concurrent reg and unreg on the same ops map are serialized. I ran 20 rounds of reload_loop under KASAN with no crashes. The bug was never real. I wrote the patch based on code analysis alone without first obtaining KASAN output confirming the crash. That was wrong. Patch 2/3 (dsq_reenq reliability): I cannot reproduce a failure with the original test (NUM_WORKERS=4). Running 10 iterations locally, all pass. The "reliability fix" had no verified failure to fix and should not be merged. Patch 3/3 (consume_immed reliability): The original test fails consistently because ops.dispatch() moves only one task per call, so dsq->nr never exceeds 1 and the IMMED slow path in dsq_inc_nr() is never triggered. This is a real bug in the test. Patch 3/3 is valid and can be considered independently. Sorry for the noise. zhidao ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-03-26 7:06 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-26 2:28 [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race zhidao su 2026-03-26 2:28 ` [PATCH 2/3] selftests/sched_ext: Fix dsq_reenq test reliability zhidao su 2026-03-26 2:28 ` [PATCH 3/3] selftests/sched_ext: Fix consume_immed " zhidao su 2026-03-26 7:06 ` Andrea Righi 2026-03-26 2:45 ` [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race Tejun Heo 2026-03-26 5:13 ` zhidao su
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox