[PATCH 1/3] sched_ext: fix NULL deref in bpf_scx

public inbox for sched-ext@lists.linux.dev
 help / color / mirror / Atom feed

* [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race
@ 2026-03-26  2:28 zhidao su
  2026-03-26  2:28 ` [PATCH 2/3] selftests/sched_ext: Fix dsq_reenq test reliability zhidao su
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: zhidao su @ 2026-03-26  2:28 UTC (permalink / raw)
  To: sched-ext
  Cc: linux-kernel, tj, void, arighi, changwoo, peterz, mingo,
	zhidao su

The reload_loop selftest triggers a KASAN null-ptr-deref at
scx_claim_exit+0x83 when two threads concurrently attach and
destroy BPF schedulers using the same ops map.

The race occurs between bpf_scx_unreg() and a concurrent reg():

1. Thread A's bpf_scx_unreg() calls scx_disable() then
   kthread_flush_work(), which blocks until disable completes
   and transitions state back to SCX_DISABLED.

2. With state SCX_DISABLED, a concurrent reg() allocates a
   new sch_B and sets ops->priv = sch_B under scx_enable_mutex.

3. Thread A's bpf_scx_unreg() then executes
   RCU_INIT_POINTER(ops->priv, NULL), overwriting sch_B.

4. When Thread B's link is destroyed, bpf_scx_unreg() reads
   ops->priv == NULL and passes it to scx_disable(), which
   calls scx_claim_exit(NULL), crashing at NULL+0x310.

Fix by adding a NULL guard for the case where ops->priv was
never set, and by acquiring scx_enable_mutex before clearing
ops->priv so that the check-and-clear is atomic with respect
to reg() which also sets ops->priv under scx_enable_mutex.

Signed-off-by: zhidao su <suzhidao@xiaomi.com>
---
 kernel/sched/ext.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 551bfb99157d..01077cc2eb62 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7372,9 +7372,22 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
 	struct sched_ext_ops *ops = kdata;
 	struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);

+	if (!sch)
+		return;
+
 	scx_disable(sch, SCX_EXIT_UNREG);
 	kthread_flush_work(&sch->disable_work);
-	RCU_INIT_POINTER(ops->priv, NULL);
+
+	/*
+	 * A concurrent reg() may have already installed a new scheduler into
+	 * ops->priv by the time disable completes. Clear ops->priv only if it
+	 * still holds our sch.
+	 */
+	mutex_lock(&scx_enable_mutex);
+	if (rcu_access_pointer(ops->priv) == sch)
+		RCU_INIT_POINTER(ops->priv, NULL);
+	mutex_unlock(&scx_enable_mutex);
+
 	kobject_put(&sch->kobj);
 }

-- 
2.43.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/3] selftests/sched_ext: Fix dsq_reenq test reliability
  2026-03-26  2:28 [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race zhidao su
@ 2026-03-26  2:28 ` zhidao su
  2026-03-26  2:28 ` [PATCH 3/3] selftests/sched_ext: Fix consume_immed " zhidao su
  2026-03-26  2:45 ` [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race Tejun Heo
  2 siblings, 0 replies; 6+ messages in thread
From: zhidao su @ 2026-03-26  2:28 UTC (permalink / raw)
  To: sched-ext
  Cc: linux-kernel, tj, void, arighi, changwoo, peterz, mingo,
	zhidao su

The dsq_reenq test was using NUM_WORKERS=4 on a 4-CPU system, so
each worker ran on its own CPU. Workers sleep 500us and yield,
getting dispatched from USER_DSQ almost immediately. By the time
the BPF timer's deferred scx_bpf_dsq_reenq() fires, USER_DSQ is
empty and no reenqueue event is observed (nr_reenq_kfunc stays 0).

Fix by increasing NUM_WORKERS to 16 (>4 CPUs). With more runnable
tasks than CPUs, USER_DSQ always has a backlog when the timer fires,
so scx_bpf_dsq_reenq() reliably finds tasks to reenqueue.

Signed-off-by: zhidao su <suzhidao@xiaomi.com>
---
 tools/testing/selftests/sched_ext/dsq_reenq.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/sched_ext/dsq_reenq.c b/tools/testing/selftests/sched_ext/dsq_reenq.c
index b0d99f9c9a9a..fe18c6af3877 100644
--- a/tools/testing/selftests/sched_ext/dsq_reenq.c
+++ b/tools/testing/selftests/sched_ext/dsq_reenq.c
@@ -15,7 +15,12 @@
 #include "dsq_reenq.bpf.skel.h"
 #include "scx_test.h"
 
-#define NUM_WORKERS	4
+/*
+ * Use more workers than CPUs so USER_DSQ is always backlogged.
+ * With tasks queued faster than dispatch can consume them, the
+ * periodic timer will always find tasks to reenqueue.
+ */
+#define NUM_WORKERS	16
 #define TEST_DURATION_SEC 3
 
 static volatile bool stop_workers;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/3] selftests/sched_ext: Fix consume_immed test reliability
  2026-03-26  2:28 [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race zhidao su
  2026-03-26  2:28 ` [PATCH 2/3] selftests/sched_ext: Fix dsq_reenq test reliability zhidao su
@ 2026-03-26  2:28 ` zhidao su
  2026-03-26  7:06   ` Andrea Righi
  2026-03-26  2:45 ` [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race Tejun Heo
  2 siblings, 1 reply; 6+ messages in thread
From: zhidao su @ 2026-03-26  2:28 UTC (permalink / raw)
  To: sched-ext
  Cc: linux-kernel, tj, void, arighi, changwoo, peterz, mingo,
	zhidao su

The consume_immed test was failing because nr_consume_immed_reenq
stayed 0. Two issues:

1. Workers were spread across CPUs, so CPU 0's local DSQ rarely
   accumulated multiple tasks. Fix: pin all workers to CPU 0 and
   increase NUM_WORKERS to 8 to ensure USER_DSQ is always backlogged.

2. ops.dispatch() called scx_bpf_dsq_move_to_local() only once, so
   CPU 0's local DSQ would contain exactly 1 task after dispatch.
   The IMMED slow path requires dsq->nr > 1 at the time of insertion
   (in dsq_inc_nr()), so a single dispatch call never triggers it.

   Fix: call scx_bpf_dsq_move_to_local() in a loop (up to 4 times)
   within a single ops.dispatch() invocation. The second call finds
   dsq->nr already 1, so dsq->nr increments to 2, triggering
   schedule_reenq_local() and the IMMED slow path.

Signed-off-by: zhidao su <suzhidao@xiaomi.com>
---
 .../selftests/sched_ext/consume_immed.bpf.c   | 25 ++++++++++++++-----
 .../selftests/sched_ext/consume_immed.c       | 14 ++++++++++-
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/sched_ext/consume_immed.bpf.c b/tools/testing/selftests/sched_ext/consume_immed.bpf.c
index 9c7808f5abe1..e99bea0b2c24 100644
--- a/tools/testing/selftests/sched_ext/consume_immed.bpf.c
+++ b/tools/testing/selftests/sched_ext/consume_immed.bpf.c
@@ -11,9 +11,9 @@
  *                   explicit SCX_ENQ_IMMED in enq_flags (requires v2 kfunc)
  *
  * Worker threads belonging to test_tgid are inserted into USER_DSQ.
- * ops.dispatch() on CPU 0 consumes from USER_DSQ with SCX_ENQ_IMMED.
- * With multiple workers competing for CPU 0, dsq->nr > 1 triggers the
- * IMMED slow path (reenqueue with SCX_TASK_REENQ_IMMED).
+ * ops.dispatch() on CPU 0 consumes multiple tasks from USER_DSQ with
+ * SCX_ENQ_IMMED in a single dispatch call, causing dsq->nr to exceed 1
+ * and triggering the IMMED slow path (reenqueue with SCX_TASK_REENQ_IMMED).
  *
  * Requires scx_bpf_dsq_move_to_local___v2() (v7.1+) for enq_flags support.
  */
@@ -55,10 +55,23 @@ void BPF_STRUCT_OPS(consume_immed_enqueue, struct task_struct *p,
 
 void BPF_STRUCT_OPS(consume_immed_dispatch, s32 cpu, struct task_struct *prev)
 {
-	if (cpu == 0)
-		scx_bpf_dsq_move_to_local(USER_DSQ, SCX_ENQ_IMMED);
-	else
+	int i;
+
+	if (cpu != 0) {
 		scx_bpf_dsq_move_to_local(SCX_DSQ_GLOBAL, 0);
+		return;
+	}
+
+	/*
+	 * Move multiple tasks into CPU 0's local DSQ with SCX_ENQ_IMMED in a
+	 * single dispatch call. When the second task is inserted (dsq->nr > 1),
+	 * dsq_inc_nr() triggers the IMMED slow path via schedule_reenq_local(),
+	 * which calls ops.enqueue() with SCX_ENQ_REENQ | SCX_TASK_REENQ_IMMED.
+	 */
+	for (i = 0; i < 4; i++) {
+		if (!scx_bpf_dsq_move_to_local(USER_DSQ, SCX_ENQ_IMMED))
+			break;
+	}
 }
 
 s32 BPF_STRUCT_OPS_SLEEPABLE(consume_immed_init)
diff --git a/tools/testing/selftests/sched_ext/consume_immed.c b/tools/testing/selftests/sched_ext/consume_immed.c
index 7f9594cfa9cb..61cbc0fe3663 100644
--- a/tools/testing/selftests/sched_ext/consume_immed.c
+++ b/tools/testing/selftests/sched_ext/consume_immed.c
@@ -12,18 +12,30 @@
 #include <stdio.h>
 #include <unistd.h>
 #include <pthread.h>
+#include <sched.h>
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include "consume_immed.bpf.skel.h"
 #include "scx_test.h"
 
-#define NUM_WORKERS		4
+/*
+ * Use more workers than CPUs, all pinned to CPU 0, so CPU 0's local DSQ
+ * accumulates multiple IMMED tasks at once, reliably triggering the slow path.
+ */
+#define NUM_WORKERS		8
 #define TEST_DURATION_SEC	3
 
 static volatile bool stop_workers;
 
 static void *worker_fn(void *arg)
 {
+	cpu_set_t cpuset;
+
+	/* Pin to CPU 0 to saturate its local DSQ */
+	CPU_ZERO(&cpuset);
+	CPU_SET(0, &cpuset);
+	pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
+
 	while (!stop_workers) {
 		volatile unsigned long i;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] selftests/sched_ext: Fix consume_immed test reliability
  2026-03-26  2:28 ` [PATCH 3/3] selftests/sched_ext: Fix consume_immed " zhidao su
@ 2026-03-26  7:06   ` Andrea Righi
  0 siblings, 0 replies; 6+ messages in thread
From: Andrea Righi @ 2026-03-26  7:06 UTC (permalink / raw)
  To: zhidao su
  Cc: sched-ext, linux-kernel, tj, void, changwoo, peterz, mingo,
	zhidao su

Hi zhidao,

On Thu, Mar 26, 2026 at 10:28:27AM +0800, zhidao su wrote:
> The consume_immed test was failing because nr_consume_immed_reenq
> stayed 0. Two issues:
> 
> 1. Workers were spread across CPUs, so CPU 0's local DSQ rarely
>    accumulated multiple tasks. Fix: pin all workers to CPU 0 and
>    increase NUM_WORKERS to 8 to ensure USER_DSQ is always backlogged.
> 
> 2. ops.dispatch() called scx_bpf_dsq_move_to_local() only once, so
>    CPU 0's local DSQ would contain exactly 1 task after dispatch.
>    The IMMED slow path requires dsq->nr > 1 at the time of insertion
>    (in dsq_inc_nr()), so a single dispatch call never triggers it.
> 
>    Fix: call scx_bpf_dsq_move_to_local() in a loop (up to 4 times)
>    within a single ops.dispatch() invocation. The second call finds
>    dsq->nr already 1, so dsq->nr increments to 2, triggering
>    schedule_reenq_local() and the IMMED slow path.
> 
> Signed-off-by: zhidao su <suzhidao@xiaomi.com>
> ---
>  .../selftests/sched_ext/consume_immed.bpf.c   | 25 ++++++++++++++-----
>  .../selftests/sched_ext/consume_immed.c       | 14 ++++++++++-
>  2 files changed, 32 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/testing/selftests/sched_ext/consume_immed.bpf.c b/tools/testing/selftests/sched_ext/consume_immed.bpf.c
> index 9c7808f5abe1..e99bea0b2c24 100644
> --- a/tools/testing/selftests/sched_ext/consume_immed.bpf.c
> +++ b/tools/testing/selftests/sched_ext/consume_immed.bpf.c
> @@ -11,9 +11,9 @@
>   *                   explicit SCX_ENQ_IMMED in enq_flags (requires v2 kfunc)
>   *
>   * Worker threads belonging to test_tgid are inserted into USER_DSQ.
> - * ops.dispatch() on CPU 0 consumes from USER_DSQ with SCX_ENQ_IMMED.
> - * With multiple workers competing for CPU 0, dsq->nr > 1 triggers the
> - * IMMED slow path (reenqueue with SCX_TASK_REENQ_IMMED).
> + * ops.dispatch() on CPU 0 consumes multiple tasks from USER_DSQ with
> + * SCX_ENQ_IMMED in a single dispatch call, causing dsq->nr to exceed 1
> + * and triggering the IMMED slow path (reenqueue with SCX_TASK_REENQ_IMMED).
>   *
>   * Requires scx_bpf_dsq_move_to_local___v2() (v7.1+) for enq_flags support.
>   */
> @@ -55,10 +55,23 @@ void BPF_STRUCT_OPS(consume_immed_enqueue, struct task_struct *p,

We should define a custom ops.select_cpu() to make sure all tasks are
bounced to ops.enqueue(), in order to have more inserts into USER_DSQ,
something like:

s32 BPF_STRUCT_OPS(consume_immed_select_cpu, struct task_struct *p,
                   s32 prev_cpu, u64 wake_flags)
{
	return prev_cpu;
}

>  
>  void BPF_STRUCT_OPS(consume_immed_dispatch, s32 cpu, struct task_struct *prev)
>  {
> -	if (cpu == 0)
> -		scx_bpf_dsq_move_to_local(USER_DSQ, SCX_ENQ_IMMED);
> -	else
> +	int i;
> +
> +	if (cpu != 0) {
>  		scx_bpf_dsq_move_to_local(SCX_DSQ_GLOBAL, 0);

Hm.. this should trigger an error, you can't use
scx_bpf_dsq_move_to_local() with SCX_DSQ_GLOBAL.

> +		return;
> +	}
> +
> +	/*
> +	 * Move multiple tasks into CPU 0's local DSQ with SCX_ENQ_IMMED in a
> +	 * single dispatch call. When the second task is inserted (dsq->nr > 1),
> +	 * dsq_inc_nr() triggers the IMMED slow path via schedule_reenq_local(),
> +	 * which calls ops.enqueue() with SCX_ENQ_REENQ | SCX_TASK_REENQ_IMMED.
> +	 */
> +	for (i = 0; i < 4; i++) {
> +		if (!scx_bpf_dsq_move_to_local(USER_DSQ, SCX_ENQ_IMMED))
> +			break;
> +	}

Why 4? Two consecutive scx_bpf_dsq_move_to_local() should be enough, right?

>  }
>  
>  s32 BPF_STRUCT_OPS_SLEEPABLE(consume_immed_init)

We don't see this from the context, but to check if SCX_ENQ_IMMED is
available can we just check if it's != 0, instead of checking the
_scx_bpf_dsq_move_to_local___v2 symbol?

Something like:

if (SCX_ENQ_IMMED == 0) {
	scx_bpf_error("SCX_ENQ_IMMED not available");
	return -EOPNOTSUPP;
}

And remove the same check in consume_immed.c, because we're already
checking it in the BPF part.

Thanks,
-Andrea

> diff --git a/tools/testing/selftests/sched_ext/consume_immed.c b/tools/testing/selftests/sched_ext/consume_immed.c
> index 7f9594cfa9cb..61cbc0fe3663 100644
> --- a/tools/testing/selftests/sched_ext/consume_immed.c
> +++ b/tools/testing/selftests/sched_ext/consume_immed.c
> @@ -12,18 +12,30 @@
>  #include <stdio.h>
>  #include <unistd.h>
>  #include <pthread.h>
> +#include <sched.h>
>  #include <bpf/bpf.h>
>  #include <scx/common.h>
>  #include "consume_immed.bpf.skel.h"
>  #include "scx_test.h"
>  
> -#define NUM_WORKERS		4
> +/*
> + * Use more workers than CPUs, all pinned to CPU 0, so CPU 0's local DSQ
> + * accumulates multiple IMMED tasks at once, reliably triggering the slow path.
> + */
> +#define NUM_WORKERS		8
>  #define TEST_DURATION_SEC	3
>  
>  static volatile bool stop_workers;
>  
>  static void *worker_fn(void *arg)
>  {
> +	cpu_set_t cpuset;
> +
> +	/* Pin to CPU 0 to saturate its local DSQ */
> +	CPU_ZERO(&cpuset);
> +	CPU_SET(0, &cpuset);
> +	pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
> +
>  	while (!stop_workers) {
>  		volatile unsigned long i;
>  
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race
  2026-03-26  2:28 [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race zhidao su
  2026-03-26  2:28 ` [PATCH 2/3] selftests/sched_ext: Fix dsq_reenq test reliability zhidao su
  2026-03-26  2:28 ` [PATCH 3/3] selftests/sched_ext: Fix consume_immed " zhidao su
@ 2026-03-26  2:45 ` Tejun Heo
  2026-03-26  5:13   ` zhidao su
  2 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2026-03-26  2:45 UTC (permalink / raw)
  To: zhidao su
  Cc: sched-ext, linux-kernel, void, arighi, changwoo, peterz, mingo,
	zhidao su

On Thu, Mar 26, 2026 at 10:28:25AM +0800, zhidao su wrote:
> The reload_loop selftest triggers a KASAN null-ptr-deref at
> scx_claim_exit+0x83 when two threads concurrently attach and
> destroy BPF schedulers using the same ops map.
> 
> The race occurs between bpf_scx_unreg() and a concurrent reg():
> 
> 1. Thread A's bpf_scx_unreg() calls scx_disable() then
>    kthread_flush_work(), which blocks until disable completes
>    and transitions state back to SCX_DISABLED.
> 
> 2. With state SCX_DISABLED, a concurrent reg() allocates a
>    new sch_B and sets ops->priv = sch_B under scx_enable_mutex.
> 
> 3. Thread A's bpf_scx_unreg() then executes
>    RCU_INIT_POINTER(ops->priv, NULL), overwriting sch_B.
> 
> 4. When Thread B's link is destroyed, bpf_scx_unreg() reads
>    ops->priv == NULL and passes it to scx_disable(), which
>    calls scx_claim_exit(NULL), crashing at NULL+0x310.
> 
> Fix by adding a NULL guard for the case where ops->priv was
> never set, and by acquiring scx_enable_mutex before clearing
> ops->priv so that the check-and-clear is atomic with respect
> to reg() which also sets ops->priv under scx_enable_mutex.

Can you reproduce this? How do you trigger enable on the same ops that has
already been enabled?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race
  2026-03-26  2:45 ` [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race Tejun Heo
@ 2026-03-26  5:13   ` zhidao su
  0 siblings, 0 replies; 6+ messages in thread
From: zhidao su @ 2026-03-26  5:13 UTC (permalink / raw)
  To: tj; +Cc: sched-ext, linux-kernel, void, arighi, changwoo, peterz, mingo,
	suzhidao

On Thu, Mar 26, 2026 at 10:55:08AM +0800, Tejun Heo wrote:
> On Thu, Mar 26, 2026 at 10:28:25AM +0800, zhidao su wrote:
> > The reload_loop selftest triggers a KASAN null-ptr-deref at
> > scx_claim_exit+0x83 when two threads concurrently attach and
> > destroy BPF schedulers using the same ops map.
> ...
> Can you reproduce this? How do you trigger enable on the same ops that has
> already been enabled?

I investigated further and the analysis in patch 1/3 was wrong.
Please do not merge patches 1/3 and 2/3 from this series. Patch
3/3 is still valid and can be applied independently.

Patch 1/3 (NULL deref):
The race described in the commit message cannot occur. Both
bpf_struct_ops_link_create() and bpf_struct_ops_map_link_detach()
hold update_mutex when calling reg()/unreg(), so concurrent reg
and unreg on the same ops map are serialized. I ran 20 rounds of
reload_loop under KASAN with no crashes. The bug was never real.

I wrote the patch based on code analysis alone without first
obtaining KASAN output confirming the crash. That was wrong.

Patch 2/3 (dsq_reenq reliability):
I cannot reproduce a failure with the original test (NUM_WORKERS=4).
Running 10 iterations locally, all pass. The "reliability fix" had
no verified failure to fix and should not be merged.

Patch 3/3 (consume_immed reliability):
The original test fails consistently because ops.dispatch() moves
only one task per call, so dsq->nr never exceeds 1 and the IMMED
slow path in dsq_inc_nr() is never triggered. This is a real bug
in the test. Patch 3/3 is valid and can be considered independently.

Sorry for the noise.

zhidao

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-26  7:06 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-26  2:28 [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race zhidao su
2026-03-26  2:28 ` [PATCH 2/3] selftests/sched_ext: Fix dsq_reenq test reliability zhidao su
2026-03-26  2:28 ` [PATCH 3/3] selftests/sched_ext: Fix consume_immed " zhidao su
2026-03-26  7:06   ` Andrea Righi
2026-03-26  2:45 ` [PATCH 1/3] sched_ext: fix NULL deref in bpf_scx_unreg() due to ops->priv race Tejun Heo
2026-03-26  5:13   ` zhidao su

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox