[PATCH] sched_ext: Fix sched_ext_dead() race with scx_root_enable

Sched_ext development
 help / color / mirror / Atom feed

* [PATCH] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()
@ 2026-04-29 13:31 zhidao su
  2026-05-04 20:31 ` Tejun Heo
  2026-05-06  5:40 ` [PATCH v2] " zhidao su
  0 siblings, 2 replies; 6+ messages in thread
From: zhidao su @ 2026-04-29 13:31 UTC (permalink / raw)
  To: tj; +Cc: void, arighi, changwoo, sched-ext, linux-kernel, zhidao su

In CONFIG_EXT_SUB_SCHED, scx_task_sched(p) returns p->scx.sched instead
of scx_root.  scx_root_enable_workfn() initializes tasks in two steps:

  scx_init_task(sch, p, false)   /* state=INIT, p->scx.sched still NULL */
  scx_set_task_sched(p, sch)     /* p->scx.sched = sch                  */

Between these two steps, a concurrent sched_ext_dead() can call
scx_task_sched(p), get NULL, and pass it to scx_disable_and_exit_task()
which crashes in SCX_HAS_OP(NULL, ...).

In sched_ext_dead(), skip scx_disable_and_exit_task() when state=INIT and
p->scx.sched is still NULL, and reset state to NONE.  In
scx_root_enable_workfn(), after scx_init_task() returns, skip
scx_set_task_sched() if sched_ext_dead() has already removed @p from
scx_tasks.

Fixes: 073d4f0667b0 ("sched_ext: Refactor task init/exit helpers")
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
---
 kernel/sched/ext.c | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f7b1b16e81a5..1872af65b103 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3887,9 +3887,21 @@ void sched_ext_dead(struct task_struct *p)
 		struct rq_flags rf;
 		struct rq *rq;
 
-		rq = task_rq_lock(p, &rf);
-		scx_disable_and_exit_task(scx_task_sched(p), p);
-		task_rq_unlock(rq, p, &rf);
+		/*
+		 * scx_root_enable_workfn() may be concurrently initializing @p.
+		 * scx_init_task() sets state=INIT before scx_set_task_sched()
+		 * sets p->scx.sched.  If we race that window, p->scx.sched is
+		 * still NULL; skip scx_disable_and_exit_task() and reset state
+		 * to NONE so @p leaves SCX cleanly.
+		 */
+		if (scx_get_task_state(p) == SCX_TASK_INIT &&
+		    !rcu_access_pointer(p->scx.sched)) {
+			scx_set_task_state(p, SCX_TASK_NONE);
+		} else {
+			rq = task_rq_lock(p, &rf);
+			scx_disable_and_exit_task(scx_task_sched(p), p);
+			task_rq_unlock(rq, p, &rf);
+		}
 	}
 }
 
@@ -6937,6 +6949,17 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 			goto err_disable_unlock_all;
 		}
 
+		/*
+		 * sched_ext_dead() may have already cleaned up @p while locks
+		 * were dropped in scx_task_iter_unlock(); skip it.
+		 */
+		scoped_guard(raw_spinlock_irq, &scx_tasks_lock) {
+			if (list_empty(&p->scx.tasks_node)) {
+				put_task_struct(p);
+				continue;
+			}
+		}
+
 		scx_set_task_sched(p, sch);
 		scx_set_task_state(p, SCX_TASK_READY);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()
  2026-04-29 13:31 [PATCH] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() zhidao su
@ 2026-05-04 20:31 ` Tejun Heo
  2026-05-06  5:40 ` [PATCH v2] " zhidao su
  1 sibling, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2026-05-04 20:31 UTC (permalink / raw)
  To: zhidao su
  Cc: zhidao su, David Vernet, Andrea Righi, Changwoo Min, sched-ext,
	linux-kernel

Hello,

The race seems real, thanks for catching it, but I'm not sure the
reader-side fix is the right shape. The new branch in sched_ext_dead()
resets state to NONE without a matching ops.exit_task(cancelled=true),
leaking what ops.init_task() set up; and the list_empty() gate sits
before scx_set_task_sched(), so a sched_ext_dead() that races after
sch is installed but before state goes READY would still flip state
to NONE under us.

Worth exploring on the writer side instead: reorder so p->scx.sched
is installed before state transitions off NONE. That restores the
"state != NONE -> p->scx.sched != NULL" invariant and the existing
sched_ext_dead() handles the rest. I haven't fully traced this
through - there may still be a residual window between INIT and the
workfn's READY write - but it seems like a more promising direction.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()
  2026-04-29 13:31 [PATCH] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() zhidao su
  2026-05-04 20:31 ` Tejun Heo
@ 2026-05-06  5:40 ` zhidao su
  2026-05-06  6:16   ` sashiko-bot
  2026-05-10 13:55   ` Tejun Heo
  1 sibling, 2 replies; 6+ messages in thread
From: zhidao su @ 2026-05-06  5:40 UTC (permalink / raw)
  To: tj; +Cc: void, arighi, changwoo, sched-ext, linux-kernel

In CONFIG_EXT_SUB_SCHED, scx_task_sched(p) returns p->scx.sched instead
of scx_root.  scx_root_enable_workfn() iterates all tasks and for each
releases scx_tasks_lock via scx_task_iter_unlock() before calling
scx_init_task().  A concurrent sched_ext_dead() can race in this window.

Two bugs:

1. NULL deref: If sched_ext_dead() runs after scx_init_task() sets
   state=INIT but before the callsite sets p->scx.sched, the invariant
   "state != NONE => p->scx.sched != NULL" is broken.  sched_ext_dead()
   calls scx_disable_and_exit_task(scx_task_sched(p)=NULL, p), which
   crashes in SCX_HAS_OP(NULL, ...).

2. Resource leak: If sched_ext_dead() runs before scx_init_task() when
   state=NONE, it skips scx_disable_and_exit_task() (state check
   fails).  scx_init_task() then calls ops.init_task() and sets
   state=INIT.  The enable loop never calls ops.exit_task(), leaking
   whatever ops.init_task() allocated.

Fix both:

- Move scx_set_task_sched(p, sch) into scx_init_task(), before the
  state transition off NONE.  This restores the invariant so
  sched_ext_dead() always finds a valid scheduler pointer (fixes
  bug 1).

- After scx_init_task() returns, check under scx_tasks_lock whether
  @p is still on scx_tasks.  If not, sched_ext_dead() raced us.
  If state != NONE, ops.init_task() ran before sched_ext_dead() saw
  state=NONE, so call scx_disable_and_exit_task() with cancelled=true
  to release the resources (fixes bug 2).  If state=NONE,
  sched_ext_dead() already cleaned up.

Fixes: 88234b075c3f ("sched_ext: Introduce scx_task_sched[_rcu]()")
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
---
v2: Rewrite as writer-side fix per Tejun's review:
 - Move scx_set_task_sched(p, sch) into scx_init_task() before the state
   transition off NONE, restoring the "state!=NONE => p->scx.sched!=NULL"
   invariant.  Bug 1 (NULL deref) is fixed without touching sched_ext_dead().
 - Handle bug 2 (resource leak) in the workfn's list_empty() path by
   calling scx_disable_and_exit_task() when state!=NONE, instead of the
   v1 reader-side branch in sched_ext_dead() that leaked resources.
 - Update Fixes: to 88234b075c3f ("sched_ext: Introduce scx_task_sched[_rcu]()")
   which is when scx_task_sched(p) started dereferencing p->scx.sched.


 kernel/sched/ext.c | 59 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 52 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f7b1b16e81a5..99560f77af81 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3583,7 +3583,15 @@ static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork
 		/*
 		 * While @p's rq is not locked. @p is not visible to the rest of
 		 * SCX yet and it's safe to update the flags and state.
+		 *
+		 * Install p->scx.sched before transitioning state off NONE so
+		 * that the invariant state!=NONE => p->scx.sched!=NULL holds as
+		 * soon as state becomes observable.  A concurrent sched_ext_dead()
+		 * that races the INIT window will then always find a valid
+		 * scheduler pointer and can call scx_disable_and_exit_task()
+		 * to release resources allocated by ops.init_task().
 		 */
+		scx_set_task_sched(p, sch);
 		p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
 		scx_set_task_state(p, SCX_TASK_INIT);
 	}
@@ -3769,8 +3777,6 @@ void scx_pre_fork(struct task_struct *p)
 
 int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs)
 {
-	s32 ret;
-
 	percpu_rwsem_assert_held(&scx_fork_rwsem);
 
 	p->scx.tid = scx_alloc_tid();
@@ -3781,10 +3787,7 @@ int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs)
 #else
 		struct scx_sched *sch = scx_root;
 #endif
-		ret = scx_init_task(sch, p, true);
-		if (!ret)
-			scx_set_task_sched(p, sch);
-		return ret;
+		return scx_init_task(sch, p, true);
 	}
 
 	return 0;
@@ -6937,7 +6940,49 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 			goto err_disable_unlock_all;
 		}
 
-		scx_set_task_sched(p, sch);
+		/*
+		 * sched_ext_dead() may have raced while locks were dropped in
+		 * scx_task_iter_unlock().  Two cases:
+		 *
+		 * (a) sched_ext_dead() ran after scx_init_task() set state=INIT:
+		 *     it called scx_disable_and_exit_task() (cancelled=true) and
+		 *     reset state to NONE.  ops.exit_task() already ran; skip.
+		 *
+		 * (b) sched_ext_dead() ran before scx_init_task() (state=NONE at
+		 *     the time): it skipped scx_disable_and_exit_task() because
+		 *     state was NONE.  scx_init_task() subsequently called
+		 *     ops.init_task() and set state=INIT, leaving allocated
+		 *     resources with no owner.  We must call
+		 *     scx_disable_and_exit_task() here to release them.
+		 *
+		 * Distinguish case (a) from (b) by reading state: (a) leaves
+		 * state=NONE (reset by scx_disable_and_exit_task); (b) leaves
+		 * state=INIT (set by scx_init_task, never reset).
+		 */
+		{
+			bool p_dead = false, need_exit = false;
+
+			scoped_guard(raw_spinlock_irq, &scx_tasks_lock) {
+				if (list_empty(&p->scx.tasks_node)) {
+					p_dead = true;
+					need_exit = scx_get_task_state(p) != SCX_TASK_NONE;
+				}
+			}
+
+			if (p_dead) {
+				if (need_exit) {
+					struct rq_flags rf;
+					struct rq *rq;
+
+					rq = task_rq_lock(p, &rf);
+					scx_disable_and_exit_task(sch, p);
+					task_rq_unlock(rq, p, &rf);
+				}
+				put_task_struct(p);
+				continue;
+			}
+		}
+
 		scx_set_task_state(p, SCX_TASK_READY);
 
 		/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()
  2026-05-06  5:40 ` [PATCH v2] " zhidao su
@ 2026-05-06  6:16   ` sashiko-bot
  2026-05-07  2:32     ` zhidao su
  2026-05-10 13:55   ` Tejun Heo
  1 sibling, 1 reply; 6+ messages in thread
From: sashiko-bot @ 2026-05-06  6:16 UTC (permalink / raw)
  To: zhidao su; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Calling `scx_set_task_state(p, SCX_TASK_READY)` without holding `task_rq_lock(p)` introduces a Time-Of-Check to Time-Of-Use (TOCTOU) data race against `sched_ext_dead()`.
- [High] A memory ordering vulnerability defeats the bug 1 fix on weakly ordered architectures, allowing the NULL pointer dereference to still occur.
- [Medium] Missing `SCX_TASK_OFF_TASKS` flag on the `scx_root_enable_workfn()` cleanup path exposes dead tasks to cgroup iterators, causing sub-scheduler enables to abort.
--

commit 1d291365affa3415c03f2031e9198ebe7208a245
Author: zhidao su <soolaugust@gmail.com>

sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()

In CONFIG_EXT_SUB_SCHED, scx_task_sched(p) returns p->scx.sched instead
of scx_root. scx_root_enable_workfn() iterates all tasks and for each
releases scx_tasks_lock via scx_task_iter_unlock() before calling
scx_init_task().

> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 5600618dcef8e..98505577d56ca 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -3690,7 +3690,15 @@ static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork
>  		/*
>  		 * While @p's rq is not locked. @p is not visible to the rest of
>  		 * SCX yet and it's safe to update the flags and state.
> +		 *
> +		 * Install p->scx.sched before transitioning state off NONE so
> +		 * that the invariant state!=NONE => p->scx.sched!=NULL holds as
> +		 * soon as state becomes observable.  A concurrent sched_ext_dead()
> +		 * that races the INIT window will then always find a valid
> +		 * scheduler pointer and can call scx_disable_and_exit_task()
> +		 * to release resources allocated by ops.init_task().
>  		 */
> +		scx_set_task_sched(p, sch);
>  		p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
>  		scx_set_task_state(p, SCX_TASK_INIT);
>  	}

Can this result in a memory ordering vulnerability on weakly ordered
architectures?

In scx_init_task(), scx_set_task_sched() uses rcu_assign_pointer(), which
relies on smp_store_release(). While smp_store_release() guarantees prior
operations are visible, it does not prevent subsequent memory operations
from being reordered before the release.

Could the CPU commit the store updating p->scx.flags to SCX_TASK_INIT
before the store to p->scx.sched is completed? If a concurrent
sched_ext_dead() observes state == SCX_TASK_INIT, branches into the cleanup
path, and reads a NULL scheduler pointer, it would pass it to
scx_disable_and_exit_task() which crashes in SCX_HAS_OP().

Does this require a full write memory barrier (e.g., smp_wmb()) between
setting the scheduler pointer and transitioning the state?

[ ... ]

> @@ -7130,7 +7133,49 @@ static void scx_root_enable_workfn(struct kthread_work *work)
>  			goto err_disable_unlock_all;
>  		}
>  
> -		scx_set_task_sched(p, sch);
> +		/*
> +		 * sched_ext_dead() may have raced while locks were dropped in
> +		 * scx_task_iter_unlock().  Two cases:
> +		 *
> +		 * (a) sched_ext_dead() ran after scx_init_task() set state=INIT:
> +		 *     it called scx_disable_and_exit_task() (cancelled=true) and
> +		 *     reset state to NONE.  ops.exit_task() already ran; skip.
> +		 *
> +		 * (b) sched_ext_dead() ran before scx_init_task() (state=NONE at
> +		 *     the time): it skipped scx_disable_and_exit_task() because
> +		 *     state was NONE.  scx_init_task() subsequently called
> +		 *     ops.init_task() and set state=INIT, leaving allocated
> +		 *     resources with no owner.  We must call
> +		 *     scx_disable_and_exit_task() here to release them.
> +		 *
> +		 * Distinguish case (a) from (b) by reading state: (a) leaves
> +		 * state=NONE (reset by scx_disable_and_exit_task); (b) leaves
> +		 * state=INIT (set by scx_init_task, never reset).
> +		 */
> +		{
> +			bool p_dead = false, need_exit = false;
> +
> +			scoped_guard(raw_spinlock_irq, &scx_tasks_lock) {
> +				if (list_empty(&p->scx.tasks_node)) {
> +					p_dead = true;
> +					need_exit = scx_get_task_state(p) != SCX_TASK_NONE;
> +				}
> +			}
> +
> +			if (p_dead) {
> +				if (need_exit) {
> +					struct rq_flags rf;
> +					struct rq *rq;
> +
> +					rq = task_rq_lock(p, &rf);
> +					scx_disable_and_exit_task(sch, p);
> +					task_rq_unlock(rq, p, &rf);

Is this cleanup path missing the SCX_TASK_OFF_TASKS flag?

If sched_ext_dead() runs before scx_init_task() completes, it sees state
NONE and skips setting SCX_TASK_OFF_TASKS. When this cleanup path calls
scx_disable_and_exit_task() under task_rq_lock(), the SCX_TASK_OFF_TASKS
flag is not set before the runqueue is unlocked.

Because the flag is missing, if a cgroup iterator traverses the cgroup
before cgroup_task_dead() removes the task, could the iterator erroneously
yield the dead task with state NONE, hit assert_task_ready_or_enabled(p),
and abort the sub-scheduler enable?

> +				}
> +				put_task_struct(p);
> +				continue;
> +			}
> +		}
> +
>  		scx_set_task_state(p, SCX_TASK_READY);

Can this cause a Time-Of-Check to Time-Of-Use (TOCTOU) data race against
sched_ext_dead()?

The code checks list_empty(&p->scx.tasks_node) under scx_tasks_lock to
determine if a concurrent sched_ext_dead() has started processing.

However, once scx_tasks_lock is dropped, sched_ext_dead() can acquire it,
remove the task, and proceed to exit the task, eventually doing
p->scx.flags |= SCX_TASK_OFF_TASKS under task_rq_lock(p).

Meanwhile, scx_root_enable_workfn() continues past the lock and
unconditionally calls scx_set_task_state(p, SCX_TASK_READY) without holding
task_rq_lock(p). Because scx_set_task_state() performs a non-atomic
read-modify-write on p->scx.flags, could this overwrite and permanently
lose the SCX_TASK_OFF_TASKS flag?

Also, could this incorrectly transition a dead task to SCX_TASK_READY after
its scheduler resources were torn down, triggering a warning?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260506054001.1105522-1-suzhidao@xiaomi.com?part=1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()
  2026-05-06  6:16   ` sashiko-bot
@ 2026-05-07  2:32     ` zhidao su
  0 siblings, 0 replies; 6+ messages in thread
From: zhidao su @ 2026-05-07  2:32 UTC (permalink / raw)
  To: sashiko; +Cc: sched-ext, Tejun Heo

On Wed, May 06 2026, sashiko-bot@kernel.org wrote:

> - [High] Calling `scx_set_task_state(p, SCX_TASK_READY)` without holding
>  `task_rq_lock(p)` introduces a Time-Of-Check to Time-Of-Use (TOCTOU)
>  data race against `sched_ext_dead()`.
> [...]
> Can this cause a Time-Of-Check to Time-Of-Use (TOCTOU) data race against
> sched_ext_dead()?

When p_dead is true the code does `continue`, so scx_set_task_state(p,
SCX_TASK_READY) is never reached for dead tasks. The TOCTOU window
doesn't exist.

> - [High] A memory ordering vulnerability defeats the bug 1 fix on weakly
>  ordered architectures, allowing the NULL pointer dereference to still occur.
> [...]
> Does this require a full write memory barrier (e.g., smp_wmb()) between
> setting the scheduler pointer and transitioning the state?

I'm not confident whether the store ordering between scx_set_task_sched()
and the state transition needs an explicit barrier here.

Tejun, what's the intended model for scx flags access in this path?

> - [Medium] Missing `SCX_TASK_OFF_TASKS` flag on the scx_root_enable_workfn()
>  cleanup path exposes dead tasks to cgroup iterators.
> [...]
> Is this cleanup path missing the SCX_TASK_OFF_TASKS flag?

Not needed here. scx_task_iter_next_locked() checks SCX_TASK_OFF_TASKS
only for cgroup-scoped iteration, which is only reachable from
scx_sub_enable_workfn(). That function requires scx_enabled() at entry;
scx_root_enable_workfn() has not called scx_ops_enable() yet at this
point, so no sub-sched enable can be running concurrently.


> 2026年5月6日 14:16，sashiko-bot@kernel.org 写道：
> 
> Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
> - [High] Calling `scx_set_task_state(p, SCX_TASK_READY)` without holding `task_rq_lock(p)` introduces a Time-Of-Check to Time-Of-Use (TOCTOU) data race against `sched_ext_dead()`.
> - [High] A memory ordering vulnerability defeats the bug 1 fix on weakly ordered architectures, allowing the NULL pointer dereference to still occur.
> - [Medium] Missing `SCX_TASK_OFF_TASKS` flag on the `scx_root_enable_workfn()` cleanup path exposes dead tasks to cgroup iterators, causing sub-scheduler enables to abort.
> --
> 
> commit 1d291365affa3415c03f2031e9198ebe7208a245
> Author: zhidao su <soolaugust@gmail.com>
> 
> sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()
> 
> In CONFIG_EXT_SUB_SCHED, scx_task_sched(p) returns p->scx.sched instead
> of scx_root. scx_root_enable_workfn() iterates all tasks and for each
> releases scx_tasks_lock via scx_task_iter_unlock() before calling
> scx_init_task().
> 
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index 5600618dcef8e..98505577d56ca 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -3690,7 +3690,15 @@ static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork
>> /*
>> * While @p's rq is not locked. @p is not visible to the rest of
>> * SCX yet and it's safe to update the flags and state.
>> + *
>> + * Install p->scx.sched before transitioning state off NONE so
>> + * that the invariant state!=NONE => p->scx.sched!=NULL holds as
>> + * soon as state becomes observable.  A concurrent sched_ext_dead()
>> + * that races the INIT window will then always find a valid
>> + * scheduler pointer and can call scx_disable_and_exit_task()
>> + * to release resources allocated by ops.init_task().
>> */
>> + scx_set_task_sched(p, sch);
>> p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
>> scx_set_task_state(p, SCX_TASK_INIT);
>> }
> 
> Can this result in a memory ordering vulnerability on weakly ordered
> architectures?
> 
> In scx_init_task(), scx_set_task_sched() uses rcu_assign_pointer(), which
> relies on smp_store_release(). While smp_store_release() guarantees prior
> operations are visible, it does not prevent subsequent memory operations
> from being reordered before the release.
> 
> Could the CPU commit the store updating p->scx.flags to SCX_TASK_INIT
> before the store to p->scx.sched is completed? If a concurrent
> sched_ext_dead() observes state == SCX_TASK_INIT, branches into the cleanup
> path, and reads a NULL scheduler pointer, it would pass it to
> scx_disable_and_exit_task() which crashes in SCX_HAS_OP().
> 
> Does this require a full write memory barrier (e.g., smp_wmb()) between
> setting the scheduler pointer and transitioning the state?
> 
> [ ... ]
> 
>> @@ -7130,7 +7133,49 @@ static void scx_root_enable_workfn(struct kthread_work *work)
>> goto err_disable_unlock_all;
>> }
>> 
>> - scx_set_task_sched(p, sch);
>> + /*
>> + * sched_ext_dead() may have raced while locks were dropped in
>> + * scx_task_iter_unlock().  Two cases:
>> + *
>> + * (a) sched_ext_dead() ran after scx_init_task() set state=INIT:
>> + *     it called scx_disable_and_exit_task() (cancelled=true) and
>> + *     reset state to NONE.  ops.exit_task() already ran; skip.
>> + *
>> + * (b) sched_ext_dead() ran before scx_init_task() (state=NONE at
>> + *     the time): it skipped scx_disable_and_exit_task() because
>> + *     state was NONE.  scx_init_task() subsequently called
>> + *     ops.init_task() and set state=INIT, leaving allocated
>> + *     resources with no owner.  We must call
>> + *     scx_disable_and_exit_task() here to release them.
>> + *
>> + * Distinguish case (a) from (b) by reading state: (a) leaves
>> + * state=NONE (reset by scx_disable_and_exit_task); (b) leaves
>> + * state=INIT (set by scx_init_task, never reset).
>> + */
>> + {
>> + bool p_dead = false, need_exit = false;
>> +
>> + scoped_guard(raw_spinlock_irq, &scx_tasks_lock) {
>> + if (list_empty(&p->scx.tasks_node)) {
>> + p_dead = true;
>> + need_exit = scx_get_task_state(p) != SCX_TASK_NONE;
>> + }
>> + }
>> +
>> + if (p_dead) {
>> + if (need_exit) {
>> + struct rq_flags rf;
>> + struct rq *rq;
>> +
>> + rq = task_rq_lock(p, &rf);
>> + scx_disable_and_exit_task(sch, p);
>> + task_rq_unlock(rq, p, &rf);
> 
> Is this cleanup path missing the SCX_TASK_OFF_TASKS flag?
> 
> If sched_ext_dead() runs before scx_init_task() completes, it sees state
> NONE and skips setting SCX_TASK_OFF_TASKS. When this cleanup path calls
> scx_disable_and_exit_task() under task_rq_lock(), the SCX_TASK_OFF_TASKS
> flag is not set before the runqueue is unlocked.
> 
> Because the flag is missing, if a cgroup iterator traverses the cgroup
> before cgroup_task_dead() removes the task, could the iterator erroneously
> yield the dead task with state NONE, hit assert_task_ready_or_enabled(p),
> and abort the sub-scheduler enable?
> 
>> + }
>> + put_task_struct(p);
>> + continue;
>> + }
>> + }
>> +
>> scx_set_task_state(p, SCX_TASK_READY);
> 
> Can this cause a Time-Of-Check to Time-Of-Use (TOCTOU) data race against
> sched_ext_dead()?
> 
> The code checks list_empty(&p->scx.tasks_node) under scx_tasks_lock to
> determine if a concurrent sched_ext_dead() has started processing.
> 
> However, once scx_tasks_lock is dropped, sched_ext_dead() can acquire it,
> remove the task, and proceed to exit the task, eventually doing
> p->scx.flags |= SCX_TASK_OFF_TASKS under task_rq_lock(p).
> 
> Meanwhile, scx_root_enable_workfn() continues past the lock and
> unconditionally calls scx_set_task_state(p, SCX_TASK_READY) without holding
> task_rq_lock(p). Because scx_set_task_state() performs a non-atomic
> read-modify-write on p->scx.flags, could this overwrite and permanently
> lose the SCX_TASK_OFF_TASKS flag?
> 
> Also, could this incorrectly transition a dead task to SCX_TASK_READY after
> its scheduler resources were torn down, triggering a warning?
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260506054001.1105522-1-suzhidao@xiaomi.com?part=1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()
  2026-05-06  5:40 ` [PATCH v2] " zhidao su
  2026-05-06  6:16   ` sashiko-bot
@ 2026-05-10 13:55   ` Tejun Heo
  1 sibling, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2026-05-10 13:55 UTC (permalink / raw)
  To: suzhidao; +Cc: void, arighi, changwoo, emil, sched-ext, linux-kernel, Tejun Heo

Hello,

Thanks for the report and the patches. The same race window also
affects the analogous sub-sched paths and the wrapper-disable paths
trip on the NONE state that scx_fail_parent() leaves behind, so I
ended up taking a more invasive route - extending the task state
machine with SCX_TASK_INIT_BEGIN and SCX_TASK_DEAD - rather than
continuing with your localized fix.

Posted as a 6-patch series:

  https://lore.kernel.org/all/20260510074113.2049514-1-tj@kernel.org/

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-10 13:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29 13:31 [PATCH] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() zhidao su
2026-05-04 20:31 ` Tejun Heo
2026-05-06  5:40 ` [PATCH v2] " zhidao su
2026-05-06  6:16   ` sashiko-bot
2026-05-07  2:32     ` zhidao su
2026-05-10 13:55   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox