* [PATCH] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()
@ 2026-04-29 13:31 zhidao su
2026-05-04 20:31 ` Tejun Heo
2026-05-06 5:40 ` [PATCH v2] " zhidao su
0 siblings, 2 replies; 6+ messages in thread
From: zhidao su @ 2026-04-29 13:31 UTC (permalink / raw)
To: tj; +Cc: void, arighi, changwoo, sched-ext, linux-kernel, zhidao su
In CONFIG_EXT_SUB_SCHED, scx_task_sched(p) returns p->scx.sched instead
of scx_root. scx_root_enable_workfn() initializes tasks in two steps:
scx_init_task(sch, p, false) /* state=INIT, p->scx.sched still NULL */
scx_set_task_sched(p, sch) /* p->scx.sched = sch */
Between these two steps, a concurrent sched_ext_dead() can call
scx_task_sched(p), get NULL, and pass it to scx_disable_and_exit_task()
which crashes in SCX_HAS_OP(NULL, ...).
In sched_ext_dead(), skip scx_disable_and_exit_task() when state=INIT and
p->scx.sched is still NULL, and reset state to NONE. In
scx_root_enable_workfn(), after scx_init_task() returns, skip
scx_set_task_sched() if sched_ext_dead() has already removed @p from
scx_tasks.
Fixes: 073d4f0667b0 ("sched_ext: Refactor task init/exit helpers")
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
---
kernel/sched/ext.c | 29 ++++++++++++++++++++++++++---
1 file changed, 26 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f7b1b16e81a5..1872af65b103 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3887,9 +3887,21 @@ void sched_ext_dead(struct task_struct *p)
struct rq_flags rf;
struct rq *rq;
- rq = task_rq_lock(p, &rf);
- scx_disable_and_exit_task(scx_task_sched(p), p);
- task_rq_unlock(rq, p, &rf);
+ /*
+ * scx_root_enable_workfn() may be concurrently initializing @p.
+ * scx_init_task() sets state=INIT before scx_set_task_sched()
+ * sets p->scx.sched. If we race that window, p->scx.sched is
+ * still NULL; skip scx_disable_and_exit_task() and reset state
+ * to NONE so @p leaves SCX cleanly.
+ */
+ if (scx_get_task_state(p) == SCX_TASK_INIT &&
+ !rcu_access_pointer(p->scx.sched)) {
+ scx_set_task_state(p, SCX_TASK_NONE);
+ } else {
+ rq = task_rq_lock(p, &rf);
+ scx_disable_and_exit_task(scx_task_sched(p), p);
+ task_rq_unlock(rq, p, &rf);
+ }
}
}
@@ -6937,6 +6949,17 @@ static void scx_root_enable_workfn(struct kthread_work *work)
goto err_disable_unlock_all;
}
+ /*
+ * sched_ext_dead() may have already cleaned up @p while locks
+ * were dropped in scx_task_iter_unlock(); skip it.
+ */
+ scoped_guard(raw_spinlock_irq, &scx_tasks_lock) {
+ if (list_empty(&p->scx.tasks_node)) {
+ put_task_struct(p);
+ continue;
+ }
+ }
+
scx_set_task_sched(p, sch);
scx_set_task_state(p, SCX_TASK_READY);
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() 2026-04-29 13:31 [PATCH] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() zhidao su @ 2026-05-04 20:31 ` Tejun Heo 2026-05-06 5:40 ` [PATCH v2] " zhidao su 1 sibling, 0 replies; 6+ messages in thread From: Tejun Heo @ 2026-05-04 20:31 UTC (permalink / raw) To: zhidao su Cc: zhidao su, David Vernet, Andrea Righi, Changwoo Min, sched-ext, linux-kernel Hello, The race seems real, thanks for catching it, but I'm not sure the reader-side fix is the right shape. The new branch in sched_ext_dead() resets state to NONE without a matching ops.exit_task(cancelled=true), leaking what ops.init_task() set up; and the list_empty() gate sits before scx_set_task_sched(), so a sched_ext_dead() that races after sch is installed but before state goes READY would still flip state to NONE under us. Worth exploring on the writer side instead: reorder so p->scx.sched is installed before state transitions off NONE. That restores the "state != NONE -> p->scx.sched != NULL" invariant and the existing sched_ext_dead() handles the rest. I haven't fully traced this through - there may still be a residual window between INIT and the workfn's READY write - but it seems like a more promising direction. Thanks. -- tejun ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() 2026-04-29 13:31 [PATCH] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() zhidao su 2026-05-04 20:31 ` Tejun Heo @ 2026-05-06 5:40 ` zhidao su 2026-05-06 6:16 ` sashiko-bot 2026-05-10 13:55 ` Tejun Heo 1 sibling, 2 replies; 6+ messages in thread From: zhidao su @ 2026-05-06 5:40 UTC (permalink / raw) To: tj; +Cc: void, arighi, changwoo, sched-ext, linux-kernel In CONFIG_EXT_SUB_SCHED, scx_task_sched(p) returns p->scx.sched instead of scx_root. scx_root_enable_workfn() iterates all tasks and for each releases scx_tasks_lock via scx_task_iter_unlock() before calling scx_init_task(). A concurrent sched_ext_dead() can race in this window. Two bugs: 1. NULL deref: If sched_ext_dead() runs after scx_init_task() sets state=INIT but before the callsite sets p->scx.sched, the invariant "state != NONE => p->scx.sched != NULL" is broken. sched_ext_dead() calls scx_disable_and_exit_task(scx_task_sched(p)=NULL, p), which crashes in SCX_HAS_OP(NULL, ...). 2. Resource leak: If sched_ext_dead() runs before scx_init_task() when state=NONE, it skips scx_disable_and_exit_task() (state check fails). scx_init_task() then calls ops.init_task() and sets state=INIT. The enable loop never calls ops.exit_task(), leaking whatever ops.init_task() allocated. Fix both: - Move scx_set_task_sched(p, sch) into scx_init_task(), before the state transition off NONE. This restores the invariant so sched_ext_dead() always finds a valid scheduler pointer (fixes bug 1). - After scx_init_task() returns, check under scx_tasks_lock whether @p is still on scx_tasks. If not, sched_ext_dead() raced us. If state != NONE, ops.init_task() ran before sched_ext_dead() saw state=NONE, so call scx_disable_and_exit_task() with cancelled=true to release the resources (fixes bug 2). If state=NONE, sched_ext_dead() already cleaned up. Fixes: 88234b075c3f ("sched_ext: Introduce scx_task_sched[_rcu]()") Signed-off-by: zhidao su <suzhidao@xiaomi.com> --- v2: Rewrite as writer-side fix per Tejun's review: - Move scx_set_task_sched(p, sch) into scx_init_task() before the state transition off NONE, restoring the "state!=NONE => p->scx.sched!=NULL" invariant. Bug 1 (NULL deref) is fixed without touching sched_ext_dead(). - Handle bug 2 (resource leak) in the workfn's list_empty() path by calling scx_disable_and_exit_task() when state!=NONE, instead of the v1 reader-side branch in sched_ext_dead() that leaked resources. - Update Fixes: to 88234b075c3f ("sched_ext: Introduce scx_task_sched[_rcu]()") which is when scx_task_sched(p) started dereferencing p->scx.sched. kernel/sched/ext.c | 59 ++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 52 insertions(+), 7 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index f7b1b16e81a5..99560f77af81 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -3583,7 +3583,15 @@ static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork /* * While @p's rq is not locked. @p is not visible to the rest of * SCX yet and it's safe to update the flags and state. + * + * Install p->scx.sched before transitioning state off NONE so + * that the invariant state!=NONE => p->scx.sched!=NULL holds as + * soon as state becomes observable. A concurrent sched_ext_dead() + * that races the INIT window will then always find a valid + * scheduler pointer and can call scx_disable_and_exit_task() + * to release resources allocated by ops.init_task(). */ + scx_set_task_sched(p, sch); p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT; scx_set_task_state(p, SCX_TASK_INIT); } @@ -3769,8 +3777,6 @@ void scx_pre_fork(struct task_struct *p) int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs) { - s32 ret; - percpu_rwsem_assert_held(&scx_fork_rwsem); p->scx.tid = scx_alloc_tid(); @@ -3781,10 +3787,7 @@ int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs) #else struct scx_sched *sch = scx_root; #endif - ret = scx_init_task(sch, p, true); - if (!ret) - scx_set_task_sched(p, sch); - return ret; + return scx_init_task(sch, p, true); } return 0; @@ -6937,7 +6940,49 @@ static void scx_root_enable_workfn(struct kthread_work *work) goto err_disable_unlock_all; } - scx_set_task_sched(p, sch); + /* + * sched_ext_dead() may have raced while locks were dropped in + * scx_task_iter_unlock(). Two cases: + * + * (a) sched_ext_dead() ran after scx_init_task() set state=INIT: + * it called scx_disable_and_exit_task() (cancelled=true) and + * reset state to NONE. ops.exit_task() already ran; skip. + * + * (b) sched_ext_dead() ran before scx_init_task() (state=NONE at + * the time): it skipped scx_disable_and_exit_task() because + * state was NONE. scx_init_task() subsequently called + * ops.init_task() and set state=INIT, leaving allocated + * resources with no owner. We must call + * scx_disable_and_exit_task() here to release them. + * + * Distinguish case (a) from (b) by reading state: (a) leaves + * state=NONE (reset by scx_disable_and_exit_task); (b) leaves + * state=INIT (set by scx_init_task, never reset). + */ + { + bool p_dead = false, need_exit = false; + + scoped_guard(raw_spinlock_irq, &scx_tasks_lock) { + if (list_empty(&p->scx.tasks_node)) { + p_dead = true; + need_exit = scx_get_task_state(p) != SCX_TASK_NONE; + } + } + + if (p_dead) { + if (need_exit) { + struct rq_flags rf; + struct rq *rq; + + rq = task_rq_lock(p, &rf); + scx_disable_and_exit_task(sch, p); + task_rq_unlock(rq, p, &rf); + } + put_task_struct(p); + continue; + } + } + scx_set_task_state(p, SCX_TASK_READY); /* -- 2.43.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() 2026-05-06 5:40 ` [PATCH v2] " zhidao su @ 2026-05-06 6:16 ` sashiko-bot 2026-05-07 2:32 ` zhidao su 2026-05-10 13:55 ` Tejun Heo 1 sibling, 1 reply; 6+ messages in thread From: sashiko-bot @ 2026-05-06 6:16 UTC (permalink / raw) To: zhidao su; +Cc: sched-ext Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider: - [High] Calling `scx_set_task_state(p, SCX_TASK_READY)` without holding `task_rq_lock(p)` introduces a Time-Of-Check to Time-Of-Use (TOCTOU) data race against `sched_ext_dead()`. - [High] A memory ordering vulnerability defeats the bug 1 fix on weakly ordered architectures, allowing the NULL pointer dereference to still occur. - [Medium] Missing `SCX_TASK_OFF_TASKS` flag on the `scx_root_enable_workfn()` cleanup path exposes dead tasks to cgroup iterators, causing sub-scheduler enables to abort. -- commit 1d291365affa3415c03f2031e9198ebe7208a245 Author: zhidao su <soolaugust@gmail.com> sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() In CONFIG_EXT_SUB_SCHED, scx_task_sched(p) returns p->scx.sched instead of scx_root. scx_root_enable_workfn() iterates all tasks and for each releases scx_tasks_lock via scx_task_iter_unlock() before calling scx_init_task(). > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 5600618dcef8e..98505577d56ca 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -3690,7 +3690,15 @@ static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork > /* > * While @p's rq is not locked. @p is not visible to the rest of > * SCX yet and it's safe to update the flags and state. > + * > + * Install p->scx.sched before transitioning state off NONE so > + * that the invariant state!=NONE => p->scx.sched!=NULL holds as > + * soon as state becomes observable. A concurrent sched_ext_dead() > + * that races the INIT window will then always find a valid > + * scheduler pointer and can call scx_disable_and_exit_task() > + * to release resources allocated by ops.init_task(). > */ > + scx_set_task_sched(p, sch); > p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT; > scx_set_task_state(p, SCX_TASK_INIT); > } Can this result in a memory ordering vulnerability on weakly ordered architectures? In scx_init_task(), scx_set_task_sched() uses rcu_assign_pointer(), which relies on smp_store_release(). While smp_store_release() guarantees prior operations are visible, it does not prevent subsequent memory operations from being reordered before the release. Could the CPU commit the store updating p->scx.flags to SCX_TASK_INIT before the store to p->scx.sched is completed? If a concurrent sched_ext_dead() observes state == SCX_TASK_INIT, branches into the cleanup path, and reads a NULL scheduler pointer, it would pass it to scx_disable_and_exit_task() which crashes in SCX_HAS_OP(). Does this require a full write memory barrier (e.g., smp_wmb()) between setting the scheduler pointer and transitioning the state? [ ... ] > @@ -7130,7 +7133,49 @@ static void scx_root_enable_workfn(struct kthread_work *work) > goto err_disable_unlock_all; > } > > - scx_set_task_sched(p, sch); > + /* > + * sched_ext_dead() may have raced while locks were dropped in > + * scx_task_iter_unlock(). Two cases: > + * > + * (a) sched_ext_dead() ran after scx_init_task() set state=INIT: > + * it called scx_disable_and_exit_task() (cancelled=true) and > + * reset state to NONE. ops.exit_task() already ran; skip. > + * > + * (b) sched_ext_dead() ran before scx_init_task() (state=NONE at > + * the time): it skipped scx_disable_and_exit_task() because > + * state was NONE. scx_init_task() subsequently called > + * ops.init_task() and set state=INIT, leaving allocated > + * resources with no owner. We must call > + * scx_disable_and_exit_task() here to release them. > + * > + * Distinguish case (a) from (b) by reading state: (a) leaves > + * state=NONE (reset by scx_disable_and_exit_task); (b) leaves > + * state=INIT (set by scx_init_task, never reset). > + */ > + { > + bool p_dead = false, need_exit = false; > + > + scoped_guard(raw_spinlock_irq, &scx_tasks_lock) { > + if (list_empty(&p->scx.tasks_node)) { > + p_dead = true; > + need_exit = scx_get_task_state(p) != SCX_TASK_NONE; > + } > + } > + > + if (p_dead) { > + if (need_exit) { > + struct rq_flags rf; > + struct rq *rq; > + > + rq = task_rq_lock(p, &rf); > + scx_disable_and_exit_task(sch, p); > + task_rq_unlock(rq, p, &rf); Is this cleanup path missing the SCX_TASK_OFF_TASKS flag? If sched_ext_dead() runs before scx_init_task() completes, it sees state NONE and skips setting SCX_TASK_OFF_TASKS. When this cleanup path calls scx_disable_and_exit_task() under task_rq_lock(), the SCX_TASK_OFF_TASKS flag is not set before the runqueue is unlocked. Because the flag is missing, if a cgroup iterator traverses the cgroup before cgroup_task_dead() removes the task, could the iterator erroneously yield the dead task with state NONE, hit assert_task_ready_or_enabled(p), and abort the sub-scheduler enable? > + } > + put_task_struct(p); > + continue; > + } > + } > + > scx_set_task_state(p, SCX_TASK_READY); Can this cause a Time-Of-Check to Time-Of-Use (TOCTOU) data race against sched_ext_dead()? The code checks list_empty(&p->scx.tasks_node) under scx_tasks_lock to determine if a concurrent sched_ext_dead() has started processing. However, once scx_tasks_lock is dropped, sched_ext_dead() can acquire it, remove the task, and proceed to exit the task, eventually doing p->scx.flags |= SCX_TASK_OFF_TASKS under task_rq_lock(p). Meanwhile, scx_root_enable_workfn() continues past the lock and unconditionally calls scx_set_task_state(p, SCX_TASK_READY) without holding task_rq_lock(p). Because scx_set_task_state() performs a non-atomic read-modify-write on p->scx.flags, could this overwrite and permanently lose the SCX_TASK_OFF_TASKS flag? Also, could this incorrectly transition a dead task to SCX_TASK_READY after its scheduler resources were torn down, triggering a warning? -- Sashiko AI review · https://sashiko.dev/#/patchset/20260506054001.1105522-1-suzhidao@xiaomi.com?part=1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() 2026-05-06 6:16 ` sashiko-bot @ 2026-05-07 2:32 ` zhidao su 0 siblings, 0 replies; 6+ messages in thread From: zhidao su @ 2026-05-07 2:32 UTC (permalink / raw) To: sashiko; +Cc: sched-ext, Tejun Heo On Wed, May 06 2026, sashiko-bot@kernel.org wrote: > - [High] Calling `scx_set_task_state(p, SCX_TASK_READY)` without holding > `task_rq_lock(p)` introduces a Time-Of-Check to Time-Of-Use (TOCTOU) > data race against `sched_ext_dead()`. > [...] > Can this cause a Time-Of-Check to Time-Of-Use (TOCTOU) data race against > sched_ext_dead()? When p_dead is true the code does `continue`, so scx_set_task_state(p, SCX_TASK_READY) is never reached for dead tasks. The TOCTOU window doesn't exist. > - [High] A memory ordering vulnerability defeats the bug 1 fix on weakly > ordered architectures, allowing the NULL pointer dereference to still occur. > [...] > Does this require a full write memory barrier (e.g., smp_wmb()) between > setting the scheduler pointer and transitioning the state? I'm not confident whether the store ordering between scx_set_task_sched() and the state transition needs an explicit barrier here. Tejun, what's the intended model for scx flags access in this path? > - [Medium] Missing `SCX_TASK_OFF_TASKS` flag on the scx_root_enable_workfn() > cleanup path exposes dead tasks to cgroup iterators. > [...] > Is this cleanup path missing the SCX_TASK_OFF_TASKS flag? Not needed here. scx_task_iter_next_locked() checks SCX_TASK_OFF_TASKS only for cgroup-scoped iteration, which is only reachable from scx_sub_enable_workfn(). That function requires scx_enabled() at entry; scx_root_enable_workfn() has not called scx_ops_enable() yet at this point, so no sub-sched enable can be running concurrently. > 2026年5月6日 14:16,sashiko-bot@kernel.org 写道: > > Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider: > - [High] Calling `scx_set_task_state(p, SCX_TASK_READY)` without holding `task_rq_lock(p)` introduces a Time-Of-Check to Time-Of-Use (TOCTOU) data race against `sched_ext_dead()`. > - [High] A memory ordering vulnerability defeats the bug 1 fix on weakly ordered architectures, allowing the NULL pointer dereference to still occur. > - [Medium] Missing `SCX_TASK_OFF_TASKS` flag on the `scx_root_enable_workfn()` cleanup path exposes dead tasks to cgroup iterators, causing sub-scheduler enables to abort. > -- > > commit 1d291365affa3415c03f2031e9198ebe7208a245 > Author: zhidao su <soolaugust@gmail.com> > > sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() > > In CONFIG_EXT_SUB_SCHED, scx_task_sched(p) returns p->scx.sched instead > of scx_root. scx_root_enable_workfn() iterates all tasks and for each > releases scx_tasks_lock via scx_task_iter_unlock() before calling > scx_init_task(). > >> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c >> index 5600618dcef8e..98505577d56ca 100644 >> --- a/kernel/sched/ext.c >> +++ b/kernel/sched/ext.c >> @@ -3690,7 +3690,15 @@ static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork >> /* >> * While @p's rq is not locked. @p is not visible to the rest of >> * SCX yet and it's safe to update the flags and state. >> + * >> + * Install p->scx.sched before transitioning state off NONE so >> + * that the invariant state!=NONE => p->scx.sched!=NULL holds as >> + * soon as state becomes observable. A concurrent sched_ext_dead() >> + * that races the INIT window will then always find a valid >> + * scheduler pointer and can call scx_disable_and_exit_task() >> + * to release resources allocated by ops.init_task(). >> */ >> + scx_set_task_sched(p, sch); >> p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT; >> scx_set_task_state(p, SCX_TASK_INIT); >> } > > Can this result in a memory ordering vulnerability on weakly ordered > architectures? > > In scx_init_task(), scx_set_task_sched() uses rcu_assign_pointer(), which > relies on smp_store_release(). While smp_store_release() guarantees prior > operations are visible, it does not prevent subsequent memory operations > from being reordered before the release. > > Could the CPU commit the store updating p->scx.flags to SCX_TASK_INIT > before the store to p->scx.sched is completed? If a concurrent > sched_ext_dead() observes state == SCX_TASK_INIT, branches into the cleanup > path, and reads a NULL scheduler pointer, it would pass it to > scx_disable_and_exit_task() which crashes in SCX_HAS_OP(). > > Does this require a full write memory barrier (e.g., smp_wmb()) between > setting the scheduler pointer and transitioning the state? > > [ ... ] > >> @@ -7130,7 +7133,49 @@ static void scx_root_enable_workfn(struct kthread_work *work) >> goto err_disable_unlock_all; >> } >> >> - scx_set_task_sched(p, sch); >> + /* >> + * sched_ext_dead() may have raced while locks were dropped in >> + * scx_task_iter_unlock(). Two cases: >> + * >> + * (a) sched_ext_dead() ran after scx_init_task() set state=INIT: >> + * it called scx_disable_and_exit_task() (cancelled=true) and >> + * reset state to NONE. ops.exit_task() already ran; skip. >> + * >> + * (b) sched_ext_dead() ran before scx_init_task() (state=NONE at >> + * the time): it skipped scx_disable_and_exit_task() because >> + * state was NONE. scx_init_task() subsequently called >> + * ops.init_task() and set state=INIT, leaving allocated >> + * resources with no owner. We must call >> + * scx_disable_and_exit_task() here to release them. >> + * >> + * Distinguish case (a) from (b) by reading state: (a) leaves >> + * state=NONE (reset by scx_disable_and_exit_task); (b) leaves >> + * state=INIT (set by scx_init_task, never reset). >> + */ >> + { >> + bool p_dead = false, need_exit = false; >> + >> + scoped_guard(raw_spinlock_irq, &scx_tasks_lock) { >> + if (list_empty(&p->scx.tasks_node)) { >> + p_dead = true; >> + need_exit = scx_get_task_state(p) != SCX_TASK_NONE; >> + } >> + } >> + >> + if (p_dead) { >> + if (need_exit) { >> + struct rq_flags rf; >> + struct rq *rq; >> + >> + rq = task_rq_lock(p, &rf); >> + scx_disable_and_exit_task(sch, p); >> + task_rq_unlock(rq, p, &rf); > > Is this cleanup path missing the SCX_TASK_OFF_TASKS flag? > > If sched_ext_dead() runs before scx_init_task() completes, it sees state > NONE and skips setting SCX_TASK_OFF_TASKS. When this cleanup path calls > scx_disable_and_exit_task() under task_rq_lock(), the SCX_TASK_OFF_TASKS > flag is not set before the runqueue is unlocked. > > Because the flag is missing, if a cgroup iterator traverses the cgroup > before cgroup_task_dead() removes the task, could the iterator erroneously > yield the dead task with state NONE, hit assert_task_ready_or_enabled(p), > and abort the sub-scheduler enable? > >> + } >> + put_task_struct(p); >> + continue; >> + } >> + } >> + >> scx_set_task_state(p, SCX_TASK_READY); > > Can this cause a Time-Of-Check to Time-Of-Use (TOCTOU) data race against > sched_ext_dead()? > > The code checks list_empty(&p->scx.tasks_node) under scx_tasks_lock to > determine if a concurrent sched_ext_dead() has started processing. > > However, once scx_tasks_lock is dropped, sched_ext_dead() can acquire it, > remove the task, and proceed to exit the task, eventually doing > p->scx.flags |= SCX_TASK_OFF_TASKS under task_rq_lock(p). > > Meanwhile, scx_root_enable_workfn() continues past the lock and > unconditionally calls scx_set_task_state(p, SCX_TASK_READY) without holding > task_rq_lock(p). Because scx_set_task_state() performs a non-atomic > read-modify-write on p->scx.flags, could this overwrite and permanently > lose the SCX_TASK_OFF_TASKS flag? > > Also, could this incorrectly transition a dead task to SCX_TASK_READY after > its scheduler resources were torn down, triggering a warning? > > -- > Sashiko AI review · https://sashiko.dev/#/patchset/20260506054001.1105522-1-suzhidao@xiaomi.com?part=1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() 2026-05-06 5:40 ` [PATCH v2] " zhidao su 2026-05-06 6:16 ` sashiko-bot @ 2026-05-10 13:55 ` Tejun Heo 1 sibling, 0 replies; 6+ messages in thread From: Tejun Heo @ 2026-05-10 13:55 UTC (permalink / raw) To: suzhidao; +Cc: void, arighi, changwoo, emil, sched-ext, linux-kernel, Tejun Heo Hello, Thanks for the report and the patches. The same race window also affects the analogous sub-sched paths and the wrapper-disable paths trip on the NONE state that scx_fail_parent() leaves behind, so I ended up taking a more invasive route - extending the task state machine with SCX_TASK_INIT_BEGIN and SCX_TASK_DEAD - rather than continuing with your localized fix. Posted as a 6-patch series: https://lore.kernel.org/all/20260510074113.2049514-1-tj@kernel.org/ Thanks. -- tejun ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-10 13:55 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-29 13:31 [PATCH] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() zhidao su 2026-05-04 20:31 ` Tejun Heo 2026-05-06 5:40 ` [PATCH v2] " zhidao su 2026-05-06 6:16 ` sashiko-bot 2026-05-07 2:32 ` zhidao su 2026-05-10 13:55 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox