* Re: [PATCH] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()
[not found] <20260429133155.3825247-1-suzhidao@xiaomi.com>
@ 2026-05-04 20:31 ` Tejun Heo
2026-05-06 5:40 ` [PATCH v2] " zhidao su
1 sibling, 0 replies; 3+ messages in thread
From: Tejun Heo @ 2026-05-04 20:31 UTC (permalink / raw)
To: zhidao su
Cc: zhidao su, David Vernet, Andrea Righi, Changwoo Min, sched-ext,
linux-kernel
Hello,
The race seems real, thanks for catching it, but I'm not sure the
reader-side fix is the right shape. The new branch in sched_ext_dead()
resets state to NONE without a matching ops.exit_task(cancelled=true),
leaking what ops.init_task() set up; and the list_empty() gate sits
before scx_set_task_sched(), so a sched_ext_dead() that races after
sch is installed but before state goes READY would still flip state
to NONE under us.
Worth exploring on the writer side instead: reorder so p->scx.sched
is installed before state transitions off NONE. That restores the
"state != NONE -> p->scx.sched != NULL" invariant and the existing
sched_ext_dead() handles the rest. I haven't fully traced this
through - there may still be a residual window between INIT and the
workfn's READY write - but it seems like a more promising direction.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()
[not found] <20260429133155.3825247-1-suzhidao@xiaomi.com>
2026-05-04 20:31 ` [PATCH] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() Tejun Heo
@ 2026-05-06 5:40 ` zhidao su
2026-05-10 13:55 ` Tejun Heo
1 sibling, 1 reply; 3+ messages in thread
From: zhidao su @ 2026-05-06 5:40 UTC (permalink / raw)
To: tj; +Cc: void, arighi, changwoo, sched-ext, linux-kernel
In CONFIG_EXT_SUB_SCHED, scx_task_sched(p) returns p->scx.sched instead
of scx_root. scx_root_enable_workfn() iterates all tasks and for each
releases scx_tasks_lock via scx_task_iter_unlock() before calling
scx_init_task(). A concurrent sched_ext_dead() can race in this window.
Two bugs:
1. NULL deref: If sched_ext_dead() runs after scx_init_task() sets
state=INIT but before the callsite sets p->scx.sched, the invariant
"state != NONE => p->scx.sched != NULL" is broken. sched_ext_dead()
calls scx_disable_and_exit_task(scx_task_sched(p)=NULL, p), which
crashes in SCX_HAS_OP(NULL, ...).
2. Resource leak: If sched_ext_dead() runs before scx_init_task() when
state=NONE, it skips scx_disable_and_exit_task() (state check
fails). scx_init_task() then calls ops.init_task() and sets
state=INIT. The enable loop never calls ops.exit_task(), leaking
whatever ops.init_task() allocated.
Fix both:
- Move scx_set_task_sched(p, sch) into scx_init_task(), before the
state transition off NONE. This restores the invariant so
sched_ext_dead() always finds a valid scheduler pointer (fixes
bug 1).
- After scx_init_task() returns, check under scx_tasks_lock whether
@p is still on scx_tasks. If not, sched_ext_dead() raced us.
If state != NONE, ops.init_task() ran before sched_ext_dead() saw
state=NONE, so call scx_disable_and_exit_task() with cancelled=true
to release the resources (fixes bug 2). If state=NONE,
sched_ext_dead() already cleaned up.
Fixes: 88234b075c3f ("sched_ext: Introduce scx_task_sched[_rcu]()")
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
---
v2: Rewrite as writer-side fix per Tejun's review:
- Move scx_set_task_sched(p, sch) into scx_init_task() before the state
transition off NONE, restoring the "state!=NONE => p->scx.sched!=NULL"
invariant. Bug 1 (NULL deref) is fixed without touching sched_ext_dead().
- Handle bug 2 (resource leak) in the workfn's list_empty() path by
calling scx_disable_and_exit_task() when state!=NONE, instead of the
v1 reader-side branch in sched_ext_dead() that leaked resources.
- Update Fixes: to 88234b075c3f ("sched_ext: Introduce scx_task_sched[_rcu]()")
which is when scx_task_sched(p) started dereferencing p->scx.sched.
kernel/sched/ext.c | 59 ++++++++++++++++++++++++++++++++++++++++------
1 file changed, 52 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f7b1b16e81a5..99560f77af81 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3583,7 +3583,15 @@ static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork
/*
* While @p's rq is not locked. @p is not visible to the rest of
* SCX yet and it's safe to update the flags and state.
+ *
+ * Install p->scx.sched before transitioning state off NONE so
+ * that the invariant state!=NONE => p->scx.sched!=NULL holds as
+ * soon as state becomes observable. A concurrent sched_ext_dead()
+ * that races the INIT window will then always find a valid
+ * scheduler pointer and can call scx_disable_and_exit_task()
+ * to release resources allocated by ops.init_task().
*/
+ scx_set_task_sched(p, sch);
p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
scx_set_task_state(p, SCX_TASK_INIT);
}
@@ -3769,8 +3777,6 @@ void scx_pre_fork(struct task_struct *p)
int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs)
{
- s32 ret;
-
percpu_rwsem_assert_held(&scx_fork_rwsem);
p->scx.tid = scx_alloc_tid();
@@ -3781,10 +3787,7 @@ int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs)
#else
struct scx_sched *sch = scx_root;
#endif
- ret = scx_init_task(sch, p, true);
- if (!ret)
- scx_set_task_sched(p, sch);
- return ret;
+ return scx_init_task(sch, p, true);
}
return 0;
@@ -6937,7 +6940,49 @@ static void scx_root_enable_workfn(struct kthread_work *work)
goto err_disable_unlock_all;
}
- scx_set_task_sched(p, sch);
+ /*
+ * sched_ext_dead() may have raced while locks were dropped in
+ * scx_task_iter_unlock(). Two cases:
+ *
+ * (a) sched_ext_dead() ran after scx_init_task() set state=INIT:
+ * it called scx_disable_and_exit_task() (cancelled=true) and
+ * reset state to NONE. ops.exit_task() already ran; skip.
+ *
+ * (b) sched_ext_dead() ran before scx_init_task() (state=NONE at
+ * the time): it skipped scx_disable_and_exit_task() because
+ * state was NONE. scx_init_task() subsequently called
+ * ops.init_task() and set state=INIT, leaving allocated
+ * resources with no owner. We must call
+ * scx_disable_and_exit_task() here to release them.
+ *
+ * Distinguish case (a) from (b) by reading state: (a) leaves
+ * state=NONE (reset by scx_disable_and_exit_task); (b) leaves
+ * state=INIT (set by scx_init_task, never reset).
+ */
+ {
+ bool p_dead = false, need_exit = false;
+
+ scoped_guard(raw_spinlock_irq, &scx_tasks_lock) {
+ if (list_empty(&p->scx.tasks_node)) {
+ p_dead = true;
+ need_exit = scx_get_task_state(p) != SCX_TASK_NONE;
+ }
+ }
+
+ if (p_dead) {
+ if (need_exit) {
+ struct rq_flags rf;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &rf);
+ scx_disable_and_exit_task(sch, p);
+ task_rq_unlock(rq, p, &rf);
+ }
+ put_task_struct(p);
+ continue;
+ }
+ }
+
scx_set_task_state(p, SCX_TASK_READY);
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()
2026-05-06 5:40 ` [PATCH v2] " zhidao su
@ 2026-05-10 13:55 ` Tejun Heo
0 siblings, 0 replies; 3+ messages in thread
From: Tejun Heo @ 2026-05-10 13:55 UTC (permalink / raw)
To: suzhidao; +Cc: void, arighi, changwoo, emil, sched-ext, linux-kernel, Tejun Heo
Hello,
Thanks for the report and the patches. The same race window also
affects the analogous sub-sched paths and the wrapper-disable paths
trip on the NONE state that scx_fail_parent() leaves behind, so I
ended up taking a more invasive route - extending the task state
machine with SCX_TASK_INIT_BEGIN and SCX_TASK_DEAD - rather than
continuing with your localized fix.
Posted as a 6-patch series:
https://lore.kernel.org/all/20260510074113.2049514-1-tj@kernel.org/
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-05-10 13:55 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260429133155.3825247-1-suzhidao@xiaomi.com>
2026-05-04 20:31 ` [PATCH] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() Tejun Heo
2026-05-06 5:40 ` [PATCH v2] " zhidao su
2026-05-10 13:55 ` Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox