The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix sched_ext_dead() races with task initialization
@ 2026-05-10  7:41 Tejun Heo
  2026-05-10  7:41 ` [PATCH 1/6] sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work Tejun Heo
                   ` (7 more replies)
  0 siblings, 8 replies; 11+ messages in thread
From: Tejun Heo @ 2026-05-10  7:41 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: emil, suzhidao, sched-ext, linux-kernel, Tejun Heo

Hello,

zhidao su reported a NULL deref and an ops.init_task() leak when
sched_ext_dead() races scx_root_enable_workfn() in CONFIG_EXT_SUB_SCHED
kernels [1]. The same race window also affects the analogous sub-sched paths
(scx_sub_enable_workfn()'s per-task init pass and scx_sub_disable()'s
migration loop), and the wrapper-disable paths trip on the NONE state that
scx_fail_parent() leaves behind. Closing all of these calls for a
state-machine extension rather than a localized fix.

The series introduces SCX_TASK_INIT_BEGIN as an explicit intermediate state
between NONE and INIT, and replaces the SCX_TASK_OFF_TASKS marker flag with
a real SCX_TASK_DEAD terminal state. With the state machine in place, every
init path uses the same handshake: write INIT_BEGIN under rq lock, init
outside the lock, recheck DEAD under rq lock, unwind via
scx_sub_init_cancel_task() on hit. The wrapper-disable and
switched_from_scx() paths get NONE early-returns to handle the
scx_fail_parent() residue.

It is more invasive than zhidao's patches but covers the related races
uniformly and avoids the implicit list_empty() check his approach relies
on. Credit to him for finding and reporting the bug.

 0001 sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
 0002 sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
 0003 sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
 0004 sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
 0005 sched_ext: Close sub-sched init race with post-init DEAD recheck
 0006 sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths

Based on sched_ext/for-7.1-fixes (ab28a0673daa).

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-7.1-fixes-dead-race

Verified with a debug patch that widens the unlocked init windows on the
root and sub-sched paths and counts post-init DEAD-recheck hits.
Reproducers exercise each of the original races plus the scx_fail_parent
NONE-state regression, followed by a multi-iteration stress under fork
churn. Counters show the windows are hit and no
BUG/WARNING/Oops/Invalid-task-state appears.

[1] https://lore.kernel.org/all/20260429133155.3825247-1-suzhidao@xiaomi.com/

 include/linux/sched/ext.h |  17 ++--
 kernel/sched/ext.c        | 221 +++++++++++++++++++++++++++++++---------------
 2 files changed, 162 insertions(+), 76 deletions(-)

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/6] sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
  2026-05-10  7:41 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix sched_ext_dead() races with task initialization Tejun Heo
@ 2026-05-10  7:41 ` Tejun Heo
  2026-05-10  7:41 ` [PATCH 2/6] sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state() Tejun Heo
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-05-10  7:41 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: emil, suzhidao, sched-ext, linux-kernel, Tejun Heo

Cleanups in preparation for the state-machine work that follows:

- Convert three sub-sched call sites that open-code
  rcu_assign_pointer(p->scx.sched, ...) to scx_set_task_sched().

- Move scx_get_task_state()/scx_set_task_state() above the SCX task iter
  section so scx_task_iter_next_locked() can use them without a forward
  declaration.

No functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 76 +++++++++++++++++++++++-----------------------
 1 file changed, 38 insertions(+), 38 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index df305712a2d4..10c6e0261f11 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -711,6 +711,41 @@ struct bpf_iter_scx_dsq {
 } __attribute__((aligned(8)));
 
 
+static u32 scx_get_task_state(const struct task_struct *p)
+{
+	return p->scx.flags & SCX_TASK_STATE_MASK;
+}
+
+static void scx_set_task_state(struct task_struct *p, u32 state)
+{
+	u32 prev_state = scx_get_task_state(p);
+	bool warn = false;
+
+	switch (state) {
+	case SCX_TASK_NONE:
+		break;
+	case SCX_TASK_INIT:
+		warn = prev_state != SCX_TASK_NONE;
+		break;
+	case SCX_TASK_READY:
+		warn = prev_state == SCX_TASK_NONE;
+		break;
+	case SCX_TASK_ENABLED:
+		warn = prev_state != SCX_TASK_READY;
+		break;
+	default:
+		WARN_ONCE(1, "sched_ext: Invalid task state %d -> %d for %s[%d]",
+			  prev_state, state, p->comm, p->pid);
+		return;
+	}
+
+	WARN_ONCE(warn, "sched_ext: Invalid task state transition 0x%x -> 0x%x for %s[%d]",
+		  prev_state, state, p->comm, p->pid);
+
+	p->scx.flags &= ~SCX_TASK_STATE_MASK;
+	p->scx.flags |= state;
+}
+
 /*
  * SCX task iterator.
  */
@@ -3499,41 +3534,6 @@ static struct cgroup *tg_cgrp(struct task_group *tg)
 
 #endif	/* CONFIG_EXT_GROUP_SCHED */
 
-static u32 scx_get_task_state(const struct task_struct *p)
-{
-	return p->scx.flags & SCX_TASK_STATE_MASK;
-}
-
-static void scx_set_task_state(struct task_struct *p, u32 state)
-{
-	u32 prev_state = scx_get_task_state(p);
-	bool warn = false;
-
-	switch (state) {
-	case SCX_TASK_NONE:
-		break;
-	case SCX_TASK_INIT:
-		warn = prev_state != SCX_TASK_NONE;
-		break;
-	case SCX_TASK_READY:
-		warn = prev_state == SCX_TASK_NONE;
-		break;
-	case SCX_TASK_ENABLED:
-		warn = prev_state != SCX_TASK_READY;
-		break;
-	default:
-		WARN_ONCE(1, "sched_ext: Invalid task state %d -> %d for %s[%d]",
-			  prev_state, state, p->comm, p->pid);
-		return;
-	}
-
-	WARN_ONCE(warn, "sched_ext: Invalid task state transition 0x%x -> 0x%x for %s[%d]",
-		  prev_state, state, p->comm, p->pid);
-
-	p->scx.flags &= ~SCX_TASK_STATE_MASK;
-	p->scx.flags |= state;
-}
-
 static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork)
 {
 	int ret;
@@ -5696,7 +5696,7 @@ static void scx_fail_parent(struct scx_sched *sch,
 
 		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
 			scx_disable_and_exit_task(sch, p);
-			rcu_assign_pointer(p->scx.sched, parent);
+			scx_set_task_sched(p, parent);
 		}
 	}
 	scx_task_iter_stop(&sti);
@@ -5783,7 +5783,7 @@ static void scx_sub_disable(struct scx_sched *sch)
 			 */
 			scx_disable_and_exit_task(sch, p);
 			scx_set_task_state(p, SCX_TASK_INIT);
-			rcu_assign_pointer(p->scx.sched, parent);
+			scx_set_task_sched(p, parent);
 			scx_set_task_state(p, SCX_TASK_READY);
 			scx_enable_task(parent, p);
 		}
@@ -7212,7 +7212,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
 			 * $p is now only initialized for @sch and READY, which
 			 * is what we want. Assign it to @sch and enable.
 			 */
-			rcu_assign_pointer(p->scx.sched, sch);
+			scx_set_task_sched(p, sch);
 			scx_enable_task(sch, p);
 
 			p->scx.flags &= ~SCX_TASK_SUB_INIT;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/6] sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
  2026-05-10  7:41 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix sched_ext_dead() races with task initialization Tejun Heo
  2026-05-10  7:41 ` [PATCH 1/6] sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work Tejun Heo
@ 2026-05-10  7:41 ` Tejun Heo
  2026-05-10 17:20   ` Andrea Righi
  2026-05-10 20:04   ` [PATCH v2 " Tejun Heo
  2026-05-10  7:41 ` [PATCH 3/6] sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state Tejun Heo
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 11+ messages in thread
From: Tejun Heo @ 2026-05-10  7:41 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: emil, suzhidao, sched-ext, linux-kernel, Tejun Heo

Prepare for the SCX_TASK_INIT_BEGIN/DEAD work that follows by collapsing the
scx_init_task() helper. Move the SCX_TASK_RESET_RUNNABLE_AT setting into
scx_set_task_state() on the INIT transition (it was set unconditionally at
every INIT site through the scx_init_task() helper), inline scx_init_task()
into scx_fork() and scx_root_enable_workfn(), and drop the helper.

As a side effect, scx_sub_disable() migration sequence now also sets
RESET_RUNNABLE_AT (it previously wrote INIT directly without going through
scx_init_task()). The flag triggers a runnable_at reset on the next
set_task_runnable(), which is harmless on a task that has just been moved
between scheds.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 31 +++++++++----------------------
 1 file changed, 9 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 10c6e0261f11..81841277a54f 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -726,6 +726,7 @@ static void scx_set_task_state(struct task_struct *p, u32 state)
 		break;
 	case SCX_TASK_INIT:
 		warn = prev_state != SCX_TASK_NONE;
+		p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
 		break;
 	case SCX_TASK_READY:
 		warn = prev_state == SCX_TASK_NONE;
@@ -3585,22 +3586,6 @@ static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fo
 	return 0;
 }
 
-static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork)
-{
-	int ret;
-
-	ret = __scx_init_task(sch, p, fork);
-	if (!ret) {
-		/*
-		 * While @p's rq is not locked. @p is not visible to the rest of
-		 * SCX yet and it's safe to update the flags and state.
-		 */
-		p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
-		scx_set_task_state(p, SCX_TASK_INIT);
-	}
-	return ret;
-}
-
 static void __scx_enable_task(struct scx_sched *sch, struct task_struct *p)
 {
 	struct rq *rq = task_rq(p);
@@ -3763,10 +3748,11 @@ int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs)
 #else
 		struct scx_sched *sch = scx_root;
 #endif
-		ret = scx_init_task(sch, p, true);
-		if (!ret)
-			scx_set_task_sched(p, sch);
-		return ret;
+		ret = __scx_init_task(sch, p, true);
+		if (unlikely(ret))
+			return ret;
+		scx_set_task_state(p, SCX_TASK_INIT);
+		scx_set_task_sched(p, sch);
 	}
 
 	return 0;
@@ -6897,8 +6883,8 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 
 		scx_task_iter_unlock(&sti);
 
-		ret = scx_init_task(sch, p, false);
-		if (ret) {
+		ret = __scx_init_task(sch, p, false);
+		if (unlikely(ret)) {
 			put_task_struct(p);
 			scx_task_iter_stop(&sti);
 			scx_error(sch, "ops.init_task() failed (%d) for %s[%d]",
@@ -6906,6 +6892,7 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 			goto err_disable_unlock_all;
 		}
 
+		scx_set_task_state(p, SCX_TASK_INIT);
 		scx_set_task_sched(p, sch);
 		scx_set_task_state(p, SCX_TASK_READY);
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/6] sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
  2026-05-10  7:41 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix sched_ext_dead() races with task initialization Tejun Heo
  2026-05-10  7:41 ` [PATCH 1/6] sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work Tejun Heo
  2026-05-10  7:41 ` [PATCH 2/6] sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state() Tejun Heo
@ 2026-05-10  7:41 ` Tejun Heo
  2026-05-10  7:41 ` [PATCH 4/6] sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN Tejun Heo
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-05-10  7:41 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: emil, suzhidao, sched-ext, linux-kernel, Tejun Heo

SCX_TASK_OFF_TASKS marked tasks already through sched_ext_dead() so cgroup
task iteration would skip them. This can be expressed better with a task
state. Replace the flag with SCX_TASK_DEAD.

scx_disable_and_exit_task() resets state to NONE on its way out, so
sched_ext_dead() now sets DEAD after the wrapper returns. The validation
matrix grows NONE -> DEAD, warns on DEAD -> NONE, and tightens READY's
predecessor to INIT or ENABLED so the new DEAD value cannot silently
transition to READY.

Prepares for the following enable vs dead race fix.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h |  9 +++++----
 kernel/sched/ext.c        | 17 +++++++++++------
 2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index adb9a4de068a..9f1a326ad03e 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -101,24 +101,25 @@ enum scx_ent_flags {
 	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
 	SCX_TASK_SUB_INIT	= 1 << 4, /* task being initialized for a sub sched */
 	SCX_TASK_IMMED		= 1 << 5, /* task is on local DSQ with %SCX_ENQ_IMMED */
-	SCX_TASK_OFF_TASKS	= 1 << 6, /* removed from scx_tasks by sched_ext_dead() */
 
 	/*
-	 * Bits 8 and 9 are used to carry task state:
+	 * Bits 8 to 10 are used to carry task state:
 	 *
 	 * NONE		ops.init_task() not called yet
 	 * INIT		ops.init_task() succeeded, but task can be cancelled
 	 * READY	fully initialized, but not in sched_ext
 	 * ENABLED	fully initialized and in sched_ext
+	 * DEAD		terminal state set by sched_ext_dead()
 	 */
-	SCX_TASK_STATE_SHIFT	= 8,	  /* bits 8 and 9 are used to carry task state */
-	SCX_TASK_STATE_BITS	= 2,
+	SCX_TASK_STATE_SHIFT	= 8,
+	SCX_TASK_STATE_BITS	= 3,
 	SCX_TASK_STATE_MASK	= ((1 << SCX_TASK_STATE_BITS) - 1) << SCX_TASK_STATE_SHIFT,
 
 	SCX_TASK_NONE		= 0 << SCX_TASK_STATE_SHIFT,
 	SCX_TASK_INIT		= 1 << SCX_TASK_STATE_SHIFT,
 	SCX_TASK_READY		= 2 << SCX_TASK_STATE_SHIFT,
 	SCX_TASK_ENABLED	= 3 << SCX_TASK_STATE_SHIFT,
+	SCX_TASK_DEAD		= 4 << SCX_TASK_STATE_SHIFT,
 
 	/*
 	 * Bits 12 and 13 are used to carry reenqueue reason. In addition to
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 81841277a54f..2fc4a12711f9 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -723,17 +723,22 @@ static void scx_set_task_state(struct task_struct *p, u32 state)
 
 	switch (state) {
 	case SCX_TASK_NONE:
+		warn = prev_state == SCX_TASK_DEAD;
 		break;
 	case SCX_TASK_INIT:
 		warn = prev_state != SCX_TASK_NONE;
 		p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
 		break;
 	case SCX_TASK_READY:
-		warn = prev_state == SCX_TASK_NONE;
+		warn = !(prev_state == SCX_TASK_INIT ||
+			 prev_state == SCX_TASK_ENABLED);
 		break;
 	case SCX_TASK_ENABLED:
 		warn = prev_state != SCX_TASK_READY;
 		break;
+	case SCX_TASK_DEAD:
+		warn = prev_state != SCX_TASK_NONE;
+		break;
 	default:
 		WARN_ONCE(1, "sched_ext: Invalid task state %d -> %d for %s[%d]",
 			  prev_state, state, p->comm, p->pid);
@@ -972,11 +977,11 @@ static struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
 		/*
 		 * cgroup_task_dead() removes the dead tasks from cset->tasks
 		 * after sched_ext_dead() and cgroup iteration may see tasks
-		 * which already finished sched_ext_dead(). %SCX_TASK_OFF_TASKS
-		 * is set by sched_ext_dead() under @p's rq lock. Test it to
+		 * which already finished sched_ext_dead(). %SCX_TASK_DEAD is
+		 * set by sched_ext_dead() under @p's rq lock. Test it to
 		 * avoid visiting tasks which are already dead from SCX POV.
 		 */
-		if (p->scx.flags & SCX_TASK_OFF_TASKS) {
+		if (scx_get_task_state(p) == SCX_TASK_DEAD) {
 			__scx_task_iter_rq_unlock(iter);
 			continue;
 		}
@@ -3847,7 +3852,7 @@ void sched_ext_dead(struct task_struct *p)
 	 * @p is off scx_tasks and wholly ours. scx_root_enable()'s READY ->
 	 * ENABLED transitions can't race us. Disable ops for @p.
 	 *
-	 * %SCX_TASK_OFF_TASKS synchronizes against cgroup task iteration - see
+	 * %SCX_TASK_DEAD synchronizes against cgroup task iteration - see
 	 * scx_task_iter_next_locked(). NONE tasks need no marking: cgroup
 	 * iteration is only used from sub-sched paths, which require root
 	 * enabled. Root enable transitions every live task to at least READY.
@@ -3858,7 +3863,7 @@ void sched_ext_dead(struct task_struct *p)
 
 		rq = task_rq_lock(p, &rf);
 		scx_disable_and_exit_task(scx_task_sched(p), p);
-		p->scx.flags |= SCX_TASK_OFF_TASKS;
+		scx_set_task_state(p, SCX_TASK_DEAD);
 		task_rq_unlock(rq, p, &rf);
 	}
 }
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/6] sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
  2026-05-10  7:41 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix sched_ext_dead() races with task initialization Tejun Heo
                   ` (2 preceding siblings ...)
  2026-05-10  7:41 ` [PATCH 3/6] sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state Tejun Heo
@ 2026-05-10  7:41 ` Tejun Heo
  2026-05-10  7:41 ` [PATCH 5/6] sched_ext: Close sub-sched init race with post-init DEAD recheck Tejun Heo
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-05-10  7:41 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: emil, suzhidao, sched-ext, linux-kernel, Tejun Heo

scx_root_enable_workfn() drops the iter rq lock for ops.init_task() and a
TASK_DEAD @p can fall through sched_ext_dead() in that window. The race hits
when sched_ext_dead() observes SCX_TASK_INIT (the intermediate state before
@p->scx.sched is published) and dereferences NULL via SCX_HAS_OP(NULL,
exit_task), or observes SCX_TASK_NONE during the unlocked init window and
skips cleanup so exit_task() never runs.

Add SCX_TASK_INIT_BEGIN. The enable path writes NONE -> INIT_BEGIN under the
iter rq lock, then takes the rq lock again after init to walk INIT_BEGIN ->
INIT -> READY. sched_ext_dead() that wins the rq-lock race observes
INIT_BEGIN and sets DEAD without calling into ops; the post-init recheck
unwinds via scx_sub_init_cancel_task().

scx_fork() runs single-threaded against sched_ext_dead() (the task is not on
scx_tasks until scx_post_fork() adds it) so its INIT_BEGIN -> INIT walk
needs no rq-lock pairing; it rolls back to NONE on ops.init_task() failure.

The validation matrix grows the INIT_BEGIN row and the INIT_BEGIN -> DEAD
edge; INIT now requires INIT_BEGIN as the predecessor. scx_sub_disable()'s
migration writes INIT_BEGIN as a synthetic predecessor to satisfy the
tightened verification.

The sub-sched paths still race with sched_ext_dead() during the unlocked
init window. This will be fixed by the next patch.

Reported-by: zhidao su <suzhidao@xiaomi.com>
Link: https://lore.kernel.org/all/20260429133155.3825247-1-suzhidao@xiaomi.com/
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h | 10 ++++---
 kernel/sched/ext.c        | 56 ++++++++++++++++++++++++++++++++++-----
 2 files changed, 55 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 9f1a326ad03e..2129e18ada58 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -106,6 +106,7 @@ enum scx_ent_flags {
 	 * Bits 8 to 10 are used to carry task state:
 	 *
 	 * NONE		ops.init_task() not called yet
+	 * INIT_BEGIN	ops.init_task() in flight; see sched_ext_dead()
 	 * INIT		ops.init_task() succeeded, but task can be cancelled
 	 * READY	fully initialized, but not in sched_ext
 	 * ENABLED	fully initialized and in sched_ext
@@ -116,10 +117,11 @@ enum scx_ent_flags {
 	SCX_TASK_STATE_MASK	= ((1 << SCX_TASK_STATE_BITS) - 1) << SCX_TASK_STATE_SHIFT,
 
 	SCX_TASK_NONE		= 0 << SCX_TASK_STATE_SHIFT,
-	SCX_TASK_INIT		= 1 << SCX_TASK_STATE_SHIFT,
-	SCX_TASK_READY		= 2 << SCX_TASK_STATE_SHIFT,
-	SCX_TASK_ENABLED	= 3 << SCX_TASK_STATE_SHIFT,
-	SCX_TASK_DEAD		= 4 << SCX_TASK_STATE_SHIFT,
+	SCX_TASK_INIT_BEGIN	= 1 << SCX_TASK_STATE_SHIFT,
+	SCX_TASK_INIT		= 2 << SCX_TASK_STATE_SHIFT,
+	SCX_TASK_READY		= 3 << SCX_TASK_STATE_SHIFT,
+	SCX_TASK_ENABLED	= 4 << SCX_TASK_STATE_SHIFT,
+	SCX_TASK_DEAD		= 5 << SCX_TASK_STATE_SHIFT,
 
 	/*
 	 * Bits 12 and 13 are used to carry reenqueue reason. In addition to
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 2fc4a12711f9..29fa9ffe7c7b 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -725,8 +725,11 @@ static void scx_set_task_state(struct task_struct *p, u32 state)
 	case SCX_TASK_NONE:
 		warn = prev_state == SCX_TASK_DEAD;
 		break;
-	case SCX_TASK_INIT:
+	case SCX_TASK_INIT_BEGIN:
 		warn = prev_state != SCX_TASK_NONE;
+		break;
+	case SCX_TASK_INIT:
+		warn = prev_state != SCX_TASK_INIT_BEGIN;
 		p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
 		break;
 	case SCX_TASK_READY:
@@ -737,7 +740,8 @@ static void scx_set_task_state(struct task_struct *p, u32 state)
 		warn = prev_state != SCX_TASK_READY;
 		break;
 	case SCX_TASK_DEAD:
-		warn = prev_state != SCX_TASK_NONE;
+		warn = !(prev_state == SCX_TASK_NONE ||
+			 prev_state == SCX_TASK_INIT_BEGIN);
 		break;
 	default:
 		WARN_ONCE(1, "sched_ext: Invalid task state %d -> %d for %s[%d]",
@@ -3753,9 +3757,12 @@ int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs)
 #else
 		struct scx_sched *sch = scx_root;
 #endif
+		scx_set_task_state(p, SCX_TASK_INIT_BEGIN);
 		ret = __scx_init_task(sch, p, true);
-		if (unlikely(ret))
+		if (unlikely(ret)) {
+			scx_set_task_state(p, SCX_TASK_NONE);
 			return ret;
+		}
 		scx_set_task_state(p, SCX_TASK_INIT);
 		scx_set_task_sched(p, sch);
 	}
@@ -3856,13 +3863,18 @@ void sched_ext_dead(struct task_struct *p)
 	 * scx_task_iter_next_locked(). NONE tasks need no marking: cgroup
 	 * iteration is only used from sub-sched paths, which require root
 	 * enabled. Root enable transitions every live task to at least READY.
+	 *
+	 * %INIT_BEGIN means ops.init_task() is running for @p. Don't call
+	 * into ops; transition to %DEAD so the post-init recheck unwinds
+	 * via scx_sub_init_cancel_task().
 	 */
 	if (scx_get_task_state(p) != SCX_TASK_NONE) {
 		struct rq_flags rf;
 		struct rq *rq;
 
 		rq = task_rq_lock(p, &rf);
-		scx_disable_and_exit_task(scx_task_sched(p), p);
+		if (scx_get_task_state(p) != SCX_TASK_INIT_BEGIN)
+			scx_disable_and_exit_task(scx_task_sched(p), p);
 		scx_set_task_state(p, SCX_TASK_DEAD);
 		task_rq_unlock(rq, p, &rf);
 	}
@@ -5773,6 +5785,7 @@ static void scx_sub_disable(struct scx_sched *sch)
 			 * $p having already been initialized, and then enable.
 			 */
 			scx_disable_and_exit_task(sch, p);
+			scx_set_task_state(p, SCX_TASK_INIT_BEGIN);
 			scx_set_task_state(p, SCX_TASK_INIT);
 			scx_set_task_sched(p, parent);
 			scx_set_task_state(p, SCX_TASK_READY);
@@ -6878,6 +6891,9 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 
 	scx_task_iter_start(&sti, NULL);
 	while ((p = scx_task_iter_next_locked(&sti))) {
+		struct rq_flags rf;
+		struct rq *rq;
+
 		/*
 		 * @p may already be dead, have lost all its usages counts and
 		 * be waiting for RCU grace period before being freed. @p can't
@@ -6886,10 +6902,26 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 		if (!tryget_task_struct(p))
 			continue;
 
+		/*
+		 * Set %INIT_BEGIN under the iter's rq lock so that a concurrent
+		 * sched_ext_dead() does not call ops.exit_task() on @p while
+		 * ops.init_task() is running. If sched_ext_dead() runs before
+		 * this store, it has already removed @p from scx_tasks and the
+		 * iter won't visit @p; if it runs after, it observes
+		 * %INIT_BEGIN and transitions to %DEAD without calling ops,
+		 * leaving the post-init recheck below to unwind.
+		 */
+		scx_set_task_state(p, SCX_TASK_INIT_BEGIN);
 		scx_task_iter_unlock(&sti);
 
 		ret = __scx_init_task(sch, p, false);
+
+		rq = task_rq_lock(p, &rf);
+
 		if (unlikely(ret)) {
+			if (scx_get_task_state(p) != SCX_TASK_DEAD)
+				scx_set_task_state(p, SCX_TASK_NONE);
+			task_rq_unlock(rq, p, &rf);
 			put_task_struct(p);
 			scx_task_iter_stop(&sti);
 			scx_error(sch, "ops.init_task() failed (%d) for %s[%d]",
@@ -6897,10 +6929,20 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 			goto err_disable_unlock_all;
 		}
 
-		scx_set_task_state(p, SCX_TASK_INIT);
-		scx_set_task_sched(p, sch);
-		scx_set_task_state(p, SCX_TASK_READY);
+		if (scx_get_task_state(p) == SCX_TASK_DEAD) {
+			/*
+			 * sched_ext_dead() observed %INIT_BEGIN and set %DEAD.
+			 * ops.exit_task() is owed to the sched __scx_init_task()
+			 * ran against; call it now.
+			 */
+			scx_sub_init_cancel_task(sch, p);
+		} else {
+			scx_set_task_state(p, SCX_TASK_INIT);
+			scx_set_task_sched(p, sch);
+			scx_set_task_state(p, SCX_TASK_READY);
+		}
 
+		task_rq_unlock(rq, p, &rf);
 		put_task_struct(p);
 	}
 	scx_task_iter_stop(&sti);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 5/6] sched_ext: Close sub-sched init race with post-init DEAD recheck
  2026-05-10  7:41 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix sched_ext_dead() races with task initialization Tejun Heo
                   ` (3 preceding siblings ...)
  2026-05-10  7:41 ` [PATCH 4/6] sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN Tejun Heo
@ 2026-05-10  7:41 ` Tejun Heo
  2026-05-10  7:41 ` [PATCH 6/6] sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths Tejun Heo
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-05-10  7:41 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: emil, suzhidao, sched-ext, linux-kernel, Tejun Heo

scx_sub_enable_workfn()'s init pass and scx_sub_disable() migration both
drop the rq lock to call __scx_init_task() against the other sched. A
TASK_DEAD @p can fall through sched_ext_dead() in that window.
sched_ext_dead() runs ops.exit_task() on the sched @p was attached to, not
on the sched whose init just completed, so the new allocation leaks.

Reuse the DEAD signal set by sched_ext_dead(). After __scx_init_task()
returns, take task_rq_lock(p) and check for DEAD; on hit, call
scx_sub_init_cancel_task() against the sub sched the init ran for and drop
@p; on miss, proceed as before.

Reported-by: zhidao su <suzhidao@xiaomi.com>
Link: https://lore.kernel.org/all/20260429133155.3825247-1-suzhidao@xiaomi.com/
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 29fa9ffe7c7b..6fbe3160eccd 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5777,6 +5777,21 @@ static void scx_sub_disable(struct scx_sched *sch)
 		}
 
 		rq = task_rq_lock(p, &rf);
+
+		if (scx_get_task_state(p) == SCX_TASK_DEAD) {
+			/*
+			 * sched_ext_dead() raced us between __scx_init_task()
+			 * and this rq lock and ran exit_task() on @sch (the
+			 * sched @p was on at that point), not on $parent.
+			 * $parent's just-completed init is owed an exit_task()
+			 * and we issue it here.
+			 */
+			scx_sub_init_cancel_task(parent, p);
+			task_rq_unlock(rq, p, &rf);
+			put_task_struct(p);
+			continue;
+		}
+
 		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
 			/*
 			 * $p is initialized for $parent and still attached to
@@ -5791,8 +5806,8 @@ static void scx_sub_disable(struct scx_sched *sch)
 			scx_set_task_state(p, SCX_TASK_READY);
 			scx_enable_task(parent, p);
 		}
-		task_rq_unlock(rq, p, &rf);
 
+		task_rq_unlock(rq, p, &rf);
 		put_task_struct(p);
 	}
 	scx_task_iter_stop(&sti);
@@ -7212,6 +7227,21 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
 			goto abort;
 
 		rq = task_rq_lock(p, &rf);
+
+		if (scx_get_task_state(p) == SCX_TASK_DEAD) {
+			/*
+			 * sched_ext_dead() raced us between __scx_init_task()
+			 * and this rq lock and ran exit_task() on $parent (the
+			 * sched @p was on at that point), not on @sch. @sch's
+			 * just-completed init is owed an exit_task() and we
+			 * issue it here.
+			 */
+			scx_sub_init_cancel_task(sch, p);
+			task_rq_unlock(rq, p, &rf);
+			put_task_struct(p);
+			continue;
+		}
+
 		p->scx.flags |= SCX_TASK_SUB_INIT;
 		task_rq_unlock(rq, p, &rf);
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 6/6] sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths
  2026-05-10  7:41 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix sched_ext_dead() races with task initialization Tejun Heo
                   ` (4 preceding siblings ...)
  2026-05-10  7:41 ` [PATCH 5/6] sched_ext: Close sub-sched init race with post-init DEAD recheck Tejun Heo
@ 2026-05-10  7:41 ` Tejun Heo
  2026-05-10 17:47 ` [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix sched_ext_dead() races with task initialization Andrea Righi
  2026-05-10 21:55 ` Tejun Heo
  7 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-05-10  7:41 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: emil, suzhidao, sched-ext, linux-kernel, Tejun Heo

scx_fail_parent() leaves cgroup tasks at (state=NONE, sched=parent,
sched_class=ext) until the parent itself is torn down by the scx_error() it
raised. When the later root_disable iterates them, two paths trip on NONE.

scx_disable_and_exit_task() re-enters the wrapper at NONE: the inner switch
returns early but the trailing scx_set_task_sched(p, NULL) clobbers the
parent sched left by scx_fail_parent(), and scx_set_task_state(p, NONE)
wastes a write on an already-NONE task. switched_from_scx() then calls
scx_disable_task(), which WARNs on non-ENABLED state and writes state=READY,
producing a NONE -> READY transition the validation matrix rejects.

Treat NONE as "nothing to do" in both paths. Add a NONE early-return at the
top of scx_disable_and_exit_task() and a parallel NONE check in
switched_from_scx() next to task_dead_and_done().

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 6fbe3160eccd..4efe0099f79a 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3703,6 +3703,15 @@ static void scx_sub_init_cancel_task(struct scx_sched *sch, struct task_struct *
 static void scx_disable_and_exit_task(struct scx_sched *sch,
 				      struct task_struct *p)
 {
+	/*
+	 * %NONE means @p is already detached at the SCX level (e.g. handed
+	 * back to the parent by scx_fail_parent() with no init to undo).
+	 * Skip to avoid clobbering scx_task_sched() and writing %NONE again
+	 * on a state that's already %NONE.
+	 */
+	if (scx_get_task_state(p) == SCX_TASK_NONE)
+		return;
+
 	__scx_disable_and_exit_task(sch, p);
 
 	/*
@@ -3921,6 +3930,16 @@ static void switched_from_scx(struct rq *rq, struct task_struct *p)
 	if (task_dead_and_done(p))
 		return;
 
+	/*
+	 * %NONE means SCX is no longer tracking @p at the task level (e.g.
+	 * scx_fail_parent() handed @p back to the parent at NONE pending the
+	 * parent's own teardown). There is nothing to disable; calling
+	 * scx_disable_task() would WARN on the non-%ENABLED state and trigger a
+	 * NONE -> READY validation failure.
+	 */
+	if (scx_get_task_state(p) == SCX_TASK_NONE)
+		return;
+
 	scx_disable_task(scx_task_sched(p), p);
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/6] sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
  2026-05-10  7:41 ` [PATCH 2/6] sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state() Tejun Heo
@ 2026-05-10 17:20   ` Andrea Righi
  2026-05-10 20:04   ` [PATCH v2 " Tejun Heo
  1 sibling, 0 replies; 11+ messages in thread
From: Andrea Righi @ 2026-05-10 17:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, changwoo, emil, suzhidao, sched-ext, linux-kernel

Hi Tejun,

On Sat, May 09, 2026 at 09:41:09PM -1000, Tejun Heo wrote:
> Prepare for the SCX_TASK_INIT_BEGIN/DEAD work that follows by collapsing the
> scx_init_task() helper. Move the SCX_TASK_RESET_RUNNABLE_AT setting into
> scx_set_task_state() on the INIT transition (it was set unconditionally at
> every INIT site through the scx_init_task() helper), inline scx_init_task()
> into scx_fork() and scx_root_enable_workfn(), and drop the helper.
> 
> As a side effect, scx_sub_disable() migration sequence now also sets
> RESET_RUNNABLE_AT (it previously wrote INIT directly without going through
> scx_init_task()). The flag triggers a runnable_at reset on the next
> set_task_runnable(), which is harmless on a task that has just been moved
> between scheds.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 31 +++++++++----------------------
>  1 file changed, 9 insertions(+), 22 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 10c6e0261f11..81841277a54f 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -726,6 +726,7 @@ static void scx_set_task_state(struct task_struct *p, u32 state)
>  		break;
>  	case SCX_TASK_INIT:
>  		warn = prev_state != SCX_TASK_NONE;
> +		p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;

IIUC enable path may touch p->scx.flags without the task rq lock held, while
struct sched_ext_entity documents flags as rq lock protected. But this whould be
fixed later by the follow-up INIT_BEGIN on the root-enable path right?

Maybe we can add a small comment in the patch description (something along these
lines)?

Note: root enable may touch p->scx.flags without the task's rq lock; this is
addressed by follow-up patches (SCX_TASK_INIT_BEGIN).

Thanks,
-Andrea

>  		break;
>  	case SCX_TASK_READY:
>  		warn = prev_state == SCX_TASK_NONE;
> @@ -3585,22 +3586,6 @@ static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fo
>  	return 0;
>  }
>  
> -static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork)
> -{
> -	int ret;
> -
> -	ret = __scx_init_task(sch, p, fork);
> -	if (!ret) {
> -		/*
> -		 * While @p's rq is not locked. @p is not visible to the rest of
> -		 * SCX yet and it's safe to update the flags and state.
> -		 */
> -		p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
> -		scx_set_task_state(p, SCX_TASK_INIT);
> -	}
> -	return ret;
> -}
> -
>  static void __scx_enable_task(struct scx_sched *sch, struct task_struct *p)
>  {
>  	struct rq *rq = task_rq(p);
> @@ -3763,10 +3748,11 @@ int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs)
>  #else
>  		struct scx_sched *sch = scx_root;
>  #endif
> -		ret = scx_init_task(sch, p, true);
> -		if (!ret)
> -			scx_set_task_sched(p, sch);
> -		return ret;
> +		ret = __scx_init_task(sch, p, true);
> +		if (unlikely(ret))
> +			return ret;
> +		scx_set_task_state(p, SCX_TASK_INIT);
> +		scx_set_task_sched(p, sch);
>  	}
>  
>  	return 0;
> @@ -6897,8 +6883,8 @@ static void scx_root_enable_workfn(struct kthread_work *work)
>  
>  		scx_task_iter_unlock(&sti);
>  
> -		ret = scx_init_task(sch, p, false);
> -		if (ret) {
> +		ret = __scx_init_task(sch, p, false);
> +		if (unlikely(ret)) {
>  			put_task_struct(p);
>  			scx_task_iter_stop(&sti);
>  			scx_error(sch, "ops.init_task() failed (%d) for %s[%d]",
> @@ -6906,6 +6892,7 @@ static void scx_root_enable_workfn(struct kthread_work *work)
>  			goto err_disable_unlock_all;
>  		}
>  
> +		scx_set_task_state(p, SCX_TASK_INIT);
>  		scx_set_task_sched(p, sch);
>  		scx_set_task_state(p, SCX_TASK_READY);
>  
> -- 
> 2.54.0
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix sched_ext_dead() races with task initialization
  2026-05-10  7:41 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix sched_ext_dead() races with task initialization Tejun Heo
                   ` (5 preceding siblings ...)
  2026-05-10  7:41 ` [PATCH 6/6] sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths Tejun Heo
@ 2026-05-10 17:47 ` Andrea Righi
  2026-05-10 21:55 ` Tejun Heo
  7 siblings, 0 replies; 11+ messages in thread
From: Andrea Righi @ 2026-05-10 17:47 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, changwoo, emil, suzhidao, sched-ext, linux-kernel

Hi Tejun,

On Sat, May 09, 2026 at 09:41:07PM -1000, Tejun Heo wrote:
> Hello,
> 
> zhidao su reported a NULL deref and an ops.init_task() leak when
> sched_ext_dead() races scx_root_enable_workfn() in CONFIG_EXT_SUB_SCHED
> kernels [1]. The same race window also affects the analogous sub-sched paths
> (scx_sub_enable_workfn()'s per-task init pass and scx_sub_disable()'s
> migration loop), and the wrapper-disable paths trip on the NONE state that
> scx_fail_parent() leaves behind. Closing all of these calls for a
> state-machine extension rather than a localized fix.
> 
> The series introduces SCX_TASK_INIT_BEGIN as an explicit intermediate state
> between NONE and INIT, and replaces the SCX_TASK_OFF_TASKS marker flag with
> a real SCX_TASK_DEAD terminal state. With the state machine in place, every
> init path uses the same handshake: write INIT_BEGIN under rq lock, init
> outside the lock, recheck DEAD under rq lock, unwind via
> scx_sub_init_cancel_task() on hit. The wrapper-disable and
> switched_from_scx() paths get NONE early-returns to handle the
> scx_fail_parent() residue.
> 
> It is more invasive than zhidao's patches but covers the related races
> uniformly and avoids the implicit list_empty() check his approach relies
> on. Credit to him for finding and reporting the bug.
> 
>  0001 sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
>  0002 sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
>  0003 sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
>  0004 sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
>  0005 sched_ext: Close sub-sched init race with post-init DEAD recheck
>  0006 sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths

Apart than a small comment about PATCH 2/6, I haven't found any issues with this
series. Looks good to me.

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> 
> Based on sched_ext/for-7.1-fixes (ab28a0673daa).
> 
> Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-7.1-fixes-dead-race
> 
> Verified with a debug patch that widens the unlocked init windows on the
> root and sub-sched paths and counts post-init DEAD-recheck hits.
> Reproducers exercise each of the original races plus the scx_fail_parent
> NONE-state regression, followed by a multi-iteration stress under fork
> churn. Counters show the windows are hit and no
> BUG/WARNING/Oops/Invalid-task-state appears.
> 
> [1] https://lore.kernel.org/all/20260429133155.3825247-1-suzhidao@xiaomi.com/
> 
>  include/linux/sched/ext.h |  17 ++--
>  kernel/sched/ext.c        | 221 +++++++++++++++++++++++++++++++---------------
>  2 files changed, 162 insertions(+), 76 deletions(-)
> 
> Thanks.
> 
> --
> tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 2/6] sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
  2026-05-10  7:41 ` [PATCH 2/6] sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state() Tejun Heo
  2026-05-10 17:20   ` Andrea Righi
@ 2026-05-10 20:04   ` Tejun Heo
  1 sibling, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-05-10 20:04 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: emil, suzhidao, sched-ext, linux-kernel, Tejun Heo

Prepare for the SCX_TASK_INIT_BEGIN/DEAD work that follows by collapsing the
scx_init_task() helper. Move the SCX_TASK_RESET_RUNNABLE_AT setting into
scx_set_task_state() on the INIT transition (it was set unconditionally at
every INIT site through the scx_init_task() helper), inline scx_init_task()
into scx_fork() and scx_root_enable_workfn(), and drop the helper.

As a side effect, scx_sub_disable() migration sequence now also sets
RESET_RUNNABLE_AT (it previously wrote INIT directly without going through
scx_init_task()). The flag triggers a runnable_at reset on the next
set_task_runnable(), which is harmless on a task that has just been moved
between scheds.

On root-enable, p->scx.flags is written without the task's rq lock. The task
isn't visible to scx yet, and a follow-up patch restores the lock-held
write.

v2: Note p->scx.flags rq-lock relaxation on root-enable path. (Andrea)

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 31 +++++++++----------------------
 1 file changed, 9 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 10c6e0261f11..81841277a54f 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -726,6 +726,7 @@ static void scx_set_task_state(struct task_struct *p, u32 state)
 		break;
 	case SCX_TASK_INIT:
 		warn = prev_state != SCX_TASK_NONE;
+		p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
 		break;
 	case SCX_TASK_READY:
 		warn = prev_state == SCX_TASK_NONE;
@@ -3585,22 +3586,6 @@ static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fo
 	return 0;
 }
 
-static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork)
-{
-	int ret;
-
-	ret = __scx_init_task(sch, p, fork);
-	if (!ret) {
-		/*
-		 * While @p's rq is not locked. @p is not visible to the rest of
-		 * SCX yet and it's safe to update the flags and state.
-		 */
-		p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
-		scx_set_task_state(p, SCX_TASK_INIT);
-	}
-	return ret;
-}
-
 static void __scx_enable_task(struct scx_sched *sch, struct task_struct *p)
 {
 	struct rq *rq = task_rq(p);
@@ -3763,10 +3748,11 @@ int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs)
 #else
 		struct scx_sched *sch = scx_root;
 #endif
-		ret = scx_init_task(sch, p, true);
-		if (!ret)
-			scx_set_task_sched(p, sch);
-		return ret;
+		ret = __scx_init_task(sch, p, true);
+		if (unlikely(ret))
+			return ret;
+		scx_set_task_state(p, SCX_TASK_INIT);
+		scx_set_task_sched(p, sch);
 	}
 
 	return 0;
@@ -6897,8 +6883,8 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 
 		scx_task_iter_unlock(&sti);
 
-		ret = scx_init_task(sch, p, false);
-		if (ret) {
+		ret = __scx_init_task(sch, p, false);
+		if (unlikely(ret)) {
 			put_task_struct(p);
 			scx_task_iter_stop(&sti);
 			scx_error(sch, "ops.init_task() failed (%d) for %s[%d]",
@@ -6906,6 +6892,7 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 			goto err_disable_unlock_all;
 		}
 
+		scx_set_task_state(p, SCX_TASK_INIT);
 		scx_set_task_sched(p, sch);
 		scx_set_task_state(p, SCX_TASK_READY);
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix sched_ext_dead() races with task initialization
  2026-05-10  7:41 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix sched_ext_dead() races with task initialization Tejun Heo
                   ` (6 preceding siblings ...)
  2026-05-10 17:47 ` [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix sched_ext_dead() races with task initialization Andrea Righi
@ 2026-05-10 21:55 ` Tejun Heo
  7 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-05-10 21:55 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: emil, suzhidao, sched-ext, linux-kernel, Tejun Heo

Hello,

Applied 1-6 to sched_ext/for-7.1-fixes. The merge into for-7.2 has a
lock-ordering complication around the post-init block; addressing with
a prep patch on for-7.2 soon.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-05-10 21:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-10  7:41 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix sched_ext_dead() races with task initialization Tejun Heo
2026-05-10  7:41 ` [PATCH 1/6] sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work Tejun Heo
2026-05-10  7:41 ` [PATCH 2/6] sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state() Tejun Heo
2026-05-10 17:20   ` Andrea Righi
2026-05-10 20:04   ` [PATCH v2 " Tejun Heo
2026-05-10  7:41 ` [PATCH 3/6] sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state Tejun Heo
2026-05-10  7:41 ` [PATCH 4/6] sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN Tejun Heo
2026-05-10  7:41 ` [PATCH 5/6] sched_ext: Close sub-sched init race with post-init DEAD recheck Tejun Heo
2026-05-10  7:41 ` [PATCH 6/6] sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths Tejun Heo
2026-05-10 17:47 ` [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix sched_ext_dead() races with task initialization Andrea Righi
2026-05-10 21:55 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox