All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix cgroup iter coverage of in-do_exit tasks
@ 2026-04-28  0:16 Tejun Heo
  2026-04-28  0:16 ` [PATCH 1/2] sched_ext: Include exiting tasks in cgroup iter Tejun Heo
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Tejun Heo @ 2026-04-28  0:16 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Cheng-Yang Chou, Emil Tsalapatis, sched-ext, linux-kernel

Hello,

a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") made
css_task_iter_advance() skip exiting tasks. That broke scx_task_iter's
cgroup-scoped mode: it now silently skips tasks that are still on
scx_tasks but past exit_signals(), so the abort path in
scx_sub_enable_workfn() can miss SCX_TASK_SUB_INIT-marked exiting tasks
and leak __scx_init_task() state.

Restoring iter coverage exposes a separate latent issue: cgroup
iteration can return tasks whose sched_ext_dead() has already torn down
their per-task SCX state (cgroup_task_dead() runs after sched_ext_dead()
in finish_task_switch() and is irq-work deferred on PREEMPT_RT). Callers
trip WARN_ON_ONCE() / fail assertions when they see such a task.

This pair fixes both:

 0001 sched_ext: Include exiting tasks in cgroup iter
      Adds CSS_TASK_ITER_WITH_DEAD; scx_task_iter opts in.

 0002 sched_ext: Skip past-sched_ext_dead() tasks in
      scx_task_iter_next_locked()
      Adds SCX_TASK_OFF_TASKS, set in sched_ext_dead() under the rq
      lock; scx_task_iter_next_locked() skips flagged tasks under the
      same lock.

Verified with a stress harness that runs a 4-deep nested sub-sched
hierarchy with continuous fork/switch workers and random sub-sched
restarts at 5s intervals. Baseline (without the patches) wedged a
192-CPU bare-metal box in 66s and oopsed a 24-thread bare-metal box at
227s. Patched ran clean for 30min on both plus an 8-vCPU vng - 0
WARN/BUG/lockdep across ~1000 sub-restarts.

Based on sched_ext/for-7.1-fixes (deb7b2f93d01).

 include/linux/cgroup.h    |  1 +
 include/linux/sched/ext.h |  1 +
 kernel/cgroup/cgroup.c    |  8 +++++---
 kernel/sched/ext.c        | 39 +++++++++++++++++++++++++++++----------
 4 files changed, 36 insertions(+), 13 deletions(-)

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git iter-include-dead-v1

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/2] sched_ext: Include exiting tasks in cgroup iter
  2026-04-28  0:16 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix cgroup iter coverage of in-do_exit tasks Tejun Heo
@ 2026-04-28  0:16 ` Tejun Heo
  2026-04-28  0:16 ` [PATCH 2/2] sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked() Tejun Heo
  2026-05-04 19:10 ` [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix cgroup iter coverage of in-do_exit tasks Tejun Heo
  2 siblings, 0 replies; 4+ messages in thread
From: Tejun Heo @ 2026-04-28  0:16 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Cheng-Yang Chou, Emil Tsalapatis, sched-ext, linux-kernel

a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") made
css_task_iter_advance() skip exiting tasks so cgroup.procs stays consistent
with waitpid() visibility. Unfortunately, this broke scx_task_iter.

scx_task_iter walks either scx_tasks (global) or a cgroup subtree via
css_task_iter() and the two modes are expected to cover the same set of
tasks. After the above change the cgroup-scoped mode silently skips tasks
past exit_signals() that are still on scx_tasks.

scx_sub_enable_workfn()'s abort path is one of the symptoms: an exiting
SCX_TASK_SUB_INIT task can race past the cgroup iter leaking
__scx_init_task() state. Other iterations share the same gap.

Add CSS_TASK_ITER_WITH_DEAD to opt out of the skip and use it from
scx_task_iter().

Fixes: b0e4c2f8a0f0 ("sched_ext: Implement cgroup subtree iteration for scx_task_iter")
Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h | 1 +
 kernel/cgroup/cgroup.c | 8 +++++---
 kernel/sched/ext.c     | 6 ++++--
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index e52160e85af4..f6d037a30fd8 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -53,6 +53,7 @@ struct kernel_clone_args;
 enum css_task_iter_flags {
 	CSS_TASK_ITER_PROCS    = (1U << 0),  /* walk only threadgroup leaders */
 	CSS_TASK_ITER_THREADED = (1U << 1),  /* walk all threaded css_sets in the domain */
+	CSS_TASK_ITER_WITH_DEAD = (1U << 2),  /* include exiting tasks */
 	CSS_TASK_ITER_SKIPPED  = (1U << 16), /* internal flags */
 };
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1f084ee71443..e51ce4cd3739 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5059,10 +5059,12 @@ static void css_task_iter_advance(struct css_task_iter *it)
 
 	task = list_entry(it->task_pos, struct task_struct, cg_list);
 	/*
-	 * Hide tasks that are exiting but not yet removed. Keep zombie
-	 * leaders with live threads visible.
+	 * Hide tasks that are exiting but not yet removed by default. Keep
+	 * zombie leaders with live threads visible. Usages that need to walk
+	 * every existing task can opt out via CSS_TASK_ITER_WITH_DEAD.
 	 */
-	if ((task->flags & PF_EXITING) && !atomic_read(&task->signal->live))
+	if (!(it->flags & CSS_TASK_ITER_WITH_DEAD) &&
+	    (task->flags & PF_EXITING) && !atomic_read(&task->signal->live))
 		goto repeat;
 
 	if (it->flags & CSS_TASK_ITER_PROCS) {
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 9eda20e5fdb8..cf43be8ac1aa 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -766,7 +766,8 @@ static void scx_task_iter_start(struct scx_task_iter *iter, struct cgroup *cgrp)
 		lockdep_assert_held(&cgroup_mutex);
 		iter->cgrp = cgrp;
 		iter->css_pos = css_next_descendant_pre(NULL, &iter->cgrp->self);
-		css_task_iter_start(iter->css_pos, 0, &iter->css_iter);
+		css_task_iter_start(iter->css_pos, CSS_TASK_ITER_WITH_DEAD,
+				    &iter->css_iter);
 		return;
 	}
 #endif
@@ -866,7 +867,8 @@ static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter)
 			iter->css_pos = css_next_descendant_pre(iter->css_pos,
 								&iter->cgrp->self);
 			if (iter->css_pos)
-				css_task_iter_start(iter->css_pos, 0, &iter->css_iter);
+				css_task_iter_start(iter->css_pos, CSS_TASK_ITER_WITH_DEAD,
+						    &iter->css_iter);
 		}
 		return NULL;
 	}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked()
  2026-04-28  0:16 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix cgroup iter coverage of in-do_exit tasks Tejun Heo
  2026-04-28  0:16 ` [PATCH 1/2] sched_ext: Include exiting tasks in cgroup iter Tejun Heo
@ 2026-04-28  0:16 ` Tejun Heo
  2026-05-04 19:10 ` [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix cgroup iter coverage of in-do_exit tasks Tejun Heo
  2 siblings, 0 replies; 4+ messages in thread
From: Tejun Heo @ 2026-04-28  0:16 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Cheng-Yang Chou, Emil Tsalapatis, sched-ext, linux-kernel

scx_task_iter's cgroup-scoped mode can return tasks whose
sched_ext_dead() has already completed: cgroup_task_dead() removes
from cset->tasks after sched_ext_dead() in finish_task_switch() and is
irq-work deferred on PREEMPT_RT. The global mode is fine -
sched_ext_dead() removes from scx_tasks via list_del_init() first.

Callers (sub-sched enable prep/abort/apply, scx_sub_disable(),
scx_fail_parent()) assume returned tasks are still on @sch and trip
WARN_ON_ONCE() or operate on torn-down state otherwise.

Set %SCX_TASK_OFF_TASKS in sched_ext_dead() under @p's rq lock and
have scx_task_iter_next_locked() skip flagged tasks under the same
lock. Setter and reader serialize on the per-task rq lock - no race.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h |  1 +
 kernel/sched/ext.c        | 33 +++++++++++++++++++++++++--------
 2 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 1a3af2ea2a79..adb9a4de068a 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -101,6 +101,7 @@ enum scx_ent_flags {
 	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
 	SCX_TASK_SUB_INIT	= 1 << 4, /* task being initialized for a sub sched */
 	SCX_TASK_IMMED		= 1 << 5, /* task is on local DSQ with %SCX_ENQ_IMMED */
+	SCX_TASK_OFF_TASKS	= 1 << 6, /* removed from scx_tasks by sched_ext_dead() */
 
 	/*
 	 * Bits 8 and 9 are used to carry task state:
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index cf43be8ac1aa..6c3c40499404 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -928,16 +928,27 @@ static struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
 		 *
 		 * Test for idle_sched_class as only init_tasks are on it.
 		 */
-		if (p->sched_class != &idle_sched_class)
-			break;
-	}
-	if (!p)
-		return NULL;
+		if (p->sched_class == &idle_sched_class)
+			continue;
 
-	iter->rq = task_rq_lock(p, &iter->rf);
-	iter->locked_task = p;
+		iter->rq = task_rq_lock(p, &iter->rf);
+		iter->locked_task = p;
 
-	return p;
+		/*
+		 * cgroup_task_dead() removes the dead tasks from cset->tasks
+		 * after sched_ext_dead() and cgroup iteration may see tasks
+		 * which already finished sched_ext_dead(). %SCX_TASK_OFF_TASKS
+		 * is set by sched_ext_dead() under @p's rq lock. Test it to
+		 * avoid visiting tasks which are already dead from SCX POV.
+		 */
+		if (p->scx.flags & SCX_TASK_OFF_TASKS) {
+			__scx_task_iter_rq_unlock(iter);
+			continue;
+		}
+
+		return p;
+	}
+	return NULL;
 }
 
 /**
@@ -3816,6 +3827,11 @@ void sched_ext_dead(struct task_struct *p)
 	/*
 	 * @p is off scx_tasks and wholly ours. scx_root_enable()'s READY ->
 	 * ENABLED transitions can't race us. Disable ops for @p.
+	 *
+	 * %SCX_TASK_OFF_TASKS synchronizes against cgroup task iteration - see
+	 * scx_task_iter_next_locked(). NONE tasks need no marking: cgroup
+	 * iteration is only used from sub-sched paths, which require root
+	 * enabled. Root enable transitions every live task to at least READY.
 	 */
 	if (scx_get_task_state(p) != SCX_TASK_NONE) {
 		struct rq_flags rf;
@@ -3823,6 +3839,7 @@ void sched_ext_dead(struct task_struct *p)
 
 		rq = task_rq_lock(p, &rf);
 		scx_disable_and_exit_task(scx_task_sched(p), p);
+		p->scx.flags |= SCX_TASK_OFF_TASKS;
 		task_rq_unlock(rq, p, &rf);
 	}
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix cgroup iter coverage of in-do_exit tasks
  2026-04-28  0:16 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix cgroup iter coverage of in-do_exit tasks Tejun Heo
  2026-04-28  0:16 ` [PATCH 1/2] sched_ext: Include exiting tasks in cgroup iter Tejun Heo
  2026-04-28  0:16 ` [PATCH 2/2] sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked() Tejun Heo
@ 2026-05-04 19:10 ` Tejun Heo
  2 siblings, 0 replies; 4+ messages in thread
From: Tejun Heo @ 2026-05-04 19:10 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Cheng-Yang Chou, Emil Tsalapatis, sched-ext, linux-kernel

Hello,

Applied 1-2 to sched_ext/for-7.1-fixes with the first patch's prefix
changed to "cgroup, sched_ext: " to reflect that it touches cgroup
core.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-05-04 19:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28  0:16 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix cgroup iter coverage of in-do_exit tasks Tejun Heo
2026-04-28  0:16 ` [PATCH 1/2] sched_ext: Include exiting tasks in cgroup iter Tejun Heo
2026-04-28  0:16 ` [PATCH 2/2] sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked() Tejun Heo
2026-05-04 19:10 ` [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Fix cgroup iter coverage of in-do_exit tasks Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.