[PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering
@ 2025-10-29  6:19 Tejun Heo
  2025-10-29  6:19 ` [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() Tejun Heo
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Tejun Heo @ 2025-10-29  6:19 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext,
	Tejun Heo

Hello,

This series fixes a cgroup task exit ordering issue that is generally
suboptimal for all cgroup controllers and has caused real breakage for
sched_ext schedulers.

Currently, when a task exits, cgroup_task_exit() in do_exit() immediately
unlinks the task from its cgroup via css_set_move_task(). From the cgroup's
perspective, the task is now gone. If this makes the cgroup empty, it can be
destroyed, triggering ->css_offline() callbacks that notify controllers the
cgroup is going offline resource-wise.

However, the exiting task continues to run, perform memory operations, and
schedule until the final context switch in finish_task_switch(). This creates
a problematic window where controllers are told a cgroup is offline while
resource activities are still occurring in it. While this hasn't broken
existing controllers, it's clearly suboptimal and has caused real breakage
for sched_ext schedulers.

The sched_ext breakage manifests in two ways:

1. When a sched_ext scheduler is loaded, it walks all tasks and calls
   ops.init_task() on each. For tasks in a cgroup, it first ensures the
   cgroup has been initialized via ops.cgroup_init(). However, if the task
   is in the dying state (still running but already unlinked from its
   cgroup), the cgroup may already be offline. This results in
   ops.init_task() being called on a cgroup that never received
   ops.cgroup_init(), breaking the initialization invariant.

   This broke the scx_mitosis scheduler with errors like "cgrp_ctx lookup
   failed for cgid 3869" where the BPF program couldn't find cgroup context
   that was never created. See: https://github.com/sched-ext/scx/issues/2846

2. Because sched_ext_free() was called from __put_task_struct() (which can
   happen long after the task stops running), ops.cgroup_exit() could be
   called before ops.exit_task() was called on all member tasks, violating
   the expected ordering where all tasks exit before their cgroup does.

The fix defers the cgroup unlinking from do_exit() to finish_task_switch(),
ensuring the task remains linked to its cgroup until it's truly done running.
For sched_ext specifically, we also move the cleanup earlier to
finish_task_switch() to ensure proper ordering with cgroup operations.

This adds two new calls to finish_task_switch() that operate on the dead task
after the final switch: cgroup_task_dead() and sched_ext_dead(). It may make
sense to factor these into a helper function located in kernel/exit.c if this
pattern continues to grow.

Patch 0004 changes sched_ext and can be applied to sched_ext/for-6.19 after
pulling cgroup/for-6.19. Alternatively, would it be easier for some or all of
this series to go through the tip tree?

Based on cgroup/for-6.19 (d5cf4d34a333).

 0001 cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()
 0002 cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
 0003 cgroup: Defer task cgroup unlink until after the task is done switching out
 0004 sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git cgroup-fix-exit-ordering

 include/linux/cgroup.h    | 14 ++++++++------
 include/linux/sched/ext.h |  4 ++--
 kernel/cgroup/cgroup.c    | 39 +++++++++++++++++++++++----------------
 kernel/exit.c             |  4 ++--
 kernel/fork.c             |  3 +--
 kernel/sched/autogroup.c  |  4 ++--
 kernel/sched/core.c       |  8 ++++++++
 kernel/sched/ext.c        |  2 +-
 8 files changed, 47 insertions(+), 31 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()
  2025-10-29  6:19 [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
@ 2025-10-29  6:19 ` Tejun Heo
  2025-10-31  2:25   ` Chen Ridong
  2025-10-29  6:19 ` [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() Tejun Heo
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2025-10-29  6:19 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext,
	Tejun Heo

The current names cgroup_exit(), cgroup_release(), and cgroup_free() are
confusing because they look like they're operating on cgroups themselves when
they're actually task lifecycle hooks. For example, cgroup_init() initializes
the cgroup subsystem while cgroup_exit() is a task exit notification to
cgroup. Rename them to cgroup_task_exit(), cgroup_task_release(), and
cgroup_task_free() to make it clear that these operate on tasks.

Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h   | 12 ++++++------
 kernel/cgroup/cgroup.c   | 11 ++++++-----
 kernel/exit.c            |  4 ++--
 kernel/fork.c            |  2 +-
 kernel/sched/autogroup.c |  4 ++--
 5 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 6ed477338b16..4068035176c4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -137,9 +137,9 @@ extern void cgroup_cancel_fork(struct task_struct *p,
 			       struct kernel_clone_args *kargs);
 extern void cgroup_post_fork(struct task_struct *p,
 			     struct kernel_clone_args *kargs);
-void cgroup_exit(struct task_struct *p);
-void cgroup_release(struct task_struct *p);
-void cgroup_free(struct task_struct *p);
+void cgroup_task_exit(struct task_struct *p);
+void cgroup_task_release(struct task_struct *p);
+void cgroup_task_free(struct task_struct *p);
 
 int cgroup_init_early(void);
 int cgroup_init(void);
@@ -680,9 +680,9 @@ static inline void cgroup_cancel_fork(struct task_struct *p,
 				      struct kernel_clone_args *kargs) {}
 static inline void cgroup_post_fork(struct task_struct *p,
 				    struct kernel_clone_args *kargs) {}
-static inline void cgroup_exit(struct task_struct *p) {}
-static inline void cgroup_release(struct task_struct *p) {}
-static inline void cgroup_free(struct task_struct *p) {}
+static inline void cgroup_task_exit(struct task_struct *p) {}
+static inline void cgroup_task_release(struct task_struct *p) {}
+static inline void cgroup_task_free(struct task_struct *p) {}
 
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 6ae5f48cf64e..826b7fd2f85d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -944,7 +944,8 @@ static void css_set_move_task(struct task_struct *task,
 		/*
 		 * We are synchronized through cgroup_threadgroup_rwsem
 		 * against PF_EXITING setting such that we can't race
-		 * against cgroup_exit()/cgroup_free() dropping the css_set.
+		 * against cgroup_task_exit()/cgroup_task_free() dropping
+		 * the css_set.
 		 */
 		WARN_ON_ONCE(task->flags & PF_EXITING);
 
@@ -6972,13 +6973,13 @@ void cgroup_post_fork(struct task_struct *child,
 }
 
 /**
- * cgroup_exit - detach cgroup from exiting task
+ * cgroup_task_exit - detach cgroup from exiting task
  * @tsk: pointer to task_struct of exiting process
  *
  * Description: Detach cgroup from @tsk.
  *
  */
-void cgroup_exit(struct task_struct *tsk)
+void cgroup_task_exit(struct task_struct *tsk)
 {
 	struct cgroup_subsys *ss;
 	struct css_set *cset;
@@ -7010,7 +7011,7 @@ void cgroup_exit(struct task_struct *tsk)
 	} while_each_subsys_mask();
 }
 
-void cgroup_release(struct task_struct *task)
+void cgroup_task_release(struct task_struct *task)
 {
 	struct cgroup_subsys *ss;
 	int ssid;
@@ -7027,7 +7028,7 @@ void cgroup_release(struct task_struct *task)
 	}
 }
 
-void cgroup_free(struct task_struct *task)
+void cgroup_task_free(struct task_struct *task)
 {
 	struct css_set *cset = task_css_set(task);
 	put_css_set(cset);
diff --git a/kernel/exit.c b/kernel/exit.c
index 9f74e8f1c431..46173461e8de 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -257,7 +257,7 @@ void release_task(struct task_struct *p)
 	rcu_read_unlock();
 
 	pidfs_exit(p);
-	cgroup_release(p);
+	cgroup_task_release(p);
 
 	/* Retrieve @thread_pid before __unhash_process() may set it to NULL. */
 	thread_pid = task_pid(p);
@@ -967,7 +967,7 @@ void __noreturn do_exit(long code)
 	exit_thread(tsk);
 
 	sched_autogroup_exit_task(tsk);
-	cgroup_exit(tsk);
+	cgroup_task_exit(tsk);
 
 	/*
 	 * FIXME: do that only when needed, using sched_exit tracepoint
diff --git a/kernel/fork.c b/kernel/fork.c
index 3da0f08615a9..960c39c9c264 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -738,7 +738,7 @@ void __put_task_struct(struct task_struct *tsk)
 	unwind_task_free(tsk);
 	sched_ext_free(tsk);
 	io_uring_free(tsk);
-	cgroup_free(tsk);
+	cgroup_task_free(tsk);
 	task_numa_free(tsk, true);
 	security_task_free(tsk);
 	exit_creds(tsk);
diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index cdea931aae30..954137775f38 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -178,8 +178,8 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag)
 	 * this process can already run with task_group() == prev->tg or we can
 	 * race with cgroup code which can read autogroup = prev under rq->lock.
 	 * In the latter case for_each_thread() can not miss a migrating thread,
-	 * cpu_cgroup_attach() must not be possible after cgroup_exit() and it
-	 * can't be removed from thread list, we hold ->siglock.
+	 * cpu_cgroup_attach() must not be possible after cgroup_task_exit()
+	 * and it can't be removed from thread list, we hold ->siglock.
 	 *
 	 * If an exiting thread was already removed from thread list we rely on
 	 * sched_autogroup_exit_task().
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
  2025-10-29  6:19 [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
  2025-10-29  6:19 ` [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() Tejun Heo
@ 2025-10-29  6:19 ` Tejun Heo
  2025-11-14 17:48   ` Michal Koutný
  2025-10-29  6:19 ` [PATCH 3/4] cgroup: Defer task cgroup unlink until after the task is done switching out Tejun Heo
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2025-10-29  6:19 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext,
	Tejun Heo

Currently, cgroup_task_exit() adds thread group leaders with live member
threads to their css_set's dying_tasks list (so cgroup.procs iteration can
still see the leader), and cgroup_task_release() later removes them with
list_del_init(&task->cg_list).

An upcoming patch will defer the dying_tasks list addition, moving it from
cgroup_task_exit() (called from do_exit()) to a new function called from
finish_task_switch(). However, release_task() (which calls
cgroup_task_release()) can run either before or after finish_task_switch(),
creating a race where cgroup_task_release() might try to remove the task from
dying_tasks before or while it's being added.

Move the list_del_init() from cgroup_task_release() to cgroup_task_free() to
fix this race. cgroup_task_free() runs from __put_task_struct(), which is
always after both paths, making the cleanup safe.

Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup/cgroup.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 826b7fd2f85d..b3c27900c5d2 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -7019,6 +7019,11 @@ void cgroup_task_release(struct task_struct *task)
 	do_each_subsys_mask(ss, ssid, have_release_callback) {
 		ss->release(task);
 	} while_each_subsys_mask();
+}
+
+void cgroup_task_free(struct task_struct *task)
+{
+	struct css_set *cset = task_css_set(task);

 	if (!list_empty(&task->cg_list)) {
 		spin_lock_irq(&css_set_lock);
@@ -7026,11 +7031,7 @@ void cgroup_task_release(struct task_struct *task)
 		list_del_init(&task->cg_list);
 		spin_unlock_irq(&css_set_lock);
 	}
-}

-void cgroup_task_free(struct task_struct *task)
-{
-	struct css_set *cset = task_css_set(task);
 	put_css_set(cset);
 }

-- 
2.51.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/4] cgroup: Defer task cgroup unlink until after the task is done switching out
  2025-10-29  6:19 [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
  2025-10-29  6:19 ` [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() Tejun Heo
  2025-10-29  6:19 ` [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() Tejun Heo
@ 2025-10-29  6:19 ` Tejun Heo
  2025-11-14 17:48   ` Michal Koutný
  2025-10-29  6:19 ` [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch() Tejun Heo
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2025-10-29  6:19 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext,
	Tejun Heo

When a task exits, css_set_move_task(tsk, cset, NULL, false) unlinks the task
from its cgroup. From the cgroup's perspective, the task is now gone. If this
makes the cgroup empty, it can be removed, triggering ->css_offline() callbacks
that notify controllers the cgroup is going offline resource-wise.

However, the exiting task can still run, perform memory operations, and schedule
until the final context switch in finish_task_switch(). This creates a confusing
situation where controllers are told a cgroup is offline while resource
activities are still happening in it. While this hasn't broken existing
controllers, it has caused direct confusion for sched_ext schedulers.

Split cgroup_task_exit() into two functions. cgroup_task_exit() now only calls
the subsystem exit callbacks and continues to be called from do_exit(). The
css_set cleanup is moved to the new cgroup_task_dead() which is called from
finish_task_switch() after the final context switch, so that the cgroup only
appears empty after the task is truly done running.

This also reorders operations so that subsys->exit() is now called before
unlinking from the cgroup, which shouldn't break anything.

Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h |  2 ++
 kernel/cgroup/cgroup.c | 23 ++++++++++++++---------
 kernel/sched/core.c    |  2 ++
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4068035176c4..bc892e3b37ee 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -138,6 +138,7 @@ extern void cgroup_cancel_fork(struct task_struct *p,
 extern void cgroup_post_fork(struct task_struct *p,
 			     struct kernel_clone_args *kargs);
 void cgroup_task_exit(struct task_struct *p);
+void cgroup_task_dead(struct task_struct *p);
 void cgroup_task_release(struct task_struct *p);
 void cgroup_task_free(struct task_struct *p);
 
@@ -681,6 +682,7 @@ static inline void cgroup_cancel_fork(struct task_struct *p,
 static inline void cgroup_post_fork(struct task_struct *p,
 				    struct kernel_clone_args *kargs) {}
 static inline void cgroup_task_exit(struct task_struct *p) {}
+static inline void cgroup_task_dead(struct task_struct *p) {}
 static inline void cgroup_task_release(struct task_struct *p) {}
 static inline void cgroup_task_free(struct task_struct *p) {}
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index b3c27900c5d2..aae180d56c8c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -944,7 +944,7 @@ static void css_set_move_task(struct task_struct *task,
 		/*
 		 * We are synchronized through cgroup_threadgroup_rwsem
 		 * against PF_EXITING setting such that we can't race
-		 * against cgroup_task_exit()/cgroup_task_free() dropping
+		 * against cgroup_task_dead()/cgroup_task_free() dropping
 		 * the css_set.
 		 */
 		WARN_ON_ONCE(task->flags & PF_EXITING);
@@ -6982,10 +6982,20 @@ void cgroup_post_fork(struct task_struct *child,
 void cgroup_task_exit(struct task_struct *tsk)
 {
 	struct cgroup_subsys *ss;
-	struct css_set *cset;
 	int i;
 
-	spin_lock_irq(&css_set_lock);
+	/* see cgroup_post_fork() for details */
+	do_each_subsys_mask(ss, i, have_exit_callback) {
+		ss->exit(tsk);
+	} while_each_subsys_mask();
+}
+
+void cgroup_task_dead(struct task_struct *tsk)
+{
+	struct css_set *cset;
+	unsigned long flags;
+
+	spin_lock_irqsave(&css_set_lock, flags);
 
 	WARN_ON_ONCE(list_empty(&tsk->cg_list));
 	cset = task_css_set(tsk);
@@ -7003,12 +7013,7 @@ void cgroup_task_exit(struct task_struct *tsk)
 		     test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags)))
 		cgroup_update_frozen(task_dfl_cgroup(tsk));
 
-	spin_unlock_irq(&css_set_lock);
-
-	/* see cgroup_post_fork() for details */
-	do_each_subsys_mask(ss, i, have_exit_callback) {
-		ss->exit(tsk);
-	} while_each_subsys_mask();
+	spin_unlock_irqrestore(&css_set_lock, flags);
 }
 
 void cgroup_task_release(struct task_struct *task)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1ebf67b48e2..40f12e37f60f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5222,6 +5222,8 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
 
+		cgroup_task_dead(prev);
+
 		/* Task is done with its stack. */
 		put_task_stack(prev);
 
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
  2025-10-29  6:19 [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
                   ` (2 preceding siblings ...)
  2025-10-29  6:19 ` [PATCH 3/4] cgroup: Defer task cgroup unlink until after the task is done switching out Tejun Heo
@ 2025-10-29  6:19 ` Tejun Heo
  2025-10-31 19:32   ` Andrea Righi
  2025-11-03 20:25   ` [PATCH v2 " Tejun Heo
  2025-11-03 20:26 ` [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
  2025-11-03 22:04 ` [PATCH v2 0/4] " Tejun Heo
  5 siblings, 2 replies; 16+ messages in thread
From: Tejun Heo @ 2025-10-29  6:19 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext,
	Tejun Heo

sched_ext_free() was called from __put_task_struct() when the last reference
to the task is dropped, which could be long after the task has finished
running. This causes cgroup-related problems:

- ops.task_init() can be called on a cgroup which didn't get ops.cgroup_init()'d
  during scheduler load.

- ops.cgroup_exit() could be called before ops.exit_task() is called on all
  member tasks, leading to incorrect exit ordering.

Fix by moving it to finish_task_switch() to be called right after the final
context switch away from the dying task, matching when sched_class->task_dead()
is called. Rename it to sched_ext_dead() to match the new calling context.

By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:

- Tasks visible on scx_tasks list have valid cgroups during scheduler load,
  as cgroup_mutex prevents cgroup destruction while the task is still linked.

- All member tasks have ops.exit_task() called and are removed from scx_tasks
  before the cgroup can be destroyed and trigger ops.cgroup_exit().

This fix is made possible by the cgroup_task_dead() split in the previous patch.

This also makes more sense resource-wise as there's no point in keeping
scheduler side resources around for dead tasks.

Reported-by: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h | 4 ++--
 kernel/fork.c             | 1 -
 kernel/sched/core.c       | 6 ++++++
 kernel/sched/ext.c        | 2 +-
 4 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index d82b7a9b0658..d7dd77be571f 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -207,14 +207,14 @@ struct sched_ext_entity {
 	struct list_head	tasks_node;
 };
 
-void sched_ext_free(struct task_struct *p);
+void sched_ext_dead(struct task_struct *p);
 void print_scx_info(const char *log_lvl, struct task_struct *p);
 void scx_softlockup(u32 dur_s);
 bool scx_rcu_cpu_stall(void);
 
 #else	/* !CONFIG_SCHED_CLASS_EXT */
 
-static inline void sched_ext_free(struct task_struct *p) {}
+static inline void sched_ext_dead(struct task_struct *p) {}
 static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {}
 static inline void scx_softlockup(u32 dur_s) {}
 static inline bool scx_rcu_cpu_stall(void) { return false; }
diff --git a/kernel/fork.c b/kernel/fork.c
index 960c39c9c264..5ae37909a813 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,7 +736,6 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(tsk == current);
 
 	unwind_task_free(tsk);
-	sched_ext_free(tsk);
 	io_uring_free(tsk);
 	cgroup_task_free(tsk);
 	task_numa_free(tsk, true);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 40f12e37f60f..d4dbffb27a66 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5222,6 +5222,12 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
 
+		/*
+		 * sched_ext_dead() must come before cgroup_task_dead() to
+		 * prevent cgroups from being removed while its member tasks are
+		 * visible to SCX schedulers.
+		 */
+		sched_ext_dead(prev);
 		cgroup_task_dead(prev);
 
 		/* Task is done with its stack. */
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 2b0e88206d07..840bc76210c8 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2926,7 +2926,7 @@ void scx_cancel_fork(struct task_struct *p)
 	percpu_up_read(&scx_fork_rwsem);
 }
 
-void sched_ext_free(struct task_struct *p)
+void sched_ext_dead(struct task_struct *p)
 {
 	unsigned long flags;
 
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()
  2025-10-29  6:19 ` [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() Tejun Heo
@ 2025-10-31  2:25   ` Chen Ridong
  0 siblings, 0 replies; 16+ messages in thread
From: Chen Ridong @ 2025-10-31  2:25 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext



On 2025/10/29 14:19, Tejun Heo wrote:
> The current names cgroup_exit(), cgroup_release(), and cgroup_free() are
> confusing because they look like they're operating on cgroups themselves when
> they're actually task lifecycle hooks. For example, cgroup_init() initializes
> the cgroup subsystem while cgroup_exit() is a task exit notification to
> cgroup. Rename them to cgroup_task_exit(), cgroup_task_release(), and
> cgroup_task_free() to make it clear that these operate on tasks.
> 

This makes sense.

> Cc: Dan Schatzberg <dschatzberg@meta.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  include/linux/cgroup.h   | 12 ++++++------
>  kernel/cgroup/cgroup.c   | 11 ++++++-----
>  kernel/exit.c            |  4 ++--
>  kernel/fork.c            |  2 +-
>  kernel/sched/autogroup.c |  4 ++--
>  5 files changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 6ed477338b16..4068035176c4 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -137,9 +137,9 @@ extern void cgroup_cancel_fork(struct task_struct *p,
>  			       struct kernel_clone_args *kargs);
>  extern void cgroup_post_fork(struct task_struct *p,
>  			     struct kernel_clone_args *kargs);
> -void cgroup_exit(struct task_struct *p);
> -void cgroup_release(struct task_struct *p);
> -void cgroup_free(struct task_struct *p);
> +void cgroup_task_exit(struct task_struct *p);
> +void cgroup_task_release(struct task_struct *p);
> +void cgroup_task_free(struct task_struct *p);
>  
>  int cgroup_init_early(void);
>  int cgroup_init(void);
> @@ -680,9 +680,9 @@ static inline void cgroup_cancel_fork(struct task_struct *p,
>  				      struct kernel_clone_args *kargs) {}
>  static inline void cgroup_post_fork(struct task_struct *p,
>  				    struct kernel_clone_args *kargs) {}
> -static inline void cgroup_exit(struct task_struct *p) {}
> -static inline void cgroup_release(struct task_struct *p) {}
> -static inline void cgroup_free(struct task_struct *p) {}
> +static inline void cgroup_task_exit(struct task_struct *p) {}
> +static inline void cgroup_task_release(struct task_struct *p) {}
> +static inline void cgroup_task_free(struct task_struct *p) {}
>  
>  static inline int cgroup_init_early(void) { return 0; }
>  static inline int cgroup_init(void) { return 0; }
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 6ae5f48cf64e..826b7fd2f85d 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -944,7 +944,8 @@ static void css_set_move_task(struct task_struct *task,
>  		/*
>  		 * We are synchronized through cgroup_threadgroup_rwsem
>  		 * against PF_EXITING setting such that we can't race
> -		 * against cgroup_exit()/cgroup_free() dropping the css_set.
> +		 * against cgroup_task_exit()/cgroup_task_free() dropping
> +		 * the css_set.
>  		 */
>  		WARN_ON_ONCE(task->flags & PF_EXITING);
>  
> @@ -6972,13 +6973,13 @@ void cgroup_post_fork(struct task_struct *child,
>  }
>  
>  /**
> - * cgroup_exit - detach cgroup from exiting task
> + * cgroup_task_exit - detach cgroup from exiting task
>   * @tsk: pointer to task_struct of exiting process
>   *
>   * Description: Detach cgroup from @tsk.
>   *
>   */
> -void cgroup_exit(struct task_struct *tsk)
> +void cgroup_task_exit(struct task_struct *tsk)
>  {
>  	struct cgroup_subsys *ss;
>  	struct css_set *cset;
> @@ -7010,7 +7011,7 @@ void cgroup_exit(struct task_struct *tsk)
>  	} while_each_subsys_mask();
>  }
>  
> -void cgroup_release(struct task_struct *task)
> +void cgroup_task_release(struct task_struct *task)
>  {
>  	struct cgroup_subsys *ss;
>  	int ssid;
> @@ -7027,7 +7028,7 @@ void cgroup_release(struct task_struct *task)
>  	}
>  }
>  
> -void cgroup_free(struct task_struct *task)
> +void cgroup_task_free(struct task_struct *task)
>  {
>  	struct css_set *cset = task_css_set(task);
>  	put_css_set(cset);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 9f74e8f1c431..46173461e8de 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -257,7 +257,7 @@ void release_task(struct task_struct *p)
>  	rcu_read_unlock();
>  
>  	pidfs_exit(p);
> -	cgroup_release(p);
> +	cgroup_task_release(p);
>  
>  	/* Retrieve @thread_pid before __unhash_process() may set it to NULL. */
>  	thread_pid = task_pid(p);
> @@ -967,7 +967,7 @@ void __noreturn do_exit(long code)
>  	exit_thread(tsk);
>  
>  	sched_autogroup_exit_task(tsk);
> -	cgroup_exit(tsk);
> +	cgroup_task_exit(tsk);
>  
>  	/*
>  	 * FIXME: do that only when needed, using sched_exit tracepoint
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3da0f08615a9..960c39c9c264 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -738,7 +738,7 @@ void __put_task_struct(struct task_struct *tsk)
>  	unwind_task_free(tsk);
>  	sched_ext_free(tsk);
>  	io_uring_free(tsk);
> -	cgroup_free(tsk);
> +	cgroup_task_free(tsk);
>  	task_numa_free(tsk, true);
>  	security_task_free(tsk);
>  	exit_creds(tsk);
> diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
> index cdea931aae30..954137775f38 100644
> --- a/kernel/sched/autogroup.c
> +++ b/kernel/sched/autogroup.c
> @@ -178,8 +178,8 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag)
>  	 * this process can already run with task_group() == prev->tg or we can
>  	 * race with cgroup code which can read autogroup = prev under rq->lock.
>  	 * In the latter case for_each_thread() can not miss a migrating thread,
> -	 * cpu_cgroup_attach() must not be possible after cgroup_exit() and it
> -	 * can't be removed from thread list, we hold ->siglock.
> +	 * cpu_cgroup_attach() must not be possible after cgroup_task_exit()
> +	 * and it can't be removed from thread list, we hold ->siglock.
>  	 *
>  	 * If an exiting thread was already removed from thread list we rely on
>  	 * sched_autogroup_exit_task().

Reviewed-by: Chen Ridong<chenridong@huawei.com>

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
  2025-10-29  6:19 ` [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch() Tejun Heo
@ 2025-10-31 19:32   ` Andrea Righi
  2025-11-03 20:25   ` [PATCH v2 " Tejun Heo
  1 sibling, 0 replies; 16+ messages in thread
From: Andrea Righi @ 2025-10-31 19:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Peter Zijlstra,
	linux-kernel, cgroups, sched-ext

Hi Tejun,

On Tue, Oct 28, 2025 at 08:19:18PM -1000, Tejun Heo wrote:
> sched_ext_free() was called from __put_task_struct() when the last reference
> to the task is dropped, which could be long after the task has finished
> running. This causes cgroup-related problems:
> 
> - ops.task_init() can be called on a cgroup which didn't get ops.cgroup_init()'d
>   during scheduler load.

s/task_init/init_task/

Also, it took me a bit to understand this point, maybe we could add
something like this to make it more clear:

 - ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d
   during scheduler load, because the cgroup might be destroyed/unlinked
   while the zombie task is still lingering on the scx_tasks list.

> 
> - ops.cgroup_exit() could be called before ops.exit_task() is called on all
>   member tasks, leading to incorrect exit ordering.
> 
> Fix by moving it to finish_task_switch() to be called right after the final
> context switch away from the dying task, matching when sched_class->task_dead()
> is called. Rename it to sched_ext_dead() to match the new calling context.
> 
> By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:
> 
> - Tasks visible on scx_tasks list have valid cgroups during scheduler load,
>   as cgroup_mutex prevents cgroup destruction while the task is still linked.
> 
> - All member tasks have ops.exit_task() called and are removed from scx_tasks
>   before the cgroup can be destroyed and trigger ops.cgroup_exit().
> 
> This fix is made possible by the cgroup_task_dead() split in the previous patch.
> 
> This also makes more sense resource-wise as there's no point in keeping
> scheduler side resources around for dead tasks.
> 
> Reported-by: Dan Schatzberg <dschatzberg@meta.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Apart from the minor comment in the description above, everything else
looks good to me.

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> ---
>  include/linux/sched/ext.h | 4 ++--
>  kernel/fork.c             | 1 -
>  kernel/sched/core.c       | 6 ++++++
>  kernel/sched/ext.c        | 2 +-
>  4 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index d82b7a9b0658..d7dd77be571f 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -207,14 +207,14 @@ struct sched_ext_entity {
>  	struct list_head	tasks_node;
>  };
>  
> -void sched_ext_free(struct task_struct *p);
> +void sched_ext_dead(struct task_struct *p);
>  void print_scx_info(const char *log_lvl, struct task_struct *p);
>  void scx_softlockup(u32 dur_s);
>  bool scx_rcu_cpu_stall(void);
>  
>  #else	/* !CONFIG_SCHED_CLASS_EXT */
>  
> -static inline void sched_ext_free(struct task_struct *p) {}
> +static inline void sched_ext_dead(struct task_struct *p) {}
>  static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {}
>  static inline void scx_softlockup(u32 dur_s) {}
>  static inline bool scx_rcu_cpu_stall(void) { return false; }
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 960c39c9c264..5ae37909a813 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -736,7 +736,6 @@ void __put_task_struct(struct task_struct *tsk)
>  	WARN_ON(tsk == current);
>  
>  	unwind_task_free(tsk);
> -	sched_ext_free(tsk);
>  	io_uring_free(tsk);
>  	cgroup_task_free(tsk);
>  	task_numa_free(tsk, true);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 40f12e37f60f..d4dbffb27a66 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5222,6 +5222,12 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  		if (prev->sched_class->task_dead)
>  			prev->sched_class->task_dead(prev);
>  
> +		/*
> +		 * sched_ext_dead() must come before cgroup_task_dead() to
> +		 * prevent cgroups from being removed while its member tasks are
> +		 * visible to SCX schedulers.
> +		 */
> +		sched_ext_dead(prev);
>  		cgroup_task_dead(prev);
>  
>  		/* Task is done with its stack. */
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 2b0e88206d07..840bc76210c8 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2926,7 +2926,7 @@ void scx_cancel_fork(struct task_struct *p)
>  	percpu_up_read(&scx_fork_rwsem);
>  }
>  
> -void sched_ext_free(struct task_struct *p)
> +void sched_ext_dead(struct task_struct *p)
>  {
>  	unsigned long flags;
>  
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
  2025-10-29  6:19 ` [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch() Tejun Heo
  2025-10-31 19:32   ` Andrea Righi
@ 2025-11-03 20:25   ` Tejun Heo
  2025-11-03 20:28     ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2025-11-03 20:25 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext

sched_ext_free() was called from __put_task_struct() when the last reference
to the task is dropped, which could be long after the task has finished
running. This causes cgroup-related problems:

- ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d
  during scheduler load, because the cgroup might be destroyed/unlinked
  while the zombie or dead task is still lingering on the scx_tasks list.

- ops.cgroup_exit() could be called before ops.exit_task() is called on all
  member tasks, leading to incorrect exit ordering.

Fix by moving it to finish_task_switch() to be called right after the final
context switch away from the dying task, matching when sched_class->task_dead()
is called. Rename it to sched_ext_dead() to match the new calling context.

By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:

- Tasks visible on scx_tasks list have valid cgroups during scheduler load,
  as cgroup_mutex prevents cgroup destruction while the task is still linked.

- All member tasks have ops.exit_task() called and are removed from scx_tasks
  before the cgroup can be destroyed and trigger ops.cgroup_exit().

This fix is made possible by the cgroup_task_dead() split in the previous patch.

This also makes more sense resource-wise as there's no point in keeping
scheduler side resources around for dead tasks.

Reported-by: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
v2: - Description correction and update (Andrea Righi).

 include/linux/sched/ext.h |    4 ++--
 kernel/fork.c             |    1 -
 kernel/sched/core.c       |    6 ++++++
 kernel/sched/ext.c        |    2 +-
 4 files changed, 9 insertions(+), 4 deletions(-)

--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -207,14 +207,14 @@ struct sched_ext_entity {
 	struct list_head	tasks_node;
 };
 
-void sched_ext_free(struct task_struct *p);
+void sched_ext_dead(struct task_struct *p);
 void print_scx_info(const char *log_lvl, struct task_struct *p);
 void scx_softlockup(u32 dur_s);
 bool scx_rcu_cpu_stall(void);
 
 #else	/* !CONFIG_SCHED_CLASS_EXT */
 
-static inline void sched_ext_free(struct task_struct *p) {}
+static inline void sched_ext_dead(struct task_struct *p) {}
 static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {}
 static inline void scx_softlockup(u32 dur_s) {}
 static inline bool scx_rcu_cpu_stall(void) { return false; }
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,7 +736,6 @@ void __put_task_struct(struct task_struc
 	WARN_ON(tsk == current);
 
 	unwind_task_free(tsk);
-	sched_ext_free(tsk);
 	io_uring_free(tsk);
 	cgroup_task_free(tsk);
 	task_numa_free(tsk, true);
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5222,6 +5222,12 @@ static struct rq *finish_task_switch(str
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
 
+		/*
+		 * sched_ext_dead() must come before cgroup_task_dead() to
+		 * prevent cgroups from being removed while its member tasks are
+		 * visible to SCX schedulers.
+		 */
+		sched_ext_dead(prev);
 		cgroup_task_dead(prev);
 
 		/* Task is done with its stack. */
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2926,7 +2926,7 @@ void scx_cancel_fork(struct task_struct
 	percpu_up_read(&scx_fork_rwsem);
 }
 
-void sched_ext_free(struct task_struct *p)
+void sched_ext_dead(struct task_struct *p)
 {
 	unsigned long flags;
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering
  2025-10-29  6:19 [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
                   ` (3 preceding siblings ...)
  2025-10-29  6:19 ` [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch() Tejun Heo
@ 2025-11-03 20:26 ` Tejun Heo
  2025-11-03 22:04 ` [PATCH v2 0/4] " Tejun Heo
  5 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2025-11-03 20:26 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext

On Tue, Oct 28, 2025 at 08:19:14PM -1000, Tejun Heo wrote:
> This series fixes a cgroup task exit ordering issue that is generally
> suboptimal for all cgroup controllers and has caused real breakage for
> sched_ext schedulers.

Peter, I'm planning to apply these patches soon. Please holler if you have
any objections.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
  2025-11-03 20:25   ` [PATCH v2 " Tejun Heo
@ 2025-11-03 20:28     ` Peter Zijlstra
  2025-11-03 20:31       ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2025-11-03 20:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, Dan Schatzberg,
	linux-kernel, cgroups, sched-ext

On Mon, Nov 03, 2025 at 10:25:13AM -1000, Tejun Heo wrote:
> sched_ext_free() was called from __put_task_struct() when the last reference
> to the task is dropped, which could be long after the task has finished
> running. This causes cgroup-related problems:
> 
> - ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d
>   during scheduler load, because the cgroup might be destroyed/unlinked
>   while the zombie or dead task is still lingering on the scx_tasks list.
> 
> - ops.cgroup_exit() could be called before ops.exit_task() is called on all
>   member tasks, leading to incorrect exit ordering.
> 
> Fix by moving it to finish_task_switch() to be called right after the final
> context switch away from the dying task, matching when sched_class->task_dead()
> is called. Rename it to sched_ext_dead() to match the new calling context.
> 
> By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:
> 
> - Tasks visible on scx_tasks list have valid cgroups during scheduler load,
>   as cgroup_mutex prevents cgroup destruction while the task is still linked.
> 
> - All member tasks have ops.exit_task() called and are removed from scx_tasks
>   before the cgroup can be destroyed and trigger ops.cgroup_exit().
> 
> This fix is made possible by the cgroup_task_dead() split in the previous patch.
> 
> This also makes more sense resource-wise as there's no point in keeping
> scheduler side resources around for dead tasks.
> 
> Reported-by: Dan Schatzberg <dschatzberg@meta.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Reviewed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> v2: - Description correction and update (Andrea Righi).
> 
>  include/linux/sched/ext.h |    4 ++--
>  kernel/fork.c             |    1 -
>  kernel/sched/core.c       |    6 ++++++
>  kernel/sched/ext.c        |    2 +-
>  4 files changed, 9 insertions(+), 4 deletions(-)
> 
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -207,14 +207,14 @@ struct sched_ext_entity {
>  	struct list_head	tasks_node;
>  };
>  
> -void sched_ext_free(struct task_struct *p);
> +void sched_ext_dead(struct task_struct *p);
>  void print_scx_info(const char *log_lvl, struct task_struct *p);
>  void scx_softlockup(u32 dur_s);
>  bool scx_rcu_cpu_stall(void);
>  
>  #else	/* !CONFIG_SCHED_CLASS_EXT */
>  
> -static inline void sched_ext_free(struct task_struct *p) {}
> +static inline void sched_ext_dead(struct task_struct *p) {}
>  static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {}
>  static inline void scx_softlockup(u32 dur_s) {}
>  static inline bool scx_rcu_cpu_stall(void) { return false; }
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -736,7 +736,6 @@ void __put_task_struct(struct task_struc
>  	WARN_ON(tsk == current);
>  
>  	unwind_task_free(tsk);
> -	sched_ext_free(tsk);
>  	io_uring_free(tsk);
>  	cgroup_task_free(tsk);
>  	task_numa_free(tsk, true);
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5222,6 +5222,12 @@ static struct rq *finish_task_switch(str
>  		if (prev->sched_class->task_dead)
>  			prev->sched_class->task_dead(prev);

^^^ can you not use task_dead_scx() ?

> +		/*
> +		 * sched_ext_dead() must come before cgroup_task_dead() to
> +		 * prevent cgroups from being removed while its member tasks are
> +		 * visible to SCX schedulers.
> +		 */
> +		sched_ext_dead(prev);
>  		cgroup_task_dead(prev);
>  
>  		/* Task is done with its stack. */
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2926,7 +2926,7 @@ void scx_cancel_fork(struct task_struct
>  	percpu_up_read(&scx_fork_rwsem);
>  }
>  
> -void sched_ext_free(struct task_struct *p)
> +void sched_ext_dead(struct task_struct *p)
>  {
>  	unsigned long flags;
>  

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
  2025-11-03 20:28     ` Peter Zijlstra
@ 2025-11-03 20:31       ` Tejun Heo
  2025-11-03 20:38         ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2025-11-03 20:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Vernet, Andrea Righi, Changwoo Min, Dan Schatzberg,
	linux-kernel, cgroups, sched-ext

Hello,

On Mon, Nov 03, 2025 at 09:28:43PM +0100, Peter Zijlstra wrote:
> > @@ -5222,6 +5222,12 @@ static struct rq *finish_task_switch(str
> >  		if (prev->sched_class->task_dead)
> >  			prev->sched_class->task_dead(prev);
> 
> ^^^ can you not use task_dead_scx() ?

Unfortunately not. Because sched class switches are atomic infallible
operations, all tasks in the system must be prepped on scheduler attach &
fork regardless of their current sched class and thus have to be cleaned up
in the same way on detach & exit.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
  2025-11-03 20:31       ` Tejun Heo
@ 2025-11-03 20:38         ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2025-11-03 20:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, Dan Schatzberg,
	linux-kernel, cgroups, sched-ext

On Mon, Nov 03, 2025 at 10:31:36AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Mon, Nov 03, 2025 at 09:28:43PM +0100, Peter Zijlstra wrote:
> > > @@ -5222,6 +5222,12 @@ static struct rq *finish_task_switch(str
> > >  		if (prev->sched_class->task_dead)
> > >  			prev->sched_class->task_dead(prev);
> > 
> > ^^^ can you not use task_dead_scx() ?
> 
> Unfortunately not. Because sched class switches are atomic infallible
> operations, all tasks in the system must be prepped on scheduler attach &
> fork regardless of their current sched class and thus have to be cleaned up
> in the same way on detach & exit.

Ah, I see scx_post_fork() and sched_ext_free^H^H^H^Hdead() effectively
duplicate the tasklist.

Oh well.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/4] cgroup: Fix task exit ordering
  2025-10-29  6:19 [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
                   ` (4 preceding siblings ...)
  2025-11-03 20:26 ` [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
@ 2025-11-03 22:04 ` Tejun Heo
  5 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2025-11-03 22:04 UTC (permalink / raw)
  To: David Vernet
  Cc: Andrea Righi, Changwoo Min, Dan Schatzberg, Peter Zijlstra,
	linux-kernel, cgroups, sched-ext

> Tejun Heo (4):
>   cgroup: Rename cgroup_{exit|release|free}() to cgroup_task_{exit|release|free}()
>   cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
>   cgroup: Defer task unlinking from cgroup during exit to finish_task_switch()
>   sched_ext: Rename sched_ext_free() to sched_ext_dead() and move earlier in finish_task_switch()

Applied 1-3 to cgroup/for-6.19 and 4 to sched_ext/for-6.19.

Thanks.
--
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
  2025-10-29  6:19 ` [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() Tejun Heo
@ 2025-11-14 17:48   ` Michal Koutný
  2025-11-14 18:18     ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Koutný @ 2025-11-14 17:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, Dan Schatzberg,
	Peter Zijlstra, linux-kernel, cgroups, sched-ext

[-- Attachment #1: Type: text/plain, Size: 1183 bytes --]

On Tue, Oct 28, 2025 at 08:19:16PM -1000, Tejun Heo <tj@kernel.org> wrote:
> Currently, cgroup_task_exit() adds thread group leaders with live member
> threads to their css_set's dying_tasks list (so cgroup.procs iteration can
> still see the leader), and cgroup_task_release() later removes them with
> list_del_init(&task->cg_list).
> 
> An upcoming patch will defer the dying_tasks list addition, moving it from
> cgroup_task_exit() (called from do_exit()) to a new function called from
> finish_task_switch().
> However, release_task() (which calls
> cgroup_task_release()) can run either before or after finish_task_switch(),

Just for better understanding -- when can release_task() run before
finish_task_switch()?

> creating a race where cgroup_task_release() might try to remove the task from
> dying_tasks before or while it's being added.
> 
> Move the list_del_init() from cgroup_task_release() to cgroup_task_free() to
> fix this race. cgroup_task_free() runs from __put_task_struct(), which is
> always after both paths, making the cleanup safe.

(Ah, now I get the reasoning of more likely pids '0' for CSS_TASK_ITER_PROCS.)

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] cgroup: Defer task cgroup unlink until after the task is done switching out
  2025-10-29  6:19 ` [PATCH 3/4] cgroup: Defer task cgroup unlink until after the task is done switching out Tejun Heo
@ 2025-11-14 17:48   ` Michal Koutný
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Koutný @ 2025-11-14 17:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, Dan Schatzberg,
	Peter Zijlstra, linux-kernel, cgroups, sched-ext

[-- Attachment #1: Type: text/plain, Size: 2461 bytes --]

On Tue, Oct 28, 2025 at 08:19:17PM -1000, Tejun Heo <tj@kernel.org> wrote:
> When a task exits, css_set_move_task(tsk, cset, NULL, false) unlinks the task
> from its cgroup. From the cgroup's perspective, the task is now gone. If this
> makes the cgroup empty, it can be removed, triggering ->css_offline() callbacks
> that notify controllers the cgroup is going offline resource-wise.
> 
> However, the exiting task can still run, perform memory operations, and schedule
> until the final context switch in finish_task_switch().
> This creates a confusing
> situation where controllers are told a cgroup is offline while resource
> activities are still happening in it.

(FWIW, I've always considered it impossible to (mm) charge into offlined
memcgs. Alhtogh I don't remember whether anything _relied_ on that...

> While this hasn't broken existing controllers,

... so hopefully not.)

> 
> Split cgroup_task_exit() into two functions. cgroup_task_exit() now only calls
> the subsystem exit callbacks and continues to be called from do_exit(). The
> css_set cleanup is moved to the new cgroup_task_dead() which is called from
> finish_task_switch() after the final context switch, so that the cgroup only
> appears empty after the task is truly done running.
> 
> This also reorders operations so that subsys->exit() is now called before
> unlinking from the cgroup, which shouldn't break anything.
> 
> Cc: Dan Schatzberg <dschatzberg@meta.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  include/linux/cgroup.h |  2 ++
>  kernel/cgroup/cgroup.c | 23 ++++++++++++++---------
>  kernel/sched/core.c    |  2 ++
>  3 files changed, 18 insertions(+), 9 deletions(-)

Acked-by: Michal Koutný <mkoutny@suse.com>

> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 4068035176c4..bc892e3b37ee 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -138,6 +138,7 @@ extern void cgroup_cancel_fork(struct task_struct *p,
>  extern void cgroup_post_fork(struct task_struct *p,
>  			     struct kernel_clone_args *kargs);
>  void cgroup_task_exit(struct task_struct *p);
> +void cgroup_task_dead(struct task_struct *p);
>  void cgroup_task_release(struct task_struct *p);
>  void cgroup_task_free(struct task_struct *p);

"Hi, I'm four-stage process, you may remember me such callbacks as
css_killed and css_release."

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
  2025-11-14 17:48   ` Michal Koutný
@ 2025-11-14 18:18     ` Tejun Heo
  0 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2025-11-14 18:18 UTC (permalink / raw)
  To: Michal Koutný
  Cc: David Vernet, Andrea Righi, Changwoo Min, Dan Schatzberg,
	Peter Zijlstra, linux-kernel, cgroups, sched-ext

Hello,

On Fri, Nov 14, 2025 at 06:48:17PM +0100, Michal Koutný wrote:
> On Tue, Oct 28, 2025 at 08:19:16PM -1000, Tejun Heo <tj@kernel.org> wrote:
> > An upcoming patch will defer the dying_tasks list addition, moving it from
> > cgroup_task_exit() (called from do_exit()) to a new function called from
> > finish_task_switch().
> > However, release_task() (which calls
> > cgroup_task_release()) can run either before or after finish_task_switch(),
> 
> Just for better understanding -- when can release_task() run before
> finish_task_switch()?

I didn't test explicitly, so please take it with a grain of salt, but I
think both autoreap and !autoreap cases can run before the final task
switch.

- When autoreap, the dying task calls exit_notify() and eventually calls
  release_task() on self. This is obviously before the final switch.

- When !autoreap, it's a race. After exit_notify(), the parent can wait the
  zombie task anytime which will call release_task() through
  wait_task_zombie(). This can happen either before or after
  finish_task_switch().

> > creating a race where cgroup_task_release() might try to remove the task from
> > dying_tasks before or while it's being added.
> > 
> > Move the list_del_init() from cgroup_task_release() to cgroup_task_free() to
> > fix this race. cgroup_task_free() runs from __put_task_struct(), which is
> > always after both paths, making the cleanup safe.
> 
> (Ah, now I get the reasoning of more likely pids '0' for CSS_TASK_ITER_PROCS.)

Yeah, I thought about filtering it out better but if we can already show 0
pid for foreign ns tasks, maybe this is okay. What do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-11-14 18:18 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-29  6:19 [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
2025-10-29  6:19 ` [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() Tejun Heo
2025-10-31  2:25   ` Chen Ridong
2025-10-29  6:19 ` [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() Tejun Heo
2025-11-14 17:48   ` Michal Koutný
2025-11-14 18:18     ` Tejun Heo
2025-10-29  6:19 ` [PATCH 3/4] cgroup: Defer task cgroup unlink until after the task is done switching out Tejun Heo
2025-11-14 17:48   ` Michal Koutný
2025-10-29  6:19 ` [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch() Tejun Heo
2025-10-31 19:32   ` Andrea Righi
2025-11-03 20:25   ` [PATCH v2 " Tejun Heo
2025-11-03 20:28     ` Peter Zijlstra
2025-11-03 20:31       ` Tejun Heo
2025-11-03 20:38         ` Peter Zijlstra
2025-11-03 20:26 ` [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
2025-11-03 22:04 ` [PATCH v2 0/4] " Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).