* [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering
@ 2025-10-29 6:19 Tejun Heo
2025-10-29 6:19 ` [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() Tejun Heo
` (5 more replies)
0 siblings, 6 replies; 16+ messages in thread
From: Tejun Heo @ 2025-10-29 6:19 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext,
Tejun Heo
Hello,
This series fixes a cgroup task exit ordering issue that is generally
suboptimal for all cgroup controllers and has caused real breakage for
sched_ext schedulers.
Currently, when a task exits, cgroup_task_exit() in do_exit() immediately
unlinks the task from its cgroup via css_set_move_task(). From the cgroup's
perspective, the task is now gone. If this makes the cgroup empty, it can be
destroyed, triggering ->css_offline() callbacks that notify controllers the
cgroup is going offline resource-wise.
However, the exiting task continues to run, perform memory operations, and
schedule until the final context switch in finish_task_switch(). This creates
a problematic window where controllers are told a cgroup is offline while
resource activities are still occurring in it. While this hasn't broken
existing controllers, it's clearly suboptimal and has caused real breakage
for sched_ext schedulers.
The sched_ext breakage manifests in two ways:
1. When a sched_ext scheduler is loaded, it walks all tasks and calls
ops.init_task() on each. For tasks in a cgroup, it first ensures the
cgroup has been initialized via ops.cgroup_init(). However, if the task
is in the dying state (still running but already unlinked from its
cgroup), the cgroup may already be offline. This results in
ops.init_task() being called on a cgroup that never received
ops.cgroup_init(), breaking the initialization invariant.
This broke the scx_mitosis scheduler with errors like "cgrp_ctx lookup
failed for cgid 3869" where the BPF program couldn't find cgroup context
that was never created. See: https://github.com/sched-ext/scx/issues/2846
2. Because sched_ext_free() was called from __put_task_struct() (which can
happen long after the task stops running), ops.cgroup_exit() could be
called before ops.exit_task() was called on all member tasks, violating
the expected ordering where all tasks exit before their cgroup does.
The fix defers the cgroup unlinking from do_exit() to finish_task_switch(),
ensuring the task remains linked to its cgroup until it's truly done running.
For sched_ext specifically, we also move the cleanup earlier to
finish_task_switch() to ensure proper ordering with cgroup operations.
This adds two new calls to finish_task_switch() that operate on the dead task
after the final switch: cgroup_task_dead() and sched_ext_dead(). It may make
sense to factor these into a helper function located in kernel/exit.c if this
pattern continues to grow.
Patch 0004 changes sched_ext and can be applied to sched_ext/for-6.19 after
pulling cgroup/for-6.19. Alternatively, would it be easier for some or all of
this series to go through the tip tree?
Based on cgroup/for-6.19 (d5cf4d34a333).
0001 cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()
0002 cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
0003 cgroup: Defer task cgroup unlink until after the task is done switching out
0004 sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git cgroup-fix-exit-ordering
include/linux/cgroup.h | 14 ++++++++------
include/linux/sched/ext.h | 4 ++--
kernel/cgroup/cgroup.c | 39 +++++++++++++++++++++++----------------
kernel/exit.c | 4 ++--
kernel/fork.c | 3 +--
kernel/sched/autogroup.c | 4 ++--
kernel/sched/core.c | 8 ++++++++
kernel/sched/ext.c | 2 +-
8 files changed, 47 insertions(+), 31 deletions(-)
--
tejun
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()
2025-10-29 6:19 [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
@ 2025-10-29 6:19 ` Tejun Heo
2025-10-31 2:25 ` Chen Ridong
2025-10-29 6:19 ` [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() Tejun Heo
` (4 subsequent siblings)
5 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2025-10-29 6:19 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext,
Tejun Heo
The current names cgroup_exit(), cgroup_release(), and cgroup_free() are
confusing because they look like they're operating on cgroups themselves when
they're actually task lifecycle hooks. For example, cgroup_init() initializes
the cgroup subsystem while cgroup_exit() is a task exit notification to
cgroup. Rename them to cgroup_task_exit(), cgroup_task_release(), and
cgroup_task_free() to make it clear that these operate on tasks.
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/cgroup.h | 12 ++++++------
kernel/cgroup/cgroup.c | 11 ++++++-----
kernel/exit.c | 4 ++--
kernel/fork.c | 2 +-
kernel/sched/autogroup.c | 4 ++--
5 files changed, 17 insertions(+), 16 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 6ed477338b16..4068035176c4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -137,9 +137,9 @@ extern void cgroup_cancel_fork(struct task_struct *p,
struct kernel_clone_args *kargs);
extern void cgroup_post_fork(struct task_struct *p,
struct kernel_clone_args *kargs);
-void cgroup_exit(struct task_struct *p);
-void cgroup_release(struct task_struct *p);
-void cgroup_free(struct task_struct *p);
+void cgroup_task_exit(struct task_struct *p);
+void cgroup_task_release(struct task_struct *p);
+void cgroup_task_free(struct task_struct *p);
int cgroup_init_early(void);
int cgroup_init(void);
@@ -680,9 +680,9 @@ static inline void cgroup_cancel_fork(struct task_struct *p,
struct kernel_clone_args *kargs) {}
static inline void cgroup_post_fork(struct task_struct *p,
struct kernel_clone_args *kargs) {}
-static inline void cgroup_exit(struct task_struct *p) {}
-static inline void cgroup_release(struct task_struct *p) {}
-static inline void cgroup_free(struct task_struct *p) {}
+static inline void cgroup_task_exit(struct task_struct *p) {}
+static inline void cgroup_task_release(struct task_struct *p) {}
+static inline void cgroup_task_free(struct task_struct *p) {}
static inline int cgroup_init_early(void) { return 0; }
static inline int cgroup_init(void) { return 0; }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 6ae5f48cf64e..826b7fd2f85d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -944,7 +944,8 @@ static void css_set_move_task(struct task_struct *task,
/*
* We are synchronized through cgroup_threadgroup_rwsem
* against PF_EXITING setting such that we can't race
- * against cgroup_exit()/cgroup_free() dropping the css_set.
+ * against cgroup_task_exit()/cgroup_task_free() dropping
+ * the css_set.
*/
WARN_ON_ONCE(task->flags & PF_EXITING);
@@ -6972,13 +6973,13 @@ void cgroup_post_fork(struct task_struct *child,
}
/**
- * cgroup_exit - detach cgroup from exiting task
+ * cgroup_task_exit - detach cgroup from exiting task
* @tsk: pointer to task_struct of exiting process
*
* Description: Detach cgroup from @tsk.
*
*/
-void cgroup_exit(struct task_struct *tsk)
+void cgroup_task_exit(struct task_struct *tsk)
{
struct cgroup_subsys *ss;
struct css_set *cset;
@@ -7010,7 +7011,7 @@ void cgroup_exit(struct task_struct *tsk)
} while_each_subsys_mask();
}
-void cgroup_release(struct task_struct *task)
+void cgroup_task_release(struct task_struct *task)
{
struct cgroup_subsys *ss;
int ssid;
@@ -7027,7 +7028,7 @@ void cgroup_release(struct task_struct *task)
}
}
-void cgroup_free(struct task_struct *task)
+void cgroup_task_free(struct task_struct *task)
{
struct css_set *cset = task_css_set(task);
put_css_set(cset);
diff --git a/kernel/exit.c b/kernel/exit.c
index 9f74e8f1c431..46173461e8de 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -257,7 +257,7 @@ void release_task(struct task_struct *p)
rcu_read_unlock();
pidfs_exit(p);
- cgroup_release(p);
+ cgroup_task_release(p);
/* Retrieve @thread_pid before __unhash_process() may set it to NULL. */
thread_pid = task_pid(p);
@@ -967,7 +967,7 @@ void __noreturn do_exit(long code)
exit_thread(tsk);
sched_autogroup_exit_task(tsk);
- cgroup_exit(tsk);
+ cgroup_task_exit(tsk);
/*
* FIXME: do that only when needed, using sched_exit tracepoint
diff --git a/kernel/fork.c b/kernel/fork.c
index 3da0f08615a9..960c39c9c264 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -738,7 +738,7 @@ void __put_task_struct(struct task_struct *tsk)
unwind_task_free(tsk);
sched_ext_free(tsk);
io_uring_free(tsk);
- cgroup_free(tsk);
+ cgroup_task_free(tsk);
task_numa_free(tsk, true);
security_task_free(tsk);
exit_creds(tsk);
diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index cdea931aae30..954137775f38 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -178,8 +178,8 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag)
* this process can already run with task_group() == prev->tg or we can
* race with cgroup code which can read autogroup = prev under rq->lock.
* In the latter case for_each_thread() can not miss a migrating thread,
- * cpu_cgroup_attach() must not be possible after cgroup_exit() and it
- * can't be removed from thread list, we hold ->siglock.
+ * cpu_cgroup_attach() must not be possible after cgroup_task_exit()
+ * and it can't be removed from thread list, we hold ->siglock.
*
* If an exiting thread was already removed from thread list we rely on
* sched_autogroup_exit_task().
--
2.51.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
2025-10-29 6:19 [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
2025-10-29 6:19 ` [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() Tejun Heo
@ 2025-10-29 6:19 ` Tejun Heo
2025-11-14 17:48 ` Michal Koutný
2025-10-29 6:19 ` [PATCH 3/4] cgroup: Defer task cgroup unlink until after the task is done switching out Tejun Heo
` (3 subsequent siblings)
5 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2025-10-29 6:19 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext,
Tejun Heo
Currently, cgroup_task_exit() adds thread group leaders with live member
threads to their css_set's dying_tasks list (so cgroup.procs iteration can
still see the leader), and cgroup_task_release() later removes them with
list_del_init(&task->cg_list).
An upcoming patch will defer the dying_tasks list addition, moving it from
cgroup_task_exit() (called from do_exit()) to a new function called from
finish_task_switch(). However, release_task() (which calls
cgroup_task_release()) can run either before or after finish_task_switch(),
creating a race where cgroup_task_release() might try to remove the task from
dying_tasks before or while it's being added.
Move the list_del_init() from cgroup_task_release() to cgroup_task_free() to
fix this race. cgroup_task_free() runs from __put_task_struct(), which is
always after both paths, making the cleanup safe.
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/cgroup/cgroup.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 826b7fd2f85d..b3c27900c5d2 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -7019,6 +7019,11 @@ void cgroup_task_release(struct task_struct *task)
do_each_subsys_mask(ss, ssid, have_release_callback) {
ss->release(task);
} while_each_subsys_mask();
+}
+
+void cgroup_task_free(struct task_struct *task)
+{
+ struct css_set *cset = task_css_set(task);
if (!list_empty(&task->cg_list)) {
spin_lock_irq(&css_set_lock);
@@ -7026,11 +7031,7 @@ void cgroup_task_release(struct task_struct *task)
list_del_init(&task->cg_list);
spin_unlock_irq(&css_set_lock);
}
-}
-void cgroup_task_free(struct task_struct *task)
-{
- struct css_set *cset = task_css_set(task);
put_css_set(cset);
}
--
2.51.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/4] cgroup: Defer task cgroup unlink until after the task is done switching out
2025-10-29 6:19 [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
2025-10-29 6:19 ` [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() Tejun Heo
2025-10-29 6:19 ` [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() Tejun Heo
@ 2025-10-29 6:19 ` Tejun Heo
2025-11-14 17:48 ` Michal Koutný
2025-10-29 6:19 ` [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch() Tejun Heo
` (2 subsequent siblings)
5 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2025-10-29 6:19 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext,
Tejun Heo
When a task exits, css_set_move_task(tsk, cset, NULL, false) unlinks the task
from its cgroup. From the cgroup's perspective, the task is now gone. If this
makes the cgroup empty, it can be removed, triggering ->css_offline() callbacks
that notify controllers the cgroup is going offline resource-wise.
However, the exiting task can still run, perform memory operations, and schedule
until the final context switch in finish_task_switch(). This creates a confusing
situation where controllers are told a cgroup is offline while resource
activities are still happening in it. While this hasn't broken existing
controllers, it has caused direct confusion for sched_ext schedulers.
Split cgroup_task_exit() into two functions. cgroup_task_exit() now only calls
the subsystem exit callbacks and continues to be called from do_exit(). The
css_set cleanup is moved to the new cgroup_task_dead() which is called from
finish_task_switch() after the final context switch, so that the cgroup only
appears empty after the task is truly done running.
This also reorders operations so that subsys->exit() is now called before
unlinking from the cgroup, which shouldn't break anything.
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/cgroup.h | 2 ++
kernel/cgroup/cgroup.c | 23 ++++++++++++++---------
kernel/sched/core.c | 2 ++
3 files changed, 18 insertions(+), 9 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4068035176c4..bc892e3b37ee 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -138,6 +138,7 @@ extern void cgroup_cancel_fork(struct task_struct *p,
extern void cgroup_post_fork(struct task_struct *p,
struct kernel_clone_args *kargs);
void cgroup_task_exit(struct task_struct *p);
+void cgroup_task_dead(struct task_struct *p);
void cgroup_task_release(struct task_struct *p);
void cgroup_task_free(struct task_struct *p);
@@ -681,6 +682,7 @@ static inline void cgroup_cancel_fork(struct task_struct *p,
static inline void cgroup_post_fork(struct task_struct *p,
struct kernel_clone_args *kargs) {}
static inline void cgroup_task_exit(struct task_struct *p) {}
+static inline void cgroup_task_dead(struct task_struct *p) {}
static inline void cgroup_task_release(struct task_struct *p) {}
static inline void cgroup_task_free(struct task_struct *p) {}
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index b3c27900c5d2..aae180d56c8c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -944,7 +944,7 @@ static void css_set_move_task(struct task_struct *task,
/*
* We are synchronized through cgroup_threadgroup_rwsem
* against PF_EXITING setting such that we can't race
- * against cgroup_task_exit()/cgroup_task_free() dropping
+ * against cgroup_task_dead()/cgroup_task_free() dropping
* the css_set.
*/
WARN_ON_ONCE(task->flags & PF_EXITING);
@@ -6982,10 +6982,20 @@ void cgroup_post_fork(struct task_struct *child,
void cgroup_task_exit(struct task_struct *tsk)
{
struct cgroup_subsys *ss;
- struct css_set *cset;
int i;
- spin_lock_irq(&css_set_lock);
+ /* see cgroup_post_fork() for details */
+ do_each_subsys_mask(ss, i, have_exit_callback) {
+ ss->exit(tsk);
+ } while_each_subsys_mask();
+}
+
+void cgroup_task_dead(struct task_struct *tsk)
+{
+ struct css_set *cset;
+ unsigned long flags;
+
+ spin_lock_irqsave(&css_set_lock, flags);
WARN_ON_ONCE(list_empty(&tsk->cg_list));
cset = task_css_set(tsk);
@@ -7003,12 +7013,7 @@ void cgroup_task_exit(struct task_struct *tsk)
test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags)))
cgroup_update_frozen(task_dfl_cgroup(tsk));
- spin_unlock_irq(&css_set_lock);
-
- /* see cgroup_post_fork() for details */
- do_each_subsys_mask(ss, i, have_exit_callback) {
- ss->exit(tsk);
- } while_each_subsys_mask();
+ spin_unlock_irqrestore(&css_set_lock, flags);
}
void cgroup_task_release(struct task_struct *task)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1ebf67b48e2..40f12e37f60f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5222,6 +5222,8 @@ static struct rq *finish_task_switch(struct task_struct *prev)
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);
+ cgroup_task_dead(prev);
+
/* Task is done with its stack. */
put_task_stack(prev);
--
2.51.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
2025-10-29 6:19 [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
` (2 preceding siblings ...)
2025-10-29 6:19 ` [PATCH 3/4] cgroup: Defer task cgroup unlink until after the task is done switching out Tejun Heo
@ 2025-10-29 6:19 ` Tejun Heo
2025-10-31 19:32 ` Andrea Righi
2025-11-03 20:25 ` [PATCH v2 " Tejun Heo
2025-11-03 20:26 ` [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
2025-11-03 22:04 ` [PATCH v2 0/4] " Tejun Heo
5 siblings, 2 replies; 16+ messages in thread
From: Tejun Heo @ 2025-10-29 6:19 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext,
Tejun Heo
sched_ext_free() was called from __put_task_struct() when the last reference
to the task is dropped, which could be long after the task has finished
running. This causes cgroup-related problems:
- ops.task_init() can be called on a cgroup which didn't get ops.cgroup_init()'d
during scheduler load.
- ops.cgroup_exit() could be called before ops.exit_task() is called on all
member tasks, leading to incorrect exit ordering.
Fix by moving it to finish_task_switch() to be called right after the final
context switch away from the dying task, matching when sched_class->task_dead()
is called. Rename it to sched_ext_dead() to match the new calling context.
By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:
- Tasks visible on scx_tasks list have valid cgroups during scheduler load,
as cgroup_mutex prevents cgroup destruction while the task is still linked.
- All member tasks have ops.exit_task() called and are removed from scx_tasks
before the cgroup can be destroyed and trigger ops.cgroup_exit().
This fix is made possible by the cgroup_task_dead() split in the previous patch.
This also makes more sense resource-wise as there's no point in keeping
scheduler side resources around for dead tasks.
Reported-by: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/sched/ext.h | 4 ++--
kernel/fork.c | 1 -
kernel/sched/core.c | 6 ++++++
kernel/sched/ext.c | 2 +-
4 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index d82b7a9b0658..d7dd77be571f 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -207,14 +207,14 @@ struct sched_ext_entity {
struct list_head tasks_node;
};
-void sched_ext_free(struct task_struct *p);
+void sched_ext_dead(struct task_struct *p);
void print_scx_info(const char *log_lvl, struct task_struct *p);
void scx_softlockup(u32 dur_s);
bool scx_rcu_cpu_stall(void);
#else /* !CONFIG_SCHED_CLASS_EXT */
-static inline void sched_ext_free(struct task_struct *p) {}
+static inline void sched_ext_dead(struct task_struct *p) {}
static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {}
static inline void scx_softlockup(u32 dur_s) {}
static inline bool scx_rcu_cpu_stall(void) { return false; }
diff --git a/kernel/fork.c b/kernel/fork.c
index 960c39c9c264..5ae37909a813 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,7 +736,6 @@ void __put_task_struct(struct task_struct *tsk)
WARN_ON(tsk == current);
unwind_task_free(tsk);
- sched_ext_free(tsk);
io_uring_free(tsk);
cgroup_task_free(tsk);
task_numa_free(tsk, true);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 40f12e37f60f..d4dbffb27a66 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5222,6 +5222,12 @@ static struct rq *finish_task_switch(struct task_struct *prev)
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);
+ /*
+ * sched_ext_dead() must come before cgroup_task_dead() to
+ * prevent cgroups from being removed while its member tasks are
+ * visible to SCX schedulers.
+ */
+ sched_ext_dead(prev);
cgroup_task_dead(prev);
/* Task is done with its stack. */
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 2b0e88206d07..840bc76210c8 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2926,7 +2926,7 @@ void scx_cancel_fork(struct task_struct *p)
percpu_up_read(&scx_fork_rwsem);
}
-void sched_ext_free(struct task_struct *p)
+void sched_ext_dead(struct task_struct *p)
{
unsigned long flags;
--
2.51.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()
2025-10-29 6:19 ` [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() Tejun Heo
@ 2025-10-31 2:25 ` Chen Ridong
0 siblings, 0 replies; 16+ messages in thread
From: Chen Ridong @ 2025-10-31 2:25 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Andrea Righi, Changwoo Min
Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext
On 2025/10/29 14:19, Tejun Heo wrote:
> The current names cgroup_exit(), cgroup_release(), and cgroup_free() are
> confusing because they look like they're operating on cgroups themselves when
> they're actually task lifecycle hooks. For example, cgroup_init() initializes
> the cgroup subsystem while cgroup_exit() is a task exit notification to
> cgroup. Rename them to cgroup_task_exit(), cgroup_task_release(), and
> cgroup_task_free() to make it clear that these operate on tasks.
>
This makes sense.
> Cc: Dan Schatzberg <dschatzberg@meta.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> include/linux/cgroup.h | 12 ++++++------
> kernel/cgroup/cgroup.c | 11 ++++++-----
> kernel/exit.c | 4 ++--
> kernel/fork.c | 2 +-
> kernel/sched/autogroup.c | 4 ++--
> 5 files changed, 17 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 6ed477338b16..4068035176c4 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -137,9 +137,9 @@ extern void cgroup_cancel_fork(struct task_struct *p,
> struct kernel_clone_args *kargs);
> extern void cgroup_post_fork(struct task_struct *p,
> struct kernel_clone_args *kargs);
> -void cgroup_exit(struct task_struct *p);
> -void cgroup_release(struct task_struct *p);
> -void cgroup_free(struct task_struct *p);
> +void cgroup_task_exit(struct task_struct *p);
> +void cgroup_task_release(struct task_struct *p);
> +void cgroup_task_free(struct task_struct *p);
>
> int cgroup_init_early(void);
> int cgroup_init(void);
> @@ -680,9 +680,9 @@ static inline void cgroup_cancel_fork(struct task_struct *p,
> struct kernel_clone_args *kargs) {}
> static inline void cgroup_post_fork(struct task_struct *p,
> struct kernel_clone_args *kargs) {}
> -static inline void cgroup_exit(struct task_struct *p) {}
> -static inline void cgroup_release(struct task_struct *p) {}
> -static inline void cgroup_free(struct task_struct *p) {}
> +static inline void cgroup_task_exit(struct task_struct *p) {}
> +static inline void cgroup_task_release(struct task_struct *p) {}
> +static inline void cgroup_task_free(struct task_struct *p) {}
>
> static inline int cgroup_init_early(void) { return 0; }
> static inline int cgroup_init(void) { return 0; }
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 6ae5f48cf64e..826b7fd2f85d 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -944,7 +944,8 @@ static void css_set_move_task(struct task_struct *task,
> /*
> * We are synchronized through cgroup_threadgroup_rwsem
> * against PF_EXITING setting such that we can't race
> - * against cgroup_exit()/cgroup_free() dropping the css_set.
> + * against cgroup_task_exit()/cgroup_task_free() dropping
> + * the css_set.
> */
> WARN_ON_ONCE(task->flags & PF_EXITING);
>
> @@ -6972,13 +6973,13 @@ void cgroup_post_fork(struct task_struct *child,
> }
>
> /**
> - * cgroup_exit - detach cgroup from exiting task
> + * cgroup_task_exit - detach cgroup from exiting task
> * @tsk: pointer to task_struct of exiting process
> *
> * Description: Detach cgroup from @tsk.
> *
> */
> -void cgroup_exit(struct task_struct *tsk)
> +void cgroup_task_exit(struct task_struct *tsk)
> {
> struct cgroup_subsys *ss;
> struct css_set *cset;
> @@ -7010,7 +7011,7 @@ void cgroup_exit(struct task_struct *tsk)
> } while_each_subsys_mask();
> }
>
> -void cgroup_release(struct task_struct *task)
> +void cgroup_task_release(struct task_struct *task)
> {
> struct cgroup_subsys *ss;
> int ssid;
> @@ -7027,7 +7028,7 @@ void cgroup_release(struct task_struct *task)
> }
> }
>
> -void cgroup_free(struct task_struct *task)
> +void cgroup_task_free(struct task_struct *task)
> {
> struct css_set *cset = task_css_set(task);
> put_css_set(cset);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 9f74e8f1c431..46173461e8de 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -257,7 +257,7 @@ void release_task(struct task_struct *p)
> rcu_read_unlock();
>
> pidfs_exit(p);
> - cgroup_release(p);
> + cgroup_task_release(p);
>
> /* Retrieve @thread_pid before __unhash_process() may set it to NULL. */
> thread_pid = task_pid(p);
> @@ -967,7 +967,7 @@ void __noreturn do_exit(long code)
> exit_thread(tsk);
>
> sched_autogroup_exit_task(tsk);
> - cgroup_exit(tsk);
> + cgroup_task_exit(tsk);
>
> /*
> * FIXME: do that only when needed, using sched_exit tracepoint
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3da0f08615a9..960c39c9c264 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -738,7 +738,7 @@ void __put_task_struct(struct task_struct *tsk)
> unwind_task_free(tsk);
> sched_ext_free(tsk);
> io_uring_free(tsk);
> - cgroup_free(tsk);
> + cgroup_task_free(tsk);
> task_numa_free(tsk, true);
> security_task_free(tsk);
> exit_creds(tsk);
> diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
> index cdea931aae30..954137775f38 100644
> --- a/kernel/sched/autogroup.c
> +++ b/kernel/sched/autogroup.c
> @@ -178,8 +178,8 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag)
> * this process can already run with task_group() == prev->tg or we can
> * race with cgroup code which can read autogroup = prev under rq->lock.
> * In the latter case for_each_thread() can not miss a migrating thread,
> - * cpu_cgroup_attach() must not be possible after cgroup_exit() and it
> - * can't be removed from thread list, we hold ->siglock.
> + * cpu_cgroup_attach() must not be possible after cgroup_task_exit()
> + * and it can't be removed from thread list, we hold ->siglock.
> *
> * If an exiting thread was already removed from thread list we rely on
> * sched_autogroup_exit_task().
Reviewed-by: Chen Ridong<chenridong@huawei.com>
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
2025-10-29 6:19 ` [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch() Tejun Heo
@ 2025-10-31 19:32 ` Andrea Righi
2025-11-03 20:25 ` [PATCH v2 " Tejun Heo
1 sibling, 0 replies; 16+ messages in thread
From: Andrea Righi @ 2025-10-31 19:32 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, Dan Schatzberg, Peter Zijlstra,
linux-kernel, cgroups, sched-ext
Hi Tejun,
On Tue, Oct 28, 2025 at 08:19:18PM -1000, Tejun Heo wrote:
> sched_ext_free() was called from __put_task_struct() when the last reference
> to the task is dropped, which could be long after the task has finished
> running. This causes cgroup-related problems:
>
> - ops.task_init() can be called on a cgroup which didn't get ops.cgroup_init()'d
> during scheduler load.
s/task_init/init_task/
Also, it took me a bit to understand this point, maybe we could add
something like this to make it more clear:
- ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d
during scheduler load, because the cgroup might be destroyed/unlinked
while the zombie task is still lingering on the scx_tasks list.
>
> - ops.cgroup_exit() could be called before ops.exit_task() is called on all
> member tasks, leading to incorrect exit ordering.
>
> Fix by moving it to finish_task_switch() to be called right after the final
> context switch away from the dying task, matching when sched_class->task_dead()
> is called. Rename it to sched_ext_dead() to match the new calling context.
>
> By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:
>
> - Tasks visible on scx_tasks list have valid cgroups during scheduler load,
> as cgroup_mutex prevents cgroup destruction while the task is still linked.
>
> - All member tasks have ops.exit_task() called and are removed from scx_tasks
> before the cgroup can be destroyed and trigger ops.cgroup_exit().
>
> This fix is made possible by the cgroup_task_dead() split in the previous patch.
>
> This also makes more sense resource-wise as there's no point in keeping
> scheduler side resources around for dead tasks.
>
> Reported-by: Dan Schatzberg <dschatzberg@meta.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>
Apart from the minor comment in the description above, everything else
looks good to me.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Thanks,
-Andrea
> ---
> include/linux/sched/ext.h | 4 ++--
> kernel/fork.c | 1 -
> kernel/sched/core.c | 6 ++++++
> kernel/sched/ext.c | 2 +-
> 4 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index d82b7a9b0658..d7dd77be571f 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -207,14 +207,14 @@ struct sched_ext_entity {
> struct list_head tasks_node;
> };
>
> -void sched_ext_free(struct task_struct *p);
> +void sched_ext_dead(struct task_struct *p);
> void print_scx_info(const char *log_lvl, struct task_struct *p);
> void scx_softlockup(u32 dur_s);
> bool scx_rcu_cpu_stall(void);
>
> #else /* !CONFIG_SCHED_CLASS_EXT */
>
> -static inline void sched_ext_free(struct task_struct *p) {}
> +static inline void sched_ext_dead(struct task_struct *p) {}
> static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {}
> static inline void scx_softlockup(u32 dur_s) {}
> static inline bool scx_rcu_cpu_stall(void) { return false; }
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 960c39c9c264..5ae37909a813 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -736,7 +736,6 @@ void __put_task_struct(struct task_struct *tsk)
> WARN_ON(tsk == current);
>
> unwind_task_free(tsk);
> - sched_ext_free(tsk);
> io_uring_free(tsk);
> cgroup_task_free(tsk);
> task_numa_free(tsk, true);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 40f12e37f60f..d4dbffb27a66 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5222,6 +5222,12 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> if (prev->sched_class->task_dead)
> prev->sched_class->task_dead(prev);
>
> + /*
> + * sched_ext_dead() must come before cgroup_task_dead() to
> + * prevent cgroups from being removed while its member tasks are
> + * visible to SCX schedulers.
> + */
> + sched_ext_dead(prev);
> cgroup_task_dead(prev);
>
> /* Task is done with its stack. */
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 2b0e88206d07..840bc76210c8 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2926,7 +2926,7 @@ void scx_cancel_fork(struct task_struct *p)
> percpu_up_read(&scx_fork_rwsem);
> }
>
> -void sched_ext_free(struct task_struct *p)
> +void sched_ext_dead(struct task_struct *p)
> {
> unsigned long flags;
>
> --
> 2.51.1
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
2025-10-29 6:19 ` [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch() Tejun Heo
2025-10-31 19:32 ` Andrea Righi
@ 2025-11-03 20:25 ` Tejun Heo
2025-11-03 20:28 ` Peter Zijlstra
1 sibling, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2025-11-03 20:25 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext
sched_ext_free() was called from __put_task_struct() when the last reference
to the task is dropped, which could be long after the task has finished
running. This causes cgroup-related problems:
- ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d
during scheduler load, because the cgroup might be destroyed/unlinked
while the zombie or dead task is still lingering on the scx_tasks list.
- ops.cgroup_exit() could be called before ops.exit_task() is called on all
member tasks, leading to incorrect exit ordering.
Fix by moving it to finish_task_switch() to be called right after the final
context switch away from the dying task, matching when sched_class->task_dead()
is called. Rename it to sched_ext_dead() to match the new calling context.
By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:
- Tasks visible on scx_tasks list have valid cgroups during scheduler load,
as cgroup_mutex prevents cgroup destruction while the task is still linked.
- All member tasks have ops.exit_task() called and are removed from scx_tasks
before the cgroup can be destroyed and trigger ops.cgroup_exit().
This fix is made possible by the cgroup_task_dead() split in the previous patch.
This also makes more sense resource-wise as there's no point in keeping
scheduler side resources around for dead tasks.
Reported-by: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
v2: - Description correction and update (Andrea Righi).
include/linux/sched/ext.h | 4 ++--
kernel/fork.c | 1 -
kernel/sched/core.c | 6 ++++++
kernel/sched/ext.c | 2 +-
4 files changed, 9 insertions(+), 4 deletions(-)
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -207,14 +207,14 @@ struct sched_ext_entity {
struct list_head tasks_node;
};
-void sched_ext_free(struct task_struct *p);
+void sched_ext_dead(struct task_struct *p);
void print_scx_info(const char *log_lvl, struct task_struct *p);
void scx_softlockup(u32 dur_s);
bool scx_rcu_cpu_stall(void);
#else /* !CONFIG_SCHED_CLASS_EXT */
-static inline void sched_ext_free(struct task_struct *p) {}
+static inline void sched_ext_dead(struct task_struct *p) {}
static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {}
static inline void scx_softlockup(u32 dur_s) {}
static inline bool scx_rcu_cpu_stall(void) { return false; }
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,7 +736,6 @@ void __put_task_struct(struct task_struc
WARN_ON(tsk == current);
unwind_task_free(tsk);
- sched_ext_free(tsk);
io_uring_free(tsk);
cgroup_task_free(tsk);
task_numa_free(tsk, true);
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5222,6 +5222,12 @@ static struct rq *finish_task_switch(str
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);
+ /*
+ * sched_ext_dead() must come before cgroup_task_dead() to
+ * prevent cgroups from being removed while its member tasks are
+ * visible to SCX schedulers.
+ */
+ sched_ext_dead(prev);
cgroup_task_dead(prev);
/* Task is done with its stack. */
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2926,7 +2926,7 @@ void scx_cancel_fork(struct task_struct
percpu_up_read(&scx_fork_rwsem);
}
-void sched_ext_free(struct task_struct *p)
+void sched_ext_dead(struct task_struct *p)
{
unsigned long flags;
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering
2025-10-29 6:19 [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
` (3 preceding siblings ...)
2025-10-29 6:19 ` [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch() Tejun Heo
@ 2025-11-03 20:26 ` Tejun Heo
2025-11-03 22:04 ` [PATCH v2 0/4] " Tejun Heo
5 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2025-11-03 20:26 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: Dan Schatzberg, Peter Zijlstra, linux-kernel, cgroups, sched-ext
On Tue, Oct 28, 2025 at 08:19:14PM -1000, Tejun Heo wrote:
> This series fixes a cgroup task exit ordering issue that is generally
> suboptimal for all cgroup controllers and has caused real breakage for
> sched_ext schedulers.
Peter, I'm planning to apply these patches soon. Please holler if you have
any objections.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
2025-11-03 20:25 ` [PATCH v2 " Tejun Heo
@ 2025-11-03 20:28 ` Peter Zijlstra
2025-11-03 20:31 ` Tejun Heo
0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2025-11-03 20:28 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Andrea Righi, Changwoo Min, Dan Schatzberg,
linux-kernel, cgroups, sched-ext
On Mon, Nov 03, 2025 at 10:25:13AM -1000, Tejun Heo wrote:
> sched_ext_free() was called from __put_task_struct() when the last reference
> to the task is dropped, which could be long after the task has finished
> running. This causes cgroup-related problems:
>
> - ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d
> during scheduler load, because the cgroup might be destroyed/unlinked
> while the zombie or dead task is still lingering on the scx_tasks list.
>
> - ops.cgroup_exit() could be called before ops.exit_task() is called on all
> member tasks, leading to incorrect exit ordering.
>
> Fix by moving it to finish_task_switch() to be called right after the final
> context switch away from the dying task, matching when sched_class->task_dead()
> is called. Rename it to sched_ext_dead() to match the new calling context.
>
> By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:
>
> - Tasks visible on scx_tasks list have valid cgroups during scheduler load,
> as cgroup_mutex prevents cgroup destruction while the task is still linked.
>
> - All member tasks have ops.exit_task() called and are removed from scx_tasks
> before the cgroup can be destroyed and trigger ops.cgroup_exit().
>
> This fix is made possible by the cgroup_task_dead() split in the previous patch.
>
> This also makes more sense resource-wise as there's no point in keeping
> scheduler side resources around for dead tasks.
>
> Reported-by: Dan Schatzberg <dschatzberg@meta.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Reviewed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> v2: - Description correction and update (Andrea Righi).
>
> include/linux/sched/ext.h | 4 ++--
> kernel/fork.c | 1 -
> kernel/sched/core.c | 6 ++++++
> kernel/sched/ext.c | 2 +-
> 4 files changed, 9 insertions(+), 4 deletions(-)
>
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -207,14 +207,14 @@ struct sched_ext_entity {
> struct list_head tasks_node;
> };
>
> -void sched_ext_free(struct task_struct *p);
> +void sched_ext_dead(struct task_struct *p);
> void print_scx_info(const char *log_lvl, struct task_struct *p);
> void scx_softlockup(u32 dur_s);
> bool scx_rcu_cpu_stall(void);
>
> #else /* !CONFIG_SCHED_CLASS_EXT */
>
> -static inline void sched_ext_free(struct task_struct *p) {}
> +static inline void sched_ext_dead(struct task_struct *p) {}
> static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {}
> static inline void scx_softlockup(u32 dur_s) {}
> static inline bool scx_rcu_cpu_stall(void) { return false; }
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -736,7 +736,6 @@ void __put_task_struct(struct task_struc
> WARN_ON(tsk == current);
>
> unwind_task_free(tsk);
> - sched_ext_free(tsk);
> io_uring_free(tsk);
> cgroup_task_free(tsk);
> task_numa_free(tsk, true);
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5222,6 +5222,12 @@ static struct rq *finish_task_switch(str
> if (prev->sched_class->task_dead)
> prev->sched_class->task_dead(prev);
^^^ can you not use task_dead_scx() ?
> + /*
> + * sched_ext_dead() must come before cgroup_task_dead() to
> + * prevent cgroups from being removed while its member tasks are
> + * visible to SCX schedulers.
> + */
> + sched_ext_dead(prev);
> cgroup_task_dead(prev);
>
> /* Task is done with its stack. */
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2926,7 +2926,7 @@ void scx_cancel_fork(struct task_struct
> percpu_up_read(&scx_fork_rwsem);
> }
>
> -void sched_ext_free(struct task_struct *p)
> +void sched_ext_dead(struct task_struct *p)
> {
> unsigned long flags;
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
2025-11-03 20:28 ` Peter Zijlstra
@ 2025-11-03 20:31 ` Tejun Heo
2025-11-03 20:38 ` Peter Zijlstra
0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2025-11-03 20:31 UTC (permalink / raw)
To: Peter Zijlstra
Cc: David Vernet, Andrea Righi, Changwoo Min, Dan Schatzberg,
linux-kernel, cgroups, sched-ext
Hello,
On Mon, Nov 03, 2025 at 09:28:43PM +0100, Peter Zijlstra wrote:
> > @@ -5222,6 +5222,12 @@ static struct rq *finish_task_switch(str
> > if (prev->sched_class->task_dead)
> > prev->sched_class->task_dead(prev);
>
> ^^^ can you not use task_dead_scx() ?
Unfortunately not. Because sched class switches are atomic infallible
operations, all tasks in the system must be prepped on scheduler attach &
fork regardless of their current sched class and thus have to be cleaned up
in the same way on detach & exit.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
2025-11-03 20:31 ` Tejun Heo
@ 2025-11-03 20:38 ` Peter Zijlstra
0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2025-11-03 20:38 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Andrea Righi, Changwoo Min, Dan Schatzberg,
linux-kernel, cgroups, sched-ext
On Mon, Nov 03, 2025 at 10:31:36AM -1000, Tejun Heo wrote:
> Hello,
>
> On Mon, Nov 03, 2025 at 09:28:43PM +0100, Peter Zijlstra wrote:
> > > @@ -5222,6 +5222,12 @@ static struct rq *finish_task_switch(str
> > > if (prev->sched_class->task_dead)
> > > prev->sched_class->task_dead(prev);
> >
> > ^^^ can you not use task_dead_scx() ?
>
> Unfortunately not. Because sched class switches are atomic infallible
> operations, all tasks in the system must be prepped on scheduler attach &
> fork regardless of their current sched class and thus have to be cleaned up
> in the same way on detach & exit.
Ah, I see scx_post_fork() and sched_ext_free^H^H^H^Hdead() effectively
duplicate the tasklist.
Oh well.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 0/4] cgroup: Fix task exit ordering
2025-10-29 6:19 [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
` (4 preceding siblings ...)
2025-11-03 20:26 ` [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
@ 2025-11-03 22:04 ` Tejun Heo
5 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2025-11-03 22:04 UTC (permalink / raw)
To: David Vernet
Cc: Andrea Righi, Changwoo Min, Dan Schatzberg, Peter Zijlstra,
linux-kernel, cgroups, sched-ext
> Tejun Heo (4):
> cgroup: Rename cgroup_{exit|release|free}() to cgroup_task_{exit|release|free}()
> cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
> cgroup: Defer task unlinking from cgroup during exit to finish_task_switch()
> sched_ext: Rename sched_ext_free() to sched_ext_dead() and move earlier in finish_task_switch()
Applied 1-3 to cgroup/for-6.19 and 4 to sched_ext/for-6.19.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
2025-10-29 6:19 ` [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() Tejun Heo
@ 2025-11-14 17:48 ` Michal Koutný
2025-11-14 18:18 ` Tejun Heo
0 siblings, 1 reply; 16+ messages in thread
From: Michal Koutný @ 2025-11-14 17:48 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Andrea Righi, Changwoo Min, Dan Schatzberg,
Peter Zijlstra, linux-kernel, cgroups, sched-ext
[-- Attachment #1: Type: text/plain, Size: 1183 bytes --]
On Tue, Oct 28, 2025 at 08:19:16PM -1000, Tejun Heo <tj@kernel.org> wrote:
> Currently, cgroup_task_exit() adds thread group leaders with live member
> threads to their css_set's dying_tasks list (so cgroup.procs iteration can
> still see the leader), and cgroup_task_release() later removes them with
> list_del_init(&task->cg_list).
>
> An upcoming patch will defer the dying_tasks list addition, moving it from
> cgroup_task_exit() (called from do_exit()) to a new function called from
> finish_task_switch().
> However, release_task() (which calls
> cgroup_task_release()) can run either before or after finish_task_switch(),
Just for better understanding -- when can release_task() run before
finish_task_switch()?
> creating a race where cgroup_task_release() might try to remove the task from
> dying_tasks before or while it's being added.
>
> Move the list_del_init() from cgroup_task_release() to cgroup_task_free() to
> fix this race. cgroup_task_free() runs from __put_task_struct(), which is
> always after both paths, making the cleanup safe.
(Ah, now I get the reasoning of more likely pids '0' for CSS_TASK_ITER_PROCS.)
Thanks,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] cgroup: Defer task cgroup unlink until after the task is done switching out
2025-10-29 6:19 ` [PATCH 3/4] cgroup: Defer task cgroup unlink until after the task is done switching out Tejun Heo
@ 2025-11-14 17:48 ` Michal Koutný
0 siblings, 0 replies; 16+ messages in thread
From: Michal Koutný @ 2025-11-14 17:48 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Andrea Righi, Changwoo Min, Dan Schatzberg,
Peter Zijlstra, linux-kernel, cgroups, sched-ext
[-- Attachment #1: Type: text/plain, Size: 2461 bytes --]
On Tue, Oct 28, 2025 at 08:19:17PM -1000, Tejun Heo <tj@kernel.org> wrote:
> When a task exits, css_set_move_task(tsk, cset, NULL, false) unlinks the task
> from its cgroup. From the cgroup's perspective, the task is now gone. If this
> makes the cgroup empty, it can be removed, triggering ->css_offline() callbacks
> that notify controllers the cgroup is going offline resource-wise.
>
> However, the exiting task can still run, perform memory operations, and schedule
> until the final context switch in finish_task_switch().
> This creates a confusing
> situation where controllers are told a cgroup is offline while resource
> activities are still happening in it.
(FWIW, I've always considered it impossible to (mm) charge into offlined
memcgs. Alhtogh I don't remember whether anything _relied_ on that...
> While this hasn't broken existing controllers,
... so hopefully not.)
>
> Split cgroup_task_exit() into two functions. cgroup_task_exit() now only calls
> the subsystem exit callbacks and continues to be called from do_exit(). The
> css_set cleanup is moved to the new cgroup_task_dead() which is called from
> finish_task_switch() after the final context switch, so that the cgroup only
> appears empty after the task is truly done running.
>
> This also reorders operations so that subsys->exit() is now called before
> unlinking from the cgroup, which shouldn't break anything.
>
> Cc: Dan Schatzberg <dschatzberg@meta.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> include/linux/cgroup.h | 2 ++
> kernel/cgroup/cgroup.c | 23 ++++++++++++++---------
> kernel/sched/core.c | 2 ++
> 3 files changed, 18 insertions(+), 9 deletions(-)
Acked-by: Michal Koutný <mkoutny@suse.com>
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 4068035176c4..bc892e3b37ee 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -138,6 +138,7 @@ extern void cgroup_cancel_fork(struct task_struct *p,
> extern void cgroup_post_fork(struct task_struct *p,
> struct kernel_clone_args *kargs);
> void cgroup_task_exit(struct task_struct *p);
> +void cgroup_task_dead(struct task_struct *p);
> void cgroup_task_release(struct task_struct *p);
> void cgroup_task_free(struct task_struct *p);
"Hi, I'm four-stage process, you may remember me such callbacks as
css_killed and css_release."
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
2025-11-14 17:48 ` Michal Koutný
@ 2025-11-14 18:18 ` Tejun Heo
0 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2025-11-14 18:18 UTC (permalink / raw)
To: Michal Koutný
Cc: David Vernet, Andrea Righi, Changwoo Min, Dan Schatzberg,
Peter Zijlstra, linux-kernel, cgroups, sched-ext
Hello,
On Fri, Nov 14, 2025 at 06:48:17PM +0100, Michal Koutný wrote:
> On Tue, Oct 28, 2025 at 08:19:16PM -1000, Tejun Heo <tj@kernel.org> wrote:
> > An upcoming patch will defer the dying_tasks list addition, moving it from
> > cgroup_task_exit() (called from do_exit()) to a new function called from
> > finish_task_switch().
> > However, release_task() (which calls
> > cgroup_task_release()) can run either before or after finish_task_switch(),
>
> Just for better understanding -- when can release_task() run before
> finish_task_switch()?
I didn't test explicitly, so please take it with a grain of salt, but I
think both autoreap and !autoreap cases can run before the final task
switch.
- When autoreap, the dying task calls exit_notify() and eventually calls
release_task() on self. This is obviously before the final switch.
- When !autoreap, it's a race. After exit_notify(), the parent can wait the
zombie task anytime which will call release_task() through
wait_task_zombie(). This can happen either before or after
finish_task_switch().
> > creating a race where cgroup_task_release() might try to remove the task from
> > dying_tasks before or while it's being added.
> >
> > Move the list_del_init() from cgroup_task_release() to cgroup_task_free() to
> > fix this race. cgroup_task_free() runs from __put_task_struct(), which is
> > always after both paths, making the cleanup safe.
>
> (Ah, now I get the reasoning of more likely pids '0' for CSS_TASK_ITER_PROCS.)
Yeah, I thought about filtering it out better but if we can already show 0
pid for foreign ns tasks, maybe this is okay. What do you think?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-11-14 18:18 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-29 6:19 [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
2025-10-29 6:19 ` [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() Tejun Heo
2025-10-31 2:25 ` Chen Ridong
2025-10-29 6:19 ` [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() Tejun Heo
2025-11-14 17:48 ` Michal Koutný
2025-11-14 18:18 ` Tejun Heo
2025-10-29 6:19 ` [PATCH 3/4] cgroup: Defer task cgroup unlink until after the task is done switching out Tejun Heo
2025-11-14 17:48 ` Michal Koutný
2025-10-29 6:19 ` [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch() Tejun Heo
2025-10-31 19:32 ` Andrea Righi
2025-11-03 20:25 ` [PATCH v2 " Tejun Heo
2025-11-03 20:28 ` Peter Zijlstra
2025-11-03 20:31 ` Tejun Heo
2025-11-03 20:38 ` Peter Zijlstra
2025-11-03 20:26 ` [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
2025-11-03 22:04 ` [PATCH v2 0/4] " Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).